How to obtain a complete data lineage of Spark processes thanks to Anjana Data

Juan Sobrino — Fri, 04 Sep 2020 09:24:08 +0000

Knowing the lineage of data is very important from a data governance perspective in order to:

Knowing how information flows throughout the organisation
Understanding the data value chain and its processes
Understanding the lifecycle of data assets and how they are being generated
Visualise dependencies between data assets and processes to manage the potential impacts generated by changes and modifications.
Facilitate the search for process errors, quality issues, service degradation, etc.

The data lineage It comes in many forms, and perhaps one of the best known is the technical lineage. That is, knowing, from a technical point of view, how data moves from one place to another through the processes that are executed on it, either automatically or manually.

However, it should be noted that, even though this is technical lineage, this information must be useful for data governance. Thus, for lineage information to be valuable, it must be interpretable, and in the vast majority of cases, a string of activity logs spewed out by a machine is of little use. That is why it is most common for technical lineage to have to be captured, interpreted, and translated in order to be of any use.

Obtaining the lineage

In this context, in order to obtain the technical lineage of the processes that move data from one site to another, we can rely on several techniques:

Extract lineage through the identification or inference of relationships between objects, which are declared in the form of metadata: This is typically the case with ETLs that operate with parameters and define all the processes to be executed by metadata.
Extracting lineage through source code parsing: This is no easy task, and its complexity depends greatly on both the programming language (SQL is not the same as Java) and the programmer who wrote the code (depending on the functions, methods, or variables used). Given that we are entering murky waters here and could encounter anything, the only certainty is that the guarantee that can be offered in these situations is usually very low and, in the vast majority of cases, a significant gap must be assumed.
Extracting lineage from the recovery and interpretation of audit logs: This is what we at Anjana Data call dynamic lineage, and in many cases it is the only way to obtain the most complete trace possible, even though you will only be able to capture what is executed. To implement this, you need to carry out native integration with data platforms and technologies to understand how logs work, know where and how to retrieve them, and finally be able to interpret the captured information, which usually has to be translated to be valuable.

Furthermore, given the variability of languages, platforms, technologies, etc., the spectrum we encounter becomes too broad to cover completely. That is why, when we talk about lineage, we usually have to seek compromises and apply Pareto's Law, especially in certain specific scenarios.

Spark and the data lineage

As we have already seen, not all data processing technologies make it easy for us to obtain the internal lineage of their processes, and among all of them, Spark is one of the most complex.

Spark is an open-source distributed processing technology that offers greater use of the possibilities of distributed data clusters. As it is an open-source project, different distributions are available from various technology providers, such as:, Cloudera, AWS EMR, GCP DataProc, or Databricks, the most famous of these, which is also offered natively by Microsoft in your Azure cloud.

One of Spark's most distinctive features is that it does not have its own storage, but rather uses the memory of the machines that make up the cluster where it runs and is capable of retrieving data from different types of storage, such as HDFS., S3, Cloud Storage, Blob Storage, etc., or streaming systems such as Kafka.

In this regard, Spark distributes tasks across the different nodes of the cluster during execution, using the memory and processors of the different machines. That is why, by its very definition, it seems difficult to obtain a complete trace of the processes executed that move or transform data. Furthermore, this is not something that is completely and natively resolved in any of the current implementations of Spark, as they are all designed and optimised for data processing and not for data governance.

When we talk about data governance and Spark appears in the picture as the technology involved, the vast majority of professionals throw up their hands in horror or assume there is a significant gap. Certainly, activating low-level log capture can provide us with a lot of information about processes, but it also penalises performance, so a compromise solution must be found.

Obtaining Spark lineage with Anjana Data

As we have seen, obtaining the trace of Spark processes is not an easy task, mainly for the following reasons:

Since it does not have its own storage, the metadata that can be obtained from Spark processes at rest is null. When not running, the most we can obtain is the metadata of the datasets that may participate in those processes as inputs or outputs, but we cannot know anything about what happens in between or how the output data is generated from the input data.
The way Spark encodes the processing of datasets, whether in an RDD or a Spark Dataset, can be written in several languages (Scala, Java, Python) and operations can be masked according to the encoding. Therefore, parsing source code to obtain traces is not a viable option when it comes to Spark.
The information that can be obtained from the execution logs is never complete or interpretable and depends largely on both the Spark distribution and the audit configuration. In many cases, the most that can be obtained is the input-output ratio at a rough level, or, on other occasions, the amount of logs to be interpreted will be unmanageable.

However, thanks to Anjana Data's approach and implementation, you can get a very reliable picture at a granular level (field level and applied functions) that no other solution on the market is able to offer. How? We can't tell you everything, but we can give you a few brief glimpses.

Essentially, what Anjana Data does is apply a combination of the three techniques mentioned above, intercepting each of the processes just before they are executed.

To execute processes, Spark creates an execution plan (DAG) before dividing the process into tasks that are launched in parallel. This is the only moment when all the information is available in a single point, just before it is distributed to all the elements responsible for its execution. By including a specific agent that is always invoked in each and every default execution, all this information can be captured and extracted with the required level of detail. Furthermore, all this can be done centrally and non-invasively, without the need for programmers to include anything in their processes.

Finally, all this information is processed and interpreted by one of the components of Anjana Data's architecture and then served to the solution's CORE, where it is cross-referenced with governed information to generate a lineage of valuable data that is made available to the end user.

Anjana Data can do this and much more... Want to find out more?

Request a demonstration and we'll tell you all about it!

data lineage - Anjana Data

How to obtain a complete data lineage of Spark processes thanks to Anjana Data

Obtaining the lineage

Spark and the data lineage

Obtaining Spark lineage with Anjana Data