Data lineage: Buy it or build it?
Data lineage is a core component of observability. But can you build your own lineage solution, or buy one off-the-shelf? We explore, and make some recommendations.
Data lineage is a critical aspect of data management. Lineage enables organizations to track existing data sources how they are used.
Small, young companies can understand their data lineage in relatively short order. But as a company grows, the number of teams producing and consuming data multiplies. The challenges in fully understanding data lineage also multiply.
The confusion leads to scenarios like the one recently described by the analytics team of a large bank: “We consume data. We know what tables we use. Where they’re coming from? We don’t know. If there’s a problem, we don’t know. If we find something weird, we kick it upstream, and tell the next team that there's something’s going on.”
Despite the increased emphasis on data lineage in recent years, many teams have only a vague sense of how it can benefit them. In this post, we interviewed engineers and data analysts from inside our organization to find out what you need to know when deciding how to actually implement data lineage in your organization.
Data lineage: feature, not product
A core belief at Bigeye is that lineage is a feature, not a full product on its own. You can often find it tucked into various data quality tools or data catalogs.
“Lineage is a feature that helps you do other things,” Egor Gryaznov, Bigeye’s CTO says. “For example, Bigeye needs lineage so we can help our users get to the root cause of the problem and understand what's being impacted. It's a tool that helps the user get to their end goal, which is to figure out where the problem is coming from.”
For a data catalog, presenting data lineage as a graph that you can click around helps users discover relevant tables for their work. As Gryaznov puts it: “I want to navigate to things that are upstream, read about them, understand who owns them and what they are used for, and how they are used in catalogs.”
OpenLineage is the industry-standard open-source framework for data lineage. It provides a standard interface for publishing lineage in a structured way to a standardized repository. It allows for a consistent experience when integrating lineage with many different tools.
Frameworks like OpenLineage don’t automatically collect the lineage. Rather, OpenLineage is analogous to an API to which you then declare your lineage .
Is lineage available in data warehouses?
BigQuery, Snowflake and Redshift comprise some amount of data lineage information out-of-the-box. They break down queries run in the data warehouse and turn them into lineage maps.
For example, if you have a query that reads from many tables and writes into one, the data warehouse will log that these objects were accessed by this query, and that this object was written into by that same query.
Subsequently, users can go into the data warehouse and ask, for each query, which tables were read from and which tables were written into.
While this gives you basic lineage information, you run into limitations fairly quickly. For example, dbt models are expressed as views rather than materialized tables. You'll have a SQL statement base, without "writing into" a table.
Note: More traditional “transactional” systems like MySQL, Postgres, SQL Server, etc., do not typically provide lineage information. Each day, transactional databases see more queries than analytical databases, and they are unable to store the logs for more than a few days.
Is lineage available in dbt?
Dbt's lineage capabilities are focused around DAGs. If you use dbt for your data transformations in-warehouse, relationships between data sources and models can be automatically inferred. If you use dbt Cloud, you can view and explore a visualization of the DAG that represents data lineage.
DAGS, or Directed Acyclic Graphs, are a type of graph whose nodes directionally relate to each other and don’t form a directional closed loop. Often, DAGs visually represent the relationships between data models. If you look at your data models and can't identify all the upstream and downstream dependencies, dbt recommends building or looking into your existing DAG.
The DAG provides a visual representation of the upstream dependencies, the models that must be processed, the downstream relationships, and the models impacted by a data pipeline. It also shows the flow of data transformations in a defined and non-cyclical manner.
Is lineage available in Airflow?
For many organizations, a large percentage of their data pipelines are scheduled with Airflow. If this is the case for your organization, you can embed your data lineage ingestion into Airflow.
Using a hosted Airflow like Astronomer, the Airflow already comes with an OpenLineage integration. Certain common Airflow operators, when used in a DAG, will emit OpenLineage events. You can also to write custom lineage extractors for your own Airflow operators.
Even if you are using open-source Airflow, off-the-shelf lineage services will usually provide hooks/callbacks that can be used to send metadata about Airflow tasks to the lineage service.
Is lineage available from off-the-shelf solutions like Atlan or Collibra?
Once you’ve collected your lineage metadata information from dbt or Airflow or your dashboards, it still needs to be additionally processed and assembled into a graph.
At this point, you might turn to an off-the-shelf data lineage solution like Atlan or Collibra. These solutions will usually ask you to connect your data warehouse and assemble the data lineage automatically via some combination of:
- Piggy-backing off the parsing that warehouses have already done
- Extracting information from DAG/task/job name metadata sent from callbacks in Airflow or dashboards
- Writing their own custom SQL parsers to parse and extract lineage from SQL
You may be wondering why SQL-parsing is necessary if the metadata has already been declared from something like Airflow. The biggest advantage to extracting lineage from production SQL is scalability. While metadata/job documentation may get stale, the SQL populating the tables is always ground truth.
Buy vs. Build
It’s unlikely that lineage is a core competency or problem for your team. You may have select situations where a highly-customized ETL process necessitates building lineage into the ETL framework. For example, Uber did so several years ago with its SQL templating ETL framework. But these situations are uncommon, and tend to happen in fast-paced environments with colossal amounts of data.
In most cases, you'll find observability tools that already have data lineage bundled in as a feature. That's why generally speaking, we recommend purchasing rather than building a custom lineage solution.
Building your own lineage graph
Despite the proliferation of off-the-shelf lineage services, there are still reasons why you might want to build your own. For example, you may need column-level lineage and language support.
Bigeye's client has an ETL pipeline that consists of data in Snowflake, with Python and R scripts that read the data, compute, and write it back out. To derive lineage from that pipeline, they need access to that Python code, plus the ability to parse and understand it.
In such a case, it is difficult from an engineering perspective to write all the parsers and the automation. As much as possible, they declare lineage explicitly in their data pipeline job configurations.
If you'd like to read more on how some large organizations built their own data catalog/data lineage services, check out this post on Netflix's data lineage journey or Slack's data lineage construction and automation.
Or, to see Bigeye's data lineage in action, request a demo.
Schema change detection