Strategies for handling bad data in data pipelines
Deciding how and when to deal with “bad data” should be a sooner-rather-than-later choice, when you work with data sets and data pipelines. This article explores some of the strategies that data engineers use to handle “bad data.”
Deciding how and when to deal with “bad data” should be a sooner-rather-than-later choice, when you work with data sets and data pipelines. This article explores some of the strategies that data engineers use to handle “bad data.” We’ll focus on unexpected data (null, badly formatted, typed) or incorrect data (out of normal ranges), and exclude operational failures (late data due to failed jobs).
There are two main strategies when it comes to how to treat bad data:
- Don’t let any bad data in. Eschew risk.
- Let it all in. Embrace risk. Refine so bad data isn’t presented.
Strategy 1: Don’t let any bad data in
Classical ETL techniques for dealing with “bad data” come in a few flavors, centered around preventing bad data from entering the data warehouse. This technique posits that no data is better than bad data. This strategy mimics an industrial quality control process, where the line stops when errors are detected, thus eliminating the possibility of error.
This caution derives from the fact that historically, data warehouses had limited capacity and were only for “first class” data. That said, these techniques are applicable in the modern data stack with today’s cloud data warehouses and cloud data lakes.
Let’s assume our dataset is frequently updated. One or a combination of these tactics are usually used in the data pipeline to exclude bad data:
- A coarse-grained approach with circuit breakers
- A fine-grained approach with in-line fixup/drop
- A fine-grained approach with dead-letter/quarantine queues
Tactic 1: Coarse-grained approach with circuit breakers
This approach is an all-or-nothing approach applied to whole tables, or bulk incremental partitions being added to a table.
It is best applied in data sets where any missing data may cause problems. For example, if the data set is a financial ledger, any missing line items could mean money or debt is missing. That's a serious problem for any financial reporting.
Example: Intuit blog from 2018
In practice ETL pipelines or workflow systems stop dependent tasks from executing if an error threshold is met. This effectively stops new data from being incrementally added, unless an incremental backfill is used to repair or reprocess the data.
In the modern data stack, folks who use orchestration systems such as Airflow, Prefect or Dagster would add a checking operator to abort a workflow when some error threshold is met.
Example: Bigeye can be triggered via API using the Python SDK to do metric runs after an incremental load (e.g. with our custom Airflow operator).
Tactic 2: Fine-grained with inline fixup/drop
This approach is fine-grained and applies row-by-row checks during ingest. The idea is to filter and accept the good, and then to try to make clean values where known bad ones are detected. An alternative is to drop the “bad” rows.
In practice, this takes the stream of incoming rows, performs bespoke rule checks, and forks the stream into the clean data stream and the dirty. The dirty data stream gets extra compute applied, and potentially third-party data quality tools to clean up the data.
Operationally, this approach is nice because the cleanup isn’t exposed. It is encapsulated into the bigger picture ingest. If all data can be cleaned up this way, we have achieved our goal.
Bigeye could be used after-the-fact to confirm that the “fixed up data” still meets the data quality goals.
Tactic 3: Fine-grained with dead letter/quarantine queues
This approach is fine-grained. Like the inline-fixup tactic, it applies row-by-row checks during ingest. The difference here is that rejected rows get sent to a separate quarantine stream, and potentially stored in an alternate table to be dealt with later.
Any quarantine data will be missing until a fine-grained backfill mechanism deals with the quarantined data. The good news is that new valid data keeps feeding the system, and the data is more “real time." Sometimes, this temporary table is handled with another process, with more transforms creating an ELTL pattern.
Strategy 2: Don’t present bad data
This approach lets all raw data in, warts and all, with transform steps clean it up before it is presented. Ideally, data users would only access the refined data sets with their BI, ML, or rETL tools. The main benefits here include dead simple and performant ingest, and compatibility with data lakes and semi-structured data.
This approach defers decisions that would cause you to lose data, and allows you to reprocess all the data if necessary. With today’s data lakes and the cloud data warehouses, compute and storage can scale to handle almost any company’s data (as long as they have the budget!).
This approach is likely considered “wrong” by classic ETL standards. However, it is de rigueur for users of the modern data stack, and folks coming from data lakes, to embrace an ELT approach (Extract, Load, Transform).
With this approach you, do not expect perfection. Bad data will inevitably be in the raw data. The refining can be deferred because you can always rebuild your data. The transformations and data that may have been errors in the past can be updated and cleaned up with more precision and more accuracy after the fact in subsequent iterations.
Using the trite “data is the new oil” meme, this treats data as a raw material and your data pipeline as a data refinery gradually improves the quality of the output. The good news is that you still have the raw material and can reprocess it again later.
There are few approaches that we see in ELT style pipelines:
- Staging inserts
Tactic 1: YOLO
The “you only live once” approach just throws data into production datasets as it arrives and may not have the benefit of ACID guarantees. This is often raw data and what organically grows out of ad-hoc efforts at data warehousing.
Bigeye can be used in these scenarios by pointing it at these tables. It can automatically deploy metrics and infer DQ thresholds. As data becomes more refined, Bigeye’s metric and monitors can be quality gates, data summaries, and also notify data users of changes to the data’s properties.
Tactic 2: Staging inserts
This approach creates a temp or staging table where new data lands. xDQ tests are applied and the “don’t let bad data in tactics” can be used, with an atomic insert / add partition to the production table instead of directly going to the production table.
This technique is used in the classic ETL world and in the modern data stack. This is a more advanced approach for classic ETL folks, but a common practice for data lake folks.
Bigeye can be triggered via API to do metric runs after an incremental load. Because new data is staged, we can check data before it is inserted into production avoiding S1.T1’s and S2.T1’s lack of isolation window. Bigeye’s deltas mechanism may be usable against the updated and snapshotted version of the table.
Bigeye has several customers who use this approach in their data pipeline. This is where Bigeye’s Deltas feature comes into play – dev branches are compared against the production dataset before promotion and used to see the changes in the data and decide if the updated pipeline and changed data should be promoted.
Tactic 3: Branches
This approach creates a separate version of the table that isn’t “live” for most data consumers to use until it is promoted. Implementation wise , this could be creating a new partition of data and creating a view with union of the original table and the new table with the added partition. It could be a snapshot of the original table, and then additions inserted. It could be a dbt users’s dev db. It could be a Databricks delta lake table, an Apache Iceberg table version, or an Apache Hudi snapshot. The key is that you get a version of a table that you can apply DQ tests to before promoting it.
Here’s where dbt shines, and frankly why it is a star in the modern data stack. Its ref operator and source operator provide key abstraction points to give the illusion of a fork using views, instead of having to use costly fully materialized tables or snapshot/rollback mechanisms. It enables data engineers / analytics engineers to do devops style CI/CD on updates to data transforms with a test, review and promotion process.
You get the good behavior of circuit breakers with stronger isolation. You enable more people to work against the data in parallel, in a similar fashion to software engineering. You get the ability to test data before it is promoted.
So users who choose to only use production grade data only get clean data. Also, instead of a black and white good data/bad data decision, advisory decisions can be made. You can allow folks who want less qualified version of the data that doesn’t pass all data quality tests, to use it, similar to how some people use beta versions of a software service.
Today’s data engineers come from multiple backgrounds. The landscape has room for multiple techniques. All of the strategies and tactics described above can be valid, depending on the criticality of the data and the results.
The good news is that a data observability platform such as Bigeye can be integrated into any of them. The even better news is that modern data observability tools and data reliability platforms provide more than static checks. They come with the ability to monitor over time, infer data quality constraints, and notify users when problems are detected and human judgment is desired.
Schema change detection