The rising interest in data quality goes beyond data and engineering teams. Search volume in 2022 for the term, for example, was up 30% compared to prior years.
While the concept of “data quality” is decades old, newer techniques have helped to formalize the practice; data observability and testing, for example. Historically, data quality was understood as a technical discipline that would “fix the issues.” Today, that understanding has evolved. Data quality techniques don’t aim to fix, unlike the previous paradigms. Why is that? Let’s dig in.
The traditional approach to data quality
Legacy tools used to change data. Several years ago, the term “data wrangling” was frequently bandied about. That term would refer to a data scientist going through several steps to clean up the data, change the shape of it, and ready it for deployment. Data wrangling might use Trifacta, or Pandas (Python library), and the DPLYR (a set of libraries for R). That “wrangling” would also include data quality tools and processes.
This process would address data quality at the last possible moment along the data pipeline. That is, immediately before it is going to be used for analysis. That’s one problem.
Another problem lies in master data management (“MDM”). Traditional MDM systems aimed to create a clean and central copy of data. Cleansing / cleaning is key to MDM, because the main copy must be clean. IBM Datastage, Informatica, and SAP all use terms like “cleanse”, “enrich”, and “validate” in MDM-oriented data quality.
But the problem is that these techniques aim to clean up the master copies. Making data more reliable and accurate in the long-term requires solving the root causes, not just a problem with one copy of the data.
The modern approach: Pipeline testing
Pipeline testing emerged at companies like Uber, Netflix, Intuit, and AirBnB as a way to identify problems within their data pipelines. Pipeline tests check data for various factors, like freshness and completeness. These tests sometimes stop the pipeline, but are rarely used to actually change the data.
Pipeline testing is analogous to the testing conducted in software engineering. Unit tests are used to identify symptoms of problems. Then, the engineer can track down the root cause, solve it, and rerun the unit test to confirm the solution worked. The tests themselves aren’t used to modify what the program is actually doing.
That brings us to the one big limitation that many large data teams ran into as they conducted pipeline testing: that of scale.
The promises of observability
Observability is used in software engineering to detect problems with the live performance of infrastructure and applications. If software goes down, it doesn’t matter if a unit test should have prevented it or not, somebody needs to know. Observability solves that problem.
In data engineering, data observability fills a similar role for the operational health of the pipeline and the quality of the data inside. If anything goes wrong, a data observability platform lets data engineers and data scientists know where, how, and ultimately why the issue occurred.
From there, the solution happens at the root cause. That might mean fixing a web form that doesn’t validate for manual data entry errors. Or, it might mean replacing an expired API token, or fixing a bug in a dbt model, or whatever else caused the pipeline issue or data quality problem.
In all of these cases, the data teams aren’t looking to “cleanse” the data. That’s not how they see data quality monitoring. They care less about a singular central data model looking pristine (even if under the hood it relies on covering up various potholes). Their priority lies in ensuring that the pipeline is running smoothly day-to-day.
MDM is an important technique, especially in enterprises merging unmatched data from multiple lines of business, often built over the course of multiple acquisitions. In these cases, cleansing techniques might be required when hunting down and solving real root causes isn’t always practical.
But for everyone else, finding the root cause and fixing it ASAP is the key to data quality. It’s the key to preventing those quality issues from haunting the organization for weeks, months, or years to come.
Data observability works when engineers and data teams can partner to create a problem-solving culture, and not consider that “another team’s problem.” While data observability itself doesn’t fix anything, it works in concert with a robust data quality culture to fortify data management across the board.
By putting a detection and resolution plan in place, data teams make the pipelines themselves increasingly anti-fragile over time. That robustness leads to higher reliability and less toil for data science and analytics teams who put that data to work.