Thought leadership
-
April 7, 2023

Data pipeline monitoring vs. data quality monitoring: What's the difference?

Why does the difference between data pipeline monitoring and data quality monitoring matter? In this post, we'll define both, and explain what the difference means.

Liz Elfman

While data monitoring often conflates data pipeline health with data health itself, in fact, they should be considered two separate disciplines. Those are: data pipeline monitoring and data quality monitoring. In this post, we'll delve into the key differences between the two and why it's essential to have both in place.

Data pipeline monitoring: Ensuring smooth data flow

Data pipeline monitoring (DPM) focuses on the jobs and tables that move the data, e.g. Snowflake and Airflow. The main aspects of DPM are freshness (when each table was last updated), volume (how many rows are being moved), and job run durations. DPM is typically the responsibility of data engineering or data platform teams.

In monitoring the data pipeline, you ensure that your ETL (Extract, Transform, Load) processes are running smoothly and that the data is flowing seamlessly between different stages of the pipeline. You avoid bottlenecks and ensure that your data is up-to-date and ready for analysis.

Data quality monitoring: Assessing the contents of the data

Data quality monitoring (DQM), on the other hand, focuses on the contents of the data. DQM includes aspects such as freshness (how old the values are), completeness (rate of nulls, blanks, etc.), duplication, and format compliance. DQM is often the responsibility of data science and analytics teams, who need to ensure that the data they use is accurate and reliable.

By implementing DQM, you can identify issues such as null values, duplicates, and outliers that may affect the accuracy of your data-driven insights. With proper DQM in place, your ML models and analytics work off of high-quality data, ultimately leading to better decision-making.

The importance of both data pipeline and data quality monitoring

While DPM and DQM can be done with two separate systems, to truly understand the behavior of your pipeline, you should correlate information from both sources. For instance, if you notice that a table has been refreshed later than usual with a larger number of rows, and you also find a significant number of duplicated IDs, this could indicate an issue with an ETL job. In this case, combining data pipeline monitoring (freshness and volume) with data quality monitoring (duplicates) can help you identify and resolve the problem.

You want to prioritize data pipeline monitoring before data quality monitoring. If the data isn't flowing smoothly through the pipeline, there's no point in worrying about data quality. Once the data engineering team has ensured the smooth operation of the data pipeline, they can hand over the responsibility of data quality monitoring to data science and analytics teams. This division of labor allows each team to focus on their area of expertise, and ensures that both aspects of data management are adequately addressed.

The role of analytics engineers in pipeline and quality monitoring

With the rise of tools like dbt, the role of analytics engineer has evolved into a mix of data analyst and data engineer. Analytics engineers understand how the data is consumed in dashboards and statistical models, and write SQL to perform data transformations. They can serve as a valuable bridge in the correlation work mentioned above.

In practice: The intersection of data pipeline and data quality monitoring

In reality, the division between data pipeline monitoring and data quality monitoring is not always clear-cut. However, having a strong understanding of the two concepts and their respective responsibilities can help organizations make informed decisions about which aspects of their data management processes need attention.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.