Engineering

March 30, 2021

Lessons Learned from Uber: Designing an Intelligent Data Quality Monitor

min read

Henry Li

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

While at Uber, I led the development of the Data Quality Monitor (DQM) to track the data health of critical platforms. In this blog post from my time at Uber, I discuss the statistical modeling approach that enabled my team to monitor data quality at scale, at the petabyte-level and thousands of data pipeline jobs. While the solution was architected specifically for Uber, there are universal lessons from the design and development process that any company undertaking a DQM project should consider.

In this blog, I will discuss the considerations that should be made before undertaking a DQM initiative. The three important factors to consider are:

1) The necessary complexity and scope of the DQM

2) The cost of developing a DQM system

3) The opportunities and potential pitfalls that come with open source DQM projects

Why Build a DQM?

In the past decade, matured database infrastructure technologies, like Redshift, Databricks, and Snowflake, have enabled companies to scale data-driven decision making and create new digital services, everything from improved health diagnostics to ridesharing and everything in between. In short, data has become core to the value that companies provide in all industries.

Data-driven companies rely on ingesting the data they need, computing the state of the world, and automating downstream business actions based on the service supply-and-demand. If bad data gets through this process, everything downstream is negatively affected, the customer experience is tarnished if not outright ruined, and the business begins to lose trust in the data.

A data quality monitoring system (DQM) alerts the data teams to incidents of bad data and guides them through incident mitigation and recovery. If implemented correctly, the monitor helps data engineers and analysts catch issues before bad data affects downstream business processes.

But building a DQM is a much more complex, time-consuming, and costly process than it might seem to be at first glance. Any company must take these factors into account when weighing the value of developing a DQM internally.

A Complex System at Scale

On the surface, data quality monitoring might seem simple. With some back-of-the-napkin analysis, you might be thinking:

A data scientist can put together a Python or R script with existing packages to monitor for anomalies in a set of metadata metrics.
Or a data engineer can use a data testing framework to identify when data services are down.

Trying to apply these approaches as your data scales, however, quickly becomes unwieldy even after just a few hundred tables. And for modern data teams, the milestone of reaching a hundred-table size database is happening earlier and earlier.

Some problems we have observed are:

Teams struggle to figure out which metrics to measure, spending a lot of effort to get even simple coverage.
When the business and underlying data are continually changing, the team faces a growing maintenance burden.
Non-comprehensive coverage only creates more risk because bad data from one area of the business can be quickly replicated and stored over and over again elsewhere, polluting the tables downstream.

And on top of it all, to be effective, the DQM must be able to scale. This requires an infrastructure that contains systems and services that can reliably produce data quality test metrics, monitoring results, and visualization tools for producing actionable insights. This in turn calls for meticulous, long-term coordination between data scientists, data engineers, and other business stakeholders.

This cross-platform collaborative planning, development, testing, implementation, and maintenance quickly becomes complex. Once the DQM matures, development work may plateau but still requires maintenance as new data needs would arise from changes such as data migration and new data types and tables. Maintenance of the DQM infrastructure is perennial.

An Expensive Endeavour

Ultimately, building a successful DQM system requires a dedicated team of several engineers and data scientists with very specialized skills, including experience with data infrastructure engineering, modern time series analysis, and appropriate selection of tooling rooted in extensive knowledge of Applied Statistics. Without the necessary talents, the team may not be able to address serious issues and blind spots in the engineering process. For example, the DQM can easily drown the data engineers in false positives, bring down the data warehouse with taxing queries, and create a tangled mess of configurations that are impossible to understand a year out from initial development.

On the other hand, if the team has the necessary skills, they will need to be dedicated not only to the development but also to the maintenance of the system, indefinitely. Before undertaking a DQM project, it’s important to consider the cost of a dedicated team, full of hard-to-find talent.

At Uber, the DQM project took 12 months and a team of 5 data scientists and engineers, and will require ongoing investment to maintain. At Uber’s scale there may be enough ROI from a highly customized in-house product to justify its existence, but this is rare. For most companies, resources are better spent on the core business.

Even exceptions like Airbnb, another leader in developing cutting edge data management tools (like Airflow, Dataportal, and Superset), have had similar experiences. At Airbnb, it took a team of six-to-seven engineers 12 months to design and create a system for visualizing data timeliness, one critical aspect of data quality. These problems are incredibly difficult to solve at scale.

Example 1

Let’s say the underlying system uses a standard deviation approach to find anomaly (one common approach among engineering teams), as shown in Figure 1. While this looks like a good approach on the surface, for a highly-seasonal data metric this approach may not work well. The point-of-contact employee would have to implement a change to the monitoring system to improve anomaly capture.

Figure 1: A sinusoidal data metric with an incident (circled). A common distribution-based approach to anomaly detection does not capture the seasonality patterns, and does not alert in this case. On the other hand, an intelligent DQM implementation captures the data issue and alerts properly. Note that the DQM also needs to be resilient to issues such as missing data as many real world events could derail the anomaly detection process.

Example 2

In another example shown in Figure 2, an incident occurs as a historically-constant data metric and is observed to have elevated in value. Usually, a simple anomaly detection tool will eventually emit boundaries that take the level change into account. However, from the standpoint of data quality, this is not a desired behavior because the DQM should not be influenced by bad data values. To build a comprehensive DQM, a research team would need to work through the unique set of problems in the data quality monitoring world.

Figure 2: A flat data metric. Usually, a simple anomaly detection system would react to the underlying pattern. This behavior is not desired for when we have a constant metric that has an incident (circled). The DQM should alert, but the thresholds should not adapt to the bad data metric values until the potential incident is fixed. A team that manages the DQM has to investigate and research cases like this as the data quality monitoring world has a unique set of problems not present elsewhere.

These are just two examples of the problems a practical DQM system will need to overcome. We have to also think about other issues such as data quality metric collection methodologies and cadence, which can generate different metric progressions. A comprehensive DQM must be able to handle these issues and many more at scale. For the vast majority of companies, this calculus is incredibly burdensome, and focusing resources on perfecting existing core business services would likely provide much greater ROI over time.

Data Quality Monitoring for All

Companies are investing a great deal into being data-driven, and data quality shouldn’t be allowed to derail those efforts. Now is a great time for data teams to really understand their data needs and use their data to the fullest extent for becoming more cost-effective and operationally-lean.

For more information on data quality in modern data warehouses like Snowflake, read our latest guide on building trustworthy data. At Bigeye, we automate scalable data quality monitoring for teams in many industries.

share this episode

Resource

Monthly cost ($)

Number of resources

Time (months)

Total cost ($)

Software/Data engineer

$15,000

$540,000

Data analyst

$12,000

$144,000

Business analyst

$10,000

$30,000

Data/product manager

$20,000

$240,000

Total cost

$954,000

Role

Goals

Common needs

Data engineers

Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.

Freshness + volume
Monitoring
Schema change detection
Lineage monitoring

Data scientists

Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.

Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing

Analytics engineers

Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.

Lineage monitoringETL blue/green testing

Business intelligence analysts

The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.

Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing

Other stakeholders

Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.

Integration with analytics toolsReporting and insights

about the author