Lessons Learned from Uber: Designing an Intelligent Data Quality Monitor
In this blog, I will discuss the considerations that should be made before undertaking a data quality initiative.
While at Uber, I led the development of the Data Quality Monitor (DQM) to track the data health of critical platforms. In this blog post from my time at Uber, I discuss the statistical modeling approach that enabled my team to monitor data quality at scale, at the petabyte-level and thousands of data pipeline jobs. While the solution was architected specifically for Uber, there are universal lessons from the design and development process that any company undertaking a DQM project should consider.
In this blog, I will discuss the considerations that should be made before undertaking a DQM initiative. The three important factors to consider are:
1) The necessary complexity and scope of the DQM
2) The cost of developing a DQM system
3) The opportunities and potential pitfalls that come with open source DQM projects
Why Build a DQM?
In the past decade, matured database infrastructure technologies, like Redshift, Databricks, and Snowflake, have enabled companies to scale data-driven decision making and create new digital services, everything from improved health diagnostics to ridesharing and everything in between. In short, data has become core to the value that companies provide in all industries.
Data-driven companies rely on ingesting the data they need, computing the state of the world, and automating downstream business actions based on the service supply-and-demand. If bad data gets through this process, everything downstream is negatively affected, the customer experience is tarnished if not outright ruined, and the business begins to lose trust in the data.
A data quality monitoring system (DQM) alerts the data teams to incidents of bad data and guides them through incident mitigation and recovery. If implemented correctly, the monitor helps data engineers and analysts catch issues before bad data affects downstream business processes.
But building a DQM is a much more complex, time-consuming, and costly process than it might seem to be at first glance. Any company must take these factors into account when weighing the value of developing a DQM internally.
A Complex System at Scale
On the surface, data quality monitoring might seem simple. With some back-of-the-napkin analysis, you might be thinking:
- A data scientist can put together a Python or R script with existing packages to monitor for anomalies in a set of metadata metrics.
- Or a data engineer can use a data testing framework to identify when data services are down.
Trying to apply these approaches as your data scales, however, quickly becomes unwieldy even after just a few hundred tables. And for modern data teams, the milestone of reaching a hundred-table size database is happening earlier and earlier.
Some problems we have observed are:
- Teams struggle to figure out which metrics to measure, spending a lot of effort to get even simple coverage.
- When the business and underlying data are continually changing, the team faces a growing maintenance burden.
- Non-comprehensive coverage only creates more risk because bad data from one area of the business can be quickly replicated and stored over and over again elsewhere, polluting the tables downstream.
And on top of it all, to be effective, the DQM must be able to scale. This requires an infrastructure that contains systems and services that can reliably produce data quality test metrics, monitoring results, and visualization tools for producing actionable insights. This in turn calls for meticulous, long-term coordination between data scientists, data engineers, and other business stakeholders.
This cross-platform collaborative planning, development, testing, implementation, and maintenance quickly becomes complex. Once the DQM matures, development work may plateau but still requires maintenance as new data needs would arise from changes such as data migration and new data types and tables. Maintenance of the DQM infrastructure is perennial.
An Expensive Endeavour
Ultimately, building a successful DQM system requires a dedicated team of several engineers and data scientists with very specialized skills, including experience with data infrastructure engineering, modern time series analysis, and appropriate selection of tooling rooted in extensive knowledge of Applied Statistics. Without the necessary talents, the team may not be able to address serious issues and blind spots in the engineering process. For example, the DQM can easily drown the data engineers in false positives, bring down the data warehouse with taxing queries, and create a tangled mess of configurations that are impossible to understand a year out from initial development.
On the other hand, if the team has the necessary skills, they will need to be dedicated not only to the development but also to the maintenance of the system, indefinitely. Before undertaking a DQM project, it’s important to consider the cost of a dedicated team, full of hard-to-find talent.
At Uber, the DQM project took 12 months and a team of 5 data scientists and engineers, and will require ongoing investment to maintain. At Uber’s scale there may be enough ROI from a highly customized in-house product to justify its existence, but this is rare. For most companies, resources are better spent on the core business.
Even exceptions like Airbnb, another leader in developing cutting edge data management tools (like Airflow, Dataportal, and Superset), have had similar experiences. At Airbnb, it took a team of six-to-seven engineers 12 months to design and create a system for visualizing data timeliness, one critical aspect of data quality. These problems are incredibly difficult to solve at scale.
Let’s say the underlying system uses a standard deviation approach to find anomaly (one common approach among engineering teams), as shown in Figure 1. While this looks like a good approach on the surface, for a highly-seasonal data metric this approach may not work well. The point-of-contact employee would have to implement a change to the monitoring system to improve anomaly capture.
In another example shown in Figure 2, an incident occurs as a historically-constant data metric and is observed to have elevated in value. Usually, a simple anomaly detection tool will eventually emit boundaries that take the level change into account. However, from the standpoint of data quality, this is not a desired behavior because the DQM should not be influenced by bad data values. To build a comprehensive DQM, a research team would need to work through the unique set of problems in the data quality monitoring world.
These are just two examples of the problems a practical DQM system will need to overcome. We have to also think about other issues such as data quality metric collection methodologies and cadence, which can generate different metric progressions. A comprehensive DQM must be able to handle these issues and many more at scale. For the vast majority of companies, this calculus is incredibly burdensome, and focusing resources on perfecting existing core business services would likely provide much greater ROI over time.
Data Quality Monitoring for All
Companies are investing a great deal into being data-driven, and data quality shouldn’t be allowed to derail those efforts. Now is a great time for data teams to really understand their data needs and use their data to the fullest extent for becoming more cost-effective and operationally-lean.
For more information on data quality in modern data warehouses like Snowflake, read our latest guide on building trustworthy data. At Bigeye, we automate scalable data quality monitoring for teams in many industries.
Schema change detection