Data in Practice: Systematizing data quality at Uber-scale
In the "Data in Practice" series, we talk to real data engineers who have put data reliability engineering concepts into practice, learning from their challenges and successes in the real world.
Uber revolutionized transportation by connecting millions of bikes, riders, drivers, and restaurants. Behind this transformation lies a complex data stack. In this blog post, adapted from this presentation at Meta’s Data Observability Learning Summit by Sriharsha Chintalapani and Sanjay Sundaresan, we look at some of the challenges Uber faced in maintaining data quality at Uber-scale, plus the solutions they implemented to tackle them.
History of Uber’s data infrastructure
Uber's data infrastructure has significantly evolved since the company's launch. In the early days, Uber had a monolithic data pipeline that was responsible for collecting, storing, and processing all data. At the height of this platform, there were 300,000+ unowned datasets. More specifically, the data pipeline consisted of:
- A sharded MySQL database as the “online database”
- A CDC pipeline powered by Hive that took data from the online database and pushed it to the data lake (in a 24-hour process)
- Once in the data lake, data was categorized as “raw tables”
- A data warehouse team that would turn raw tables into dimension tables and fact tables
- Utilities and tools on top of the data warehouse for data scientists' usage
The need to build a data observability platform
With the huge number of pipelines and datasets, several issues arose. In particular:
- Data duplication: No one knew which data existed, so teams felt it was most convenient to create their own version of the data
- No visibility into data lineage and freshness: No one knew when exactly certain data was landing in tables
- Data quality: There was no way to gauge the quality of the data teams were seeing (e.g. in dashboards)
In the years of Uber’s hypergrowth (2015-2016), teams concentrated on scaling the data infra itself, rather than investing in the data product. Soon after, Uber realized that a stronger data foundation was a top priority.
Uber’s principles for data
Uber applied the following principles to arrive at a better data culture:
- Data as code: Data is treated as code and is managed in a similar way to software. The artifacts are reviewed, and any schema change done in production goes through the review process. Producers of the data, as well as consumers of the data, are tagged during the review process. This approach makes it easier to track changes, version data, and collaborate with others.
- Data is owned: This principle mainly focuses on data ownership. The data must be owned by the business or functional teams that use it. The teams must clearly define the intent of the data product and artifact, own it, and provide guarantees around the data. This approach ensures that teams are responsible for the quality of the data they use and they are motivated to improve it.
- Data quality is known: Data quality is continuously monitored and measured. The SLA targets are used as part of the assertions. All datasets are categorized with tiering levels, which are defined as criteria to set default SLA values. This approach enables teams to identify and fix data quality issues quickly and easily.
With the implementation of these principles, Uber moved from a platform of self-serving tools to a more regulated, owned, and responsible data platform.
Data observability at Uber in 2021
Fast forward to the present day: Uber has built out a data observability platform with the following components:
At Uber, not all data is equally important. The company implemented a tiering concept for its data assets (tables, pipelines, ML models, and dashboards). Tier 1 indicates an extremely important dataset and Tier 5 indicates an individually-owned dataset, generated in staging environments, without any guarantees.
After all datasets were tiered, the company identified 2500 Tier 1 and Tier 2 tables (out of 130k+) that were extremely important. That way, Uber could focus its efforts on ensuring the quality of the most important data, while still providing visibility into all data.
Databook, Uber’s data catalog
Uber's in-house catalog is called Databook. Databook makes data exploration and discovery much easier for Uber’s engineers, data scientists, and operations teams. serves as a user interface on top of dataset metadata like:
- Quality signals
- Data asset owners
- Products enabled by the data
Databook also provides information about lineage, or the relationships between different datasets. Information around lineage helps engineers understand how data flows through the pipeline, from source to destination.
Uber's data quality system going forward
To ensure data quality, Uber implemented a data quality system. Once a certain dataset is labeled as Tier 1 or Tier 2, it automatically onboards into a set of data quality checks and foundational guarantees, ensuring that the data is:
- Connected to PagerDuty on-call
These guarantees essentially mean that the data asset is treated like a service. Additionally, all Tier 1/Tier 2 data assets are monitored on a set of metrics including:
- Freshness: measures how recent the data in a dataset is. It can be determined by comparing the timestamp of the data to the current time, or by comparing it to a known source of truth.
- Completeness: measures how much of the expected data is present in a dataset. It can be determined by comparing the number of rows or columns to a known expected value. This metric also ensures that all data that is present in upstream is also present in downstream.
- xDC consistency: measures the consistency of data across different data centers. It can be determined by comparing data in different data centers for the same key or by using a hashing function to compare data across data centers.
- Duplicates: measures the number of duplicate records in a dataset. It can be determined by comparing primary keys or by using a hashing function to compare records.
Additionally, Uber allows users to set up custom checks on top of these standard metrics. This gives users the ability to define specific checks that are relevant to their use case and to monitor the data in a way that is most meaningful to them.
Uber's approach to data reliability and quality is built on the principles of data as code, data ownership, and data quality. To put these principles into practice, the company intentionally constructed processes and tooling for visibility into all data and metadata, with a particular emphasis on key pipelines and datasets.
Schema change detection