Modern, data-driven companies need reliable data. But on the path to building trustworthy analytics, ML models, and data products, your data team is bound to hit some roadblocks. In theory, data reliability is straightforward. But when it comes to the messy business of actually implementing it, what do real-life teams do?
Organizations like Lyft, Walmart, and LinkedIn have applied data reliability techniques to solve their data challenges. In this post, we highlight 20 of those real-world examples.
1. Monitoring data freshness/staleness
LinkedIn built a system called Data Health Monitor (DHM) that automatically monitors the freshness and staleness of datasets. With this system, they can detect issues like pipelines unintentionally using an older dataset version.
2. Monitoring data volume changes
LinkedIn’s DHM also monitors for sudden drops or increases in data volume, which can indicate partial data or insufficient resources, respectively. Being aware of volume changes helps LinkedIn maintain pipeline and data quality.
3. Monitoring the quality of offline data
Lyft built Verity, a check-based system to monitor the quality of offline data. It allows users to define checks that run queries to validate expectations - for example, checking for null values in a column. Verity checks can be configured to run automatically on a schedule or as part of data pipelines. The check results are stored to enable debugging when failures occur.
4. Bringing objectivity to quality
Walmart built DQAF (Data Quality Assessment Framework), which is the company’s product for Continuous Data Quality. The DQAF enables stakeholders to define objective thresholds for quality metrics based on what "good quality" means to them. This makes quality less subjective. For example, a business user can set a threshold that a critical column must be 95-100% complete.
5. Clarifying ownership
Walmart’s framework also assigns ownership of quality scores for different data domains to the relevant teams. So for example, the "Orders" team owns order data quality scores. This functionality delineates responsibilities across siloed teams.
6. Tracking improvements
By storing quality scores over time, Walmart can quantify improvements as data stewards fix issues. If the completeness score for a column goes from 80% to 95% after fixes, it demonstrates the business impact of quality efforts.
7. Depicting interconnectedness
Even though teams own data in silos, quality scores in Walmart’s framework show interdependencies between data sets. For example, Orders team data quality is connected to the Customer team data quality.
8. Enabling custom algorithms
At Walmart, teams can define custom data quality algorithms tailored to their specific data needs. For example, the Orders team could create a validity check unique to a column in the Orders table.
9. Answering questions about data quality
Through Walmart's quality score tracking, analysts can also answer questions like "Why did data quality dip last month?" These analyses provide data-driven narratives around quality that can be surfaced to executives.
10. Making data discoverable
LinkedIn built "Super Tables", which are centralized, well-documented datasets that have been pre-computed and normalized. They aim to be the “go-to datasets” for certain domains, e.g.:
JOBS Super Table: Consolidates data from 57+ different job-related data sources into a single table with 158 columns. Provides precomputed information commonly needed for job analytics and insights.
Ad Events Super Table: Consolidates data from 7 different ad-related tables, including ad impressions, clicks, video views, etc. Joins in campaign and advertiser dimensions. Provides 150+ columns for ad analytics and reporting.
The goal of both Super Tables is to simplify data discovery, reduce redundant joins and storage, and precompute commonly used data for downstream analytics.
11. Guaranteeing table availability
Linkedin’s Super Tables also have well-defined service level agreements (SLAs) that specify availability, supportability, and change management commitments.
For availability, the goal is to achieve 99%+ uptime. For a daily Super Table flow, this translates to about one SLA miss per quarter. To improve availability, Super Tables can be materialized in multiple clusters with active-active configurations. This provides redundancy in case of failures.
Upstream data sources must also commit to SLAs that enable the Super Table to meet its own SLA. The SLAs of upstream sources are tracked and monitored.
12. Managing schema changes in upstream sources
By default, schema changes (additions, deletions etc.) in upstream source data do not automatically affect the Super Table schema.
If a new column is added in a source, it does not appear in the Super Table. If a source column is deleted, its value is nullified in the Super Table.
The Super Table governance body is notified of source schema changes that could potentially impact the table. All planned schema changes to the Super Table itself are documented and communicated to downstream consumers, and there is a monthly release cadence for accepting schema change requests to the Super Table.
13. Reducing alert fatigue
Uber used tiering to classify and prioritize its various data assets, such as tables, pipelines, machine learning models, and dashboards. By assigning different tiers to these assets, Uber is able to manage its resources more efficiently, ensuring that only the most important data gets alerted on:
Tier 0: These are the most critical data assets that are foundational for the business to operate. Any disruption in these assets could have severe consequences. Kafka as a service, for example, falls under this category.
Tier 1: Extremely important datasets that could be essential for decision-making, analytics, or operational aspects. These could be things like user data, transaction data, etc.
Tier 2: Important but not critical datasets. These could be important for some departments or features but aren't as universally crucial.
Tiers 3, 4: Less critical data that may still be useful for specific analyses or features.
Tier 5: These are individually owned datasets, often generated in staging or test environments. They have no guarantees of quality or availability and are the least prioritized.
By identifying just 2,500 Tier 1 and Tier 2 tables out of over 130,000 tables, Uber focused its efforts on a manageable but critically important subset of its data, allowing for better quality, reliability, and resource allocation.
14. Reducing manual data issue debugging
Stripe built a centralized observability platform and internal UI that allowed users to select different runs of a data job and compare metrics like runtime, data volume processed, and logs across the run.
Based on current runtime progression and historical runtimes, the UI would also predict estimated completion time for running jobs, which would help address stakeholder questions.
Finally, users could configure standardized fallback behaviors for different failure cases, and data tests, through the UI.
15. On-call training
Playbooks and runbooks are documents that outline the steps for responding to specific types of issues/incidents. In the context of running a data organization, they ensure that everyone involved has a shared understanding of the plan of action. More specifically, they provide a checklist of action items so that nothing is forgotten. This checklist can also be used to train new staff on data issue response.
16. Data producer-consumer alignment
Convoy pioneered data contracts. These are API-based agreements between software engineers who own services and business-focused data consumers, with the goal of generating well-modeled, high-quality, trusted data. They allow a service to define the entities and application-level events they own, along with their schema and semantics.
Data contracts ensure that production-grade data pipelines are treated as part of the product, with clear SLAs and ownership. They also orient everyone in the same direction so that problem-solving work is effective.
17. Prevent degradation in machine learning model performance
At Lyft, input features to models are validated in real-time against valid value ranges. This catches issues like incorrect units or data types passing to models.
They also monitor distributions of model score outputs with time series alerts, and analyze historical logs of features and predictions to catch unusual statistical deviations that could imply model degradation. If upstream feature changes or data drift is detected, they automatically retrain models to prevent performance from declining.
18. Making it easier for business users to answer data questions
Pinterest built Querybook, an open-source data collaboration platform for sharing SQL queries, datasets, and insights. Querybook also has a ChatGPT-like interface to automatically generate and execute SQL queries from plain text questions. For example, users can ask natural language questions like "How many daily active users in the past month?" and it will generate the appropriate SQL query.
19. Making data incidents less stressful
Following the principles of data reliability will hopefully mean you face fewer data incidents, but it also means that even when data incidents occur, they’re less stressful.
You can apply standard incident response frameworks to data incidents too. For example, the response process (Incident detection, response, root cause analysis, and resolution, and blameless post-mortem) and the response team (incident leader, SME, liaison, scribe). Therein lies your tried and true plan of attack.
20. Encouraging data-driven business decisions
Ultimately, you’re not collecting and analyzing all this data at a company for fun: it should be in service of making business or product decisions. Data reliability principles ensure that analyses and reports are accurate, that metrics and trends can be tracked over time, and that key financial information is always up-to-date and correct for compliance reasons.
Modern data stacks enable tremendous analytical capabilities but also introduce reliability challenges from complexity and scale. Companies like Lyft, LinkedIn, Uber, Walmart, and Pinterest, apply data reliability principles to build trust and confidence in their data products and make better business choices.