Thought leadership
-
July 26, 2023

Tackling data correctness: Lessons from Stripe

How does Stripe ensure end-to-end data reliability? Here, we walk through some real-life lessons learned and best practices for doing just that.

Liz Elfman

Exceptional data correctness is essential for companies dealing with sensitive data, especially financial transactions. At Stripe, data correctness is a top priority given their handling of payments and money movement. Some of Stripe's key systems like treasury, billing, and reconciliation rely entirely on data pipeline accuracy.

Stripe’s Data Platform team is responsible for maintaining all of Stripe’s data infrastructure and building checks for data observability and correctness at every point in the system, including: automated checks, fallback procedures, and other safeguards. This post will dive into some of the lessons we can learn from Stripe on ensuring end-to-end data reliability.

Automated checks and robust fallback procedures

Stripe has implemented automated checks that run after data processing jobs to validate the results. These checks are implemented as "Airflow decorators" and include simple validations like ensuring data volumes are increasing, as well as more complex checks that verify primary keys exist in upstream data sources.

That's not all. Stripe has also established fallback procedures for when these checks fail or jobs don't complete successfully. For less critical data, the system may automatically use the previous day's data. For higher priority data, it may rerun the pipeline using the previous day's data. And for the most critical data, the system halts the pipeline entirely until the issue is addressed. These procedures help avoid moving forward with questionable data.

Dealing with eventual consistency

In the past, S3 followed an "eventual consistency" model. This means that after a successful write operation (such as PUT, POST, DELETE), subsequent read operations (GET) might not reflect the change for a brief period of time. This is because it takes a short while for the change to propagate to all replicas of the data across Amazon's infrastructure.

So for instance, if you updated or deleted an object and immediately tried to read or delete it, you might receive the old version of the object or encounter a "not found" error. Until 2020, Stripe dealt with this issue by building a “metadata layer” to store extra information along S3, only allowing the data pipeline to continue once all necessary data was in place. This metadata layer eventually evolved into a new table format, providing more efficiency and insight into what data was needed for each job.

Type-safe libraries

Stripe implemented type-safe libraries for Apache Spark to enforce correctness in data operations. Since Spark's built-in encoders didn't support all of Stripe's data types, they implemented their own to support every data type used at Stripe. With type-safe data and operations, pipelines were less prone to failure since these libraries helped catch issues early and prevent errors.

Data observability platform

Stripe created an observability platform specifically for monitoring data and providing a comprehensive view of Stripe’s data platforms. The platform integrates with Airflow, and the metadata layer to track key metrics for each data job like runtime, data size, column names, and types. This UI based observability solution eliminates the dependency on data science and platform teams by empowering stakeholders with relevant information on why the required data isn't available yet or due to failure or bottleneck in the data pipeline. While initially basic, this platform gave Stripe a foundation to build upon for specifying fallback logic, test definitions, and other data correctness procedures.

While Stripe, as a large tech company with many resources, chose to build their data observability platform in-house, this is where an off the shelf data observability platform like Bigeye can be helpful. Bigeye’s autometrics and autothresholds make it easy for you to monitor the data in your data warehouse, and to receive automatic alerts before they go out of bounds.

Periodic reprocessing

To ensure maximum accuracy for its most critical data, Stripe frequently recomputes the “universe” of data, reprocessing all data from the beginning. While this adds latency and cost, the benefits to data correctness outweigh these trade-offs. Re-computation also provides an opportunity to optimise infrastructure. Although recomputing at massive scale may be infeasible for huge companies, Stripe's data volume has remained manageable enough to recompute when needed.

Striking a balance: Trade-offs and efficiencies

Stripe's focus on correctness requires trade-offs, like increased latency, unpredictability, and cost. But these stringent data correctness procedures at Stripe have driven them to implement efficiency tools like the metadata layer and Apache Iceberg table format. The metadata layer provides information to only reprocess data that has changed, while Iceberg’s table format enables storing data in multiple locations and only querying what is needed for a specific job. The robust, self-service data systems built around correctness allow Stripe to have confidence in their data.

Final thoughts

Data correctness has always been a top priority for Stripe, and they have built it into every layer of their data stack. Comprehensive monitoring, automated checks, a metadata layer, type-safe libraries, observability platforms, and regular complete re-computations of data provide multiple levels of validation to ensure accurate data. By building data correctness into systems and processes from end to end, Stripe has established a model of data reliability engineering other companies would do well to follow.

If you'd like to improve your data correctness with Bigeye, schedule a demo here.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.