Tackling data correctness: Lessons from Stripe
How does Stripe ensure end-to-end data reliability? Here, we walk through some real-life lessons learned and best practices for doing just that.
Exceptional data correctness is essential for companies dealing with sensitive data, especially financial transactions. At Stripe, data correctness is a top priority given their handling of payments and money movement. Some of Stripe's key systems like treasury, billing, and reconciliation rely entirely on data pipeline accuracy.
Stripe’s Data Platform team is responsible for maintaining all of Stripe’s data infrastructure and building checks for data observability and correctness at every point in the system, including: automated checks, fallback procedures, and other safeguards. This post will dive into some of the lessons we can learn from Stripe on ensuring end-to-end data reliability.
Automated checks and robust fallback procedures
Stripe has implemented automated checks that run after data processing jobs to validate the results. These checks are implemented as "Airflow decorators" and include simple validations like ensuring data volumes are increasing, as well as more complex checks that verify primary keys exist in upstream data sources.
That's not all. Stripe has also established fallback procedures for when these checks fail or jobs don't complete successfully. For less critical data, the system may automatically use the previous day's data. For higher priority data, it may rerun the pipeline using the previous day's data. And for the most critical data, the system halts the pipeline entirely until the issue is addressed. These procedures help avoid moving forward with questionable data.
Dealing with eventual consistency
In the past, S3 followed an "eventual consistency" model. This means that after a successful write operation (such as PUT, POST, DELETE), subsequent read operations (GET) might not reflect the change for a brief period of time. This is because it takes a short while for the change to propagate to all replicas of the data across Amazon's infrastructure.
So for instance, if you updated or deleted an object and immediately tried to read or delete it, you might receive the old version of the object or encounter a "not found" error. Until 2020, Stripe dealt with this issue by building a “metadata layer” to store extra information along S3, only allowing the data pipeline to continue once all necessary data was in place. This metadata layer eventually evolved into a new table format, providing more efficiency and insight into what data was needed for each job.
Stripe implemented type-safe libraries for Apache Spark to enforce correctness in data operations. Since Spark's built-in encoders didn't support all of Stripe's data types, they implemented their own to support every data type used at Stripe. With type-safe data and operations, pipelines were less prone to failure since these libraries helped catch issues early and prevent errors.
Data observability platform
Stripe created an observability platform specifically for monitoring data and providing a comprehensive view of Stripe’s data platforms. The platform integrates with Airflow, and the metadata layer to track key metrics for each data job like runtime, data size, column names, and types. This UI based observability solution eliminates the dependency on data science and platform teams by empowering stakeholders with relevant information on why the required data isn't available yet or due to failure or bottleneck in the data pipeline. While initially basic, this platform gave Stripe a foundation to build upon for specifying fallback logic, test definitions, and other data correctness procedures.
While Stripe, as a large tech company with many resources, chose to build their data observability platform in-house, this is where an off the shelf data observability platform like Bigeye can be helpful. Bigeye’s autometrics and autothresholds make it easy for you to monitor the data in your data warehouse, and to receive automatic alerts before they go out of bounds.
To ensure maximum accuracy for its most critical data, Stripe frequently recomputes the “universe” of data, reprocessing all data from the beginning. While this adds latency and cost, the benefits to data correctness outweigh these trade-offs. Re-computation also provides an opportunity to optimise infrastructure. Although recomputing at massive scale may be infeasible for huge companies, Stripe's data volume has remained manageable enough to recompute when needed.
Striking a balance: Trade-offs and efficiencies
Stripe's focus on correctness requires trade-offs, like increased latency, unpredictability, and cost. But these stringent data correctness procedures at Stripe have driven them to implement efficiency tools like the metadata layer and Apache Iceberg table format. The metadata layer provides information to only reprocess data that has changed, while Iceberg’s table format enables storing data in multiple locations and only querying what is needed for a specific job. The robust, self-service data systems built around correctness allow Stripe to have confidence in their data.
Data correctness has always been a top priority for Stripe, and they have built it into every layer of their data stack. Comprehensive monitoring, automated checks, a metadata layer, type-safe libraries, observability platforms, and regular complete re-computations of data provide multiple levels of validation to ensure accurate data. By building data correctness into systems and processes from end to end, Stripe has established a model of data reliability engineering other companies would do well to follow.
Schema change detection