Seven Principles for Reliable Data Pipelines
How we applied Google’s SRE principles to data at Uber
It’s getting easier and faster to stand up data infrastructure and make data available for production use cases like: self-service business intelligence, machine learning, and a/b testing. The barriers to writing and deploying data pipelines keep falling, and analysts (or analytics engineers) are able to own more of the modeling process, with fewer blocking dependencies on their data engineering teammates.
As analysts are increasingly empowered to directly author and update their own data pipelines, they also take on increasing responsibility for the reliability and quality of the data coming out.
Platformizing data pipelines at Uber
The founding team at Bigeye is largely made up of former members of Uber’s data platform. Between about 2015 and 2019, a similar pattern emerged there as internal data tooling got easier and easier to use.
Platformizing Uber’s pipeline writing tool enabled analysts anywhere in the company to move logic out of (sometimes hilariously) complicated SQL queries, into materialized datasets. Those datasets could then be searched in the catalog, and enabled far simpler, more performant queries. This helped feed the appetite for using data in nearly everything the company does.
But Uber is a large and complex organization, and not every team was equally good at reliably operating whatever pipelines they decided to spin up.
And while it wasn’t okay when pipeline problems affected internal dashboards, there was a whole new level of not okay when pipelines were feeding ML models and in-app features.
Principles for Data Reliability Engineering
“In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.” -Ben Treynor, VP Engineering at Google
Looking back, the single biggest differentiator — between teams with “boring” pipelines that just worked, and teams with “exciting” pipelines and regular fire fights — was how much a team looked at their section of the data model from a Site Reliability Engineering (SRE) perspective.
Here’s a look at those seven principles from Google’s SRE handbook, rewritten for analytics and data engineers, with accompanying examples from some of internal tools and practices at Uber that supported them.
Principle #1 — Embrace risk
The only way to have perfectly reliable data, is to not have any data. Software breaks in unexpected ways, and so do data pipelines. Software incidents get expensive quickly, and — if you’re actually “data driven” — so do data incidents. The best teams budget time to plan for identifying, mitigating, communicating, and learning from inevitable data pipeline incidents.
Uber example: put simply, teams that didn’t embrace risk —who cranked out pipelines while postponing testing, skipping code review, and waiting for incidents to make themselves known — were the ones who had to explain themselves when an outage was discovered after weeks or months in the wild.
Principle #2— Set standards
All those new pipelines are only useful if the business actually depends on them. And when someone depends on you, it’s wise to clarify what exactly they can depend on with hard numbers. Data teams can draw on the existing framework of SLAs (Service Level Agreements), SLOs (Service Level Objectives), and SLIs (Service Level Indicators).
Read more from Google themselves here, but in a nutshell:
1. Measure specific aspects of pipeline performance with Service Level Indicators (SLIs). Software engineering example: “pages returned in under 90ms” or “storage capacity remaining”. Analytics engineer example: “hours since dataset refreshed” or “percentage of values that match a UUID regex”.
2. Give each one a target called a Service Level Objective (SLO). Software engineering example: “99% of pages in the last 7 days returned in under 90ms”. Analytics engineering example: “less than 6 hours since dataset refreshed” or “at least 99.9% of values match a UUID regex”.
3. Put those targets together into a Service Level Agreement (SLA), describing in hard numbers how the dataset should perform. Software engineering example: how fast should a user expect the website to behave. Analytics engineering example: what can an analyst expect to be true when querying this dataset.
Uber example: an internal testing harness existed so pipeline authors could set up recurring data integration tests. The body of tests effectively enabled SLIs, SLOs, and thus an SLA to be created for any dataset and published into the data catalog (though they weren’t named SLI/SLI/SLA explicitly at the time).
Principle #3 — Reduce toil
“Toil” is Google’s chosen word for the human work needed to operate your system — operational work — as opposed to engineering work improving it somehow. Examples: kicking an Airflow job or updating a schema manually.
Engineering data infrastructure to remove wasteful toil pays back in reduced overhead. Using tools like FiveTran can reduce the toil in ingesting data. Doing Looker training sessions can reduce the toil of responding to BI requests. It’s worth spending time to find toil in your data architecture and automate it.
Uber example: internal tools like Marmaray and Piper reduced the amount of manual process placed on data engineering teams for things like ingesting new data into the data lake or creating a new pipeline.
Principle #4 — Monitor everything
It’s impossible for an engineering team to tell how a site is doing, without knowing both performance metrics like Apdex, and system metrics like CPU utilization. While most data teams have some degree of system monitoring like Airflow job statuses, fewer of them monitor their pipeline content the way a Google SRE would monitor application performance.
Uber example: the anomaly-detection based Data Quality Monitor identifies, prioritizes, and alerts on pipeline issues at scale, without the level of manual configuration needed to use the test harness mentioned above.
Principle #5 — Use automation
Data platform complexity can easily grow exponentially, but managing it manually grows linearly with headcount. Which is expensive. Automating manual processes frees up brainpower and time for tackling higher-order problems. This pays dividends during a data incident when time is money.
Uber example: the Experimentation Platform team scripted a common backfill pattern so their data engineers could easily fix impacted partitions in their incremental datasets during an incident response.
Principle #6 — Control releases
Making changes is ultimately how things improve, and how things break. Google defines a process that teams can follow to avoid breakage, and automates the toil involved in deploying it. This is a place data teams can borrow pretty directly: code review, CI/CD pipelines, and other DevOps practices all apply to releasing data pipeline code. After all, pipeline code is code at the end of the day.
Uber example: Piper pipeline jobs could be run locally first, then in a staging zone — on actual data but where they couldn’t cause harm — before finally running in production.
Principle #7 — Maintain simplicity
The enemy of reliability is complexity. Complexity can’t be completely eliminated — the pipeline is doing something to the data after all — but it can be reduced. Minimizing and isolating the complexity in any one pipeline job goes a long way toward keeping it reliable.
Uber example: the Experimentation Platform team’s pipeline for crunching a/b-test results was complex, but breaking out logic into individual jobs with specific tasks gave each stage materialized results that could be more easily debugged.
Barriers to building data infrastructure have fallen dramatically, and a relatively small team can set up a data platform and crank out pipelines in a fraction of the time it took just a few years ago. The decoupling of content-focused analytics engineers, and infrastructure-focused data engineers, brings a new level of speed when it comes to putting data into production.
The principles behind Google’s SRE function have helped it achieve some of the most reliable products in the world in terms of uptime—data teams don’t need to reinvent the wheel.
If there’s something to be learned from Uber’s pipeline platformization exercise several years ago, it’s that teams who don’t make time for at least some level of SRE thinking are the most likely to eventually slow down in the face of mounting reliability challenges. And hopefully their pipelines aren’t the ones being used in production :)
Schema change detection