Engineering

October 1, 2021

Seven Principles for Reliable Data Pipelines

min read

Kyle Kirwan

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Introduction

It’s getting easier and faster to stand up data infrastructure and make data available for production use cases like: self-service business intelligence, machine learning, and a/b testing. The barriers to writing and deploying data pipelines keep falling, and analysts (or analytics engineers) are able to own more of the modeling process, with fewer blocking dependencies on their data engineering teammates.

As analysts are increasingly empowered to directly author and update their own data pipelines, they also take on increasing responsibility for the reliability and quality of the data coming out.

Platformizing data pipelines at Uber

The founding team at Bigeye is largely made up of former members of Uber’s data platform. Between about 2015 and 2019, a similar pattern emerged there as internal data tooling got easier and easier to use.

Platformizing Uber’s pipeline writing tool enabled analysts anywhere in the company to move logic out of (sometimes hilariously) complicated SQL queries, into materialized datasets. Those datasets could then be searched in the catalog, and enabled far simpler, more performant queries. This helped feed the appetite for using data in nearly everything the company does.

But Uber is a large and complex organization, and not every team was equally good at reliably operating whatever pipelines they decided to spin up.

And while it wasn’t okay when pipeline problems affected internal dashboards, there was a whole new level of not okay when pipelines were feeding ML models and in-app features.

“Soooo…I’m gonna go tell the mobile app people about this now.”

Principles for Data Reliability Engineering

“In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.” -Ben Treynor, VP Engineering at Google

Looking back, the single biggest differentiator — between teams with “boring” pipelines that just worked, and teams with “exciting” pipelines and regular fire fights — was how much a team looked at their section of the data model from a Site Reliability Engineering (SRE) perspective.

Here’s a look at those seven principles from Google’s SRE handbook, rewritten for analytics and data engineers, with accompanying examples from some of internal tools and practices at Uber that supported them.

Principle #1 — Embrace risk

The only way to have perfectly reliable data, is to not have any data. Software breaks in unexpected ways, and so do data pipelines. Software incidents get expensive quickly, and — if you’re actually “data driven” — so do data incidents. The best teams budget time to plan for identifying, mitigating, communicating, and learning from inevitable data pipeline incidents.

Uber example: put simply, teams that didn’t embrace risk —who cranked out pipelines while postponing testing, skipping code review, and waiting for incidents to make themselves known — were the ones who had to explain themselves when an outage was discovered after weeks or months in the wild.

Principle #2— Set standards

All those new pipelines are only useful if the business actually depends on them. And when someone depends on you, it’s wise to clarify what exactly they can depend on with hard numbers. Data teams can draw on the existing framework of SLAs (Service Level Agreements), SLOs (Service Level Objectives), and SLIs (Service Level Indicators).

Read more from Google themselves here, but in a nutshell:

1. Measure specific aspects of pipeline performance with Service Level Indicators (SLIs). Software engineering example: “pages returned in under 90ms” or “storage capacity remaining”. Analytics engineer example: “hours since dataset refreshed” or “percentage of values that match a UUID regex”.

2. Give each one a target called a Service Level Objective (SLO). Software engineering example: “99% of pages in the last 7 days returned in under 90ms”. Analytics engineering example: “less than 6 hours since dataset refreshed” or “at least 99.9% of values match a UUID regex”.

3. Put those targets together into a Service Level Agreement (SLA), describing in hard numbers how the dataset should perform. Software engineering example: how fast should a user expect the website to behave. Analytics engineering example: what can an analyst expect to be true when querying this dataset.

Uber example: an internal testing harness existed so pipeline authors could set up recurring data integration tests. The body of tests effectively enabled SLIs, SLOs, and thus an SLA to be created for any dataset and published into the data catalog (though they weren’t named SLI/SLI/SLA explicitly at the time).

Principle #3 — Reduce toil

“Toil” is Google’s chosen word for the human work needed to operate your system — operational work — as opposed to engineering work improving it somehow. Examples: kicking an Airflow job or updating a schema manually.

Engineering data infrastructure to remove wasteful toil pays back in reduced overhead. Using tools like FiveTran can reduce the toil in ingesting data. Doing Looker training sessions can reduce the toil of responding to BI requests. It’s worth spending time to find toil in your data architecture and automate it.

Uber example: internal tools like Marmaray and Piper reduced the amount of manual process placed on data engineering teams for things like ingesting new data into the data lake or creating a new pipeline.

Principle #4 — Monitor everything

It’s impossible for an engineering team to tell how a site is doing, without knowing both performance metrics like Apdex, and system metrics like CPU utilization. While most data teams have some degree of system monitoring like Airflow job statuses, fewer of them monitor their pipeline content the way a Google SRE would monitor application performance.

Uber example: the anomaly-detection based Data Quality Monitor identifies, prioritizes, and alerts on pipeline issues at scale, without the level of manual configuration needed to use the test harness mentioned above.

Principle #5 — Use automation

Data platform complexity can easily grow exponentially, but managing it manually grows linearly with headcount. Which is expensive. Automating manual processes frees up brainpower and time for tackling higher-order problems. This pays dividends during a data incident when time is money.

Uber example: the Experimentation Platform team scripted a common backfill pattern so their data engineers could easily fix impacted partitions in their incremental datasets during an incident response.

Principle #6 — Control releases

Making changes is ultimately how things improve, and how things break. Google defines a process that teams can follow to avoid breakage, and automates the toil involved in deploying it. This is a place data teams can borrow pretty directly: code review, CI/CD pipelines, and other DevOps practices all apply to releasing data pipeline code. After all, pipeline code is code at the end of the day.

Uber example: Piper pipeline jobs could be run locally first, then in a staging zone — on actual data but where they couldn’t cause harm — before finally running in production.

Principle #7 — Maintain simplicity

The enemy of reliability is complexity. Complexity can’t be completely eliminated — the pipeline is doing something to the data after all — but it can be reduced. Minimizing and isolating the complexity in any one pipeline job goes a long way toward keeping it reliable.

Uber example: the Experimentation Platform team’s pipeline for crunching a/b-test results was complex, but breaking out logic into individual jobs with specific tasks gave each stage materialized results that could be more easily debugged.

Closing thoughts

Barriers to building data infrastructure have fallen dramatically, and a relatively small team can set up a data platform and crank out pipelines in a fraction of the time it took just a few years ago. The decoupling of content-focused analytics engineers, and infrastructure-focused data engineers, brings a new level of speed when it comes to putting data into production.

The principles behind Google’s SRE function have helped it achieve some of the most reliable products in the world in terms of uptime—data teams don’t need to reinvent the wheel.

If there’s something to be learned from Uber’s pipeline platformization exercise several years ago, it’s that teams who don’t make time for at least some level of SRE thinking are the most likely to eventually slow down in the face of mounting reliability challenges. And hopefully their pipelines aren’t the ones being used in production :)

If you want to learn more, reach out to Kyle Kirwan and request a demo to see Bigeye in action.

share this episode

Resource

Monthly cost ($)

Number of resources

Time (months)

Total cost ($)

Software/Data engineer

$15,000

$540,000

Data analyst

$12,000

$144,000

Business analyst

$10,000

$30,000

Data/product manager

$20,000

$240,000

Total cost

$954,000

Role

Goals

Common needs

Data engineers

Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.

Freshness + volume
Monitoring
Schema change detection
Lineage monitoring

Data scientists

Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.

Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing

Analytics engineers

Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.

Lineage monitoringETL blue/green testing

Business intelligence analysts

The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.

Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing

Other stakeholders

Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.

Integration with analytics toolsReporting and insights

about the author

Kyle Kirwan

Chief Product Officer, Bigeye

Kyle Kirwan is Co-Founder and Chief Strategy Officer of Bigeye, where he leads strategic partnerships, prototype development, and other zero-to-one projects.

Kyle’s journey to founding Bigeye began at Uber, where he helped scale the company’s experimentation and data platforms during a period of hypergrowth. As a product leader and former founding data scientist on Uber’s experimentation platform, he worked on standardizing metrics across thousands of A/B tests that shaped rider, driver, and pricing experiences for millions of users.

It was at Uber that Kyle met Egor Gryaznov. Shortly after Egor joined, he launched Uber’s first SQL bootcamp. Kyle signed up partly out of curiosity, and partly to make sure the new guy actually knew his stuff. They quickly bonded over giving each other increasingly complex SQL challenges to solve.

As Uber’s data ecosystem grew to hundreds of petabytes and thousands of weekly users, Kyle saw a pattern emerge: testing the data pipelines was valuable but didn’t scale. His team experimented with using machine learning models on the daily data profiles of tables in the data lake to see if anomalies could be identified without manually writing data quality checks. This technique would later be termed data observability.

In 2019, Kyle and Egor co-founded Bigeye to use the lessons learned at Uber to transform data management in the enterprise. Today Bigeye serves some of the world’s largest organizations and ensures their data is trustworthy, and that their enterprise AI initiatives are grounded in that trusted data.

about the author

Kyle Kirwan is Co-Founder and Chief Strategy Officer of Bigeye, where he leads strategic partnerships, prototype development, and other zero-to-one projects.