How to become a Data Reliability Engineer
There's a a need for a new kind of role within the data organization – one dedicated to the quality, observability, and maintenance of data. Enter the data reliability engineer.
These days nearly every business is consuming more data than ever before: for analytics about their products and services, fraud detection, in-app features like product recommendations, financial forecasting, and a whole litany of other applications. That’s resulting in more data sources, more data in their warehouses, more complex pipelines, and—crucially—higher stakes when any of those moving pieces breaks.
All this points to a need for a new kind of role within the data organization – one dedicated to the quality, observability, and maintenance of data. Somebody who helps keep everything reliable so these high-stakes use cases don’t break, despite the increasing scale and complexity of their pipelines.
What is data reliability engineering?
Data Reliability Engineering (DRE), a term inspired by Google's Site Reliability Engineering, refers to the work of creating standards, process, alignment, and tooling to keep data applications—like dashboards and ML models—reliable, without slowing down the organization’s ability to handle more data, and to evolve their data pipelines. Data Reliability Engineers keep data quality high, ensure the data is moving on time, that analytics tools and machine learning models stay reliable, and that data engineering and data science teams don’t have to slow down.
Despite the increasingly critical role data engineering and data science play in creating customer experiences and driving revenue, many data teams are still relying on ad-hoc actions like spot checking data, being the last to find out about data problems, putting out fires, and hand-rolling unscalable tools like SQL-and-Grafana monitoring.
As an industry, if we’re going to shift from these bespoke, manual one-off fixes to organized, repeatable, scalable processes, we should borrow from the pressure-tested best practices of modern software engineering and DevOps. These techniques are used at companies like Google, Facebook, and Stripe to ship changes to massive and complex applications, at high speed, with a spectacularly low overall rate of impact to their customers.
Here are seven principles adapted from Site Reliability Engineering:
The Seven Principles of Data Reliability Engineering
Accept risk: In any complex system it’s unavoidable that something will go wrong at some point. The only way to have zero defects is to prevent any change to the entire system—which isn’t possible in today’s fast moving data environments. While it’s important to reduce the rate at which problems occur, it’s also important to have processes and controls in place to minimize the impact of problems when they do inevitably happen.
Monitor everything: If problems aren't recognized, they can't be controlled or minimized. Monitoring and alerting provide teams with the visibility they need to figure out what's wrong and how to remedy it. For infrastructure and apps, observability tooling is well-established, but for data, it is still in its infancy.
Set standards: Is the data of good quality or not? In order for teams to achieve progress, they have to move from subjectivity to clear, measurable, and agreed upon definitions. It will be difficult to do anything if the concept of good or bad is ambiguous or lacks alignment. Standards-setting tools such as SLIs, SLOs, and SLAs should be put in place as a verbose benchmark for data expectations.
Reduce toil: "Toil" refers to the human labor required to simply run your system — the pure operational work — as opposed to engineering work that enhances it. Manually restarting an Airflow process or rotating the credentials used in a data replication job are two examples. Every bit of toil removed from the system reduces the number of error-prone human tasks that can lead to an outage, and it frees up human brainpower to spend on improvements.
Use automation: Data platform complexity can grow exponentially in a short amount of time. If you choose to manage it manually, you need to add expensive headcount and resources to manage it. If you automate, you free up your team’s brainpower to tackle problems of a higher order, plus you minimize the need for headcount. Your streamlined, automated teams will pay dividends during a data incident, when time is money.
Control releases: Making changes is how things improve in the end, but it's also how they break. Code review, testing outputs from an ETL job in a staging schema, and data pipeline tests are examples of techniques that can help catch errors before releasing potentially breaking changes into production.
Maintain simplicity: Complexity is the enemy of reliability. Complexity cannot be totally eliminated — after all, the pipeline is altering the data — but it may be decreased. It goes a long way toward keeping a pipeline operation reliable if the complexity is minimized and isolated.
What do Data Reliability Engineers do?
Data Reliability Engineers can be one role within a versatile data team that might also include data engineers, software engineers, analytics engineers, and even data scientists. Depending on the level of specialization within the team, they might work on infrastructure like Snowflake and Airflow, pipeline code like dbt models, observability and testing tools like Datadog, Bigeye, and Great Expectations, processes that are defined in spreadsheets and documents, or all of the above!.
Like Site Reliability Engineers (SREs)—their analogs in software engineering—DREs have the goal of setting and maintaining standards for the defending the reliability of production data, while enabling velocity for data and analytics engineers. They are also the organization’s commander in the case of a data outage – when there’s a problem with the data that could impact production (an analytics dashboard, a machine learning model, etc.) they advise on fixing data quality and availability issues. More than just being reactive, they will also be in charge of preemptively identifying and fixing potential problems, and coming up with automated ways of testing and validating data.
Areas that DREs would have purview over include:
- Data lifecycle procedures (e.g. when and how data gets deprecated)
- Data SLA definition and documentation
- Data observability strategy and implementation
- Data pipeline code review, and testing frameworkData outage triage and response process
- Data ownership strategy and documentation
- Education and culture-building (e.g. internal roadshow to explain data SLAs)
Like SREs, DREs don’t just put out fires. They put the guardrails in place to prevent the fires from happening. They enable agility for analytics engineers and data scientists, keeping them moving quickly knowing that safety guards are in place to prevent changes to the data model from impacting production. Data teams are always balancing speed with reliability. The Data Reliability Engineer owns the strategies for achieving that balance.
Technical skills needed to become a Data Reliability Engineer
Data Reliability Engineers are expected to be familiar with tools and concepts like:
- OLTP databases
- PostgreSQL, MySQL, SQL Server, and others.
- OLTP databases / data warehouses
- Snowflake, Bigquery, and Redshift
- Data lake technologies
- Databricks, Presto, Azure Synapse, and Hive
- Observability and testing for detecting and resolving issues
- Infrastructure observability tools like Datadog
- Data observability tools like Bigeye
- Data testing tools like Great Expectations, and dbt tests
- Definition and tracking of data SLAs, SLOs, and SLIs
- Discovery and governance tools for managing data ownership and documentation
- Alation, Collibra, and Immuta
- Select Star, Stemma, and Castor
- Data pipeline tools
- Orchestration tools like Airflow, Prefect, and Dagster
- Transformation tools like DBT
- Reverse-ETL tools like Census and Hightouch
- Analytics and ML tools used by their counterparts in Analytics
- Infrastructure provisioning and automation
- Kubernetes, Terraform
- broad knowledge of networking
- HTTPS, FTP; VPNs, DNS, firewalls, load balancing
Why should I change my title to Data Reliability Engineer?
It’s possible that you’re already performing many of the duties and have many of the skills listed above, just under the guise of a different title, like data engineer, data scientist, software engineer, or devops engineer. Making the switch official can help your organization recognize the importance of that work, and give you the freedom to prioritize it appropriately, making sure it doesn’t get sidelined in lieu of other projects like extending your data model. Ultimately this benefits the organization by ensuring someone is thinking comprehensively about how to measure and defend the reliability of your data, while maintaining data engineering and data science velocity.
Some reasons to make the switch:
- More recognition for the work you’re probably already doing!
- Easier for you to fully prioritize data reliability projects
- Easier to hire and interview for more DRE team mates in the future
- Centralizes responsibility for reliability engineering with you, freeing up your teammates
- Clarifying career progression - just as site reliability engineering has become a separate ladder within engineering in many organizations, having a specific, separate title for data reliability can facilitate certain distinct demands or criteria, for instance additional compensation for being on-call for data pipelines.
In short: you’ll be getting clear buy-in from your organization that data reliability matters, is worth dedicating real effort toward, and is worth compensating for appropriately if successful.
Persuading your company that they need to focus on data reliability
In the early days of an organization, particularly a fast-growing one, there will often be little focus on data quality, freshness, and overall reliability. Many times, it’s only when a crisis finally occurs – an outage that wasn’t caught, a machine learning model that made an embarrassing determination – that spurs organizations to change.
It’s just natural to focus on advancing the business and ignore potential risks from data outages until the organization has experienced one. Don’t get too hung up on this. Unless it’s truly an existential threat to the business, only small investments in data reliability engineering are likely to get buy-in. But once an outage does occur, you can be prepared with an action plan, and get buy-in while everyone still feels the heat and are ready to dedicate resources toward prevention.
In as quantitative a fashion as possible, demonstrate the cost of a lack of data reliability, including:
- How a lack of data reliability directly affects engineer productivity. How much time does the average data engineer, analytics engineer, or data scientist spend reacting to data reliability issues?
- How does data reliability affect executives’ ability to make good business decisions? If you surveyed the executive team, how many would say they trust the data they make decisions with, on an 8-out-of-10 or higher level? What percentage of decisions are still made without data backing them up?
- How reliable are the organization’s machine learning models, and how expensive might an outage be if they were impacted? What’s the order of magnitude for the dollars-per-hour of outage? $1000/hr? $10,000/hr? At companies like Facebook and Uber, $1M-$10M/hr is entirely feasible depending on the model affected.
- How might a lack of data reliability opens the company up to embarrassment and error? Are there compliance or PR risks if customer data is inaccurate? You can try the Wall Street Journal test for this: what if the WSJ discovered that all your customers were sent invoices for 3x their actual usage? Would the business be impacted by this news coverage?
You may then present a Data Reliability Initiative plan with concrete tasks, for example:
- Defining data owners for each pipeline step from origin through to the end application
- Setting data SLA’s, SLOs, and SLIs
- Investing in a data observability tools like Bigeye
- Investing in data pipeline testing with DBT or Great Expectations
- Defining an on-call rotation for data reliability problems identified through testing and observability
- Designating certain existing engineers as data reliability engineers (or hiring for this role in a new posting.
Have you successfully made the push for Data Reliability Engineering in your org? We want to hear from you, and share your learnings with the community. Reach out to email@example.com and tell us what’s worked for you!
Schema change detection