Engineering
-
May 20, 2022

A day in the life of a data reliability engineer

We researched recent job posts to gather a series of common responsibilities that candidates might expect to find in a DRE role.

Kyle Kirwan

Data’s role in the modern organization continues to evolve, but one thing is certain: data sits at the forefront of the most important business decisions.

The volume of available data has increased exponentially in recent years. There’s been an explosion of tools, new technologies, and sources from which to pull data. “Big data” has been a premier buzzword in the startup world for more than a decade. And while all of that data presents new opportunities for a competitive edge, someone needs to make sense of it first.

Historically, that responsibility has fallen to a hodgepodge of business, operations, and technical individuals and teams. But today, many teams are choosing to formalize the role, elevating it as a key driver of organizational strategy and decision-making.

When data impacts everything from collaboration to customer experience to marketing to product development, the data reliability engineer is a linchpin that unites teams and informs organizational cohesion.

What is data reliability engineering?

The emerging field of data reliability engineering (DRE for short) focuses on maintaining fresh, accurate, high-quality data while streamlining repetitive tasks that block data and engineering teams.

In the early 2000s, Google convened the first Site Reliability Engineering (SRE) team to reduce the downtime and latency that’s a typical byproduct of software development. Since then, the principles and practices of SRE have been widely applied to infrastructure and operations problems, wherever teams are looking to create reliable, scalable software.

DRE brings a similar set of principles and practices to data. DRE takes a holistic approach to managing all parts of the data infrastructure within an organization. That infrastructure may include things like: data pipelines, databases, CI/CD, data warehouses, archives, change management, deployments, and cross-team communication. In short, the main aim of DRE is to bring cohesion, discipline, and automation to the data infrastructure.

The seven principles of data reliability engineering

At Bigeye, we looked at the overarching practices that data teams use to maintain quality and reliability in their data. From there, we developed a set of seven principles for reliable data pipelines. Those principles are:

  1. Embracing Risk – The only way to have perfectly reliable data, is to not have any data at all. Data pipelines break in unexpected ways—embrace the risk and plan for how to manage it effectively.
  2. Set Standards – When someone depends on data, it's wise to clarify what exactly they can depend on with clear definitions, hard numbers, and clear cross-team agreements.
  3. Reduce Toil – Removing repetitive manual tasks needed to operate your data platform repays dividends in reduced overhead and fewer human errors.
  4. Monitor Everything – It's impossible for a data team to understand how their data and infrastructure is behaving without comprehensive, always-on monitoring.
  5. Use Automation – Automating manual processes reduces manual mistakes and frees up brainpower and time for tackling higher-order problems.
  6. Control Releases – Making changes is ultimately how things improve, and how things break, and having a process for reviewing and releasing data pipeline code helps you ship improvements without causing breakage.
  7. Maintain Simplicity – The enemy of reliability is complexity. Minimizing and isolating the complexity in any one pipeline job goes a long way toward keeping it reliable.  

Data reliability engineering qualifications and responsibilities

Companies like Equifax, Procore, and Mythical Games are actively seeking to grow DRE teams. A typical DRE role seeks candidates who are looking to help transform business through data analysis, investigation, and automation. One job posting from Microsoft invites DRE candidates to “come and help us build the most reliable & efficient datacenter infrastructure on the planet.” So, what are some of the common responsibilities and qualifications that companies seek as they expand their DRE operations?

Responsibilities

We researched recent job posts to gather a series of common responsibilities that candidates might expect to find in a DRE role. Those responsibilities include:

  • Researching and identifying data-related problems or potential problems and drive to resolution
  • Defining business rules that determine governance and data quality
  • Assisting in writing tests that validate business rules
  • Rigorously testing to ensure data quality
  • Working closely with application, data platform, product, and data engineering teams to optimize data pipelines for monitoring and reliability
  • Driving data postmortems and diagnose data incidents
  • Planning for data reliability from requirements through to deployment, including the assessment of end-user and business needs and team scheduling and engagement

Qualifications

Similarly, across DRE job posts were a series of common qualifications that candidates might expect to fulfill in a DRE role. Those qualifications include:

  • End-to-end experience with data engineering across big data solutions like SQL, Azure, and others
  • Experience creating and leveraging data visualization platforms like Tableau and PowerBI
  • Experience managing Infrastructure as Code or building and operating cloud-based, distributed, and scalable databases with large amounts of data
  • Experience with data-centric applications like Hadoop, Kusto, or equivalent
  • Familiarity with general-purpose programming languages like Java, C/C++, C#, Python, or JavaScript
  • The ability to deal with ambiguity and drive actionable solutions
  • A computer science degree or the equivalent work experience

Interview with a data reliability engineer: Miriah Peterson

We recently sat down and spoke with an experienced data reliability engineer. Miriah Peterson is a technologist with a background in machine learning, data engineering, and data architecture strategy. In a recent blog post, she noted that “We, as Data Reliability Engineers, want to understand how SRE standards for software and SysAdmin-styled best practices apply to data systems, data pipelines, and other areas of traditional data-based infrastructure.” Read our conversation below.

Q. Walk us through a typical day in your role as a data reliability engineer.

A: The DRE day-to-day can be pretty typical to most engineering roles. You have your prioritized work, and you’re busy making sure that your initiatives align with the strategy and plan of your team. You tend to drive this alignment through sprints, and you’re focused on serving other engineers and other teams.

Inevitably, issues arise that interrupt your team’s initiatives. Your priority becomes unblocking that team and making sure they have the resources available that they need to keep working. You help them make necessary changes. You also ensure that the pipelines they need to work on are up and running, and that they have tools they need to finish their project.

In terms of workload management, we use the same tools typical to any engineer. We use ticketing systems to track issues and prioritize them, like the Jira workboard. Then those are integrated with a bug ticketing tool or some kind of an operations or error reporting tool that keeps track of the state of the pipeline. We typically ensure that we have a good metrics engine like Datadog or Griffin or Prometheus, which provides alerting and visibility into the data services we’re running.

Q. If you were writing a job description for DRE, what backgrounds and responsibilities would you think are important?

A. Ultimately, you want a data-focused engineer that uses software-style problem solving. You want someone that has good experience with metrics and SRE-style skills like reporting, understanding data stores, and a fluency in basic infrastructure.

You also want someone with a good software background who understands the array of tools at a software developer’s disposal. You’re pulling in a lot from the data side, so it’s key to understand pipeline tools, data stores, ETL frameworks, and other elements that support a data organization. The software-style problem solving applied to data issues is the new twist that comes alongside the typical data work.

Q. What does a data reliability engineer do that makes them unique to other engineers?

A. In short, it’s about scope of work and a holistic approach. A data reliability engineer is going to think about how their data product is being used and made available to other teams. A lot of engineers are concerned about their product, their customers, and web services. Data reliability engineers care about those things, but we also care about data services and the people that use them.

We’re obsessed with making sure our internal teams have a flawless and uninterrupted experience with data services and data projects. While those teams are internal to an organization, they’re likely going to be external to our team or department. A wide variety of teams use data, from finance to operations to sales. We make sure that the data being delivered to them is uninterrupted and reliable. We’re an internal resource, in the same way that site reliability engineers are used.

Q. How did you come into this role?

A. Our organization needed to revamp some of our existing data pipelining tools and services. We had a lack of observability into the tools, and a lack of understanding as far as their uptime and how customers were experiencing and interacting with data. There was also a lack of understanding of how other teams used our data.

Management asked us to help them understand the data side of the software we were providing. I was put onto the team that researched this problem, and we started to see how reliability was so important to our stakeholders. In fact, reliability was important to the customers that used our product, but also the other teams that used this data to refine the product. It became a natural rebrand of sorts, to say that we were working on reliability, uptime, and observability while creating a better and more sustainable data service.

Q. Is the need for DRE growing? What does the future of DRE look like?

A. There's definitely a long-standing and enduring need for DRE. There are a lot of holes in the data stack as it stands today. It's really hard to get reliable data, and to make sure data is fresh and up to date. Due to that challenge, it can be really hard to build services around data, and to build pipelines around data that expose it to the appropriate teams at the right time. There is a constant challenge in making sure all services and pipelines run in the expected way.

The principles of DRE - whether the term sticks around or gets abbreviated or changed - touches on something fundamental and enduring. The fact that teams are recognizing the need for a reliability engineer that specializes in data services? That’s around for the foreseeable future. Because of what the data landscape looks like today, teams need someone who truly understands the data side and can drive it back to creating more value for the business.

The Data Reliability Engineering conference—DRE-Con—is taking place on May 25th & 26th. Hear speakers from Verizon, Strava, DoorDash, attend a hands-on workshop, and participate in live Q&A. DRE-Con is virtual and free to attend.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.