Thought leadership

June 29, 2023

The data reliability engineering dictionary

min read

Liz Elfman

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

This dictionary was specifically created for use in the field of data reliability engineering (DRE). Think of it as your go-to guide for terms related to DRE. In the rapidly evolving digital landscape, where "data is the new oil," DRE is more critical than ever. As an emerging technical discipline, there's a dense vocabulary that accompanies the work of DRE.

To help you understand and apply DRE concepts, we illuminate some of them, breaking down complex terminologies into easily digestible explanations. Whether you're a seasoned practitioner wanting to stay updated with the latest advancements, or a newbie taking your first steps into this field, this dictionary is here to guide your journey.

Anomaly detection

Detecting data points, events, and/or information that falls outside of a dataset’s normal behavior. Anomaly detection helps companies flag areas that might have issues in their data pipelines.

Blue-green deployment

A technique for releasing changes with no downtime. The current "blue" system remains active, while the "green" updated system is deployed. Traffic is then routed to the green system. If issues arise, traffic can quickly switch back to blue.

Bronze layer/Silver layer

In a data warehouse architecture, data is often organized into layers based on how "refined" or aggregated it is. The layers are:

Bronze layer: The raw data from source systems, loaded into the data warehouse as-is. No transformations are applied in this layer. It serves as a "backup" of the raw data in its original form.
Silver layer: Lightly transformed and cleansed data from the bronze layer. Simple aggregates, deduplication, and consistency checks may be applied. This layer is used for exploratory analysis where speed is more important than perfection.
Gold layer: Highly refined, standardized, and aggregated data optimized for business intelligence and reporting. Complex joins, calculations, and summaries are applied to promote a "single source of truth." This layer serves as the primary basis for decision making.

For many data teams, a bronze-silver-gold model strikes a good balance, but some opt for just two layers (raw and refined) or additional layers (like platinum) for their needs. The right approach depends on factors like data volumes, infrastructure, business requirements, and available resources.

CI/CD

Continuous integration (CI): Continuous Integration is a software development practice where members of a team integrate their work frequently, with each integration verified by an automated build (including test) to detect integration errors as quickly as possible.

Continuous Deployment (CD): Continuous deployment is a software practice where after each code commit, the project is automatically deployed.

Controlled releases

Releasing changes to data systems in a gradual, staged manner. This could include canary releases, A/B testing, and blue-green deployments.

Data as code

The practice of defining and manipulating data through code and tools typically used by software engineers. Enables versioning, review processes, and automation of data. A data reliability engineering best practice.

Data downtime

Periods when data or pipelines are unavailable, resulting in lack of access to information. Data SLAs help minimize downtime.

Data incident management

The process for detecting, diagnosing, and remediating issues with data or data pipelines. A key focus of data reliability engineering.

Data monitoring

Continuously checking data quality and pipeline health to detect issues early. Data monitoring is a core part of data reliability engineering and enables data incident management. Data monitoring encompasses both data pipeline monitoring and data quality monitoring.

DataOps

An agile methodology for collaborating on and automating data-centric processes to improve data quality and reduce "toil". DataOps applies DevOps principles to data work.

Data pipeline monitoring

Data pipeline monitoring is concerned with the jobs and tables that move the data. It is typically the responsibility of data engineering or data platform teams. The main aspects monitored include the freshness of the data (when each table was last updated), the volume of data (how many rows are being moved), and job run durations. Its primary aim is to ensure that the Extract, Transform, Load (ETL) processes are functioning smoothly, allowing for seamless data flow between different stages of the pipeline, thereby avoiding bottlenecks and ensuring that data is up-to-date for analysis.

Data quality monitoring

Data quality monitoring is about validating the accuracy, consistency, and usability of the data itself. It’s not just about whether data is moving correctly from point A to point B, but whether the data itself is reliable and usable for the purposes it’s intended for.

For instance, data quality monitoring might involve looking for issues like:

Missing or null values in critical data fields.
Inconsistencies or contradictions within the data.
Data that doesn't match expected patterns or values.
Outdated or duplicated data.

Data reliability engineering

The discipline of ensuring reliable and trustworthy data through engineering best practices. These include:

Monitoring: Gain visibility into the health and quality of data and data systems through metrics, logs, and dashboards. Track key performance and quality indicators to detect issues quickly.
Automation: Reduce repetitive manual work and human error through test automation, CI/CD, and infrastructure as code. Automate as much as possible to scale with data growth.
Prevention: Build reliability into the design of data architectures, pipelines, and processes. Use techniques like graceful degradation, throttling, and circuit breakers to prevent catastrophic failures.
Response: Develop playbooks and runbooks for resolving incidents when they occur. Perform blameless postmortems to determine root causes and update systems to prevent future issues.
Improvement: Continually optimize data reliability through incremental refinements to processes, tests, documentation, dashboards, and automation. Hold post-incident reviews to apply lessons learned.

The history of data reliability engineering traces back to SRE, a discipline created by Google to apply software engineering principles to operations. SRE focuses on aspects like monitoring, incident response, capacity planning, and automation to improve the reliability of software systems and services.

Data reliability engineering applies these same principles to data platforms, data pipelines, and analytics.

Data reliability engineer

An engineer who practices data reliability engineering. Data reliability engineers build and maintain data pipelines, define data SLAs, implement data monitoring, conduct root cause analysis for data incidents, and work to reduce "data toil" through building self-service data tools and automation. Some organizations are beginning to hire for dedicated data reliability engineers, and it is becoming its own discrete career ladder; other organizations split the responsibilities of DREs between other roles, like data engineers and software engineers.

Data SLAs

Service level agreements that define the reliability, availability, and quality expected of data. Data SLAs include data SLIs and data SLOs.

Data SLIs

Service level indicators, the metrics that are measured to determine if a data SLA is being met. For example, metrics like data accuracy, completeness, and freshness.

Data SLOs

Service level objectives, the target thresholds for data SLIs that must be achieved to meet the data SLA. For example, "Data must be 99% accurate and 95% complete."

Data versioning

The ability to track changes to datasets and data models over time. Enables reproducibility, rollbacks, and comparison of data changes. An important data reliability engineering tool for monitoring data quality.

Data warehouse

A centralized data system used for analysis and reporting. Data from various sources is extracted, transformed, and loaded into the warehouse. Popular data warehouses include:

Snowflake
Databricks
Redshift
Bigquery

Dbt tests

Dbt is a popular open-source transformation tool that allows data analysts to write data transformations in the data warehouse using SQL. Dbt tests are part of the dbt framework - SQL statements that validate assumptions about data within a data warehouse and enable data teams to test the input or output of their dbt transformations.

Since dbt tests are easy to write and come out-of-the-box with dbt, many teams use it as a first step towards implementing some sort of data quality check. However, data teams should ultimately move to more comprehensive data observability solutions that provide continuous, automated monitoring and root cause analysis capabilities. Unlike data testing, data observability is proactive rather than reactive and gives a bigger-picture view into the overall health of an organization’s data.

ELT/ET

Data integration processes used to load data into a data warehouse. ELT stands for:

1. Extract: Pull data from source systems.

2. Load: Load the data into the data warehouse.

3. Transform: Cleanse and transform the data in the data warehouse.

In ELT, the order of T and L steps are reversed, with the data transformation happening after the data has been loaded into the data warehouse.

ELT is a more “modern” approach that has become popular in recent years, as data warehouses have become more powerful and cost-effective, allowing teams to leverage the warehouses’ computing resources for transformation, instead of needing a separate transformation engine. Previously, teams would build Spark or Hadoop clusters to perform transformations.

Infrastructure as code

Managing infrastructure (networks, virtual machines, load balancers, etc.) through machine-readable definition files, rather than physical hardware configuration. This enables versioning, testing, and release management of infrastructure.

Root cause analysis

A systematic approach to determine the underlying cause of an issue. Data reliability engineers conduct root cause analysis to fully diagnose data incidents and prevent future reoccurrences.

Runbooks

A runbook is a set of written procedures that outline the steps required to complete a specific process or resolve an issue. Runbooks are a key part of data reliability engineering and incident management.

Some examples of data reliability engineering runbooks include:

Data pipeline failure runbook: The steps to take if an ETL or ELT pipeline fails or experiences an error. This could include checking monitoring dashboards, reviewing logs, rolling back to the last known good version, and re-running the pipeline.
Dashboard outage runbook: The procedures to follow if a key business dashboard or report goes down. This may include checking if the underlying data is accessible, determining if the issue is with the dashboard tool or code, and steps to restore the dashboard to working order.
Data restore runbook: The steps required to restore data from a backup if data has been lost or corrupted in the primary data system. This includes ensuring you have proper backups in place, testing restore procedures, and documenting how to fully recover your data.

Data reliability engineering runbooks unlock quicker, more standardized incident responses and collect information that may be stuck in the heads of a few subject matter experts. They are also useful for training and accessibility.

Toil

Repetitive, tedious tasks that data teams must execute manually. A goal of data reliability engineering is to reduce "data toil" through automation and self-service tools.

If you're interested in taking Bigeye's data reliability tool for a spin, book a demo here.

share this episode

Resource

Monthly cost ($)

Number of resources

Time (months)

Total cost ($)

Software/Data engineer

$15,000

$540,000

Data analyst

$12,000

$144,000

Business analyst

$10,000

$30,000

Data/product manager

$20,000

$240,000

Total cost

$954,000

Role

Goals

Common needs

Data engineers

Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.

Freshness + volume
Monitoring
Schema change detection
Lineage monitoring

Data scientists

Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.

Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing

Analytics engineers

Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.

Lineage monitoringETL blue/green testing

Business intelligence analysts

The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.

Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing

Other stakeholders

Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.

Integration with analytics toolsReporting and insights

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

The data reliability engineering dictionary

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

We're Launching The AI Trust Summit

Understanding Data Profiling: A Foundation for Better Business Intelligence

5 Questions to Pressure-Test Your AI Foundation

Join the Bigeye Newsletter

The data reliability engineering dictionary

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

Related posts

We're Launching The AI Trust Summit

Understanding Data Profiling: A Foundation for Better Business Intelligence

5 Questions to Pressure-Test Your AI Foundation

Join the Bigeye Newsletter