The data reliability engineering dictionary
Here is your go-to guide for the dense terminology related to data reliability engineering. Both seasoned pros and new practitioners, read on.
This dictionary was specifically created for use in the field of data reliability engineering (DRE). Think of it as your go-to guide for terms related to DRE. In the rapidly evolving digital landscape, where "data is the new oil," DRE is more critical than ever. As an emerging technical discipline, there's a dense vocabulary that accompanies the work of DRE.
To help you understand and apply DRE concepts, we illuminate some of them, breaking down complex terminologies into easily digestible explanations. Whether you're a seasoned practitioner wanting to stay updated with the latest advancements, or a newbie taking your first steps into this field, this dictionary is here to guide your journey.
Detecting data points, events, and/or information that falls outside of a dataset’s normal behavior. Anomaly detection helps companies flag areas that might have issues in their data pipelines.
A technique for releasing changes with no downtime. The current "blue" system remains active, while the "green" updated system is deployed. Traffic is then routed to the green system. If issues arise, traffic can quickly switch back to blue.
Bronze layer/Silver layer
In a data warehouse architecture, data is often organized into layers based on how "refined" or aggregated it is. The layers are:
- Bronze layer: The raw data from source systems, loaded into the data warehouse as-is. No transformations are applied in this layer. It serves as a "backup" of the raw data in its original form.
- Silver layer: Lightly transformed and cleansed data from the bronze layer. Simple aggregates, deduplication, and consistency checks may be applied. This layer is used for exploratory analysis where speed is more important than perfection.
- Gold layer: Highly refined, standardized, and aggregated data optimized for business intelligence and reporting. Complex joins, calculations, and summaries are applied to promote a "single source of truth." This layer serves as the primary basis for decision making.
For many data teams, a bronze-silver-gold model strikes a good balance, but some opt for just two layers (raw and refined) or additional layers (like platinum) for their needs. The right approach depends on factors like data volumes, infrastructure, business requirements, and available resources.
Continuous integration (CI): Continuous Integration is a software development practice where members of a team integrate their work frequently, with each integration verified by an automated build (including test) to detect integration errors as quickly as possible.
Continuous Deployment (CD): Continuous deployment is a software practice where after each code commit, the project is automatically deployed.
Releasing changes to data systems in a gradual, staged manner. This could include canary releases, A/B testing, and blue-green deployments.
Data as code
The practice of defining and manipulating data through code and tools typically used by software engineers. Enables versioning, review processes, and automation of data. A data reliability engineering best practice.
Periods when data or pipelines are unavailable, resulting in lack of access to information. Data SLAs help minimize downtime.
Data incident management
The process for detecting, diagnosing, and remediating issues with data or data pipelines. A key focus of data reliability engineering.
Continuously checking data quality and pipeline health to detect issues early. Data monitoring is a core part of data reliability engineering and enables data incident management. Data monitoring encompasses both data pipeline monitoring and data quality monitoring.
An agile methodology for collaborating on and automating data-centric processes to improve data quality and reduce "toil". DataOps applies DevOps principles to data work.
Data pipeline monitoring
Data pipeline monitoring is concerned with the jobs and tables that move the data. It is typically the responsibility of data engineering or data platform teams. The main aspects monitored include the freshness of the data (when each table was last updated), the volume of data (how many rows are being moved), and job run durations. Its primary aim is to ensure that the Extract, Transform, Load (ETL) processes are functioning smoothly, allowing for seamless data flow between different stages of the pipeline, thereby avoiding bottlenecks and ensuring that data is up-to-date for analysis.
Data quality monitoring
Data quality monitoring is about validating the accuracy, consistency, and usability of the data itself. It’s not just about whether data is moving correctly from point A to point B, but whether the data itself is reliable and usable for the purposes it’s intended for.
For instance, data quality monitoring might involve looking for issues like:
- Missing or null values in critical data fields.
- Inconsistencies or contradictions within the data.
- Data that doesn't match expected patterns or values.
- Outdated or duplicated data.
Data reliability engineering
The discipline of ensuring reliable and trustworthy data through engineering best practices. These include:
- Monitoring: Gain visibility into the health and quality of data and data systems through metrics, logs, and dashboards. Track key performance and quality indicators to detect issues quickly.
- Automation: Reduce repetitive manual work and human error through test automation, CI/CD, and infrastructure as code. Automate as much as possible to scale with data growth.
- Prevention: Build reliability into the design of data architectures, pipelines, and processes. Use techniques like graceful degradation, throttling, and circuit breakers to prevent catastrophic failures.
- Response: Develop playbooks and runbooks for resolving incidents when they occur. Perform blameless postmortems to determine root causes and update systems to prevent future issues.
- Improvement: Continually optimize data reliability through incremental refinements to processes, tests, documentation, dashboards, and automation. Hold post-incident reviews to apply lessons learned.
The history of data reliability engineering traces back to SRE, a discipline created by Google to apply software engineering principles to operations. SRE focuses on aspects like monitoring, incident response, capacity planning, and automation to improve the reliability of software systems and services.
Data reliability engineering applies these same principles to data platforms, data pipelines, and analytics.
Data reliability engineer
An engineer who practices data reliability engineering. Data reliability engineers build and maintain data pipelines, define data SLAs, implement data monitoring, conduct root cause analysis for data incidents, and work to reduce "data toil" through building self-service data tools and automation. Some organizations are beginning to hire for dedicated data reliability engineers, and it is becoming its own discrete career ladder; other organizations split the responsibilities of DREs between other roles, like data engineers and software engineers.
Service level agreements that define the reliability, availability, and quality expected of data. Data SLAs include data SLIs and data SLOs.
Service level indicators, the metrics that are measured to determine if a data SLA is being met. For example, metrics like data accuracy, completeness, and freshness.
Service level objectives, the target thresholds for data SLIs that must be achieved to meet the data SLA. For example, "Data must be 99% accurate and 95% complete."
The ability to track changes to datasets and data models over time. Enables reproducibility, rollbacks, and comparison of data changes. An important data reliability engineering tool for monitoring data quality.
A centralized data system used for analysis and reporting. Data from various sources is extracted, transformed, and loaded into the warehouse. Popular data warehouses include:
Dbt is a popular open-source transformation tool that allows data analysts to write data transformations in the data warehouse using SQL. Dbt tests are part of the dbt framework - SQL statements that validate assumptions about data within a data warehouse and enable data teams to test the input or output of their dbt transformations.
Since dbt tests are easy to write and come out-of-the-box with dbt, many teams use it as a first step towards implementing some sort of data quality check. However, data teams should ultimately move to more comprehensive data observability solutions that provide continuous, automated monitoring and root cause analysis capabilities. Unlike data testing, data observability is proactive rather than reactive and gives a bigger-picture view into the overall health of an organization’s data.
Data integration processes used to load data into a data warehouse. ELT stands for:
1. Extract: Pull data from source systems.
2. Load: Load the data into the data warehouse.
3. Transform: Cleanse and transform the data in the data warehouse.
In ELT, the order of T and L steps are reversed, with the data transformation happening after the data has been loaded into the data warehouse.
ELT is a more “modern” approach that has become popular in recent years, as data warehouses have become more powerful and cost-effective, allowing teams to leverage the warehouses’ computing resources for transformation, instead of needing a separate transformation engine. Previously, teams would build Spark or Hadoop clusters to perform transformations.
Infrastructure as code
Managing infrastructure (networks, virtual machines, load balancers, etc.) through machine-readable definition files, rather than physical hardware configuration. This enables versioning, testing, and release management of infrastructure.
Root cause analysis
A systematic approach to determine the underlying cause of an issue. Data reliability engineers conduct root cause analysis to fully diagnose data incidents and prevent future reoccurrences.
A runbook is a set of written procedures that outline the steps required to complete a specific process or resolve an issue. Runbooks are a key part of data reliability engineering and incident management.
Some examples of data reliability engineering runbooks include:
- Data pipeline failure runbook: The steps to take if an ETL or ELT pipeline fails or experiences an error. This could include checking monitoring dashboards, reviewing logs, rolling back to the last known good version, and re-running the pipeline.
- Dashboard outage runbook: The procedures to follow if a key business dashboard or report goes down. This may include checking if the underlying data is accessible, determining if the issue is with the dashboard tool or code, and steps to restore the dashboard to working order.
- Data restore runbook: The steps required to restore data from a backup if data has been lost or corrupted in the primary data system. This includes ensuring you have proper backups in place, testing restore procedures, and documenting how to fully recover your data.
Data reliability engineering runbooks unlock quicker, more standardized incident responses and collect information that may be stuck in the heads of a few subject matter experts. They are also useful for training and accessibility.
Repetitive, tedious tasks that data teams must execute manually. A goal of data reliability engineering is to reduce "data toil" through automation and self-service tools.
If you're interested in taking Bigeye's data reliability tool for a spin, book a demo here.
Schema change detection