Complete guide to understanding data observability
This is your one-stop shop for all things data observability. Learn what data observability is, who it’s for and who uses it, how it benefits organizations, how it works, key terms, use cases, and best practices.
This is your one-stop shop for all things data observability. Learn what data observability is, who it’s for and who uses it, how it benefits organizations, how it works, key terms, use cases, and best practices.
What is data observability?
Data observability refers to the ability of an organization to see and understand the state of their data at all times. By “state” we mean things like: where is it coming from and going within our pipelines, is it moving on time and with the volume we expect, is the quality high enough for our use cases, and is it behaving normally or did it change recently?
Here are some questions you could answer with data observability:
- Is the customer’s table getting fresh data on time or is it delayed?
- Do we have any duplicated shopping cart transactions and how many?
- Was the huge decrease in average purchase size just a data problem or a real thing?
- Will I be impacting anyone if I delete this table from our data warehouse?
Observability platforms aim to give a continuous and comprehensive view into the state of data moving through data pipelines, so questions like these can be easily answered.
Common data observability activities include: monitoring the operational health of the data to ensure it’s fresh and complete, detecting and surfacing anomalies that could indicate data accuracy issues, mapping data lineage to upstream tables to quickly identify the root causes of problems, and mapping lineage downstream to analytics and machine learning applications to understand the impacts of problems.
Once data teams unlock these activities, they can systematically understand when, where, and why data quality problems occur in their pipelines. They can then prevent those problems from impacting the business, and work to prevent them occurring in the future!
Data observability unlocks these basic activities, so it’s the first stepping stone toward every organization’s ultimate data wishlist: healthier pipelines, data teams with more free time, more accurate information, and happier customers.
Why is data observability important?
Organizations push relentlessly to better use their data for strategic decision making, user experience, and efficient operations. All of those use cases assume that the data they run on is reliable.
The reality is that all data pipelines will experience failures. It’s not a question of if, but when, and how often. What the data team can control is how often issues tend to occur, how big the impact, and how stressed out they are when resolving these failures.
A data team that lacks this control will lose the trust of their organization, therefore limiting organizational willingness to invest in things like analytics, machine learning, and automation. On the other hand, a data team who consistently delivers reliable data can win the trust of their organization, and fully leverage data to drive the business forward.
Data observability is important because it is the first step toward having the level of control needed to ensure reliable data pipelines that win the trust of the organization and ultimately unlock more value from the data.
What are the benefits of data observability?
What do you get once you have total observability over your data pipelines? The bottom line is that the data team can ensure that data reaching the business is fresh, high quality, and reliable—which unlocks trust in the data.
Let’s break down the tangible benefits of data observability a little further:
- Decreased impacts from data issues—when problems do occur, they’ll be understood and resolved faster; ideally before they reach a single stakeholder. Data outages will always be a risk, but with observability, their impacts are greatly reduced.
- Less firefighting for the data team—you’ll spend less time firefighting data outages, and being reactive. That means more time building things, creating automation, and the other fun parts of data engineering and data science.
- Increased trust in the data by stakeholders—once they stop seeing questionable data in their analytics, and stop hearing about ML model issues, they’ll start trusting the data and assuming it’s good for making decisions with or integrating into their products and services.
- Increased investment in data from the business—once stakeholders can trust the data, they can feel comfortable using data in more places across the business, which means allowing a bigger budget on data and the data team.
Who uses data observability?
Data observability can touch many departments within an organization. But if you’re wondering who it’s for and who works with it, let’s explore some context. Following, we’ll explore the relationship between data observability and some common data roles within teams.
The history of data observability
The concept of “data observability” emerged in the late 2010s. It was initially inspired by internal efforts at companies like Uber, Netflix, Airbnb, and Lyft to improve data quality and monitor data pipelines and tables.
Most of these data teams developed some sort of pipeline testing system first, before moving on to developing true data observability tools.
Eventually, smaller companies with lighter technical teams also sounded the alarm for observability capabilities. However, they didn’t have the horsepower to build these solutions in-house. And thus, data observability SaaS solutions were created to fill the gap. Take a look at the data observability origin stories from some companies you’ve heard of:
What’s the difference between data observability and…
What’s the difference between data observability and data testing?
Testing is a manual expression of a condition that we know to be true. For example:
- An engineer expects a data table to be populated with entries. They can test that there is a non-zero number of rows in a data table
- A data scientist creates complex queries to stress test a database and check its responsiveness
- A data reliability engineer uses a sample of input data, runs it through a development version of a data pipeline, and compares the actual output with the expected output
Testing is generally the first line of defense for data quality issues. Testing is manual and opinionated; something either works or it doesn’t. In comparison with data observability, testing is less all-encompassing. Testing can be a component of a larger data observability program; it feeds into observability. In fact, it’s best practice for both facets to be hard at work in a healthy, high-quality data pipeline.
What’s the difference between data observability and data monitoring?
Data monitoring is not opinionated. It simply reports back the conditions of the system on a regular time cadence. A good metaphor is the dashboard of a car. Monitoring tools allow teams to watch and understand the state of their systems, based on predefined sets of metrics.
In contrast to data monitoring, data observability is more like a vehicle diagnostics test: it scans your data system components to check for issues, reports them, and tells you where they’re coming from and why they’re happening.
Rather than simply reporting stats, data observability tools provide analytical insights that allow teams to actively address data problems, based on the exploration of properties and patterns that weren’t defined in advance.
For example, data monitoring might issue an alert if a set of data rows suddenly explodes in quantity, and pre-defined rules explicitly state that there should only be ten rows of data. Data observability, on the other hand, can pinpoint exactly where and why that explosion in volume occurred. Data observability can layer onto data monitoring in order to detect not only that there is an issue, but why and where there is an issue, as well as other details about the issue, like what changes were made, by whom and when, and who is potentially impacted downstream.
What’s the difference between data observability and data quality?
Data quality is a traditional, all-encompassing term that is focused on fixing data issues in a reactive manner. Data quality refers to the general state of your data: how healthy is it?
In contrast to data quality, data observability constantly surveys the state of the data pipeline and proactively diagnoses issues. Data observability platforms can help to ensure data quality.
What’s the difference between data observability and data reliability?
Data observability is a method in which you achieve data reliability. Data reliability is the end state and the goal, and data observability is how you get there. You can deliver reliability through an observability platform.
What’s the difference between data observability and machine learning observability?
Data engineering, science, and analytics teams use data observability tools like Bigeye to understand the current state of all their data-related systems – data tables, databases, data pipelines – at all times. Data observability deals with metrics like data freshness (how up to date the data is), data completeness (how much missing data there is), and data accuracy (how much of the data is mis-formatted, or contains typos of some sort).
Machine learning observability involves knowing the current state of your machine learning model’s data and performance across its lifecycle. It deals with issues like model drift, data quality issues, and anomalous performance degradations using baselines.
What’s the difference between data observability and observability (the regular kind)?
IT operations and site reliability engineering (SRE) teams use infrastructure observability tools like Datadog and SolarWinds to understand the current state of their infrastructure, and the applications that run on them, at all times. For most companies, this means servers, micro-services, and applications. Infrastructure observability typically deals with metrics like uptime, error rate, and number of requests.
Metrics in infrastructure observability generally remain fairly constant over time. Data observability, on the other hand, can have metrics that fluctuate wildly over time. Data observability as a concept is inspired by infra observability, but differs from it considerably in use case and user.
For more definitions, check out our data observability dictionary.
In general, data observability encompasses monitoring, alerting, and lineage – in other words, not just understanding the state of the system, but reacting to problems in the system.
How does data observability work?
In general, a good data observability platform should monitor overall warehouse health (freshness, volume, accurate formatting), data quality issues in the tables themselves (anomalies, outliers, potentially erroneous entries), and where problems and improvements will have an impact on the operation (lineage, root cause analysis).
There are two types of data observability: warehouse metadata monitoring, and deep statistical monitoring.
General warehouse health visibility is achieved through warehouse metadata monitoring. It’s fairly straightforward and low impact to your data environment, so it’s easy to apply across the entire warehouse.
Deep statistical monitoring goes deeper. It is achieved by connecting to a read-only account and applying metrics to your data. This approach is very similar to connecting a BI tool. But a good data observability tool won’t actually copy any of your data. It will simply monitor the source and store observability information—limiting operational strain on the database and minimizing security risks.
Companies generally only apply deep data quality checks on the most critical tables. For most companies, that’s about 20% or less of all the data in their warehouse!
Common deep monitoring attributes to track include completeness, duplication, format errors, outliers, and distributional statistics. These attributes can’t be collected from metadata. They require direct interrogation of the data itself. This is necessarily more impactful on the warehouse, but also enables the detection of issues inside the datasets themselves.
Finally, a data observability tool will parse logs and use AI to map out the data model and flow of your environment. This process helps you understand the potential impacts of changes upstream and down.
NOTE: In both cases, only aggregates need to be returned and stored in the data observability platform. Raw data isn’t needed at this stage, and doesn’t need to leave the data source.
Basic data observability concepts
If you’re completely new to data observability, there are some foundational concepts that you can digest. The following summaries are meant as bite-sized primers to help you familiarize yourself with the core aspects of a data observability system.
Data freshness refers to how up-to-date data is, e.g. the amount of time since a data table was last refreshed. Your data’s freshness is one of the main things that an observability platform will track; ensuring that no data goes stale and out-of-date.
A database schema is an abstract design that represents the storage of your data in a database. It describes both the organization of data, and the relationships between tables in a given database.
Metadata is data about data. In the data observability context, metadata tells you about what’s being done to the data by the infrastructure, e.g. an INSERT to append new rows to a table, or an Airflow job that failed and didn’t write anything to its intended destination.
Metadata can give you key information about, for example:
- Data freshness - e.g. time since a data table was last refreshed
- Data volume - e.g. number of rows inserted per day
- Read query volume - e.g. number of queries run per day
Metadata is important for observing the state of your data system. Out-of-date data (for instance in dashboards) is often extremely visible to non-technical leadership, and impedes their decision-making ability. Meanwhile, a massive drop in data volume or data query volume almost certainly indicates that an upstream or downstream dependency is down.
A data breakage or outage happens somewhere along the pipeline. The first question is, where? Data lineage helps you find out, by feeding information about which upstream and downstream sources were impacted. It also ties that information back to the teams that are generating and accessing the data. Lineage is a key piece of any data observability program. When a data issue occurs, your lineage serves as a map for how governance, business, and technical teams are impacted.
To be more specific, data lineage is the path that data takes through your data system, from creation, through any databases and transformation jobs, all the way down to final destinations like analytics dashboards and feature stores. Data lineage is an important tool for data observability because it provides context - it tells you:
- For each data pipeline job, which dataset it’s reading from and the dataset to which it is writing
- Whether problems are localized to just one dataset, or cascade down your pipeline
This helps to answer questions about where data problems originate, and how widespread their impact is:
- If I change the schema of this table, what other tables will start having problems?
- If I see a problem in this table, how can I identify where that problem originated, and how far it propagates?
Lineage is most often collected by parsing the logs of queries that write into each table. By inspecting what’s happening inside the query, you can see which tables are being read from, and which tables are being written into.
But lineage can (and should) go further than just table-to-table or column-to-column relationships within a single source. Companies like AirBnB and Uber have been modeling lineage all the way upstream to the source database or Kafka topic, and all the way downstream to the user level, so they can communicate data problems or changes all the way up to the relevant humans.
5. Anomaly detection
Detecting data points, events, and/or information that falls outside of a dataset’s normal behavior. Anomaly detection helps companies flag areas that might have issues in their data pipelines.
The anomaly detection system will need to learn the historical patterns present in each data quality attribute, learn what abnormal behavior looks like, and ultimately fire alerts that indicate real issues while ignoring behavior that’s slightly off but not indicative of a real problem. This can be especially difficult when there are hundreds or thousands of data quality attributes being tracked simultaneously.
Naive techniques like gaussian models—that simply look at a number of standard deviations above or below the historical mean—fall apart in many commonly occurring time series patterns. A good model will need to adapt to various patterns that can regularly occur in these metadata attributes over time.
6. Data mesh
A data mesh is a type of data architecture. It assigns ownership to various parts of data pipeline based on specific business teams - Marketing, Sales, Product, etc. A data mesh eliminates bottlenecks associated with one monolithic data system, and makes data observability more effective by pinpointing data changes, movement, access requests, and other events at a more granular level within the data pipeline.
Service Level Agreements are a foundational part of data observability. SLAs were originally designed to clarify and document expectations between a service provider and its users. In a data observability context, they allow for clear communication about what “good” and “not good” data looks like!
They’re built up in three stages:
- SLIs: Service Level Indicators—what attributes will be tracked and what are acceptable levels
- SLOs: Service Level Objectives—how often will the SLIs be within their expected range
- SLAs: Service Level Agreements—who does what if the SLOs are not met
- SLI: Percent of null user_uuids per partition <= 0.5% in last 2 partitions
- SLO: 99% as tracked daily over a trailing 30 day window
- SLA: Data engineering will halt pipeline changes until resolved and SLO back within 99%
The development of the SLIs, SLOs, and SLA creates a clear framework for data consumers and data teams to align on exactly what “high quality” data means, and what will be done if that definition isn’t being met. It prevents ambiguity and eliminates arguments during high pressure situations when something does happen to go wrong.
Architecture and location within the data stack
Data observability systems usually consist of the following elements:
- The control plane—the interface where everything is managed from
- The collection system—how the metrics and metadata get collected from data sources
- The anomaly detection system—takes in metrics, builds models, and fires alerts
- The notification connectors—outputs to Slack, PagerDuty, data catalogs, etc.
The collection system usually operates by running queries on the underlying data sources. This can be done via an agent, or it can be agentless with a direct connection to the data source. In either case a service account will exist on the data source, from which the metric and metadata collection queries can be run.
Queries are run on the data source, the metrics or metadata are summarized, and sent back to the control plane where they’re stored. This means raw data stays on the data source, and only aggregates are transmitted back to the control plane.
Once stored in the control plane, the metrics and metadata results are fed to the anomaly detection system so anomaly detection models can be trained on them and alerts can be fired if needed.
Finally, if an alert is needed, the control plane needs to be connected to the desired alert destination, like a Slack workspace, a PagerDuty workspace, a data catalog like Alation, or an analytics tool like Tableau.
Use cases to solve with data observability
When should you use data observability? When do teams start to explore data observability platforms, and are there any universal indicators that data observability should deploy? Let’s walk through the various use cases for which we prescribe data observability as the cure.
1. Keeping analytics accurate and trustworthy
Executives at large organizations regularly complain that “they don’t trust the data”. This is often because the data is internally inconsistent, or because there are discrepancies between the expected numbers and what the dashboard is showing. Ensuring that analytics dashboards and reports display accurate data is probably the top use case for data observability.
2. Protecting machine learning performance
Machine learning models are “garbage in, garbage out.” In order for recommendation systems, fraud detection systems and computer vision systems, to generate accurate recommendations and predictions, the data inputs must be accurate and up-to-date. Data observability helps ensure these standards.
3. Accelerating ETL/ELT development velocity
Data model blue-green testing is a framework for deploying changes to data pipeline jobs. The framework expects two schemas in your database, one staging (blue), and one production (green). When you make a change to your job in your data pipeline, you initially run the changed job only in the staging schema. You then compare the outputted data to what is in the production schema. The only differences should be the expected ones from your change.
4. Accelerating data warehouse migrations
When you invest in a cloud data warehouse, you want to get up and running and start seeing value quickly. Unfortunately, migration complexity and manual validation processes can often kill momentum and force data teams to burn time and resources chasing migration issues instead of adding value to the business. Data observability can reduce the friction and produce faster, smoother migrations by replacing manual checks with automated migration validation and comprehensive reporting—helping ensure every row of data is delivered safe and sound.
Common signs that it’s the right time for data observability
You just experienced a high-severity data outage
This is the most obvious time to invest in data observability is right after an outage has been resolved! All organizations are busy, and getting buy-in to take preventative measures against a future outage can be difficult. The moments following an outage are the absolute best time to invest in data observability, because all stakeholders are aligned in wanting to prevent future problems from occurring.
Your pipelines have gotten complex
A lot of your teams are all about data. Where does it come from, where is it going, what is it telling you about how to run your business better, and reach more customers? Data environments are constantly growing, shifting, and evolving. They truly take on a life of their own.
The “modern data stack” involves more tools now than ever before. That means more pipelines, more tables, and more opportunities for failure or disruption along with them. At all stages of the data pipeline, there are complex transactions occurring. Data is flowing between storage, processing, transformation, modeling, and business intelligence.
Teams can’t wait to be blindsided by inaccurate, broken, or stale data. One schema change can cause a furious uproar and catastrophic consequences. Change means growth, but it also means unpredictability. Data observability is technology’s answer to that unpredictability; data observability platforms introduce predictability and reliability back into your complex data pipelines. You can’t manually keep data catalogs up-to-date with spreadsheets and the occasional debriefing meeting. You need sharper visibility into your data pipelines and anomalies as soon as they occur.
You’ve moved to a hub-and-spoke data team structure
Data touches a lot of hands on its journey through your organization. As teams add data scientists, analysts, data reliability engineers, and business analysts, the ownership of data functions might shift or change hands, even within specific teams. And what about multiple teams that partially own some of the same data?
Data observability can help teams understand how work fits into the larger puzzle of data in your organization. Schema changes, new data sources, and pipeline additions are tracked and communicated with data observability. That way, teams can understand the impact of changes that feel minor, but might cause major ripple effects. Data observability is an effective communication tool; as your data writes a story, data observability serves as the transcript.
Building vs. Buying
There’s no one-size-fits-all answer to the age-old question, “Do we build it or do we buy it?” You can look at some of the following factors as you consider the best path forward for your data observability.
1. Is the observability technology core to your business?
Is the technology core to your business, or if a data observability build is going to impact your core product? For example, Uber built their own mapping system. The bespoke “build” approach made sense because maps are a chokepoint to their core business. However, another business for whom a map is a “nice-to-have” would not want to build an in-house mapping infrastructure. They’re a “buy” candidate.
2. Does the tool require huge engineering changes?
When buying a data tool off-the-shelf, it often comes with a certain framework. For example, a tool might assume all your data is going into a cloud data warehouse.
If your data isn’t going into a cloud data warehouse, the convenience of the third-party tool is rendered irrelevant. You’ll have to put engineering resources on the case to fully integrate the tool, and that’s not what you want.
3. What is the cost and time-to-value?
Manual checks take time. We estimate that building a custom solution in-house can cost hundreds of thousands of dollars, and take months to build. Not to mention the strain on your data and engineering teams. Building in-house observability means redirecting engineering time from customer problems to internal work. It’s likely you’d prefer to keep engineers focused on issues that plague paying customers.
4. Does the vendor have an experienced, communicative, and flexible team?
A good third-party vendor can provide some benefits of customization, without your having to build them in-house.A data observability vendor with a track record of supporting scaling, complex companies is ideal. Ask about their support, training, customer success, and integrations. Do they have a roster of enterprise customers?
What does their product roadmap look like in the near future? You might benefit from feature requests that the vendor gets from their entire customer base, even when it's not something you would have requested.
Signs you should reevaluate your current solution
Your data is likely changing frequently: so should your data observability. On a periodic basis (maybe once a year), take stock of your data observability stack and tooling and re-evaluate whether it still suits your organizational needs.
This might mean conducting an audit of all the tables in your data warehouse, figuring out which of them might be decommissioned or modified, and removing or editing the associated metrics and alerts.
Best practices for data observability
Whether you have a data observability solution in place or not, there are steps you can take to make your organization more observability-friendly. These best practices can be applied at any point in your data observability journey.
1. Start defining the business-critical data
Before jumping into data observability, it’s important to first go through your data and understand which data matters most, and which data can be ignored. After all, if you monitored all of your data at all times, you’d quickly reach your maximum capacity to deal with the onslaught of that information.
Instead, look at what data is being consumed in the most business-critical ways. For example, the “Orders” table consumed by an analytics dashboard that is being regularly reviewed by executives is a high priority.
Data lineage products that help visualize the path data takes through your data system, from creation, through any databases and transformation jobs, all the way down to final destinations like analytics dashboards and feature stores, can also be helpful.
2. Rollout data observability in a T-shape
Once you understand which data matters most, you can apply T-shaped monitoring. Everything in your organization doesn’t need the same level of monitoring. Some areas of your data pipeline need special attention and care, and should be prioritized.
T-shaped Monitoring is an approach to data observability that tracks fundamentals across all your data, while applying deeper monitoring on the most critical datasets such as those used for financial planning, machine learning models, and executive-level dashboards.
Here’s how it works:
- Track data freshness first
- Then, select some business-critical datasets on which you can apply deep monitoring. Feel free to use a blend of metrics that Bigeye suggests for each table from its library of 70+ pre-built data quality metrics.
- Add custom metrics with templates and virtual tables to ensure custom business logic is being monitored for defects.
For more information about T-shaped monitoring read on here.
3. Assign ownership over the stewardship the data pipeline
Each step in each data pipeline should be assigned to an owner. For example, if data moves from an online service to Kafka to Snowflake, undergoes transformations and lands in tables that are input data for fraud detection algorithm, the online-to-Kafka segment might be “owned” by the Kafka team, while the transformations for the fraud tables might be owned by the fraud team. The owner assigned is responsible for the data pertaining to that segment.
Getting started with data observability
How do you roll out your data observability solution? It helps to ask your team a few questions first. Namely:
- What can break?
- What is impacted?
- Where do breaks occur?
- Why do breaks occur?
- What are the root causes of these potential breaks?
Data observability will help you get better answers to these questions. It works in a continuous feedback loop. A successful data observability practice looks like this:
- You understand data quality issues as they occur in production
- You get immediate insight into the impact of each issue
- You can pinpoint the spots along the data pipeline where something broke
- You can take action to fix the issue
- You can learn from the issue so it (and similar issues) don’t reoccur
With a data observability tool, you’ll be equipped to understand the impact of data issues, so you can take action, fix them, and learn from them so they don’t reoccur
If you’re ready to build trust in your data across your whole organization, Bigeye can help. We recommend the following resources:
Schema change detection