In this article, we’ll answer the following questions:
What is data observability?
Is data observability something you need?
What are the primary use cases for data observability?
What should you look for in a data observability platform?
How does data observability help you keep data reliable?
Data observability enables a step-change in how much you can confidently do with your data by making it easy to keep data fresh and high quality 24/7.
What is data observability?
Data observability is the ability to have constant and complete knowledge of what’s happening inside all the tables and pipelines in your data stack.
SRE and DevOps teams have long used observability to keep applications and infrastructure working around the clock. Data observability reimagines the concept for the world of data engineering and data science. Instead of knowing that servers or containers are working, and that app performance looks healthy, data observability helps to ensure that data is flowing and is high enough quality for the analytics or machine learning applications it’s being used in.
Unlike the older business-rules approach to data quality, or even modern data pipeline testing that relies on pass/fail conditions, observability continuously collects signals from datasets and enables monitoring and anomaly detection to be done with those signals afterward. Because of this, observability can be automatically applied at any scale, without human subject matter experts or data engineers manually defining everything they expect from the data upfront. Check out our blog on testing vs observability for more.
How to know when you need data observability?
Data observability matters as soon as your organization starts any initiative that puts data directly in front of executives, non-data-team members, or customers.
These stakeholders expect the data they’re interacting with—a dashboard, a machine learning model, a report, etc—to be accurate, and there isn’t a human in the loop to validate all the data by hand before they see it.
Data observability creates a safety net for these data applications, ensuring that people on the data team can spring into action before stakeholders are affected, ultimately working toward a high level of trust in the data. So as soon as you have data being used in an “online” use case like self-service analytics or machine learning, you need some level of data observability to ensure it’s working reliably.
A company has an ML model that suggests products to customers based on interactions they have on an e-commerce site. The company depends on this model to recommend products their customers might be interested in, ultimately increasing sales. If the pipeline feeding that model stops running, and the model is unable to recommend or starts making poor recommendations, every minute until the issue is identified and fixed is leading to confused customers and lost sales. On top of that, the product team loses trust in the model and the CRO starts to question the data team on reliability, and whether the budget being spent on machine learning projects should just be diverted to marketing.
With data observability providing a 24/7 safety net, the data team gets notified of the pipeline problem as soon as it occurs, enabling them to respond quickly and minimize the model’s downtime. They can give their stakeholders confidence that even with all the constant changes happening within the data model, there won’t be uncaught outages.
What are the primary use cases for data observability?
Unblocking data engineering teams
The data engineering team is the beating heart of data, and where the most pressure tends to collect when data quality starts becoming an issue. These teams are often already understaffed but get slowed down further when:
They’re reactively root causing and fixing data quality issues reported by their stakeholders
They need to closely couple with subject matter experts to set up manual business rules or data pipeline tests
Data observability unblocks these teams and enables them to serve the organization with fresher, more reliable data. It helps data engineers get proactive with earlier warnings and more context about problems in their pipelines. It also decouples them from the subject matter experts, giving those stakeholders the ability to subscribe to whatever monitoring and alerting is relevant to them, without heavy coordination to write business rules or tests.
Here are some examples of what data engineering teams gain from adding a data observability platform to their data stack:
Their data science stakeholders are happier because freshness issues that would otherwise slow them down are quickly identified and fixed.
The whole organization trusts the data more because the data engineering team is always the first to know about any data outage, rather than executives or end users.
The data engineering team can work more efficiently, even at large scale, by moving from manual pipeline testing methods to self-configuring monitoring.
Enabling data science teams to move faster
Data science teams still spend a ton of time each week doing exploratory analysis, wrangling, and effectively playing defense against the data to make sure their insights and ML models aren’t degraded. Data observability gives these teams time back by screening out entire classes of data quality issues that the data science team would otherwise have to repetitively check for.
Some data science teams are able to request pipeline tests from their data engineering neighbors, but firing off a Jira ticket and waiting for it to be prioritized wastes both team’s time. Data observability platforms decouple this workflow, and enable the data science team to add whatever monitoring they need to protect their own use cases.
Here are some examples of monitoring that data science teams might enable:
Detecting new categorical values in the feature store data that didn’t exist when the model was last trained, prompting a model retrain to protect prediction accuracy
Warning them of unexpected duplicates in training data that might create bias in their model the next time it retrains
Freshness checks for the tables upstream from dashboards they’ve created for key executives that are expected to refresh on schedule to provide an up to date view of the business
Keeping data replication running smoothly
The financial services industry consumes an enormous amount of external data to drive investment decisions and other high-dollar-amount use cases. Data observability helps Venture Capital funds like SignalFire know that all of the external data being consumed is landing on time so they can keep their models and analytics fresh.
Here are some examples of data replication that benefit from observability:
Ensuring data integration tools like Fivetran aren’t just running on time, but actually moving the expected volume of data into the data warehouse
Third party data purchased from vendors or data marketplaces that may not always arrive on time, triggering credits owed under their contracts
Operational data and reverse-ETL
Getting data out of the warehouse and into operational tools like HubSpot, Salesforce, Marketo, and Jira is a key part of the promise of central data platforms. Push all the data into the warehouse, transform it to combine and aggregate and enrich everything, then push it into the places the business needs to use it. Solutions like Census and Hightouch are helping companies operationalize their data via reverse-ETL. Data observability becomes incredibly valuable in these applications, because uncaught data problems turn into operational errors quickly and at scale.
Imagine millions of marketing emails going out at 9:00AM with missing or obviously incorrect values. With data observability, duplication or missing values are easily detectable before the emails are sent, saving the company from making an embarrassing mistake to prospective customers.
Working more efficiently
Data teams, and indeed all teams across an organization, work more efficiently with proper data observability in play. Learn more about automating data observability at scale here.
What to look for in a data observability platform?
Ease, speed, and accuracy of coverage
At the most fundamental level, a data observability platform should take the burden of writing and maintaining manual tests and move data teams from reacting to data quality fires to proactive resolving them. That means the platform should leverage automation wherever possible, including automatically recommending and implementing data quality metrics and detecting and alerting when issues occur.
The platform should ultimately be able to detect all of the issues that could go wrong, from simple monitoring like freshness and row counts that you want tracked on every table, to deeper monitoring for things like distribution shifts that are necessary on the critical tables that drive key dashboards and models.
Great anomaly detection performance
Data science is the unsung hero of effective data observability and responsible for some of the most important parts of a data observability system, including anomaly detection. Bad anomaly detection that simply relies on moving averages or shoehorning an open source forecasting project into an anomaly detection role runs the risk of creating a giant noise-making machine.
With solid anomaly detection in place, data teams can proactively detect issues (even “unknown, unknowns”) and leadership can hone into the root causes of business problems quickly. Without advanced anomaly detection, data teams have less assurance that their data observability system will catch all of the issues that matter without flooding them with false positives.
Helps you monitor replications and migrations
Data is constantly being updated, optimized, and migrated. Cloud migrations happened 24x faster during the pandemic, and now, more data than ever is being moved into data warehouses and lakes with tools like Airbyte, Fivetran, and Matillion. A data observability platform can ensure that the data you moved from point A landed unbroken at point B, providing greater validation at a fraction of the time normally needed to compare datasets.
This is done by comparing datasets and identifying differences. Ideally the comparison should be automatic, meaning columns need to be mapped in the source and target tables, metrics need to be determined, and the differences are computed to tell you if anything is getting dropped or mutated as it’s moving from one place to another.
Supports upward/outward reporting
Once the platform is up and running, you’ll want to be able to quantify how much you’re improving data reliability for your organization. Features like built-in analytics dashboards, or the ability to export data into your warehouse for custom analysis will help your team quantify and communicate your progress with leadership and other teams.
Integration with your stack
You need to ensure that the data observability platform will work with the other tools in your stack. Here are the major considerations, and you can check out Bigeye’s integrations here.
How data observability helps you keep data reliable
Data observability gives data teams the information they need to predict, diagnose, understand, and ultimately prevent problems before they impact their users. And when they start identifying problems earlier, fixing them faster, and preventing them from happening again, their organizations can start to use data in higher-risk, higher-reward applications.
That’s what they really want at the end of the day—to put their data to work in the highest value applications they possibly can.
Data teams using Bigeye have reported 50%+ reductions in incidents that reach their users, 20-30% less time spent working on data quality in general, and problem detection and resolution times down from several days to a few hours.
If you want to see how Bigeye can help your data team, schedule a demo with us today.