May 5, 2022

Automating data observability at scale

Scale observability so that data users can focus on solving business problems instead of dealing with data issues.

Henry Li

Data is becoming core to operations and strategic decision-making across all industries. Anyone who’s worked in data for a while knows that data is sometimes riddled with problems and that data can change all the time; harnessing data insights correctly is essential to any data-driven business. Because of this, dynamic data quality and data process monitoring are two of the most important areas in the Modern Data Infrastructure (MDI). Today, statistical and AI/ML (SAM) methodologies are beginning to be used to power automation in observability. The goal is to scale observability so that data users can focus on solving business problems instead of dealing with data issues.

A data observability system alerts data teams to incidents of bad data and guides them through incident mitigation and recovery. If implemented correctly, the observer helps engineers, analysts, and scientists catch issues before bad data affects downstream business processes.

Data observability has been implemented at scale, such as Uber’s DQM, but only in a few companies with massive data engineering budgets. Today, companies can choose from a set of data observability tools on the market, saving multiple developer years of cost in building from scratch.

Companies developing data observability tools must serve and manage data tasks for various types of companies in different industries over time. Even more complicated, within each company, there are stakeholders that have different roles: data engineer, analyst, scientist, leadership, etc. Each role has a different need for data — such as observing data ingestion versus tracking metrics from an executive dashboard. Observability should work on multiple levels throughout the data infrastructure.

This presents a unique problem and opportunity: how do we build one data observability system that works for all?

The challenges of data observability at scale

There are several major challenges in this space due to regulatory and technical constraints.

Data security and privacy

When talking about data, security and localization come to mind, and various rules and regulations govern how data is collected, stored, and accessed. There are region-specific regulations such as GDPR and CCPA that restrict data access within and across companies. In addition, the policy landscape is rapidly evolving, with new regulations in the United States to keep data generated by applications of foreign companies within the country. In short, the data observability system needs to function in a localized way to maintain full data security and privacy.

Data integrity

Maintaining data integrity means that the observability tool should not modify or insert data into user databases or computing systems. Without data modification, it is ensured that data integrity is preserved 100% of the time.

Local optimization

The deployment of an observability system and alerting should work out-of-the-box, and it should optimize over time based on user feedback, sometimes called reinforcement learning for anomaly detection. The objective is to generate a deployment model that suits the users’ needs by surfacing the issues they care about; however, this model should still warn users of data quality issues in the “unknown, unknowns” category.

Given these constraints, the observability system must automatically evolve over time if it were to capture data outages (true positives) with low noise (false positive alerts). The key to this is to capture user feedback on whether a set of issues is relevant, as the system learns from various data quality signal metrics in the form of time series (a high dimensional statistical problem). And if an issue truly reflects a data degradation event, then there needs to be a mechanism that allows local deployments to evolve: collecting information on which system parameters work and don’t work so that the SAM models can be adjusted.

A federated learning framework

What we described above is akin to the self-driving car problem: deploy cars in different locations and let the cars gather local data for optimal driving settings. The cars will send measurement data back to the company for tuning of the overall driving model(s).

This approach is called federated learning, and its use has been on the rise in the modern SAM world. And this analogy serves as a framework for thinking about how we can achieve data observability at scale in MDI.

But even with this setup, data observability faces a set of interesting SAM challenges. Not all data dynamics in the real world are the same for each company, each organization, and each user. Different data dynamics can arise from the ebbs and flow of business cycles, glitches in pipelines, manual data modifications, etc. As a result, potential anomalies that are flagged by an observability tool may look different across the board.

Another layer of complexity on top of this is that user feedback is not always consistent. When an issue is presented to the user, the user could label the issue as a false-positive alert, true positive alert, expected dynamic, or something else. And some users may consider an alert useful, while others might not, due to multiple factors like whether they have time to address that problem, or what their relationship to the data is in the first place.

For instance, the evaluation of model performance, usually by a ROC curve, is for one model at a time. Given a diverse set of observability data metric types and their combinations across all user spaces, applying the calculations blindly would not be statistically sound. Adjusting system parameters could also bring in unexpected problems, and we must ensure that each improvement iteration will still correct for all past deficiencies; i.e. a past deficiency would not resurface in future deployments.

High-quality data observability for all

Modern data infrastructure is proving to be a complex and developing space. In the past, we only cared about one data pipeline or one database. Now, we look at a much bigger infrastructure that includes data collection, ingestion, storage, data transformation, machine learning model training, etc., with each part becoming more specialized. For data users, navigating through this infrastructure and thinking about what to observe and monitor can be very challenging, especially for users of data who are far from the processes that move data. However, within the data observability community, we have SAM tools and frameworks, such as federated learning, to draw from to deliver high-quality data observability systems to all data users.

At Bigeye, we know that there are different users and organizations that care about a vast, diverse set of data issues and specific data needs. That’s why we are building an intuitive user experience and feedback system to solve users’ data challenges while helping them tune the observability system to capture problems that they care about over time. Through understanding these data and human complexities, and approaching them with a federated learning framework, we believe we can scale data observability to serve any type of data problem — small or large.

share this episode
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
Data analyst
Business analyst
Data/product manager
Total cost
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.