Company

May 5, 2022

Automating data observability at scale

min read

Henry Li

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Data is becoming core to operations and strategic decision-making across all industries. Anyone who’s worked in data for a while knows that data is sometimes riddled with problems and that data can change all the time; harnessing data insights correctly is essential to any data-driven business. Because of this, dynamic data quality and data process monitoring are two of the most important areas in the Modern Data Infrastructure (MDI). Today, statistical and AI/ML (SAM) methodologies are beginning to be used to power automation in observability. The goal is to scale observability so that data users can focus on solving business problems instead of dealing with data issues.

A data observability system alerts data teams to incidents of bad data and guides them through incident mitigation and recovery. If implemented correctly, the observer helps engineers, analysts, and scientists catch issues before bad data affects downstream business processes.

Data observability has been implemented at scale, such as Uber’s DQM, but only in a few companies with massive data engineering budgets. Today, companies can choose from a set of data observability tools on the market, saving multiple developer years of cost in building from scratch.

Companies developing data observability tools must serve and manage data tasks for various types of companies in different industries over time. Even more complicated, within each company, there are stakeholders that have different roles: data engineer, analyst, scientist, leadership, etc. Each role has a different need for data — such as observing data ingestion versus tracking metrics from an executive dashboard. Observability should work on multiple levels throughout the data infrastructure.

This presents a unique problem and opportunity: how do we build one data observability system that works for all?

The challenges of data observability at scale

There are several major challenges in this space due to regulatory and technical constraints.

Data security and privacy

When talking about data, security and localization come to mind, and various rules and regulations govern how data is collected, stored, and accessed. There are region-specific regulations such as GDPR and CCPA that restrict data access within and across companies. In addition, the policy landscape is rapidly evolving, with new regulations in the United States to keep data generated by applications of foreign companies within the country. In short, the data observability system needs to function in a localized way to maintain full data security and privacy.

Data integrity

Maintaining data integrity means that the observability tool should not modify or insert data into user databases or computing systems. Without data modification, it is ensured that data integrity is preserved 100% of the time.

Local optimization

The deployment of an observability system and alerting should work out-of-the-box, and it should optimize over time based on user feedback, sometimes called reinforcement learning for anomaly detection. The objective is to generate a deployment model that suits the users’ needs by surfacing the issues they care about; however, this model should still warn users of data quality issues in the “unknown, unknowns” category.

Given these constraints, the observability system must automatically evolve over time if it were to capture data outages (true positives) with low noise (false positive alerts). The key to this is to capture user feedback on whether a set of issues is relevant, as the system learns from various data quality signal metrics in the form of time series (a high dimensional statistical problem). And if an issue truly reflects a data degradation event, then there needs to be a mechanism that allows local deployments to evolve: collecting information on which system parameters work and don’t work so that the SAM models can be adjusted.

A federated learning framework

What we described above is akin to the self-driving car problem: deploy cars in different locations and let the cars gather local data for optimal driving settings. The cars will send measurement data back to the company for tuning of the overall driving model(s).

This approach is called federated learning, and its use has been on the rise in the modern SAM world. And this analogy serves as a framework for thinking about how we can achieve data observability at scale in MDI.

But even with this setup, data observability faces a set of interesting SAM challenges. Not all data dynamics in the real world are the same for each company, each organization, and each user. Different data dynamics can arise from the ebbs and flow of business cycles, glitches in pipelines, manual data modifications, etc. As a result, potential anomalies that are flagged by an observability tool may look different across the board.

Another layer of complexity on top of this is that user feedback is not always consistent. When an issue is presented to the user, the user could label the issue as a false-positive alert, true positive alert, expected dynamic, or something else. And some users may consider an alert useful, while others might not, due to multiple factors like whether they have time to address that problem, or what their relationship to the data is in the first place.

For instance, the evaluation of model performance, usually by a ROC curve, is for one model at a time. Given a diverse set of observability data metric types and their combinations across all user spaces, applying the calculations blindly would not be statistically sound. Adjusting system parameters could also bring in unexpected problems, and we must ensure that each improvement iteration will still correct for all past deficiencies; i.e. a past deficiency would not resurface in future deployments.

High-quality data observability for all

Modern data infrastructure is proving to be a complex and developing space. In the past, we only cared about one data pipeline or one database. Now, we look at a much bigger infrastructure that includes data collection, ingestion, storage, data transformation, machine learning model training, etc., with each part becoming more specialized. For data users, navigating through this infrastructure and thinking about what to observe and monitor can be very challenging, especially for users of data who are far from the processes that move data. However, within the data observability community, we have SAM tools and frameworks, such as federated learning, to draw from to deliver high-quality data observability systems to all data users.

At Bigeye, we know that there are different users and organizations that care about a vast, diverse set of data issues and specific data needs. That’s why we are building an intuitive user experience and feedback system to solve users’ data challenges while helping them tune the observability system to capture problems that they care about over time. Through understanding these data and human complexities, and approaching them with a federated learning framework, we believe we can scale data observability to serve any type of data problem — small or large.

share this episode

Resource

Monthly cost ($)

Number of resources

Time (months)

Total cost ($)

Software/Data engineer

$15,000

$540,000

Data analyst

$12,000

$144,000

Business analyst

$10,000

$30,000

Data/product manager

$20,000

$240,000

Total cost

$954,000

Role

Goals

Common needs

Data engineers

Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.

Freshness + volume
Monitoring
Schema change detection
Lineage monitoring

Data scientists

Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.

Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing

Analytics engineers

Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.

Lineage monitoringETL blue/green testing

Business intelligence analysts

The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.

Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing

Other stakeholders

Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.

Integration with analytics toolsReporting and insights

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Automating data observability at scale

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

The challenges of data observability at scale

Data security and privacy

Data integrity

Local optimization

A federated learning framework

High-quality data observability for all

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

Bigeye is Great Place to Work-Certified™

Achieving Data Transparency at Scale: Freedom Mortgage’s Success with Bigeye

The Next Chapter for Bigeye: Realigning Our Leadership Team for the Future

Join the Bigeye Newsletter

Automating data observability at scale

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

The challenges of data observability at scale

Data security and privacy

Data integrity

Local optimization

A federated learning framework

High-quality data observability for all

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

Related posts

Bigeye is Great Place to Work-Certified™

Achieving Data Transparency at Scale: Freedom Mortgage’s Success with Bigeye

The Next Chapter for Bigeye: Realigning Our Leadership Team for the Future

Join the Bigeye Newsletter