Product
-
December 2, 2022

Automatically quarantining bad data with Bigeye and dbt

Let’s look at how I built an automatic data quarantining workflow by leveraging Bigeye’s automatic anomaly detection and dbt’s data transformation capabilities.

Egor Gryaznov

At Bigeye, we’re always working with the tools that our customers use and love in order to find new ways to bring data observability right into their most important workflows. Normally, we don’t reveal these capabilities until they are production-ready, but with dbt’s Coalesce Conference just around the corner, I wanted to preview an exciting new workflow I’ve built by connecting Bigeye to our internal dbt Cloud instance.  

For the uninitiated, dbt is a popular data transformation tool that lets data analysts build their own data pipelines using SQL, and Bigeye is a powerful data observability tool that automatically identifies stale, damaged, and unexpected data. While these tools mostly focus on different portions of the data stack — dbt performing transformations inside the data warehouse and Bigeye sitting on top of the data warehouse to monitor data quality — they’re definitely interoperable.

Let’s look at how I built an automatic data quarantining workflow by leveraging Bigeye’s automatic anomaly detection and dbt’s data transformation capabilities.

Bigeye → dbt

Here's a common scenario: you're a product manager about to go into an important meeting with an executive when you realize that the numbers on your BI dashboard don't make sense. Maybe there's six hours yesterday with no data, or there are suddenly four subscription categories instead of three, with the fourth one being a misspelling.

Bigeye automatically detects these types of data issues by monitoring for 60+ data quality metrics, such as the percentage of nulls in a given column. Bigeye intelligently learns thresholds for these metrics based on historical patterns of the data and alerts you when the data is behaving anomalously. No more getting caught off guard by data quality issues, your data team will know about issues before your stakeholders are affected.

But what happens once an issue has been detected? The next step is figuring out the root cause of the bad or missing data.

Because all Bigeye data quality metrics are computed using dynamic queries generated at runtime, we can also dynamically generate a follow-up query to run when anomalies are detected. These follow-up queries can identify specific rows that contributed to the anomaly, and we show previews of these rows in the UI to speed up root cause analysis for our users.

But what if we could not only surface the anomalous rows for debugging purposes—what if we could also quarantine those rows so they don’t impact our stakeholder’s dashboard?

To find out, I connected Bigeye to our dbt Cloud account and wired it up to take our dynamically generated query, and produce a job within dbt to pull all affected rows out into a side table. This keeps them out of the stakeholder's dashboard and makes it easy for the data engineer to inspect them, all automatically—and crucially, in a way that can be easily reversed.

If I don’t like the rows Bigeye vacuumed out of our source table, I can easily generate another job to run the process in reverse and put the same rows back.

Next Steps for Bigeye <> DBT

The beauty of dbt is that it's a complete programming environment for data modeling. This opens up a number of possibilities for further Bigeye-dbt integrations.

We rarely share projects before they are in production, but we just couldn’t help ourselves in this case! If you get just as excited about these kinds of projects as we do, then we’d love to talk to you. Or if you are interested in leveraging Bigeye and dbt to automatically quarantine bad data, send me a note to egor@bigeye.com.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.