Product
-
March 1, 2023

How Bigeye detected issues that our dbt CI jobs missed

This is a story about broken data, and how Bigeye pinpointed the issue, minimizing the impact to internal analytics. Crisis, averted. Read on to learn more.

Christian MacDonald

At Bigeye we use a variety of tools in our data stack to analyze our customers' usage patterns. Dbt transforms our data into meaningful insights, and we use Bigeye to observe our data reliability and quality.

As a relatively new user of dbt, I unknowingly broke our internal dbt runs and created stale data. This is the story of how Bigeye detected that problem, and helped us minimize the impact to our internal analytics.

The dbt seed file change

We use dbt seeds to map our customer metric usage data to overarching metric themes. For example, we categorize our freshness and volume metrics as pipeline reliability themes, while we categorize formatting metrics like null percentages or percent UUID as data quality themes. If you are unfamiliar with dbt seeds, then they are great for static data that changes infrequently, like mapping of metric categories to overarching business themes.

In a seemingly simple update, I added a new column to our metric_category.csv seed file in dbt for a different way to categorize metric types that the customer success team preferred. It was at a slightly lower grain, which we wanted to use to help improve our customer usage tracking. The update was very simple, I just went into the seed file and added a new column with a new mapping to each of the 80+ column-level metric types that Bigeye offers.

Discovering the data error

After adding in this new column, I pushed the commit and created a new pull request to merge this change into production. As a best practice, we run CI checks on every pull request, which dbt Cloud enables via Slim CI jobs. This CI check is intended to give users confidence that their new code is building as expected and not introducing errors when merged into the main branch. So when the CI job passed, I felt good and merged my small change into production. That night, however, all of our dbt runs failed due to the changes made in that PR.

Fortunately, since we use Bigeye to monitor all of our internal analytics tables for freshness and volume I was quickly notified the next morning of our stale datasets. The image below shows the exact metric we configured in Bigeye to report the stale table. In this example, we created a Bigeye metric that checks for the hours since the latest value in the _DBT_RUN_AT column every 60 minutes using our autothresholds. Since our autothresholds have detected patterns in our update frequency, when the observed value was greater than the typical hours since the latest value - we were alerted via Slack.

The broken data and the Bigeye fix

After being notified by Bigeye that morning, I quickly acknowledged the issue and went into investigation mode to determine root cause and total impact. It didn’t take long to realize that our dbt runs were failing because the dbt seed was trying to truncate the existing table and re-insert the data into a table where that new column did not exist. The solution was to update our dbt job temporarily to use dbtseed--full-refresh. The --full-refresh flag forces dbt to drop cascade the existing table before rebuilding it, therefore fixing the problem.

After the updated dbt job ran successfully, I refreshed the metric in Bigeye and it returned back to a healthy state and the issue on the table was auto-healed. I also could have waited for the scheduled 60 minutes to run up and have the metric run and healing process occur without any action, but I was eager to close the issue and move on with my day.

Final thoughts

Sometimes the most seemingly minor of changes can create significant issues in our data. Even if we use best practices like CI jobs to ensure these issues are reduced and don’t make their way into our production environments - they still seem to find their way. Having a data observability solution like Bigeye can help find and detect issues when all else fails and provide additional confidence in the reliability of our data.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.