Product

July 6, 2023

Triaging data issues with Bigeye (Part 1)

min read

Jon Hsieh

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

So you’ve found a data issue with Bigeye, now what? In this two-part blog we’ll walk through several scenarios and share tips and tricks for handling uncovered data issues in Bigeye.

In part one, we’ll start with the simplest and fastest ways to triage a single issue. Then we'll work our way up to handling dozens or even hundreds of data issues in minutes in part two.

Let’s start by setting up our scenario:

The team

Frank is the typical data engineer who is on-call for handling operational data pipeline reliability and is a user of Bigeye. He’s part of a small team responsible for the overall health of the data and data pipeline.

Becky is a data engineer who builds and maintains new data pipelines—focusing on the transformations and combinations of data from different sources. She cares about pipeline reliability, but is more focused on making sure the data is high quality and data values seem reasonable.

The environment

So what does their data stack look like? They primarily pull customer data and app data from an operational database into their data warehouse/data lake. They watch replication from integration tools (e.g. stitch or fivetran) and make sure that dbt jobs that transform data into the company’s data model are running correctly.

The modeled data is then connected to BI tools to generate reports or embed analytics into their app, or fed back to source systems using a reverse ETL tool.

For data observability, they use Bigeye.

At this point Becky and Frank have set up Bigeye monitoring on tables in their data warehouse/data lake. Production tables get deep metric coverage (grouped metrics, data quality metrics, distribution metrics) recommended by Bigeye’s Autometrics while the rest of their tables are lightly covered with pipeline reliability autometrics (freshness metrics, volume metrics, and schema change tracking). Bigeye collections are used to group important metrics together and they have connected Bigeye to Slack and Pagerduty for alerts.

Scenario 1: Triage in Slack results in a closed issue

Frank is on-call and gets pinged via Slack and via pager duty that his pipeline has some problems.

Frank can see that the value is slightly outside of threshold (2.22k vs 2.21k). This is actually a good thing since he expects the metric to grow. He could complete the triage of this issue by selecting monitor with adapt feedback to tell Bigeye this an expected value and to close the issue when thresholds have adapted and it’s had a healthy metric run.

But maybe Frank decides to double-check by opening up the slack thread. He sees that Bigeye has sent a message to the thread saying that the metric actually has already returned to health!

Instead of using a monitoring state, Frank can change the issue to the closed state. When Bigeye asks for feedback on the thresholds, he can choose to give it either adapt feedback (learn the changes) or maintain feedback (since the value may have returned to normal).

If Frank wanted to be even more confident in his decision, he could click into the app to look at the recent metric values. Looking at the chart, he can see there was a jump right at the edge of the threshold values. As mentioned before, this is a metric he hopes will go up. Since this is considered a new normal, he reaches the same conclusion: close the issue with adapt feedback.

After these actions, Frank has effectively resolved the issue and doesn't have any more follow up. Victory!

Scenario 2: Triage in Slack results in an Ack’ed issue

Frank looks at another alert in Slack and it says delivery of data to a table is late. This notification is troubling because the daily leadership report needs to be delivered by a certain time each day and this warning puts that in jeopardy.

Unfortunately, when he looks at the slack thread, there is no message about the metric being healthy and he doesn’t have enough information to quickly resolve the issue. Instead, in this case he decides to punt for now and follow up on this issue. Thus, in Slack, he changes the issue to the ack’ed state. This signals to his colleagues that the issue has been seen and he is working on it. This status update shows up in the app and in Slack.

Maybe Frank has a hunch about the problem – maybe it’s just slightly late and something that a little more detail could help resolve. Frank navigates from slack to Bigeye directly to the issue to get to more details. He can now see a chart and schema change details which provide insight into why the issue was created.

By reviewing the “Hours since last load” chart, Frank confirms that the regular loading pattern has changed. The last schema change was months ago and can be ruled out. So, at the moment there is no obvious root cause. Frank has other new issues to triage so he decides to punt this one for now and come back later.

Frank moves the issue to the ack’ed state. He knows it’s been flagged for further investigation after the triage pass. Bigeye automatically updates the issue and the slack notifications about the issue status to keep everyone in the loop.

We've covered two scenarios for triaging individual issues, but what about when you want to manage groups of issues in bulk? Stay tuned for part 2 of this series where we'll discuss scenarios for triaging issues at scale.

share this episode

Resource

Monthly cost ($)

Number of resources

Time (months)

Total cost ($)

Software/Data engineer

$15,000

$540,000

Data analyst

$12,000

$144,000

Business analyst

$10,000

$30,000

Data/product manager

$20,000

$240,000

Total cost

$954,000

Role

Goals

Common needs

Data engineers

Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.

Freshness + volume
Monitoring
Schema change detection
Lineage monitoring

Data scientists

Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.

Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing

Analytics engineers

Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.

Lineage monitoringETL blue/green testing

Business intelligence analysts

The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.

Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing

Other stakeholders

Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.

Integration with analytics toolsReporting and insights