Product
-
April 13, 2023

Efficiency with issue monitoring states

In this post, we'll take a general look at the lifecycle of an issue, from triage to close, using Bigeye.

Jon Hsieh

As a data engineer or data reliability engineer, you need to quickly decide on how to handle data quality issues that Bigeye detects.  The decision needs to be efficient and allow you to act with confidence knowing that the data quality is improving with each issue handled.  There needs to be a mechanism to enable focus on the urgent problems, to avoid duplicate effort from teammates, and to ease reporting progress in resolving the problems to your stakeholders.  Changing the status of issues is the primary way you can convey this in Bigeye.  

Initially Bigeye emulated how observability tools handled anomalies and issues in metrics – it has a triage state for new problems or issues that need to be re-evaluated, an acknowledged state for issues that are being investigated, and a closed state for problems that are resolved.  Here’s the general lifecycle of an issue.

In practice, we’ve found by looking at a metric’s chart, you will often know whether the issue is due to the data or due to the metric’s thresholds. The problem is that pipeline fixes may take hours to days to repair and push updates to the data through.  This is a lot of time between when a conclusion can be made and when proof that the problem has been fixed is evident. Data engineers need a way to separate the issue that they have made conclusions on from newly arriving issues or issues that require further investigation.

This is where Bigeye’s Issue monitoring states come in.  With it, data engineers can now specify if it is a data problem or a threshold problem and set expectations on how anomalous metrics should behave moving forward.  This eliminates toil because Bigeye takes on the responsibility of monitoring the metric and then sending out a notification when expectations are met.  Here’s the updated lifecycle of an issue.

Each state now has a single purpose which makes it easier to prioritize different issues based on if it needs human intervention or not.  The table below summarizes each state’s purpose.

The states make triage of issue lists more efficient. When you visit an issue list, you can prioritize looking into the issues in triage state first.  If there are none, you could look into the ack’d issues – they are being investigated and haven’t reached a conclusion.  If your issue list only has items in monitoring, you are all set because these issues are just waiting for a healthy run. Using the states can help track progress and help you show your stakeholders that progress is being made.

The state can be set from Slack, issue list pages, and on issue details pages.  The timeline on the issue details page shows the different state transitions and comments to summarize the lifecycle of the issue and to provide context when revisiting an issue days later.  

So, what does this mean in practice?  Here are three scenarios where monitoring states can help you monitor your most critical datasets:

  1. When a fix is coming
  2. When the data is changing
  3. When downstream tables are impacted

Scenario 1: When a fix is coming

You’ve been alerted about an anomaly and have concluded that it is a bug due to a change in a data pipeline or upstream data delivery delay.  You want Bigeye to maintain the current thresholds but not alert again if the same values recur unless it has returned to health first.You know why the problem exists and may  have already deployed the fix.  In either case, you expect the metric to return to within the existing thresholds after the pipeline updates and execution completes.

In this situation you would transition the issue to the new monitoring state and leave a comment with the reasoning.  If this was an autothreshold, you’d also give Bigeye feedback to maintain thresholds.  This issue will auto-close and notify when the metric returns to health without any further intervention.

Changing the state to monitoring reduces the urgency of the issue and reduces the need to revisit the issue in subsequent triage passes when waiting for the data to return to normal.  

Scenario 2: The data is changing

You’ve been alerted about an anomaly but know that this is a legitimate change.  Maybe there has been a new data set added or maybe thresholds need to be more permissive to account for the new normal.  

If you are using manual thresholds, you can modify your metric’s threshold configuration and put the issue into a monitoring state and leave a comment with the reasoning while waiting for it to return to health.  

If you are using autothresholds, you want Bigeye to adapt the current auto-thresholds to accept the new values moving forward without triggering a notification. You would transition the issue to use the new monitoring state giving Bigeye feedback to adapt threshold and leave a comment with the reasoning. The autothresholds will update and Bigeye will auto-close the issue when the metric returns to health.  

Like in the previous scenario, changing the state to monitoring reduces the urgency of the issue and reduces the need to revisit the issue in subsequent triage passes when waiting for the thresholds to adapt.

Scenario 3: Cascading monitoring states via Lineage

Monitoring states also can dramatically improve the efficiency of issue management when combined with Bigeye’s lineage capabilities.  With lineage involved, you can set a monitoring state and give threshold feedback on an issue and on all the issues downstream from it with a single action.  So if an issue is detected at an upstream table, you could change the status on that table and downstream table’s metrics with the same monitoring state and feedback.

Because Bigeye is monitoring the issue and notifies when the issue auto-closes upon returning to health, your pipeline status should improve as new metric runs cascade.   This eliminates follow-up interactions required to resolve related individual downstream issues.

If the fix leaves unresolved problems in the pipeline, those issues should remain open.  This is a good thing – progress has been made because the open issue list has been whittled down.  A new investigation can be done to address the remaining issues.  

Conclusion

Bigeye’s monitoring state is a simple mechanism that lets you set expected behavior in response to an issue and lets the system take care of the rest.  The workflow for managing data issues becomes clearer because issues that require investigation are separated from those where resolution decisions have been made but need time to return to health.  In the multiple issues scenarios, using the states can reduce the number of repeated interactions with individual issues. As pipelines get deeper, monitoring states can help scale by applying monitoring states in bulk cascading down lineage relations.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.