Thought leadership

September 9, 2022

Data anomaly detection requires hindsight and foresight

min read

Henry Li

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

We are experiencing a profound shift in the way data is stored and used. New data infrastructure and observability applications such as Snowflake, Databricks, Datadog, and Bigeye are transforming the way engineers and data scientists interact with data pipelines and contents.

One of these shifts is happening specifically in anomaly detection, or, the practice of identifying when data is behaving in a way that falls outside of the norm. In this post, we’ll walk through the two major approaches to data anomaly detection, and what you should expect from an anomaly detection tool.

Ex ante and post hoc anomaly detection

When we think about anomaly detection in data, we think about finding something interesting that does not fit the mold of certain models. We seek extreme changes, subtle yet significant differences, or something different all together. We may compare two or more data objects; in addition, we may look for before and after states of the same objects after processing over time. Anomaly detection is a broad idea that encompasses all of the above.

While anomaly detection comes in many forms, there are two major approaches: ex ante and post hoc. Ex ante means before the event, which is a prediction of the future. Post hoc means after this, which is reasoning after all events have already occurred.

Ex ante anomaly detection ensures that you will always have a set of tolerances for what your data should look like into the future. That way, you can anticipate problematic data behaviors as well as defining what normal looks like.

Post hoc anomaly detection helps you to understand potential problematic data behaviors that you may have missed in the past. Data practitioners should be aware of these differences in the operational aspects of a data observability system.

In data quality root cause analysis, it is common for data users to commit mistakes such as hindsight bias (using post hoc) that some data pipeline problems should have been captured or prevented earlier. The truth is, any company will run into data quality problems, and it is those companies that understand how to find meaningful anomalies that will be successful in a data-driven world.

Anomaly detection in many data applications are post hoc: the data is already provided, and as a data scientist or engineer, you need to determine if there are any anomalous data points that would affect downstream analyses and model training. One example is when an on-call engineer on the infrastructure team who’s looking at service error metrics (5xx, 4xx, etc) after a service outage was declared. The engineer wants to know the root cause(s) of service issues by examining when the time series of error counts exceeded certain predetermined thresholds or changed patterns in the recent past. Post hoc detection works well here because the data metric is relatively stable, meaning that the past and present are nearly identical as time progresses–a hallmark of great system engineering.

However, when the operational task is to look for anomalies that will occur in the future with datasets that may or may not behave stably, as often the case with real world datasets, then ex ante is very effective. As we have discussed in a previous blog post, one challenge of most data teams is to make sure that bad data do not creep into the database over time, which may render downstream automation and manual analyses inaccurate; to scale data monitoring, it is important to automatically set anomaly tolerances for monitoring dynamical datasets.

One way to distinguish between ex ante and post hoc is the phrase “hindsight is 20/20”. When someone is looking back on the full stock price history of, say, the major tech companies, they may discover interesting features that point to times of wonderful financial decisions. These features may not play out in the near future, and could bias one’s future judgment if interpreted incorrectly. Of course, someone with a time machine may take advantage of these features to become an extraordinarily successful investor, but nobody has invented a time machine, yet. On the other hand, if we have the perspective of operating in the present to prepare for the future, our goal would be to use ex ante analysis to minimize the risks of loss while optimizing the chance of success. Here, we take into account real world uncertainties. We build dynamic detection models as well as rules to achieve the best ex ante analysis and monitoring.

Two use cases of hindsight and foresight in data observability

Let’s look at two examples of distinguishing between hindsight and foresight in data observability. Below, we are looking at two data metrics progressing over time. One metric has a seasonal pattern, while the other has a change of level pattern.

Figure 1. A segment of seasonal pattern data metric. When ex ante analysis takes place, the anomaly detector alerts on the ascend to the first peak and the drop to the first trough. As the metric progresses in time, the detector learns the variation and starts to model the underlying seasonality more closely. On the other hand, post hoc detector sees all of the data, and models the seasonality pattern well, with tighter fitting.

Figure 2. A segment of stepping up pattern data metric. When ex ante analysis takes place, the anomaly detector alerts on one of the first points in the first step; we then see multiple alerts firing, with the detector eventually adapting to the underlying variation when it learns that a new normal has taken place. As in Figure 1, we see a much tighter fit in the post hoc scenario, with alerting points at every increase and drop.

In the above two scenarios, we observe that alerting locations sometimes overlap for ex ante and post hoc. Looking closely–you can cover up the figures and then slowly reveal the metric charts from left to right–we can see that the alerts can mean more than just “there was an extreme point.” Operationally, ex ante is telling us that if we were to play time forward, the normal behavior would be within the thresholds of the detector. Over time, the detector will have to decide what normal is and if there is a new normal (of course, the detector should allow input from users when available) to shift and/or widen the tolerance range.

In figure 1, the user is alerted to the first peak and trough of a seasonal pattern metric. While in hindsight (post hoc) the seasonal pattern is obvious, it would not have been clear as the data is being generated and monitored from the start. As a data scientist, I would want to know when the data is changing over time, and monitor the data pattern to confirm that the dynamic is normal. Another interesting distinction between the two approaches can be seen in Figure 2, in which the sense of normal behavior is propagated more prominently in ex ante. While it seems more impressive that post hoc can fit better, the ex ante thresholds are much more informative through time as they give us the sense of what is normal based on what has been seen so far. Depending on the anomaly detection approach, the interpretation of alerts can have different meanings.

Striking the ex ante and post hoc balance

In practice, both foresight and hindsight are critical components of any good data observability and anomaly detection system.

Bigeye uses both frameworks - ex ante and post hoc - to generate Autothresholds which reliably explains anomalies in user data. Post hoc identification highlights the potentially anomalous and/or broken data points in past history. And while post hoc detection plays a role, ex ante detection is the true core of Autothresholds. It's the forward-looking perspective about what actually constitutes “anomalous” data.

Similar to post hoc detection for individual data fields, Bigeye also offers Deltas, which focuses on changes at a table level. Through Deltas, users can confirm similarities and differences between data tables - over time and across warehouses.

In the data world, observability tools are proliferating. Distinguishing between ex ante and post hoc anomaly detection can help inform your understanding of which anomalous behaviors truly fall outside of the norm. As you browse your observability options, make sure to choose the platform that encompasses the full spectrum of anomaly detection.

share this episode

Resource

Monthly cost ($)

Number of resources

Time (months)

Total cost ($)

Software/Data engineer

$15,000

$540,000

Data analyst

$12,000

$144,000

Business analyst

$10,000

$30,000

Data/product manager

$20,000

$240,000

Total cost

$954,000

Role

Goals

Common needs

Data engineers

Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.

Freshness + volume
Monitoring
Schema change detection
Lineage monitoring

Data scientists

Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.

Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing

Analytics engineers

Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.

Lineage monitoringETL blue/green testing

Business intelligence analysts

The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.

Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing

Other stakeholders

Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.

Integration with analytics toolsReporting and insights

about the author