5 misconceptions about data observability
Think data observability isn't for you? Maybe you're right. But make sure you aren't operating under some of these common misconceptions.
Organizations that would benefit from investing in data observability or data reliability often forego the investment, based on a few common misconceptions. In this blog post, we discuss five of them and provide clarity on each.
1. Data observability is solely about monitoring
A common misconception about data observability is that it only involves monitoring data pipelines and systems. In reality, it also includes things like alerting, lineage, metadata, and metrics.
Metrics: Metrics are statistics calculated on data that are then monitored by data observability solutions. Some examples include:
- %null - percentage of rows in a column that are null
- average - mean of the value of the rows in a column
- volume - number of rows written in a table in the last 24 hours
Metrics are the core building block of every data observability solution. When a metric goes above or below certain thresholds considered “normal”, the data observability solution will classify it as a data quality issue.
Anomaly Detection: Anomaly detection refers to using historical data to understand dynamic patterns in data. For example, an anomaly detection algorithm might understand that a metric has a certain weekly seasonality, with peaks and troughs. A “normal” data point for Saturday night might be an abnormal one for Monday morning.
Alerting: It’s not sufficient to just monitor your data and detect problems. You also need to be notified when such a data issue arises. Data observability systems usually let you configure notifications either through email or Slack. To avoid being inundated with alerts, you need to make sure that your data observability solution’s anomaly detection is accurate, and more importantly, configurable.
Lineage: In data observability, lineage refers to the ability to trace data from its source to its destination through the data pipeline, identifying any changes or transformations that occur along the way. Lineage helps users to identify the cause of data issues and troubleshoot problems more effectively.
2. Data observability is a one-time implementation
Data observability isn’t just a one-time implementation – it’s a continuous process that requires maintenance, iteration, and learning.
For example, when onboarding to a new data observability solution, the data team will generally choose a set of data quality metrics to enable. However, this is not a set-it-once-and-done situation. It’s equally important to continue tracking and analyzing these metrics over time. As data systems change, some metrics may no longer be needed, while other ones become more important.
Another example is false positive alert feedback. When a data observability solution generates an alert, it’s crucial to investigate and determine whether it’s a true positive or a false positive (i.e. whether it indicates a real data issue). In the case of false positives, it’s important to provide feedback to the data observability solution to help it learn and avoid false positives in the future.
Fortunately, if you choose the right data observability solution, it should make it easy to provide this sort of maintenance: Bigeye, for example, automatically profiles any new tables in connected data sources, generating auto-metrics. In the case of false positives, Bigeye gives users the option to tell the algorithm to ignore that piece of data for anomaly detection training in the future.
3. Data observability is only for data scientists
While data scientists are some of the most prominent consumers of data in an organization, data observability isn’t just important for them. In reality, data observability is important for anyone who works with data or consumes data, including analysts, business users, and decision-makers.
4. Data observability is too expensive and uses up warehouse compute
Cost is a major concern for organizations, whether they build or buy data observability tools. As organizations move to ELT setups where data transformations are performed in-warehouse, every query run incurs a cost. And while ELT warehouse costs are considered table stakes, many organizations balk at paying for observability, considering it a “nice-to-have." This is short-sighted thinking.
The cost of implementing data observability is often far outweighed by the cost savings resulting from early detection and issue remediation. If you monitor your raw tables as they land in the data warehouse, you pinpoint stale data, and ultimately avoid expensive backfills and re-runs of pipelines.
Furthermore, data observability tools like Bigeye are optimized to run as few queries as possible:
- Bigeye only profiles all the table once, up front, to compute
- Bigeye only profiles all the tables once, up front, to compute autometrics
- Freshness and volume metadata metrics are pulled from the warehouse logs, which do not incur costs
- Deeper columnar checks are customer-enabled
- Bigeye storing the data as it comes back in its own database – unless you’re going to the preview page, Bigeye is not live querying data.
- Bigeye batches metric queries together; for instance, rather than running three separate queries for a max, a min, and an average, running one query for all three keeps the cost down.
- Bigeye is aware of warehouse particularities and working around them. For example, since Snowflake charges customers based on how long the warehouse is running, Bigeye tries to run all the Snowflake queries at the same time, rather than say, every ten minutes, which would keep the customer’s instance up all night, even if it’s technically “spread out.”
5. Systems monitoring, observability, and data monitoring tools are all the same
"Data reliability," "site reliability," "data monitoring," and "systems monitoring" all sound similar. Executives may be tempted to think you can use a blanket tool for all processes. As an organization, if you’ve already invested in a solution like Datadog or ServiceNow for systems monitoring, why do you need something else?
The goal is to fully observe the state of the system, the instrumentation, and measurements of the state. For systems observability, you might want to know how many requests a service received and what percentage of those were errors. For data observability, that level of information might not be sufficient: a data pipeline might successfully complete without errors, but the data still has issues that need to be surfaced and alerted on. While you can probably make a tool like Datadog work for you in the data observability use case, it will require some engineering hacking.
Data reliability and data observability are easier and cheaper to implement than usually assumed; and they're often more critical than assumed. So make sure to check your misconceptions before assuming you've already got it covered.
Schema change detection