Product
-
March 27, 2023

The difference between data observability and legacy data quality platforms

The data quality products of yore originated in the on-prem era. Modern data observability is a whole different ballgame. From complexity to scale to the cloud, here's the difference between data quality systems of today and those of yesterday.

Liz Elfman

Unlike legacy data quality products, which originated in the on-premise era, modern data observability platforms like Bigeye were built from the ground up to handle the complexity and scale of cloud data environments.

In this blog post, we’ll explore how data observability platforms outperform legacy data quality products, and why organizations should make the switch.

What is Informatica?

Informatica, founded in 1993, is an all-in-one data integration platform that provides an interface for building ETL pipelines. The platform helps organizations integrate data from various sources into a data warehouse or data lake. Informatica's no-code interface allows non-technical users to build ETL pipelines without writing any code. It's comparable in ethos/paradigm to something like UIPath.

Informatica was founded in the pre-cloud era, and many large companies still have on-premise Informatica deployments. Over the years, Informatica has also expanded into adjacent products in data quality and governance.

How do Informatica’s data quality products work?

Legacy data quality products like Informatica use rule-based approaches to ensure data quality. That means users define and implement rules for validating, cleansing, and transforming data from different sources.

But, rule-based approaches have limitations:

  • They still fall in the category of writing hundreds or thousands of custom checks for each data source or pipeline
  • They do not adapt well to changes in data schemas, formats, or business requirements
  • They do not provide comprehensive visibility into data health and performance across the entire data ecosystem
  • As a result of everything described above, they require heavy professional services resources to set up and maintain

By contrast, modern data observability products like Bigeye take a much more automated and proactive approach. They enable users to monitor and improve data quality at scale using intelligent methods. In particular, Bigeye offers features like:

Autometrics

One of the key features of Bigeye’s data observability product is its Autometrics capability. Autometrics are smart recommendations for monitoring coverage based on an analysis of the user’s data, including column type, semantic details, and formatting. When users connect their database to Bigeye, it automatically indexes their source to generate the catalog and suggest basic autometrics. Customers can then go through and enable the autometrics they want.

Autothresholds

When you deploy a standard autometric, Bigeye builds in "Autothresholds" computed from historical data through machine learning models. These autothresholds are dynamic and configurable. You can set them to "narrow," "normal," or "wide." You can also give feedback to the machine learning models on the "Issues" page, if you find that the autothresholds are inaccurate.

Anomaly detection

Anomaly detection is the process of identifying deviations from expected values or trends. ​​Bigeye has sophisticated anomaly detection algorithms that understand trends and seasonality. These algorithms even recognize hard-to-detect anomalies like "slow degradation."Bigeye can identify and adjust to pattern changes in data, removing the need for manual intervention.

Bigeye’s anomaly detection improves over time, through reinforcement learning and anomaly exclusion. The system learns from user feedback, in order to fine-tune detection and alerting capabilities. Further, Bigeye includes root-cause and impact analysis, to help engineers resolve detected anomalies.

Metadata metrics

In Bigeye, you can access metadata metrics immediately upon connecting your data warehouse. Examples include "data freshness" and "volume," which indicate the general success or failure of a data pipeline. These metrics primarily focus on whether a table has been updated and/or accessed. The table below presents some examples:

Metadata Metric NameAPI NameDescriptionHours since last loadHOURS_SINCE_LAST_LOADThe time elapsed (in hours) since a table was last modified with an INSERT, COPY, or MERGE operation. Recommended as an automatic metric for each table.Rows insertedROWS_INSERTEDThe total number of rows added to the table through INSERT, COPY, or MERGE statements in the last 24 hours. Recommended as an automatic metric for each table.Read queriesCOUNT_READ_QUERIESThe total number of SELECT queries executed on a table in the last 24 hours. Recommended as an automatic metric for each table.

Metadata metrics are a crucial component of Bigeye's T-shaped monitoring approach, which advises tracking basic metrics across all data while implementing more in-depth monitoring for the most critical datasets, such as those used in financial planning, machine learning models, and high-level executive dashboards.

Conclusion

In conclusion, rule-based data quality solutions like Informatica (or even more modern tools like dbt tests) require users to manually define each metric and their associated thresholds. These rules can be complex and time-consuming to maintain and update as the data changes over time. Additionally, a lack of sophisticated anomaly detection algorithms in these products means that anomalies detected might be false positives or not actionable. By comparison, Bigeye provides instant coverage for the entire data warehouse from the moment customers connect, with the option to go deeper with little additional effort.

If you want to learn more about how Bigeye can help you achieve better data quality at scale, request a demo.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.