Thought leadership
-
January 13, 2022

4 Data Quality Categories to Watch in 2022

“No data is better than bad data” is a phrase you’ve probably heard from a CEO who...

Arpit Choudhury

Arpit Choudhury is founder of astorik, a community to explore and learn about modern data tools.

“No data is better than bad data” is a phrase you’ve probably heard from a CEO who started their keynote with these words — only to then suggest that every forward-looking organization needs to collect every data point they can in order to become data-driven.

On a serious note, though, it’s unlikely that an organization has only bad data or only good data. A ridiculously high number of data sources is the norm, the likelihood of inaccuracy is high, and the reasons for anomalies are plenty. Data, the kind your team uses every day, might be the best advantage your company has over the competition. But one inaccurate report, as a result of a broken pipeline, is all it takes to lose your team’s trust in your data.

Enter data quality or DQ. When data quality is good, it is accurate, consistent, complete, and fresh. Ensuring reliability and usability for every data set is not trivial, but purpose-built DQ tools can do the heavy lifting for data teams.

What does a data quality tool do?  

There are many factors that can affect the quality of data, and based on your needs and available resources, you can adopt one or more types of data quality tools. For instance, data transformation or modeling can be the first step towards ensuring data accuracy. However, an unexpected format or an incomplete job can make carefully modeled data look bad when an executive discovers an incorrect figure in their dashboard.We’ve collated twelve DQ tools and placed them under one of four categories based on the primary need each tool fulfills.

The data quality tooling landscape

Data observability

Data observability burst onto the scene in 2021 introducing tools that allow you to understand the internal state and behavior of your data. With effective data quality in place, data teams can catch data quality issues before they adversely impact the business. Pros

  • Detect data quality issues, including “unknown, unknowns”, before the business is affected.
  • High-level of automation and flexibility saves data teams times while ensuring broad data quality coverage.
  • Speed up resolution times by providing teams with context on what went wrong.

Popular data observability tools include:

Bigeye automatically detects data quality anomalies and speeds up resolution before the data reaches end-users.

Monte Carlo monitors critical data assets and proactively alerts stakeholders to data quality issues.

Anomalo automatically detects data quality issues without writing code, configuring rules, or setting thresholds.

Data transformation

Like the name suggests, data transformation tools allow you to transform your data into a usable form as well as apply data quality rules before data is analyzed or acted upon. There is a good selection of data transformation tools on the market – both established and emerging – ensuring that there is a tool that will work with your data environment. Pros

  • Help to bring software engineering best practices to data transformation workflows, like version control, quality assurance, documentation, and modularity.
  • Using a data transformation tool as a testing framework can stop bad data from flowing.
  • Range of emerging and established tooling means there’s a good chance that you’ll find a tool that integrates with your data environment.

Popular data transformation tools include:

dbt (data build tool) is an open-source tool that enables you to transform data in the warehouse using SQL.

Dataform is similar to dbt and also SQL-based but Dataform only works with BigQuery (after its acquisition by Google Cloud)

Trifacta offers a visual interface to transform data but also integrates with the open source version dbt (core). Trifacta was recently acquired by its biggest competitor, Alteryx.

Data testing

Data testing tools borrow from traditional software engineering practices and adopt pipeline testing to catch data quality issues.Pros

  • Test your data to catch key data quality issues.
  • Prevent bad data from flowing downstream.
  • Act like documentation for your data.

Popular data testing tools include:

Great Expectations is a data quality tool to validate data via tests to ensure that the data appears as expected.

Soda is a data quality tool to validate data via tests and monitor the data to ensure that it appears as expected.

Deequ is an open-source library built on top of Apache Spark to define unit tests for data.

Data lineage

Data lineage tools enable you to understand the impact of bad data downstream and can help find the root cause of a data quality issue. Pros

  • A strong understanding of data lineage is important for impact analysis.
  • Resolve data quality issues faster by pinpointing the correct source of errors.
  • Understand which downstream analytics have been affected by data quality issues.

Popular data lineage tools include:

MANTA is a data lineage tool that integrates with a data catalog to automatically scan the data and build a map of all data flows.

Alvin is a data lineage tool that offers column-level, cross-system data lineage.

Datakin is a data lineage tool that automatically traces data lineage and builds a graph that illustrates the upstream and downstream relationships for each dataset.

Which tool is right for you?

With so many flavors of data quality solutions, figuring out the solution or solutions that are right for your use case can be daunting, especially due to the overlapping capabilities of tools with distinct core offerings.

I recommend opting for tools that are extensible by design and have integrations with modern data tools. But most importantly, the data quality tools that you choose must cater to your primary needs —  whether that is to transform data, run tests to ensure that data is as expected, detect anomalies, monitor data pipelines, track changes in data, or visualize data lineage. That’s not all though — dedicated tools are available to monitor ML models, Business KPIs, or even your entire data infrastructure — you can expect more fragmentation as companies in the data quality space continue to innovate.

Therefore, it’s a good idea to invest in best-in-class tools that solve a core problem really well. If you are looking to build a modern data stack, astorik is a great place to start exploring data tools.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.