Thought leadership

May 9, 2023

Can ChatGPT identify and handle data quality issues?

min read

Liz Elfman

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Amidst all the fervor around ChatGPT, most of the applications and use cases have had to do with words and code. But what about data? How well is ChatGPT able to analyze and process data, especially messy data?

In this blog post, we evaluate ChatGPT on a variety of common data quality issues and tasks.

1. Missing data

Missing or incomplete data is probably the most common data observability issue, whether it’s because of data entry error, or because ETL jobs are failing. Here’s an example of a table with some missing data:

Customer IDCustomer NameEmailPhone1John Doejohn.doe@example.com555-12342Jane Smith555-56783Mary Johnsonmary.johnson@example.com4James Brown

ChatGPT can easily identify that there are missing fields, and which rows have missing fields. This might not be the most efficient way of doing so, however, given that null checks are fairly easy to write even in SQL.

2. Inconsistent formatting

Another common data reliability issue is inconsistent formatting. For instance, a customer's first and last name may be in different capitalizations or a date might not be in the right format. String columns are particularly prone to this issue. To address this, historically, the data must be standardized across the source system or, at the very least, in the data pipeline before being fed into the data lake or warehouse.

Here’s an example table with the "Date of Birth" column in several different date formats.

Customer IDCustomer NameEmailDate of Birth1John Doejohn.doe@example.com12-05-19902Jane Smithjane.smith@example.com05/14/19853Mary Johnsonmary.johnson@example.com1983-11-074James Brownjames.brown@example.com23-08-1978

Again, it looks like ChatGPT can identify the inconsistent date format as an issue. However, form validation on the data entry side, or a simple regex check, might be computationally cheaper. Where ChatGPT improves on the status quo is in the ease of fixing the issue - you’re able to give it instruction to standardize the date format in natural language, which is a much better user experience than having to write a regex.

3. Out-of-range values

LLMs are cool because they have some understanding of column names, and what they might mean semantically. For example, in the table below, ChatGPT can identify that a negative price is probably an error. Rather than having to manually set thresholds on the columns, you can just give a broad instruction: “Check this data for data quality issues and fix them if there are any”, and it will do so.

Product IDProduct NamePrice1Laptop15002Tablet-3003Monitor2504Keyboard0

4. Erroneous data types

Again, since LLMs have a semantic understanding of column names, it’s able to tell that the column Age, which should be an integer number, contains values like lambda and 41.5 that are probably not the right data type.

Employee IDEmployee NameAge1John Doelambda 2Jane Smith293Mary Johnson41.54James Brown32

5. Calculating simple statistics

Now let’s try something a little harder - calculating simple statistics like mean, median, and variance, which are the bread-and-butter of data quality checks.

Product IDProduct NamePrice1Laptop15002Tablet8003Monitor2504Keyboard20

ChatGPT gets the mean correct (although the response is fairly wordy), but it does not in fact get the variance correct. In fact, the variance should be 433891.6666666667. Performance on these sorts of simple math calculations will probably improve once GPT plugins (for instance, a calculator plugin) are fully enabled.

6. Unknown, new data

Let’s use ChatGPT to explore a raw, large data file. This is a common task for data scientists, and ChatGPT does a really good job of describing, in an organized way, what the data in a table is, from just the column names. Integrating LLMs into data catalogs seems like a great way to automate metadata maintenance.

7. Anomalies in timeseries data

Finally, let’s look at how ChatGPT is able to handle anomaly detection. We are using some mock timeseries data with synthetic anomalies inserted. It looks like this:

Now, let’s ask ChatGPT to identify the anomalies (note that we’ve truncated the data because the whole timeseries doesn’t fit in the context window - this is probably the biggest problem with using LLMs on large datasets. While you can chunk up the data and feed it piecemeal to the LLM, this isn’t great if you have a long timeseries, where later values have some relationship to earlier values). While some of the anomalies identified are indeed anomalies, for example timestamp 1217, others that are identified, like all the 0s, are not. We ask the anomaly detection algorithm to be less sensitive, and it gives a better answer.

Overall, it seems that ChatGPT is most useful in tackling data quality issues that have to do with formatting and data exploration. It’s able to extrapolate human-readable metadata from machine-friendly column names. However, its non-deterministic nature and lack of high-level quantitative reasoning ability make it a poor fit for other tasks; at least for now!

share this episode

Resource

Monthly cost ($)

Number of resources

Time (months)

Total cost ($)

Software/Data engineer

$15,000

$540,000

Data analyst

$12,000

$144,000

Business analyst

$10,000

$30,000

Data/product manager

$20,000

$240,000

Total cost

$954,000

Role

Goals

Common needs

Data engineers

Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.

Freshness + volume
Monitoring
Schema change detection
Lineage monitoring

Data scientists

Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.

Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing

Analytics engineers

Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.

Lineage monitoringETL blue/green testing

Business intelligence analysts

The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.

Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing

Other stakeholders

Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.

Integration with analytics toolsReporting and insights