Thought leadership

October 27, 2022

Data Practitioner Spotlight: Laura Dawson of EDO on data taxonomies

min read

Kyle Kirwan

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

In the Data Practitioner Spotlight series, we interview people who are directly working at the forefront of data across a range of sectors. In this edition, we sit down with Laura Dawson, who heads up data quality at marketing analytics startup EDO.

Laura Dawson is the head of data quality at EDO. EDO is a marketing analytics startup that helps marketers determine the effectiveness of their television ad buys. They do this by cross-referencing the ads against search traffic for the brands in the ads – if an ad for Porsche is shown on TV and at the same time, searches for Porsche jump a lot, then the ad was probably effective.

The EDO product, then, is about accurately integrating two primary sources of data, TV streams and Google trends data, and then making that merged data available to marketers so they can make better ad-buying decisions.

What EDO's data team does

EDO has collection infrastructure that pulls in TV streams. The individual ads in the streams are then tagged with:

Brand
Product
Product Variation
Language
Is a promotion being run for the product

So for example, this ad would be tagged with:

Brand: Amazon
Product: Alexa
Product Variation: Echo
Language: English
Promotion: 20% off on Amazon.com

The data is the ultimate product and has to be accurate. Tagging is done manually, through an outsourced team aided by automated tools. (This is a similar model to startups like Scale.ai in the self-driving car data generation space.) The annotator has a dropdown that the taggers select. If they believe the correct tag is not already in the dropdown, they add a new one which is reviewed the next day.

Initial tagging occurs during the US nighttime. When the US team wakes up, Dawson’s team normalizes those tags into a data taxonomy and deals with governance issues. This process is often very subjective.

“We’re dealing with questions like, is Amazon an umbrella brand, or are we going to break out Amazon Prime video as a separate brand?" says Dawson. Sometimes, these decisions are guided by the brand (if the brand is a client). Other times, the data quality team makes a judgment call. “We’ll go to their website and see how they're organizing their products and look at how they’ve discussed their brand over time historically,” Dawson says.

In addition to maintaining the data taxonomy and the correct labels, Dawson’s team is also in charge of tasks like:

Normalizing video quality: For example, two pieces of video might actually be semantically the same even if there are small differences in audio or video quality.
Responding to changes in Google Trends data: The final output of the data quality team at EDO is a dashboard used by the analytics team, coupled with custom reports that they run. The company also licenses the taxonomy developed and distributes that via s3 bucket.

What is a data taxonomy?

Dawson’s work on building a robust data taxonomy for EDO’s data brings up a question. What is a data taxonomy, and why is it important?

A data taxonomy refers to how tracked events and properties are named and categorized. Data taxonomy ensures that data is consistently categorized across multiple sources and channels, so that consumers of the data can derive meaningful insights.

Let’s say that you’re merging two tables: one where the purchase event is denoted as “Checkout Submitted Order” and another where it’s denoted “Checkout submitted order.” These events will be considered two separate events and will not automatically merge. Therefore, if you query for submitted orders, you’ll probably get an inaccurate result.

Data taxonomy originated as a subject in the library sciences, used to figure out how to best categorize and name books. It eventually broadened beyond the library sciences into data at large.

Data taxonomy in e-commerce

The earliest applications of internet data taxonomies happened in the e-commerce space. Online marketplaces like Amazon had to organize their product catalog in such a way that consumers would actually find the stuff they wanted.

In a recent blog post, for example, Etsy outlined their product taxonomy: a collection of hierarchies “comprising of 6,000+ categories (ex. Boots), 400+ attributes (ex. Women’s shoe size), 3,500+ values (ex. 7.5), and 90+ scales (ex. US/Canada).” These hierarchies form the foundation for the various filters and category-specific shopping experiences that make up the buyer experience.

Video and content taxonomy

Prior to EDO, Dawson spent time as a taxonomy analyst at HBO. There, along with her boss, she pioneered a new standard for language metadata called IETF BCP 47 (Internet Engineering Task Force Best Common Practices).

Previously, different departments coded the Spanish language differently, including uppercase Spanish, lowercase Spanish, and other variations to represent specific dialects.

By creating the language metadata standards, Dawson created a source of truth across the company,. The streamlined the language metadata terminology for audio, subtitles, closed captions, rights and licensing.

Principles of building a data taxonomy

Building and maintaining data taxonomies is probably one of the most labor-intensive approaches to high-quality data, and EDO does it because data is the product. Below, Dawson shared some of her hard-won principles of exporting data to its end users.

1. Think about who the data is for

Is it for a manufacturer, a distributor, or a consumer? “Back when Amazon first started up, they were using backend data from publishing warehouses that was really junky,” says Dawson. But consumers have different expectations. “There was this whole educational effort back in the late 90s and early 2000s to make that data more palatable for consumers.”

2. Understand the constraints of your system

You might have certain engineering constraints or database constraints. How will you bend the taxonomy to make that work? EDO uses a three-tier taxonomy - brands, products, product variations. In the case of a product variation, what do teams do? Says Dawson, "We don't have a fourth level. We have to figure out a way to set up the product variation field to concatenate all of these different spinoffs."

3. Don’t make your taxonomy too deep

“If your taxonomy is too layered, if it goes too deep, you're going to have a nightmare in terms of organization and monitoring that data," says Dawson. "For us, three layers just seemed to be the level at which our clients responded well to it AND our reviewers were able to work with it.”

4. Engage a customer success team around the most-used data

Dawson tells us, “The more eyes are on the data, the more you need a dedicated person or team to react quickly to the inevitable incoming feedback, like when a brand should capitalized and it’s not.”

5. If you’re managing a taxonomy, you will always be grooming it

A data taxonomy is not a set-it-once kind of thing. It’s a constant, iterative process. Dawson says, “You're always looking for ‘can we collapse these?’, ‘do we have to expand these?'” It's a living set of rules that is subject to change as you acquire more information.

Conclusion

As a marketing analytics company, EDO’s product is data. The standards it sets for its data quality and the processes it has for maintaining that quality, offer valuable lessons for data teams that are mostly exporting data for internal usage.

share this episode

Resource

Monthly cost ($)

Number of resources

Time (months)

Total cost ($)

Software/Data engineer

$15,000

$540,000

Data analyst

$12,000

$144,000

Business analyst

$10,000

$30,000

Data/product manager

$20,000

$240,000

Total cost

$954,000

Role

Goals

Common needs

Data engineers

Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.

Freshness + volume
Monitoring
Schema change detection
Lineage monitoring

Data scientists

Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.

Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing

Analytics engineers

Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.

Lineage monitoringETL blue/green testing

Business intelligence analysts

The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.

Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing

Other stakeholders

Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.

Integration with analytics toolsReporting and insights

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Data Practitioner Spotlight: Laura Dawson of EDO on data taxonomies

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

What EDO's data team does

What is a data taxonomy?

Data taxonomy in e-commerce

Video and content taxonomy

Principles of building a data taxonomy

Conclusion

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

Bridging the AI Hype Gap: Real-World Insights From Data Leaders On What It Takes To Succeed

SLAs: Not Just for Software Engineers Anymore

Get AI Ready with Governance & Data Observability

Join the Bigeye Newsletter

Data Practitioner Spotlight: Laura Dawson of EDO on data taxonomies

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

What EDO's data team does

What is a data taxonomy?

Data taxonomy in e-commerce

Video and content taxonomy

Principles of building a data taxonomy

Conclusion

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

Related posts

Bridging the AI Hype Gap: Real-World Insights From Data Leaders On What It Takes To Succeed

SLAs: Not Just for Software Engineers Anymore

Get AI Ready with Governance & Data Observability

Join the Bigeye Newsletter