Thought leadership
-
October 27, 2022

Data Practitioner Spotlight: Laura Dawson of EDO on data taxonomies

In the "Data Practitioner Spotlight" series, we interview data practitioners from the field. In this conversation, we dive deep into data taxonomies and using ad data to make better decisions with Laura Dawson of EDO.

Kyle Kirwan

In the Data Practitioner Spotlight series, we interview people who are directly working at the forefront of data across a range of sectors. In this edition, we sit down with Laura Dawson, who heads up data quality at marketing analytics startup EDO.

Laura Dawson is the head of data quality at EDO. EDO is a marketing analytics startup that helps marketers determine the effectiveness of their television ad buys. They do this by cross-referencing the ads against search traffic for the brands in the ads – if an ad for Porsche is shown on TV and at the same time, searches for Porsche jump a lot, then the ad was probably effective.

The EDO product, then, is about accurately integrating two primary sources of data, TV streams and Google trends data, and then making that merged data available to marketers so they can make better ad-buying decisions.

What EDO's data team does

EDO has collection infrastructure that pulls in TV streams. The individual ads in the streams are then tagged with:

  • Brand
  • Product
  • Product Variation
  • Language
  • Is a promotion being run for the product

So for example, this ad would be tagged with:

  • Brand: Amazon
  • Product: Alexa
  • Product Variation: Echo
  • Language: English
  • Promotion: 20% off on Amazon.com

The data is the ultimate product and has to be accurate. Tagging is done manually, through an outsourced team aided by automated tools. (This is a similar model to startups like Scale.ai in the self-driving car data generation space.) The annotator has a dropdown that the taggers select. If they believe the correct tag is not already in the dropdown, they add a new one which is reviewed the next day.

Initial tagging occurs during the US nighttime. When the US team wakes up, Dawson’s team normalizes those tags into a data taxonomy and deals with governance issues. This process is often very subjective.

“We’re dealing with questions like, is Amazon an umbrella brand, or are we going to break out Amazon Prime video as a separate brand?" says Dawson. Sometimes, these decisions are guided by the brand (if the brand is a client). Other times, the data quality team makes a judgment call. “We’ll go to their website and see how they're organizing their products and look at how they’ve discussed their brand over time historically,” Dawson says.

In addition to maintaining the data taxonomy and the correct labels, Dawson’s team is also in charge of tasks like:

  • Normalizing video quality: For example, two pieces of video might actually be semantically the same even if there are small differences in audio or video quality.
  • Responding to changes in Google Trends data: The final output of the data quality team at EDO is a dashboard used by the analytics team, coupled with custom reports that they run. The company also licenses the taxonomy developed and distributes that via s3 bucket.

What is a data taxonomy?

Dawson’s work on building a robust data taxonomy for EDO’s data brings up a question. What is a data taxonomy, and why is it important?

A data taxonomy refers to how tracked events and properties are named and categorized. Data taxonomy ensures that data is consistently categorized across multiple sources and channels, so that consumers of the data can derive meaningful insights.

Let’s say that you’re merging two tables: one where the purchase event is denoted as “Checkout Submitted Order” and another where it’s denoted “Checkout submitted order.” These events will be considered two separate events and will not automatically merge. Therefore, if you query for submitted orders, you’ll probably get an inaccurate result.

Data taxonomy originated as a subject in the library sciences, used to figure out how to best categorize and name books. It eventually broadened beyond the library sciences into data at large.

Data taxonomy in e-commerce

The earliest applications of internet data taxonomies happened in the e-commerce space. Online marketplaces like Amazon had to organize their product catalog in such a way that consumers would actually find the stuff they wanted.

In a recent blog post, for example, Etsy outlined their product taxonomy: a collection of hierarchies “comprising of 6,000+ categories (ex. Boots), 400+ attributes (ex. Women’s shoe size), 3,500+ values (ex. 7.5), and 90+ scales (ex. US/Canada).” These hierarchies form the foundation for the various filters and category-specific shopping experiences that make up the buyer experience.

Video and content taxonomy

Prior to EDO, Dawson spent time as a taxonomy analyst at HBO. There, along with her boss, she pioneered a new standard for language metadata called IETF BCP 47 (Internet Engineering Task Force Best Common Practices).

Previously, different departments coded the Spanish language differently, including uppercase Spanish, lowercase Spanish, and other variations to represent specific dialects.

By creating the language metadata standards, Dawson created a source of truth across the company,. The streamlined the language metadata terminology for audio, subtitles, closed captions, rights and licensing.

Principles of building a data taxonomy

Building and maintaining data taxonomies is probably one of the most labor-intensive approaches to high-quality data, and EDO does it because data is the product. Below, Dawson shared some of her hard-won principles of exporting data to its end users.

1. Think about who the data is for

Is it for a manufacturer, a distributor, or a consumer? “Back when Amazon first started up, they were using backend data from publishing warehouses that was really junky,” says Dawson. But consumers have different expectations. “There was this whole educational effort back in the late 90s and early 2000s to make that data more palatable for consumers.”

2. Understand the constraints of your system

You might have certain engineering constraints or database constraints. How will you bend the taxonomy to make that work? EDO uses a three-tier taxonomy - brands, products, product variations. In the case of a product variation, what do teams do? Says Dawson, "We don't have a fourth level. We have to figure out a way to set up the product variation field to concatenate all of these different spinoffs."

3. Don’t make your taxonomy too deep

“If your taxonomy is too layered, if it goes too deep, you're going to have a nightmare in terms of organization and monitoring that data," says Dawson. "For us, three layers just seemed to be the level at which our clients responded well to it AND our reviewers were able to work with it.”

4. Engage a customer success team around the most-used data

Dawson tells us, “The more eyes are on the data, the more you need a dedicated person or team to react quickly to the inevitable incoming feedback, like when a brand should capitalized and it’s not.”

5. If you’re managing a taxonomy, you will always be grooming it

A data taxonomy is not a set-it-once kind of thing. It’s a constant, iterative process. Dawson says, “You're always looking for ‘can we collapse these?’, ‘do we have to expand these?'” It's a living set of rules that is subject to change as you acquire more information.

Conclusion

As a marketing analytics company, EDO’s product is data. The standards it sets for its data quality and the processes it has for maintaining that quality, offer valuable lessons for data teams that are mostly exporting data for internal usage.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.