Creating trust in data: 10 examples from real life data teams
Here are ten rules for creating trust in data, taken from real-life case studies, straight from the mouths of battle-tested data practitioners.
Trustworthy data is the ultimate litmus test for organizations investing in data infrastructure. It doesn’t matter how much data is collected, how sophisticated the algorithms are, or how advanced the analytics tools might be. Without trusted data, the data isn't worth much at all.
In this post, we explore ten examples of practices that various data teams implemented to improve trust in their data. From crowdsourcing data updates to standardizing metric definitions, these are real-world examples that are probably applicable to your organization too.
1. Crowdsource data updates from customers to keep data fresh
"One of the things that creative companies are doing to keep their data fresh is crowdsourcing and asking their customers and community to do it for them. So if you want to know if somebody's moved, the best way to do that is to ask them. There was a time in 1995 when we would rely on...Axiom and third-party sources for everything. But if you're clever, you can get better updates of that data."
– Joe Dosantos, formerly Chief Data Officer, Qlik
2. Separate data into big data and fast data
Dosantos also discussed the importance of separating data into big data vs. fast data. Big data requires years of lookin at data trends; the exercise generally does not require super latency data. Fast data, on the other hand, has to do with making in-the-moment decisions that require real-time, up-to-date, and accurate data:
"I think we need to think about there being a sort of latency spectrum…So thinking critically about the use case of that data and not just jumping through hoops to make everything perfect and current and replicated. Certain data will need to be replicated in near real-time. It will. Or in real-time. But I think that kind of sifting through the use cases to figure out how it will be used will save you a lot of energy on the back half."
– Joe Dosantos, formerly Chief Data Officer, Qlik
3. Categorize data assets into gold, silver, and bronze layers
At Coca-Cola, the data teams categorize all data assets into gold, silver, and bronze layers. This categorization refers to how much validation and processing occurred on that dataset. So a bronze dataset is essentially just a raw dump into the data lake. On the other end of the spectrum, a gold dataset is one with dedicated IT resources, data validations, data checks, data load checks, and extensive documentation on the transformations:
"Upon acquisition of a new data set, we review, where does this one need to be? What's the plan for IT to get us started for the long term? That's the one we've implemented that works really well. Because that's something we can communicate to our data community and let them know so that they know when to exercise the proper amount of caution when using that data."
– Rod Bates, previously VP of data science at Coca-Cola
4. Assign data captains
Data captains sit on the business side but are responsible for a particular data asset. At Coca-Cola, data captains:
- Own the Wiki entry
- Partner with IT on update frequency
- Represent the entire business for the dataset and serve as the voice of the customer to make that dataset as useful as possible.
- Are the single point of contact for the true business context of how that data is used and should be kept up to date.
5. Start with manual pulls, but automate quickly
At Fidelity National Financial, teams don’t just jump to automation when presented with a new dataset or use case. Instead, they build momentum and consensus by starting with manual pulls, then making the jump to automating data pipelines when the use case is clearly defined:
"Early on, when we're doing our data science programs is balancing what's the investment toward putting some sort of an automated update or automated feed in between the systems versus just having somebody, say, run a report or dump some data once a month. When we're in the discovery phase for some of these projects, we'll go ahead and have manual data pulls done by a person. But as soon as the viability of the project starts to get proven out, we start designing for that end result state in mind immediately, as early as possible. Can we start examining automated connections, can we start trying to tool everything for the eventuality where you're going to have some sort of an automated data feed? And so then once you have that set up and people have been thinking about it for a month or two duration of the project, it's actually almost less investment because everybody's already on board and you don't have to have this huge scramble of use case analysis and business analysts and all these other things."
– Rachel Levanger, director of data science at Fidelity National Financial
6. Figure out your most frequently-accessed data
At Microsoft, 80-90% of the data that is collected is never accessed again. The bulk of the signal for machine learning models is collected from a tiny number of features. These features are critical to get right and need to be closely vetted for data quality. The 80-90% of the data that is never touched again, though, does not:
"Have clarity on what data is more relevant and what data is more important. And run statistics on data to see, well, you know what, now we haven’t accessed this whole table in four years. Do we really think that we should spend a huge amount of effort? And it's not only the storage cost - luckily, storage costs have been reduced significantly. The main cost is to have people looking at the alerts, whether the data loads or not, spending a large amount of effort on the automation. It's only going to get worse. Historically, companies collect more and more and more data and there is more process and there's cost for even maintaining the data. So having a good understanding of what data is important or not is crucial for organizations."
– Juan Lavista, general manager, Microsoft's AI for Good Research Lab
7. Surface engineering dashboards to business users for basic data observability metrics like freshness
Healthcare companies often have to wrestle with vendors being late with their data, resulting in stale data for the healthcare companies’ own data pipelines, which are consuming this source data.
At one health tech startup, the engineering team has taken existing logging that it has in place - and summarized and exposed it to business stakeholders in a more digestible, user-friendly UI:
"Purely by knowing that the bottleneck is not from errors in our codebase, but that it’s instead due to upstream data vendors not sending us the data yet -- that's saved a lot of energy and increased trust through transparency. We used to get Slack messages or emails all the time, “Hey, why is this data missing?” Now non-technical stakeholders can look at regularly updated dashboards and realize that there’s an extra day lag for this data due to XYZ data source being delivered late."
– Head of data analytics infrastructure, health tech startup
8. Make it easy for data platform users to write their own data checks
At Stripe, one engineering manager noted that ideally, all data platform users should be able to write their own data checks, picking up from where the previous user left off. She noted:
"So how do we be correct at Stripe? We start with a lot of checks that are automated. We provide checks for our users. Our users can also write their own checks. This is basically a very simple Airflow decorator, which is the scheduler that we use, that you can call in. These checks will run automatically after the job, if you use whatever decorator you want. And these decorators are simple. For example, we want to make sure that the data is increasing, never decreasing. Or that there are unique keys, so anybody can write these checks, but we make it really easy to call them in."
– Emma Tang, former engineering manager, Stripe
9. Define fallback behaviors for data jobs
Expect the best, prepare for the worst. One Strip engineering manager noted that she's come to define complex fallback behaviors for every single data job she runs. She says:
"What do you do when something goes wrong? We have very complex fallback behaviors defined for all of our data jobs. So for example, if it's not super critical, the job will automatically fall back to using yesterday's output. So it doesn't do a rerun, it just fails. It tries three times or however many times the user defines and it will just be like, I give up, it's okay, I'm going to continue the pipeline with yesterday's output.
And then a little bit higher than that, maybe something's wonky with today's data, let's use yesterday's data, but rerun the pipeline and continue with that.
As you can imagine, the most severe case would actually block the pipeline, not move forward, and halt everything. And that is also marked with a decorator that says this is really severe and we don't want to move forward if this is the case."
– Emma Tang, former engineering manager, Stripe
10. Standardize metric definitions
At Vercel, one common problem the team faced during early days was non-standard metric definitions. They noted:
"One day someone notes that our weekly metric dropped. But it's only been six days. How is that possibly a weekly metric? Another one is time frames. Like, what is a monthly metric? Is it 31 days? Is it just a month? Is it four weeks? What are we even talking about?
The data team remedied this by enforcing standard metric definitions at the company level, which increased trust in any given metric:
Let's start at the company level because everyone works at the company. Let's build metrics and let's cascade them down because at least the company metrics will be applicable to everyone. Then we standardize on what is monthly and what is a weekly time frame. So no more six-day weeks, no more 31-day months. It was hard to compare February to October, which people were doing. And then they wondered why they saw so much more growth in October than in February? Because there are more days in October. "
– Thomas Mickley-Doyle, Analytics and data science initiatives lead, Vercel
Ultimately, the success of data-driven initiatives in any organization is rooted in trust. This trust can be built and sustained by adopting the practices highlighted above. If you'd like to learn how Bigeye can help you build trust in your data, schedule a demo today.
Schema change detection