The observatory
-
July 13, 2022

Chad Sanderson, Convoy

Chad Sanderson, head of data at Convoy, and Kyle Kirwan, CEO and co-founder of Bigeye, discuss the hub-and-spoke data team structure, 1990s data initiatives, and resolving long-term technical debt.

Tech debt

Read on for a lightly edited version of the transcript.

Kyle: Hey, what's up Kyle from Bigeye here, welcome back to The Observatory. Today, we're gonna be chatting with Chad Sanderson. He’s the head of data at Convoy, which is a digital freight network that's headquartered in Seattle, Washington. Chad, thanks for being on today.

Chad: Thanks for having me.

Kyle: So Chad, maybe we could start with what it is that Convoy does, and how do they use data to do that?

Chad: So Convoy is a digital freight network. And also a digital freight broker. Convoy sits in the middle between a shipper that's trying to move some freight to a facility, and a carrier that usually owns a fleet of trucks who are trying to take that freight. We optimize using machine learning, delivering the right shipments to the right trucker at the right time. And then do a lot of other really nifty things like batching, which is a process of ensuring that we don't have empty miles on the road. So instead of a trucker taking a load from Seattle to San Francisco and having to drive back with their trucks empty, we can batch loads together so that on all their round trips they have a full truck of shipments. This is driven by machine learning. Convoy is a truly ML-first company. So for that reason, data is critically important. We have a lot of machine learning models, from our pricing model to our batching model to network efficiency models and things like that, that rely extremely heavily on the quality of our data, our ability to collect that data really quickly, for it to be accurate, for it not to change over time, and for our data science team to have a lot of flexibility when working with it. So a lot of these data quality initiatives have been critically important over the last year to two years.

Kyle: Sounds like it's probably a pretty good chunk of the company that is either a consumer of data working on data engineering, or is the data scientists. Sounds like a bigger proportion maybe than in most other companies. Is that right?

Chad: It is. Convoy has a data team of about 70 people, not including the data platform team, which is data engineers and software engineers. The data team includes data scientists, analysts, and business intelligence engineers that are working with the data directly, building models, experiments, reports, things like that, which is pretty large in proportion to our scale as a startup.

Kyle: You mentioned that there's a data platform team. And then there are also these data science teams throughout the company. That sounds to me like a model that I've been hearing about more and more often, the hub and spoke type of arrangement. How long has the company been in that type of mode when it comes to data?

Chad: We've been in the arrangement for a pretty long time now. I'd say around four years. Early in Convoy’s trajectory as a business, we started with the hub and spoke. The data platform team functions exclusively as an infrastructure organization. So we maintain the tools: our data warehouse, any technology around ETL, pipelines, orchestration. Tools like airflow, dbt, Snowflake. And then we also maintain what we call the application layer, which is the way that data consumers leverage data to solve business problems. So this would be our experimentation platform, machine learning platform, and so on and so forth.

Kyle: Got it, okay. And you've been pretty vocal about some serious architectural changes to the way analytics data gets emitted and processed. Sometimes those cut against what we've seen in traditional architectures. You've even advocated for the data warehouse to almost be like a showcase, I think that’s the word you used,  rather than the place where all the transformation is happening. Can you talk a little bit more about that?

Chad: Yeah, I can. So early last year, we started an effort to take a very first principles approach to solving complex data quality problems. As I mentioned, data quality is a massive issue at a company where the quality of your data literally determines the success or failure of the entire business. And we were facing some challenges across a variety of our models and business units. So we wanted to approach this truly from a bottoms up perspective and not take anything for granted. And that meant doing an enormous amount of customer research. And when I say customers, I mean the internal data consumers at Convoy, talking to other folks in the infrastructure space, speaking to a lot of startups to see how they were thinking about the world. And what we came to was that there are two main problems that we were running into on data quality. The first was what I've been referring to internally as “upstream data quality.” And the second was “downstream data quality.” Upstream data quality being the way our ETL process works today. I think a lot of ETL processes work . You are extracting data from production databases. And oftentimes that data is just the internal implementation details of a service. The software engineer never really intended or wanted anyone to take production dependencies on that data, they have to be flexible to change their service as they need. But data scientists and ML engineers need something to build on top of, so there's not really a choice. The CDC processes get built by data engineers. For that reason, we pipe the data into the data warehouse, where a lot of transformation happens. And inevitably, when the software engineer needs to change something, stuff breaks down the pipeline. And then this question arises of who owns it. Should the data engineer own it because they built CDC and they own the pipelines?  Should a software engineer own it?  And that question raises a lot of problems. There’s some animosity built in the organization around that. That's the upstream problem. The downstream problem is a result of the upstream problem. So if you have all this data flowing into the data warehouse that was never intended for analytics there, you know, it's sort of digital exhaust that's coming from the services. Then, when the data science team needs to create meaningful concepts for their models, or experiments or reports, they have to reverse engineer these concepts in SQL. And sometimes, I would say oftentimes, that's really, really complicated. So it may end up that a data scientist produces a 500 line SQL query with 20 or 30 joins. And they're under the gun to produce this data. So they're trying to do it as quickly as they can. And maybe they don't have time to build good tests, they don't have time to create good monitors, they don't have time to write really good documentation, they don't have time to make sure that their query is actually scalable. And then once they build this query, it becomes fundamentally important in the business. Other people start taking dependencies on it, even if it wasn't built that well. And as the business changes, the person who “owns” it needs to always be on top of those changes to make sure that nothing breaks. And inevitably, that's just not what happens. And these two problems, the upstream problem and the downstream data problem, combine in the data warehouse to create this sort of horrific, horrific mess. And a lot of the modern data stack, especially tools like dbt, are sort of building on top of this mess. And it's making the problem even worse.

Kyle: It almost sounds like traditional software engineers working in a microservice situation where you have a service, somebody else starts building a dependency on your service, and you may not even know that that dependency is there. And like you said, you build this cascading effect where something does break, or something wasn't super well tested. Or even somebody just goes in and they make a change to their dbt job, right? And now that can potentially cascade down and impact the business directly.

Chad: Yeah, exactly. And I think that dbt is a really amazing product. It allows data scientists to move very quickly and answer business problems quickly. But there is a cost to speed. And that's lack of governance, lack of data modeling and data architecture, lack of best practices. You might have a data science team that is increasingly becoming less attuned to the world of data architecture. A lot of hiring now is being done for data organizations based more around knowledge of Python, and statistics and machine learning and those types of things. There’s much less focus around data engineering fundamentals. Then you have people that don't really know how to best produce these queries. They don't really know how to best model the data. And that creates a lot of technical debts in the long term. And no one's really thinking about how to resolve all that debt.

Kyle: You mentioned testing and monitoring and being under the gun, and teams not having enough time to do those things when they're building out their data architecture. Do you feel that those are things that a high performance data team should be doing on a regular basis?

Chad: Yeah, I think it's critical. I think that you absolutely need some form of testing and some form of monitoring. Both in the sense of being able to automatically detect variations in the data as it comes in. But then also more of the business logic testing, where you have to understand what this data is supposed to look like. So for example, maybe some particular ID always needs to be a 13-character string. And you get 15 characters. You need that business logic encapsulated as a test. And you need a data team that's actively thinking about and evolving those types of tests.

Kyle: And Chad, besides the data platform work that you're doing today, and we're really talking about these core pipelines, this seems like a pretty evergreen topic within data. It doesn't seem like it's a solved problem, so to speak, that is something that teams don't have to worry about anymore. Why do you feel historically like experimentation and personalization has been so challenging to solve in a general way that works for everybody?

Chad: Well, I think this connects to the conversation we were having, where the fundamental problem is the data. Really what's been the legacy of experimentation up until now is people have taken the route of least resistance. So the early days of third party SaaS experimentation and personalization tools really focused on the equivalent of the CDP: the customer journey, and the front end of the application. Using a single line of JavaScript, you emit a beacon, record some clickstream events, and you do all your analysis around those. But what businesses realized very quickly was that that didn't really get to the heart of a lot of the core business questions that product teams wanted to answer. They wanted to understand, how did this particular feature affect margin? How did it affect volume growth? How did it affect potentially other service-side metrics, like latency and things like that?  You just didn't have access to those metrics in these front end tools. So teams had to start thinking more about doing experimentation at the data layer. Being able to create metrics in the data warehouse, to connect those metrics to users, or other entities that were entered into experiments and then perform statistical analyses on them in a single pipeline. And that's just a really, really hard thing to do. And it's especially hard if your data warehouse is a mess. If you don't really trust the data that's coming in in the first place, then it's hard to build metrics. To your earlier point, there is a cascading effect where now you have the same data quality problem that's afflicting your experimentation framework and your personalization framework, as well as your machine learning framework. And I think that that's still kind of an unresolved problem with a lot of these tools. We've gotten better as an industry at making a lot of the steps in implementing experimentation software on top of the data warehouse easier. But the data quality issue is still a big thorn in everybody's side.

Kyle: That makes sense. So you've got to solve this lower level of the pyramid effectively before you can do the higher level work. All right. Chad, what do you see changing in the data landscape more broadly, in the next three to five years? Or do you see progress being made on these really critical architectural and quality related problems?

Chad: I do, there are a lot of interesting initiatives that are either starting to emerge for the first time or are starting to Horseshoe Theory their way back from the 90s on really great data warehouse design, potentially facilitated by new technology. Data mesh is an interesting concept. This is the idea of building teams around data products. The engineers on the team own the data. The data itself is treated as a product. It has an SLA, it has a data product manager. You're standing up a mini team to own and maintain the data domain, and then that data is shared in between teams. That's certainly one route that I've seen some organizations going. I don't know if that's the route that Convoy is going to go, but it is certainly a really interesting option. And it has a lot of cool ramifications and things to think about as an organizational model. The other thing that I think is changing, which can potentially contribute to data mesh, but also exists outside that paradigm, is the idea of data contracts. And the data contract is basically an agreement between the data organization and the software engineering organization on what the shape of the data should look like. Both for semantic entities and objects, like shippers, shipments, carriers, trucks, things like that. And then also for real world events, semantic events, like a carrier got into an accident or a shipper canceled a load. And agreeing to the schema. So what are the properties that were recorded about each entity and event? And then having the software engineer implement that in the service and own the quality, own the SLA. So we're pushing the quality more upstream, instead of having it be downstream where the data scientist doesn't really know what to do with it. I think that is an emerging trend that is going to start gaining a lot of popularity. There's a lot of things that you can do with that model. And there's a lot of ways that you can extend it inside and outside of the data warehouse.

Kyle: You mentioned SLAs and contracts. SLA is something that we're obviously huge fans of at Bigeye. I've used them in the past and in my role on a data team prior to being here. And it's something that's front and center for us with our customers as well. So I am definitely a huge fan of SLAs. Also I'd love to know a little bit more about the contracts concept. Is that different from an SLA?

Chad: It is. It's more than just the SLA. It is also the shape of the data, how it should be modeled, and all and the schema basically. And also the metadata around that particular entity or event. So to give you a practical example, the way that a contract might work at Convoy is a data scientist might say: today, we are emitting an event every time we get an inbound phone call from a carrier. But what we're not doing is we're not collecting the shipment ID for every phone call event. And that means that it's really, really difficult to actually figure out if a phone call leading to a resolution or not leading to a resolution impacted any particular shipments, and by extension, impacted our relationship with that shipper. So what should we do instead? We should open up the contract for the inbound phone call event. We should modify that contract by adding in a new property, which is the shipment ID. That should have a democratized collaborative surface for review, where the software engineering team and the data team and potentially a data engineering team can review that and make sure that it makes sense. Then the engineering team can emit that property and then wrap it in an SLA and quality. And then some stage in the middle, ideally, should be able to map that to a data scientist's ideal view of the world. How should this look in the data warehouse and potentially translate that into data warehouse tables without anybody touching it? That's sort of the ideal view of the world that we're working towards. So that the data team doesn't actually have to think about doing a lot of transformation in the data warehouse anymore. They can focus exclusively on defining the data they need. And then turning that data into metrics and features and leveraging it for experiments and reports and those types of things.

Kyle: That sounds like that would free up a ton of data engineering time and open up the door for a lot of automation. That sounds like an awesome concept. I hope we start to see this more broadly. Alright, Chad, so I want to move on to our rapid fire questions. I have three quick questions. Question number one, when you want to read something interesting, where do you go?

Chad: I have to say I'm a big Locally Optimistic fanboy. I love going to that Slack. There's a lot of really, really smart people that post there.

Kyle: Awesome. Question number two, what is one thing people get wrong about data that you wish they did not?

Chad: We move very, very quickly when a company is initially being built. And data is not a first class citizen from the beginning. We create a bunch of services, we do a lot of software engineering, and then we hire a data team to live in the mess and make sense of what exists. And this is a fundamentally non-scalable process that comes back to bite pretty much every single team that goes down this route. It doesn't have to be that way.

Kyle: Awesome. So think about data upfront. I love that idea. Number three, what's something that the Convoy team does differently that you think other data teams might be missing out on?

Chad: I think the data contract piece is certainly a part of it. That's more of a technology. But just from a philosophical perspective, the way that we think about the evolution of data in and out of the data warehouse is, through definition, upfront. It's thinking about your business. What are all the entities? What are the events? What are the relationships between them? What's the cardinality defining that first? And then once that gets defined, it actually makes the development of the data warehouse super, super simple and straightforward, and unlocks a large amount of opportunity for innovation. Brand new services that can be built up, brand new machine learning models that can be built up. But there is a cost to this and it’s time. It requires going back and starting to really think deeply about the use cases that customers have. And I find that teams get so wrapped up in the next feature, and the next release, that they don't think about how valuable it is to have a good data foundation that reflects the real world.

Kyle: All right, Chad, thanks for chatting with me today. If you want to learn more about Convoy and their data team (and Chad is hiring!) click the link in the description below. Chad, thanks for joining us on The Observatory, and see you next time.

Get started on your data reliability engineering journey with Bigeye. Request a demo here.

share this episode