Boris Jabes, Census

Boris Jabes, founder and CEO of Census, weighs in on reverse ETL (and its derivatives) and shares his view for what people get wrong about the data field.

Census

Reverse ETL

Read on for a lightly edited version of the transcript.

Kyle: Welcome back to The Observatory. I'm Kyle Kirwan from Bigeye. And today, we're going to be chatting with Boris Jabes, the founder and CEO of Census, which is a reverse ETL platform. Boris, great to have you on the show. Thanks for being here today.

Boris: I'm delighted to be here.

Kyle: Boris, we obviously go back a little way. So, where do we start? Maybe an easy place to start is… What is reverse ETL? In your own words?

Boris: I always like to joke that it's technically a bit of a misnomer because the concept of ETL is about moving data, extracting it, transforming it, and loading it. As people say, it has no real direction. But reverse ETL is the ability to take data from a data warehouse or database, specifically a modern kind of cloud data warehouse, and publish and federate the data into tools where people do their work.

So traditionally, BI and data teams would present their work in a dashboard, right? And with a product like ours, you can actually take the insights you're building and the metrics you've developed and push them into products like Salesforce or Marketo. And change the kind of work that people do in those tools, so you're directly impacting the sales and marketing and support teams. And that's reverse ETL.

Kyle: So, Boris, how did you realize that this was a need in the market? Did you call it reverse ETL? How did this kind come about?

Boris: We started Census in 2018. And the way I had experienced it was, at the company I'd worked at prior, our team was in San Francisco. And there were a bunch of sales and marketing people in Boston. And it often felt like we couldn't communicate well. They didn't seem to know what our users were doing. We had a self-serve product, right? So all of our users were self-serve, and we had all this data about them. For us on the product team, it was really easy to understand everything they were doing.

But then the sales or marketing team was sending them kind of generic messages. They didn't seem to know. They would ask my team. And sadly, I interpreted the problem to be cultural. Oh, they're from Boston, and they don't know. They're from another kind of company, and they're this acquisition. And so if you're a salesperson, you wake up in the morning, and you have a lot to do in your tool like Salesforce. Same for marketing and whatever tools that you're using. We really needed to bring the data to them that we had locked away in our product and analytics stack.

So that's where it was born out of. In 2018, we started working with a couple of customers on this. And it's obvious that there was a pattern of companies for whom this was already a problem back then: companies that ship software as their main product and generate a lot of data about their users, ideally, because they have a free tier or a kind of self-service capability. And they were just drowning in information that was not being distilled down to the teams to do better marketing, better sales, better support, etc. That's kind of how we got our first couple of customers.

Our first one was Figma, which is a pretty well-known company. Now, if I want every team and every employee to be able to use whatever app they want, the hindrance to that is data federation, right? If every app is its own little island, you'll always be stuck in this question of “Do I have the most trustworthy data?” Do I have the data? How do I go get it?” It's hard to deploy that. And so that was why we called the product Census because it was the idea of having one count of things and one version of the data that you can somehow get into the hands of every person and every application.

The product would take any query, model, or table in your warehouse and push it into Salesforce. And that was our first version. The way our first few users interpreted this—the easy shorthand that they used when they were deploying this was “It's like reverse Fivetran.” Which that actually makes perfect sense, right? Because it was a tool that exclusively brought data down into a warehouse. And we were told it pushed data out of a warehouse. And over the next year, as more people discovered Census, it got morphed into reverse ETL.

Kyle: Have you ever heard of reverse ELT? Has that happened yet?

Boris: I've seen someone tweet that once, and said, wait, that doesn't make sense. So people have joked around with that, but no, it has not taken off. I'm not even sure ELT has taken off as much as we had hoped, as a colloquial term, right? I think, in our industry, we know. But I think ETL still remains that kind of fairly generic word.

Kyle: I couldn't agree more. Egor, my co-founder at Bigeye, and I used to go for coffee pretty regularly when we worked on the data team at Uber. And I still remember him telling me, “Hey, you know that what we do here is not actually ETL, right?” And I was like, “What do you mean? It's not ETL? That's what it is.” And he said technically, we're loading the data in, and then all the transforms are happening inside the warehouse. And that was mind-blowing to me. I'd only ever heard it called ETL. I knew what we did, and I understood that the transformations were happening in the warehouse. But it had not occurred to me at that point that it could be done a different way...that the transforms could happen in flight.

Boris: If we want to get a tiny bit nerdy, I wonder if the better way to talk about Census is TEL. Because the transition from ETL to ELT in the data ingestion space is this idea that the transformation logic should not live in the wires, it should live in the hub. And the reason we could do that transition is not just because great software tools emerged to do it. It's that it became cost-effective and technologically possible to do all your transformations in a warehouse because we have this infinite storage and scale. But it used to not be the case.

So you needed to kind of transform in transit. Otherwise, you would just overwhelm your systems, right? Census is the same premise. When it comes to data integration, people have been doing that for decades, even outside the data world. People have been connecting app to app for a very long time. And my point of view was, you should not put much logic in the wires or in the connector. It should be mostly in the hubs. So you could argue that Census is really TEL, right? It's ELT that brings in the data, and then that T is shared. And this is where you transform your data and clean your data. You create models that are useful for the business and are not just the raw data. And then extract that data out of the warehouse and load it into the destination application.

Kyle: I want to change the focus a little bit here. Very often, the beneficiary of reverse ETL is somebody who's trying to consume the data and do something with it in their line of work, right? How do you think reverse ETL changes things for the people that are in data engineering or who are on the data platform team? They're not necessarily the ones that are receiving the data after it gets loaded into Salesforce. They're sort of on the beginning end of that journey. How has this changed things for them?

Boris: I think everything rolls downhill, right? So if I think of traditional analytics, you still have a data refining process, right? You're loading data, and you're cleaning it. You're modeling it, and then you're presenting it in a chart in your BI tool. And potentially, someone is building that chart themselves.

Then in that workflow, your data engineering team is still affected by random charts that people make in one of two ways, right? One is the data's wrong because something broke along the way. Whether that's a cleanup process, we've made about modeling mistake, or the data is stale or whatever. Or pressure on the system has increased. Even in traditional BI, your data engineering team is in the path. So suddenly, there's a report that is just dramatically more expensive to compute. The warehouse is falling down. You're causing other queries to fall down. And you know, by democratizing the data, you've actually changed your workload. And you've got to figure out how to diagnose that and fix it.

So when you think about what happens when you use Census, you're pushing data all the way through to a system where people are taking action or operating the business. It’s what we like to call operational analytics, as opposed to traditional analytics. The effect that has on data engineers is, first, it increases the number of consumers of data. So there are only so many humans who look at a dashboard with data being pushed all the way into an operational system. Then there are more people who are dependent. So just the general pressure to get the data right goes up, and that hits everyone downstream. So the analytics people have to be better, the analytics engineers have to be better, and the data engineers have to be better at making sure data is in good shape. That's the first big difference.

Two, there's a subtle thing where you might discover that when you push data all the way through, when your output is not a picture, but it's actually the data points themselves, there are certain things you have to think about now in terms of the data types. How you're storing time may not matter if your output is a picture. But if your data doesn't have time zones, and you're pushing it all the way down to a system, like a marketing tool, it's going to cause problems because they will make wrong interpretations from that data. So it actually pushes you to be more precise in your data.

And then finally, everyone's favorite thing, the latency requirements increase over time. Not immediately. A lot of people sync data using Census on a daily basis, so it doesn't actually create new pressure on the data engineering team. Once a day, we push some data into our sales tools. But the more people who have access to this tool, and the more they think in terms of operationalizing their analytics, rather than just visualizing their analytics, the more people want lower latency. I think inevitably, data organizations are going to be pushed to go from the data is correct, on a 24-hour basis, down to 12, six, one, and less.

Kyle: I mean, I think a lot of people were probably hoping to hear that it makes life easier. It almost sounds like it's going the other way.

Boris: So I would say two things. One, it means your data team is more in demand. And so you have to you either work harder or smarter, but you're going have to do something because more people depend on the data. Which I think is the goal of any person in a company—to be more crucial, more leveraged. So I think of that as a good kind of pressure, but yes, it increases the work. The thing that it solves for you is that you don't have to figure out how to move data into Salesforce.

Kyle: If you're in data in the first place, it's because you probably see how important it is or how valuable it could be. You want to do a great job in your role so that your organization can put that data to work. And I think that that's why a lot of people got into working in data engineering in the first place.

Boris: Yeah, yeah. And, you know, there are three personas that are in this workflow. There are data engineers and infrastructure folks in the data world. Then there are the analysts, analytics, engineers, etc. And then there are the operators who are receiving the data and trying to automate the business using that data in the Census kind of flow. And, they had a realization that didn't hit them right away. It really threw me for a loop at first, and then I realized this is actually a good thing. They expressed fear, like legitimate fear—which is not at first what you want to engender through your product. I didn't get into the business of making software to scare people.

But the fear came from realizing the amount of power they had and the effect that they could have on downstream systems. So in this particular case, their job had shifted, thanks to our product, from providing data and dashboards to taking over the entire data that lives in their email marketing tool. So the entire email marketing tool was now driven one-to-one from queries in their warehouse. And that was super cool, right? The data got cleaned, it was great. And then this realization hit, where if they make one mistake in their query now, the effect is a million—not exaggerating—bad emails go out. And so it hit them, and this realization was that I'm playing with live ammo in a way that I've never had before.

But the downsides before were that it was a bad presentation. Like in the board meeting. It's its own kind of bad, but it's not the same. And, and so at first, I was like, well, I didn't want my product to scare you. But in reality, this is good. It means you're taking on more responsibility, and that's a good thing. And now our job, and a lot of what we built over the last couple of years, is to try to provide as many guardrails as possible to help you not screw up. But you know, you and Egor love to talk about this idea. Anyone who's never screwed up is not doing anything interesting with data.

Kyle: Yeah, so you touched on something important there, which is the changes in your data models, in your pipelines, and your transformations. That is where the company is getting value from, but those changes also incur risk. Every time you move a piece, you potentially break something. So I wanted to unpack something slightly before we move on to the three rapid-fire questions.

You touched on being in the line of fire. And you mentioned the board meeting. So the board meeting made me think that I used to help prepare data prior to some of these big offline presentations. On a Friday morning, for example. And so you would have a few hours to sit down, review the data, like, and make sure everything looks good. Really thoroughly vet it before, you know, before your hypothetical board meeting there.

But in a case enabled by Census, that is literally taken away, right? That job is going to run in the middle of the night, those million emails are going to get sent out. So it's not just the impact of the use case. It's the fact that it's not real-time. It is automated. It's recurring. It's not sitting there. It doesn't have a human gate anymore. And I think those two pieces are both kind of happening simultaneously, but they are distinct problems that both raise the stakes for data teams.

Boris: So much of what's happening in our field has parallels from the 50 years of software development. And, and the art of it is to extract what is useful and bring it in, but applications used to be tested fairly manually. And then, we started to build automated testing for software. And without it, we wouldn't have scaled to where we are today. And if anything, now, the pendulum has swung the other way, in software, where, you know, we've realized you can test it like crazy in an automated fashion.

But that doesn't really ensure that this user interaction is still good. So there's still a need for humans in the loop. We haven't really taken that away. But yeah, I think for data teams that deploy Census, you have to start investing in automated ways to catch catastrophic failures. And it's not a guarantee, of course. But that's the work I see our customers doing. And we do what we can to catch the things that our system can find for you.

Kyle: Yeah, it's always nice to have those things, as long as you can make the tool do what you need at the end of the day. I think everybody likes having a guardrail. So Boris, I really love talking in-depth with you about these things. We could go on for a while. But I do have three rapid-fire questions for you that I think should be fun. So number one, when you want to read something interesting, where do you go?

Boris: I'll give my out-of-left-field answer. I spend all day, most days reading technical websites, the same websites that most people in software go to. I think that's not an interesting answer. The place I go to for unusual reading is a website called Arts & Letters Daily—which is a kind of aggregator of interesting arts and literature articles. And so it's a total breath of fresh air from the news streams that I'd say we probably both consume.

Kyle: I was hoping you were not going to take the easy way out and say: Hacker News. Okay, number two, what is one thing, and this is not specific to reverse ETL. But let's go a little broader to data in general, what do you think is one thing that people get wrong about the data field that you really wish that they did not?

Boris: I think the biggest frustration and misrepresentation that annoys me is that they are their own domain, rather than something that affects, represents, and should crosscut the entire business. Data to me is fundamental. And the data team should have a company-wide perspective. And I feel like they get shunted off into product analytics. Stuff like that really bugs me. And that's why I've always been jealous of the kinds of companies that you've worked at where it's clear that the data organization is for the whole company. And it tracks to the whole company.

Kyle: I think another good example of this may be recruiting, right? Every single person that works at "pick your company” had to get there somehow. So that's a good point. Question number three, and this one is unique to Census. What is the hardest thing about making Census work that you're able to share?

Boris: The naive answer is that there's no limit, no end in sight, for how bad APIs can be and how much you have to deal with that in our platform. So the thing that we want to completely abstract is actually unbelievably difficult. I'll give you my favorite example, so people can kind of get the sense of it.

You often think about APIs as there's a rate limit, and you just make sure you just hit the right rate limit. And the goal of Census is to be as fast as possible. But actually, there's so much—let's call it "behavior"—that is not specified. But you have to uncover it by just experience.

And an example would be that there is an application, one of our integrations, where you have to stitch the data. So we batch the data. If you have a million rows, and the system can only receive, l10,000 rows at a time, we'll package 10,000 rows, and ship them across. Ship the next 10,000 rows, and so on, so forth. And we can do that in parallel, obviously. And you uncovered that one of these applications could take only so many rows, but also so many columns. And, you know, this is a product that could have, let's say, 1000 columns on these rows. And, so you have to stitch the data in two dimensions. We have to sync, let's say, the 10,000 rows with 500 columns, and then the next 10,000 rows with the next 500 columns. Because otherwise, the system falls down. And it's like, these are not specified in the system, right? This is not part of the API surface, you just have to kind of discover these failure modes.

So that's still the bread and butter of Census to build these kinds of connectors in a way that's completely seamless. And I would say that's still probably where a lot of the most painful problems occur, but they’re also the ones that we're most used to. So it's one of those where we've learned to pattern match really, really well. And we've built a lot of tooling to paper over those things as much as possible. And so I think over time, that will no longer be the hardest problem at Census, but it's still the fundamental piece. It's like, hey, you have no idea how hard it is to build these connectors sometimes.

Kyle: That sounds like some pretty hard-fought knowledge. Well, Boris, thanks for chatting with me today.

If you want to learn more about Census, the link is going to be down below in the description. I think we'll also add a link to Arts & Letters. I think that plenty of people want to check that one out as well.

Boris, thanks for joining me at The Observatory. Thanks, everyone, for watching, and we'll see you again on the next one.

Get started on your data reliability engineering journey with Bigeye. Requesting a demo here.

share this episode