The observatory

January 26, 2023

Adi Polak, lakeFS

Adi Polak, VP of devex at Treeverse, and Kyle Kirwan, CEO and co-founder of Bigeye, careers in developer experience, the open-source data community, the biggest "oh damn moment" of Adi's career, and more.

Analytics

Data catalog

Read on for a lightly edited version of the transcript.

Kyle: Welcome back to The Observatory, everybody. Today we're talking with Adi Polak. She’s the VP of devex at Treeverse. Treeverse is a company behind the free open source software called lakeFS. LakeFS enables an atomic-versioned data lake on top of an object store.

For several years before that, she worked on cloud-scale analytics back at Microsoft, and is a software engineer by background. She’s also the author of a book called Scaling Machine Learning with Spark. Welcome to The Observatory!

Adi: Thank you so much, Kyle. I'm super excited to be with you here today.

Kyle: Tell us a little bit about lakeFS. I’ve heard about it and seen a couple of the blog posts from lakeFS.I talk to a lot of folks that are trying to work with data that's in object stores and want some sort of interface to it. Tell us a little bit more about lakeFS and how it's different from some other objects or interfaces that people might be familiar with.

Adi: Yeah, of course. So when we think about object stores, we of course talk about the cloud. There may be some minIO, something related to HDFS, although it's not exactly an object store. And then usually we hit a couple of problems.

We often hit the problem of cost. There’s a very high cost. The cost grows as we use more of that. Then we hit a couple of other problems like reproducibility. Sometimes we need to reprocess some of our logic with a specific version of the data.

We also hit the problem of not being able to revert. Essentially the data and object stores, if you look at S3 or Azure Blob, are mutable. And mutable means we can change it, we can delete it, we can do what we want with it. And that’s sometimes when machines or humans make mistakes and delete crucial, critical production data.

lakeFS comes to solve that problem. lakeFS enables us to treat our data lake the same way we treat our code, with Git-like interfaces where we can actually create our own repositories of datasets. If you need a repository, we can have multiple data sets which creates our end result - the data product that we either expose downstream to the next system, or expose to our customers.

And because it uses Git-like concepts, we have the capability to branch out of a mainstream of data, work in a specific branch, introduce some features, do some testing, and then decide what we wanna do with this data.

It also enables us to use the best practices from software development of CI/CD, which in data actually means that we can branch out of the mainstream of data, run different logic or analytics or any processing we want on top of the data, and then create what we call quality gates. So, the leveraging mechanism that enables us to create quality gates before we actually automatically merge it back to the mainstream.

Adi: So essentially it solves a lot of problems for a lot of data lake users. And the fun thing about it, it's open source. Everyone can go and take a look under the hood and see how it's built, how the model is built, all the different capabilities, try it out. We are a big believer in open source and giving back to the community.

Kyle: Very cool. So, if I'm getting that right, as a data engineer or data scientist, the superpower here is that it would give me the ability to, in ETL development, check out a version of the data set and mess with it. And see that as if it was physical data stored in the lake. I could work with it privately before deciding whether or not I want to actually overwrite the original version of that when I'm done. You talked about mutability or immutability. Once I'm done and I merge that data back into whatever that main branch is, is that going to overwrite and replace that data?

Adi: Yeah, after you run all your checks, and you're happy with the result, and you want to override it, you can. After you do that, you can also revert back to an older commit. So you can actually travel back in time to a specific commit of the whole data lake. Not only a specific table, but you know, across the different tables or data sets that you keep there.

So yeah, a hundred percent, and you nailed it. It's data versioning under the hood. It's a data versioning engine that enables us, with relatively zero copies, the ability to copy our data into a new branch. Again, it's not a real copy, it's a shallow copy. We only have pointers to the actual data, and we copy and write. So if we are changing some of the files and we're writing them back to storage, that’s when the writing operation is going to take place. Which enables this whole mechanism.

Kyle: And is this most heavily used by data engineers? Or data scientists? Or a mix of both. Like who is lakeFS for?

Adi: That's a great question. So we saw maybe two or three dominant personas. And it's very high level. Like when we talk about data engineers, there are multiple personas within the data engineering world as of today. So it's very hard to drill down into the specifics. But we do see a lot of data engineers using it.

Specifically people that come from the Hadoop world. So a lot of Spark users, Airflow users, etc. We also see that machine learning engineers who build the infrastructure for data scientists are leveraging lakeFS. And actually one of the companies spoke about it publicly - Volvo have been lakeFS users for the last two years, if I remember correctly.

They created a whole wrapper for their machine learning platform, that enables their data scientists to create a notebook and already have the branch inside of it for the specific repository of data that they work with.

So we see a lot of machine learning engineers leveraging that by creating a wrapper around lakeFS to enable data scientists to do their work on a zero copy version of the data, which enables them to collaborate on the same data sets and reproduce experiments.

It’s been an interesting journey. We learned that the FDA now has regulations around machine learning that asks companies to reproduce their machine learning experiments. For that, they need the exact data set, the exact code, the exact environment. For the code, we already solved it for the environment. It's very easy. Data was always tricky, because a lot of data scientists were doing different permutations of the data during their runs, saving it to one place, and not to the other place. It wasn't well organized.

Now, they're using lakeFS to organize all their data in one repository. And also leverage the branching out and the versioning mechanism to cover the whole space of reproducibility. So we see all these different personas are using it with one goal: to work better with their data, be able to recover from issues and be able to troubleshoot if they need to.

Some data engineers are using it and creating an automation in production that essentially creates branches for every ETL at the specific time that the ETL itself runs. So if it runs every 30 minutes or every hour or every five minutes, basically there's an automation that creates the branch and then runs the logic and introduces the quality gate hooks in order to decide if we want to automatically merge it back to the mainstream.

That data engineers created. It was really beautiful to see that people are actually taking it as a tool that goes data teams to help everyone work together and collaborate.

Kyle: A few years ago there was this great blog post by the data engineering team at Intuit describing a pipeline gating technique that they termed “circuit breakers.” What you just described about quality gates sounds very similar to circuit breakers, where you can block the pipeline from completing or promoting the data to production if it doesn't meet certain criteria.

But this takes that even further because you can actually roll something back. Like even if you didn't implement the quality gate, you can basically time travel back to that previous version before that instance of the ETL job brand. Is that correct?

Adi: A hundred percent. That's a really interesting article. I'd love to read it. I always enjoy reading about more concepts and best practices that the industry creates. I feel strongly about it and I feel like a lot of fantastic engineers created different best practices in the data world. And slowly now, the industry finally realized that, and is trying to figure out the right way to go. It’s always fascinating to see how different companies name things, and also double down in different areas, according to the requirements and their needs.

Kyle: Yeah. I know we're seeing a ton of crossover right now between concepts that have existed for a while in software engineering, in data. I just spoke on another episode of the Observatory with Andrew Jones from GoCardless about data contracts, which have been described as APIs. So this just feels like one more point in that direction for data engineering to adopt practices from software.

Now, I want to change gears a little bit. You're, you're the VP of devX. That’s “developer experience.” Can you tell us more about that? What is developer experience, and how does that look at lakeFS? What types of people do you work with?

Adi: So, we are responsible for building the ecosystem around lakeFS, and also working together with our community to understand the different use cases and what they need out of lakeFS. It spans across multiple disciplines.

I have some folks that come from marketing. I have some folks that come from engineering. I have some folks that have been doing devex or “devrel” for years. So it's definitely a combination of personalities and different skills that people bring to the table with the sole goal of enabling engineers with the best practices we wish we had.

Adi: Building the tools, if they need any adapters, we do that as well. Documentation. Tutorials to help people get started and also level up. One of my favorite tutorials that we have is actually “how do you level up your data lake?”

Adi: And it starts with the very basic case: I have a CSV file. Okay, that's great. Now maybe I need a table format file. Then we'll be looking at Delta and Iceberg, and way to leverage lakeFS with the previous technologies in order to really build my data lake to the level where I have the right architecture that is going to take me through the next 10 years with production, troubleshooting, and everything else that I need.

Adi: So we are very much public-facing. We work a lot with the community to give our users what they need and also try and anticipate what the future brings. So we also work collaboratively with other companies to support them and think together about the right path. How can we enable data practitioners with better tools? So it’s a combination of a lot of different worlds and expertise.

Kyle: So then that sounds like software engineering, community building, technical writing, marketing. That’s a lot! And you yourself come from software engineering background. Is that the main entry point? Like if someone was interested in devex, would software engineering be the natural starting place for that? Do you see people come into the role from other places? If someone's interested in devex, how would you advise they get started?

Adi: That's a fantastic question. I wish I knew about it a couple of years ago. I think I have a sense, but there are always smarter people than I am, who are more experienced. I can share what I saw and some of the people I met along the way.

I saw people come from marketing, from theater, from software engineering, and people that came from product. People also came from being a solution architect, so the sales or customer success side of the company. It’s definitely a mix of backgrounds.

I think what brings them all together is that they care about the people and they care about the community that they work with. They want to serve them in a way that gives them the best content, the best experience, and the best tools for them to be successful.

Adi: I spoke with so many folks and each one of them brought a unique approach and unique skills. It’s really interesting. If people want to get into devex, I recommend trying to understand which technology they love, what they're excited for, and which people they wanna work with.

Adi: At the end of the day, you have your team and that’s great. But having said that, you're probably going to work with a lot of folks in the community. Much more often than the people in the company specifically, because this is an essential part of the work of the day- to-day.

Adi: Like, where are people? What do they need? How can I serve them better? That’s one area I could think about. And then the second one is, you can go in two different ways. You can go into the advocacy part, which is public-facing, and creating tutorials. Then you need the more technical skills.

Adi: Here, it really depends on which product you're advocating for. If it's a data product, you should probably have some data-related experience in that field. If it's a web development product, then you should probably have some experience in the web space, understanding how things work and how to put together an architecture if you're going for community.

Adi: And then you need some people skills on your plate to be able to work with people and help people get unblocked. You need a holistic approach around product and the developer journey. How people move from one place to the other, and what can help them to take the next step.

Adi: Where do they get stuck? That requires some research as well, looking into the data to understand, because sometimes developers don't like to speak with people. It's like, “let me just be in my corner. I'll read some Reddits. I'll follow up with editorial and I'll figure it out.”

So for a lot of folks that do documentation, they need to be able to look into the data to understand where people get stuck so they could build what they need in order to serve them better.

Adi: So it's definitely a combination of skills and different areas that people come from. I think in the data specifically, most of the folks I've seen come from a background in data. So they understand how data works. Either they were a DBA or they were a data engineer, data analyst, or machine learning engineer. Some notion related to the data that brings them in.

Kyle: So it sounds like you can get into devex from a variety of backgrounds. The through-line appears to be having a passion for helping people with the technology or with the subject matter for which they’re choosing to advocate. Is that right?

Adi: Yeah, a hundred percent. Either the technology or the main product, and of course, the love for serving people and helping people.

Kyle: That's a great reason. Well Adi, it's time for rapid fire questions. I've got three for you. Are you ready?

Kyle: Number one, you get to speak with one figure from history, living or dead. But you get to have a conversation with that person. Who would it be?

Adi: Jeff Bezos.

Kyle: Okay. interesting. Tell, tell us a little more about that.

Adi: I was always fascinated with the way they built the culture, especially when they started. How they kept it going throughout the years. And also how they survived. I think they were part of the dot com bubble back in the day. So how they survived and how they went through it.

Adi: I think it's really interesting to see how a company grows and survives a couple of economic crises, while maintaining a specific culture while growing.

Kyle: Okay. All right. Great. Number two. You could either be fluent in every language or you could be a master of every instrument. Which of these two superpowers would you choose?

Adi: Hmm. Every instrument. I love the piano in the orchestra. When I want to tune out at the end of the day and kind of relax, I listen to opera without any people speaking or without the actual text. Piano is always one of my favorites. And I suck at it. So, . Maybe if I had some more time, I could practice.

Kyle: Okay. All right. Last one. Obviously being able to version control data helps with “oopsies” that tend to happen when doing data engineering or working on pipelines. What is your most notable “oops” moment when it comes to data engineering or data pipeline?

Adi: It was a couple of years ago. When I worked in the building infrastructure for big data analytics back in the day, we called it “big data.”

Adi: It wasn't data engineering, it was big data. And basically I was a software developer, building the infrastructure, and also building some of the logic and the pipelines. I remember we had issues with our indexing. Our indexing was off. It returned null values, it crashed the system, we had production issues.

Adi: So I needed to get into the production data. Back in the day, you’d run some query against the data, and then download the files, look into the files, see whatever is going on in order to understand where the problem is. So it's a needle in the haystack problem. Once you figure it out, the timestamp where it went wrong, then you’d want to delete them.

Adi: I wanted to copy the right data back. So we had a high availability architecture where we actually had three clusters. All of them were main, but only one was actually the active one. So we had active-passive-passive architecture where all of them ran the same operation calculation.

Adi: I went into production. Then I was sitting together with the architect and another software engineer. We were looking at the query, cuz it was essentially a SQL query, we just deleted a couple of things. And we used the wrong timestamp, and I deleted almost half of my production data.

Adi: So it was a big “oopsie.” I remember shaking, going into my director’s office where I told him I accidentally deleted half of our production data. And, you know, that immediately reflected in our BI analytics and what we exposed to customers. The good thing was that we had the high availability of the architecture, so we could direct all the networking and the requests to a different cluster.

Adi: But it was a very weird, scary situation. And I remember he looked at me and he was like, “You'll fix it. You have two weeks.” I was like, okay, two weeks, let's go. So I ran to the devops office, and they were like, “Yeah, of course we're saving some copies of the data. We have some tracking.”

Adi: This is when I realized that everything we had wasn't really working. So all our recovery systems for a specific cluster didn't really work. I had to come up with an idea. Basically I put together a plan with the architect and we solved it at the end of the day.

Adi: But it was a very big “oopsie” and I learned so much from it about the production system failures, recovery for failures, and all about the system we thought worked. Everything the devops built and couldn't really recover, the cluster situation, to bring it back to what it was before.

Adi: So essentially, troubleshooting a problem knowing it's a good idea. We had another cluster that ran the exact computation, so we were safe. Having said that, we paid a very high price in maintaining two duplicates of the cluster. Back in those days, it was very common.

Adi: But it was a very big “oopsie.” And I believe today it would be an even bigger one, because a lot of the customers and a lot of the infrastructure engineers that I speak with, actually don't do that anymore. It’s considered super pricey and no one wants to do it. When people move to the cloud, they don't really have multiple Hadoop environments that store copies of their data.

Kyle: That’s intense. It sounds like you made it out okay. But yeah, that does sound like a rough couple weeks!

Adi: Yeah. I mean, The person who used to be my director is now a VP of R&D at Akamai. And every once in a while he texts me to check in and say hi. So I guess we are okay. But it was definitely intense, and I learned a lot from that experience.

Adi: It’s okay to make mistakes as long as you can actually sit down and figure out how to recover.

Kyle: I've heard the phrase “You can pedal faster on a bicycle if you have a good brake.” So it sounds like that applied in this scenario. All right well, Adi, it's been great having you on the show.

Kyle: Adi is the author of a book which we will link to down below if you're interested in checking that out and learning. We'll also link down in the description to lakeFS. Check it out and read some of the awesome documentation that Adi’s team produces. It’s been great having you on.

Kyle: Thanks for joining us today.

Adi: Thank you so much, Kyle. It was super lovely to be here with you today, and thank you for the great questions and conversation.

Kyle: All right, we'll see you all next time for another episode of the Observatory.

share this episode