Read on for a lightly edited version of the transcript.
Kyle: Hey, everyone. Welcome back to another episode of The Observatory. I'm your host Kyle. And today we're speaking with Andrew Jones from GoCardless.
GoCardless is a tech unicorn founded in 2011. They're in the financial and payments space. They make it easy to collect recurring and on- off payments from clients via bank debit schemes like ACH, debit, and others. If you've heard of Stripe, they're a little bit like that.
And today we were speaking with Andrew Jones, He's the tech lead for their data pillar. He focuses on data infrastructure and machine learning enablement. Andrew has gotten a little bit famous recently for coining the term “data contract.” So we'll leave a link in the description below where you can read his blog post introducing that concept. And he's been at GoCardless since 2017. A little while before that he spent most of his career working at a semiconductor manufacturer, which is pretty cool. Andrew, thanks for being on the observatory.
Andrew: Cool. Thanks for having me.
Kyle: So let's just dive right into some questions that I have for you. The first is about the data contracts that your blog post was about. For anyone who hasn't read the post, maybe you could explain what that is, and when teams should be thinking about data contracts?
Andrew: A data contract is our approach to trying to improve the quality and reliability of data at GoCardless. It's really about trying to interface between how data is generated and how it is consumed.
The easiest way to think about data contracts is that it’s essentially like an API from data. A lot of it is inspired by APIs between services. An API tells you how to interact with a service. We can apply the same principles to data. The same way an API is a contract between services, a data contract is between data generation and data consumption. That’s where the name came from.
Kyle: This trend of battle-tested software engineering practices moving into the data engineering space, this feels like a key part of that overall macro trend. Is that literally where this came from? Were you looking at API specs and saying, “Hey, why doesn't this apply to data generation processing? Where did the idea start to come into focus for you?
Andrew: So a couple years ago I spent a lot of time speaking to people in BI and data science. I’d often hear about how a schema change broke their models, or the reporting, but they didn’t find out until it broke.
I thought, “what would you do if you were in the engineering space?” Which is where I’d been building my career until more recently.
What I would do is create an API between them and say, “This is the contract I’m giving you as the consumer, and you can use it knowing the properties around the data.”
We wanted to be a lot more deliberate about data produced, producing data designed for consumption, not as a side effect of that service and that database upstream.
Kyle: So you mentioned talking about designing the contract for the consumer. That puts some responsibility on the data generator, right? In the data team I was part of a couple of years ago, sometimes we felt like we were handed data from the data generator, and then expected to do something with it.
This seems to flip that paradigm and say, “Okay, no, you need to treat me like a customer almost, or a consumer.” Anticipate what commitments you will make to me. You mentioned schema. Are there other things that would be contained in these contracts that would help create guarantees for me as a consumer about what I should expect?
Andrew: Yeah, it is about that. It contains a schema, types, structure, basic things like that.
From the start we’ve added things like categorization of the data. And in the future, it could contain things like SLOs, maybe a quality level, that kind of thing. We’re not there yet. So quite early days. But all of these properties about data? We expect to be described in the data contract.
Kyle: So this sounds like a fairly flexible spec that you can extend over time to continue to allow the contract to contain additional guarantees.
Andrew: Yeah, exactly.
Kyle: One of the things that comes to mind for me is, a lot of people are probably nodding their heads at this right now, because obviously this is super useful and needed in a lot of situations.
How is this actually expressed physically? How does one of these contracts come into existence? Are we talking about a YAML file? What is a data contract in a physical sense?
Andrew: We use something called JSON. And we’ve got data contracts on that. So the idea was, given enough information in this data contract, we can spin up the kind of resources needed to manage that data correctly.
So there might be some bespoke services built in house to manage GDPR deletions, or to do backups or things like that. It could be spun up into cloud resources, like in our case, a BigQuery table and a Pub/Sub topic.
It just needs to describe the data, describe our needs. Do we need a BigQuery table? Do we need backups and how long do they want them for? They can find all of that in the data contract.
And behind the scenes, we spin up resources required to manage that data correctly and to make use of that data correctly.
Kyle: Maybe I'm going out of bounds here. But the word that comes to mind to describe this is a declarative data pipeline. Is that close to the idea?
Andrew: It’s all deployment close to the consumers, close to people generating this data. And they maintain it by deciding who has permissions to it, how it evolves over time, when to remove data. It’s not owned by a centralized data team anymore.
So the company is more decentralized. We’ve moved the autonomy towards the teams. In future, we could use this to build out pipelines with things like validations, data tests, and data observability.
As long as we have one place where we define metadata and properties of the data, there’s no reason why we can’t use that to say something like “Integrate directly into Looker.” We’re still defining data, getting into BigQuery, getting ownership and schema evolutions worked out. But we could really build this tool out in the future.
Kyle: It sounds like you could co-locate all these different contracts into one location and that this is a fairly un-opinionated store of these contracts. And then, I can imagine a ton of things would be able to read off of that store. And you could do a lot with your infrastructure from there.
Andrew: Yeah, exactly. Another thing we found: other engineering teams wanted to put data in BigQuery, or use Pub/Sub and things like that, not really using the tool we built in the data structure team. they were rebuilding their own.
We found that people were quite opinionated about what they built, what made sense for them. And I didn’t necessarily want them to have to go through us when they wanted permission for a BigQuery table or do some backfilling. I didn't want them to have to talk to us about that. They should be able to manage it themselves. It's their data.
So we’ve been trying to make it a lot more decentralized. Give them that autonomy. And that fed into the design of data contracts as well.
The good thing about using BigQuery or similar services is that, although it’s decentralized in a way, it’s kind of isolated, you can still query different BigQuery tables from wherever you are. So we’re not creating silos of data, it’s all made available, it can still be queried together.
So although it’s promoting decentralization, we’re hopefully not going to see an increase in silos being made, and we’re still putting it all together as needed.
Kyle: It sounds like part of what makes this super strong is that anybody can read from that repo of contracts, so people don't have to ask for permission. You don't need permission, you don't get blocked, you can go out and build on top of that contract repo. But do we end up in this Wild West then, where you've got pipelines going everywhere, and it’s super siloed and nobody knows what’s going on? But it sounds like by co-locating everything in BigQuery, yes, you have that flexibility to spin up whatever you want, based on the existing set of contracts, but there's still some degree of centralization there so that you don't have a total mess going on, either.
Andrew: Yeah, that's what we hope. Like I said, it’s early days for us, so obviously we’ll see how this pans out. And there will be other companies doing similar initiatives. They might spend a lot of time doing data modeling and trying to make sure that they’re not repeating data from different places and there’s a good sense of what properties there are around particular entities - for us it’s payers and merchants, and for other companies it might be orders or other factors like that.
But where we are at the moment, we’re just trying to get everyone onto data contracts and decommission our TVC pipelines. And from there we can evolve our data. And if we make mistakes along the way, that’s okay, we can just evolve the data, evolve the contracts, and learn from our mistakes. But at least we're doing it from a place where our data has got ownership, it's got metadata, it's got schema, it's got version control, which is a lot better than we have now. So it's already a step in the right direction.
Kyle:I had a conversation last year with Colin Zima, formerly of Looker. And he was talking about the spectrum that teams tend to fall on, from hyper flexible but maybe less structured, less standardized. And then all the way on the other end of the spectrum, to very controlled, with very slow evolutions and things. But very reliable, very stable, everybody knows the state of things. Andt teams need to move along that spectrum, depending on what's going on in the business or what's going on in the team. So it sounds like you guys have located where you want to be on that spectrum and then built these interfaces to really support that. So that's super cool.
Maybe we can talk a little bit about GoCardless, cause we just spent a bunch of time on data contracts. And if you want to read more about them, we have a link in the description down below.
So Andrew - GoCardless is one of the big FinTech unicorns in London. And obviously, the company has grown a lot and you've been there for some of that. How has that sort of growth in the business changed the way that you're working with data?
Andrew: Good question. I think it really means that we want to do more with the data we've got, and do more important things with data as well. Now that we’re quite successful, we’ve got quite a lot of data, and particular markets are more mature, we now want to leverage that data to build a more defensible product.
So that might mean adding new products alongside our core offering. And these are things we can maybe build on top of the data we’ve got, so data is now being used as part of our product; part of the thing we are selling to our customers. That increases the requirements on reliability of data and quality of data.
Like we can’t have schema changes breaking all our models, all our products, that’s just not acceptable anymore, now that we’re making money from it. It might have been acceptable when we were using it for analytics only, and if reports were broken for a couple of days, it wasn’t business-critical. But as we start using our data for more business-critical activities, that’s one of the drivers for data contracts.
Kyle: So obviously machine learning is the key topic here. A lot of what you described is about, how do we leverage ML? And obviously these are some fairly high risk scenarios, right? Like there are literal dollars on the line.
FinTech is obviously a very unique space in this regard. What is the state of machine learning generally, that you've seen within FinTech? And where do you think it's going specifically within that vertical industry?
Andrew: So I think you have responsibility for people's data. So we've always had to make sure that we're taking the right care of the data as well.
But more generally, in future, I think ML is moving to more standard, off-the-shelf ML models that you can deploy. And that’s allowing us and other organizations to apply models more easily, maybe to areas of business where we hadn't previously. So maybe to sales pipelines, and things like that.
I think we are going to start seeing a trend of democratizing ML, making it a bit easier if you're not a data scientist, but BI analysts and other people can move a bit into the ML space. Maybe start deploying simple models.
Obviously, there's still space for custom models. Like our full model can’t be a standard model that we deploy off-the-shelf. Hopefully we’ll see more standard tooling evolve over time. It’s still quite custom, company-to-company, like what the ML platform looks like, what kind of tools they're using.
So it feels quite immature, compared to even the data industry, where we've got quite mature data warehouses and quite mature tools like dbt and Airflow. And so hopefully, over time, we’ll see standardized data platforms evolve.
Kyle: I think over the last like decade or so, as I think I've described before, data science moving from people who came from let's say econometrics, or physics, and they're literally hand crafting a pipeline where the model is the output from that pipeline.Like super, super custom stuff.
And now, we've made a ton of progress as an industry towards, like you said, more standardized tooling, more off-the-shelf, it's a little easier. And I think I hadn't really thought about that. But the transferability of skills, if someone moves team to team or company to company, that's actually huge. Because you don't have to go and completely relearn this ultra custom setup that the new team has. You don't have to recreate your tools from scratch. You don't have to build your own hammer, so to speak. That's super exciting.
Andrew: Yeah. That’s where I’d like to see our platform industry moving. I think it is moving in that direction slowly. But we'll see more of it over time.
Kyle: So obviously, you're operating in the financial space at GoCardless. I know that a lot of people are thinking about these national and regulatory laws that we started to see in various places. So how would you say that these local or national-level regulatory environments affect the work that you do?
Andrew: Yeah, so we’re based in the UK, but we operate in a number of markets, like the EU, the US, and others as well. Generally, like most organizations, we have to comply with the restrictions of those markets.
I think with regulation in general,the days of collecting and storing huge amounts of data infinitely and indefinitely are gone now. Data privacy laws have gotten stricter. Tech in general is getting more and more regulated. So we need to be conscious about that when building our tech platforms. Our platforms need to move away from the idea that we can just store all the data indefinitely.
I started thinking about building these kind of privacy and security considerations from the start. It’s always a lot harder to go back and apply data categories later. Particularly, as you try and make big bets on your product being around for the foreseeable future, you want to build from the start working with security teams, working with compliance teams, understanding that privacy and data laws are only going one way: that’s stricter.
For example, one way they’re getting stricter is within the localization area. Countries in Asia, particularly India and China, already have quite strong data and localization rules.We don’t serve those markets at the moment. But the EU and other places are looking at these kinds of laws as well, and how to prevent their citizens’ data from moving across borders.
And that can be quite a big change.
We’re not really building data platforms in that way, or building any sort of software that way for the most part. So try and think ahead, you know, this is probably coming, it’s likely to come in the future. We can anticipate that, and how we can be ready for that. So we don’t have a massive rush and loads of work before it gets implemented.
Kyle: And for those who aren't familiar data localization, could you maybe explain it in practical terms? Are we talking about, if I'm located in this country, I know that my data is being written to disk and processed by rackmount servers in a data center that's physically located in the country that I'm in? Is that what we mean by localization? Or could you tell us a bit more about that?
Andrew: Yeah, it depends on the country and the market and the rules where they are applying. So for example, in Australia, it only applies to health data at the moment. So you can’t move health data around. You have to keep it within Australian borders. So it means processing that data on servers located physically in Australia.In Spain as well, it applies to electoral data and things like that. It’s not really all data.
But in places where it’s stricter, like China and India and Russia, where they don’t want any of their citizens’ data leaving the borders. So that means data is located on servers physically within their borders.
So that means, in some cases, say I’m based in China and I’m working remotely in the UK, I shouldn’t have access to that data anymore. I shouldn’t be able to physically see it in a dashboard or a database somewhere. I shouldn’t really have access to that data anymore.
So yeah, I’m not an expert on data localization, but it’s something I’m starting to look at more because I’m trying to anticipate trends in regulation and try and understand how it’s going to affect the data platform we’re building now. Like what’s it going to look like in five years if this regulation continues to get stricter as more and more countries and organizations in the EU look at this and start applying the same sort of things to their citizens and their organizations?
Kyle: Yeah, and I know, this is, this is a topic that's maybe on everyone's mind to some degree. I know, California has made some moves in the direction of privacy restrictions. So, you know, this is maybe an early topic for a lot of folks, but I think it's something that's on the horizon for practically anybody who's working in data right now.
Kyle: All right Andrew, so we’ve got three rapid fire questions for you today. Are you ready?
Andrew: Yeah, let’s go!
Kyle: All right, number one. So obviously, we're just talking a ton about the FinTech space and about finance. So, do you have a favorite currency? What is it?
Andrew: I guess my own - the British pound. The original.
Kyle: Okay, all right. Fair enough. Rapid Fire question number two. Would you rather communicate only using emojis or never be able to text at all ever again?
Andrew: Emojis. I do use them quite a lot.
Kyle: You think you're gonna hold a conversation just with emojis?
Andrew: Yeah, I think I’d be fine. I don’t need text.
Kyle: All right. And the last one; maybe this one's a little bit more of a brain tickler. Would you describe technology overall as a net positive or a net negative?
Andrew: Definitely net positive. One of the reasons why I joined GoCardless was, I like the way that a company like GoCardless allowed other businesses to start, and allowed other businesses to collect money from their customers. So that could be anyone from a window-cleaner collecting direct debits or regular payments every week, to someone starting the next big thing that everyone’s talking about.
So I think technology is a great leveler for people starting businesses, from wherever you are geographically located, whatever kind of background you’ve got, it’s a great leveler, so I think it's a great thing overall.
Kyle: I like that, data and technology as an equalizer and a leveler. Super cool. All right. Well, everyone, thanks for listening to this episode of The Observatory. Today, we’ve had the pleasure of speaking with Andrew, from GoCardless. So we'll leave some links down below for some additional reading material. Andrew, thanks for being on the show today.
Andrew: Cool, thank you for having me.