Engineering
-
June 8, 2023

Navigating the data universe: lakes, warehouses, and lakehouses

Understanding data lakes, data warehouses, and how they can work together may bring substantial benefits to your organization.

Egor Gryaznov

How do you store data from different parts of an organization in a reasonable way that enables company-wide analytics and data observability, even as the amount and types of data continues to grow and change?

30 years ago, all stored data was important, and therefore all data was modeled. However, the process of modeling data is difficult and expensive. Someone needs to talk to business stakeholders and understand how data is used and what it represents, which requires time and effort poured into the company’s important data sets.

Today, that’s not the case. As compute and storage got cheaper, we introduced different paradigms that incentivized storing everything, just in case it might be needed at some point.

Data storage and management has always been crucial, but as the volume of data continues to grow, it has become even more vital for businesses to implement effective strategies. Two such strategies include data lakes and data warehouses. These solutions are often viewed as an either-or choice, however, understanding each and how they can work together may bring substantial benefits to your organization.

What is a data warehouse?

Data warehousing as a concept has been around for a long time. It originated with the need for a single place for storing org-wide data, and typically each company had one and only one data warehouse. Historically, data was required to be modeled and cleaned up before it could be put in the data warehouse - refer to Ralph Kimball’s Data Warehouse Toolkit for a more in-depth overview.

As a system, a data warehouse is ultimately used for reporting and data analytics. Since the data is already modeled, data warehouses are great at running analytics queries across large time-spans of data very quickly. The underlying technology makes assumptions about what kind of data is stored in the warehouse, and thus allows for very efficient business intelligence and analytics type queries.

Data in a warehouse is typically already cleaned, processed, and structured. As mentioned, this modeling process is complex and expensive, and with larger amounts and varieties of data, there became a growing need to simply store everything and let the individual users figure out how to deal with it, rather than pre-processing it.

What is a data lake?

Enter data lakes! Data lakes are large-scale storage repositories that hold raw data in a native format until it's needed. This approach took off with Hadoop back in the early 2010s, when HDFS introduced a scalable way to store arbitrary data and MapReduce was the framework for processing that massive amounts of data.

If a data warehouse is more of a business play, data lakes are more like insurance; they allow organizations to retain all data, structured or unstructured, in one place. With a data lake, the business can postpone any sort of data processing work until the point that it wants to do something with the data. To account for this, data lake technologies prioritize cheap storage and scalable (but typically not very fast) compute. The data is always stored raw for long periods of time, and then processed as needed. With recent advances in storage and processing power, a whole new concept arose to account for the slow compute drawback of data lakes.

Data lakehouse

The “data lakehouse” is a new, hybrid concept developed by Databricks. It blends the best features of data lakes and data warehouses, suggesting that these two don't have to be different technologies (as they have historically been assumed to be).

The idea is to model your data where your lake is, effectively using it as your warehouse as well. This gives you the power to still have a raw data layer in your lake underpinning everyone, but also gives analysts access to a layer of the data lake that is already conceptually modeled like it would be in a warehouse. This is a great approach to consolidate your tech stack, while maintaining the benefits of both a data lake and a data warehouse.

Data lakes vs. data warehouses: an e-commerce use case

To explore both these technologies, let's take the example of an e-commerce website. The business generates various types of data: orders, user data, web logs, analytics, competitor pricing. In a data lake, you'd ingest and store all this data in its raw format, regardless of what format it has or where it came from. This approach allows you to have all your data in one place and to analyze it later.

So let’s say the business needs to understand its sales performance, consisting of order history and their trends. For this, an analyst would have to either find an existing table that fits their needs, or take the raw orders data, analyze it, and possibly create a reusable, well-structured warehouse table. If there are ten analysts that need to do the same sort of analysis, without a modeled table in a data warehouse, there would be not only redundant compute, but also possible drift in terms of how the data is analyzed.

On the other hand, if there is a new product line being started and our business wants to do some analytics that hasn’t been done before, there definitely would not be an existing modeled table that fits their needs. In this case, having all internal and external sources being stored as raw data in a data lake allows the new team to quickly experiment and find out what the product line should be focusing on. Eventually, once they know what their KPIs are, they can then canonicalize the computation into a modeled table and store that in their data warehouse.

Final thoughts

Both data lakes and data warehouses have unique roles and advantages in the data management landscape. While data lakes offer flexible, cost-effective storage for raw data, data warehouses spit out structured, high-quality data that’s ready for easy analysis and reporting. A successful data strategy can involve use both these systems, sometimes even within the same hybrid platform (the “data lakehouse” we talked about earlier). What’s more, data warehouses are being treated more like data lakes as storage gets cheaper. The line is blurring pretty rapidly between the different technologies. At the end of the day, pick the tech that matches your business requirements, rather than picking one because someone says you need it.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights