A brief history of Databricks
Databricks has been a key innovator in data over the past decade. Here's a rundown of their history and impact on data engineering and ML.
If you’ve been working in or around data for any length of time, you’ve heard of Databricks.
Databricks is based in San Francisco and provides automated Spark cluster management, data science notebooks, and other tools for analytics, data science, and machine learning designed to work at “big data” scale. Some of their most well known direct competitors are Snowflake and Google’s Bigquery. Their commercially available platform has a strong connection to Apache Spark, an open-source data processing engine developed at UC Berkeley's AMPLab.
Databricks was founded by the creators of Apache Spark, and it continues to contribute significantly to the Spark project. The company leverages its deep understanding of Spark and designs commercial products to leverage its capabilities while addressing several challenges associated with using Spark directly.
Over the past decade, a number of trends and factors have helped Databricks rise to prominence:
- Growing demand for data management: The volume of data generated and collected by businesses has increased exponentially in the past decade. This explosion of data has created a pressing need for effective data processing, reliability, and analysis tools.
- The shift to the cloud: There has been a significant shift toward cloud-based solutions, driven by the need for scalability, flexibility, and cost-effectiveness. Databricks' cloud-native platform aligns with this trend, offering easy deployment, scalability, and reduced overhead costs associated with on-premise infrastructure.
- The AI/ML boom: AI and machine learning have become increasingly important for businesses seeking to gain insights from their data. Databricks offers an end-to-end platform that supports machine learning workflows, from data preparation to model training and deployment. This has made it an attractive option for companies looking to implement AI and machine learning at scale.
In this post, we'll look at the emergence of Databricks from the core work the founders did on Spark, the early growth the company achieved, it’s ongoing contributions to the Spark ecosystem, where they stand today in the market, and where they might be headed.
2009 to 2013: Spark
Databricks exists due to the data processing framework Spark, so let's start with the creation story of Spark itself.
Spark originated at UC Berkeley's AMPLab, which was a collaborative effort involving students, researchers, and faculty focused on big data analytics. It was first conceived in 2009 as a part of a research project to create a cluster computing framework that could handle big data workloads faster and more efficiently than Hadoop MapReduce, the predominant big data processing framework at the time.
Matei Zaharia, then a Ph.D. student at UC Berkeley, developed Spark to overcome the limitations he observed with MapReduce, particularly its poor performance with iterative algorithms and interactive data mining tasks. Spark introduced several improvements and features that gave it an edge over other big data processing frameworks:
- In-memory processing: Spark introduced the concept of an in-memory computing engine that significantly improved the speed of data processing tasks, especially iterative algorithms, which are common in machine learning and graph processing.
- Ease of use: Spark offers high-level APIs in Java, Scala, Python, and R, and includes built-in modules for SQL, streaming, machine learning, and graph processing, which make it more accessible and versatile for developers and data scientists.
- Fault tolerance: Despite its focus on in-memory processing, Spark maintains the fault-tolerance characteristic of Hadoop, meaning it can recover quickly and completely in case of a failure.
- General purpose computing: Unlike MapReduce, which is strictly batch-oriented, Spark supports batch processing, interactive queries, streaming, and complex analytics such as machine learning and graph algorithms all within the same framework.
The Spark research project was open-sourced in 2010 and later moved to the Apache Software Foundation in 2013, becoming Apache Spark. Zaharia later became the CTO of Databricks and continues to contribute to the evolution of Apache Spark.
2013: Databricks' formation
Zaharia co-founded Databricks in 2013 along with Ali Ghodsi, Ion Stoica, Patrick Wendell, Reynold Xin, Scott Shenker, and Andy Konwinski. The founding team already had strong relationships from working together during the development of Apache Spark at AMPLab.
Databricks aimed to fill gaps left by Apache Spark's community-driven model. While the open-source model allowed Spark to evolve rapidly and be widely adopted, there were some inherent limitations:
- Lack of commercial support and services: Many organizations, particularly larger ones, require robust support to fully adopt a new technology. They need assurance that there is professional assistance available to handle any issues. While a community-driven project can provide assistance to some extent, it lacks the dedicated support that many companies need. This gap became a significant factor in the creation of Databricks.
- Inconsistent quality of contributed code: With an open-source project like Spark, the quality of the contributed code will vary. Databricks was created to provide a commercial version of Spark that guaranteed quality and reliability to enterprise customers.
- Deployment and management complexity: Deploying and managing Spark, especially at scale, is complex. The original creators of Spark understood these challenges well, and thus Databricks was established to provide a platform that made Spark deployment and management easier at scale.
Databricks' first significant investment came from a Series A round in 2013. The round was led by Andreessen Horowitz, a prominent VC known for its investments in high-growth tech companies. This initial investment helped the founders get up and running, hire the initial team, and develop their Unified Analytics Platform based on Apache Spark.
They subsequently raised multiple rounds of funding—totalling $3.5B as of September 2023—from various investors including Coatue, CapitalG, NEA, Blackrock, and others:
- Series A: September 2013, $13.9 million led by Andreessen Horowitz
- Series B: June 2014, $33 million, led by New Enterprise Associates (NEA)
- Series C: March 2026, $60 million, led by New Enterprise Associates (NEA)
- Series D: December 2017, $140 million, led by Andreessen Horowitz
- Series E: February 2019, $250 million, led by Andreessen Horowitz, Coatue Management, and Microsoft.
- Series F: October 2019, $400 million Series F, led by Andreessen Horowitz, which brought its valuation to $6.2 billion
- Series G: 2021, Series G round of $1 billion at a $28 billion post-money valuation, led by Franklin Templeton
2013-2017: Early years
After founding in 2013, the company worked hard to build a commercial product around Apache Spark that was more user-friendly and easier to deploy.
In the initial stages, Databricks faced the typical challenges of any startup: proving their value proposition, attracting customers, and securing further funding. They also faced technical challenges, like improving Spark's stability and functionality and building a user-friendly, cloud-based platform for data processing and analytics.
But due to the increasing popularity of Apache Spark and the founders' deep knowledge of the technology, Databricks gained quick traction. Companies saw the potential in Spark's ability to process large volumes of data quickly, and were eager to leverage this power with the added support and convenience that Databricks offered.
A recap of some major early milestones:
- Funding: Databricks secured significant venture capital funding, including the $14 million Series A round in 2013, the $33 million Series B round in 2014, and a substantial $60 million Series C round in 2016.
- Product Launch: In 2014, Databricks launched its cloud-based platform, Databricks Cloud (now known as the Databricks Unified Analytics Platform). This platform integrated with Apache Spark, simplifying the process of building and deploying Spark applications.
- Partnerships: In 2015, Databricks partnered with major cloud providers such as Amazon Web Services (AWS), providing seamless integration and making it easier for companies to use Databricks on the cloud platform they were already using. Later, a similar partnership was forged with Microsoft Azure in 2016.
- Customer Adoption: By 2016, Databricks boasted several high-profile customers, including Shell, HP, and Salesforce, highlighting its growing acceptance in the industry.
- Innovation: In 2017, Databricks launched Databricks Delta (now known as Delta Lake), a key technological advancement designed to enhance data reliability and quality in the big data space.
These milestones prove the rapid progress made by Databricks in its early years. The founders' deep understanding of Apache Spark, along with their vision for solving the pain points of big data processing, allowed them to deliver a powerful and valuable platform for customers.
2019: Delta Lake and the data lakehouse
Delta Lake is an open-source storage layer that brings reliability to data lakes. It was introduced publicly by Databricks in 2019 to address the challenges of data quality and reliability in big data processing. The key features missing from traditional data lakes that Delta Lake provides included ACID transactions, scalable metadata handling, and simultaneous streaming and batch data processing.
Before Delta Lake, the limitations of existing data lakes presented several inherent management problems, like dealing with corrupt and inconsistent data, problems enforcing data privacy regulations, and difficulties handling both batch and real-time data simultaneously.
The idea behind Delta Lake was initially conceived by Dominique Brezinski and Michael Armbrust during the 2018 Spark Summit. Brezinksi worked at Apple and needed to process petabytes of daily log data and develop machine learning models that could use streaming data to perform real-time intrusion detection. The scale of the problem required a lake architecture. The lack of transactional consistency resulted in frequent pipeline failures.
Challenges with the lake + warehouse architecture
Data lakes emerged as a concept in response to the limitations of traditional data storage and management systems, especially with the advent of big data. Traditional systems, such as relational databases and data warehouses, were primarily designed to handle structured data and had certain limitations when it came to handling the volume, variety, and velocity of big data.
- Diverse data type handling: Unlike traditional systems that handle structured data well, data lakes store a wide variety of data types, including structured, semi-structured, and unstructured data. Examples include everything from structured tables and CSV files to semi-structured JSON data, and unstructured data like images and text documents.
- Scalability: Data lakes, often built on distributed file systems like Hadoop HDFS or cloud-based storage, can easily scale out to store and process massive volumes of data, unlike traditional databases which could become expensive to scale.
- Cost efficiency: In a data lake, data is stored in its raw, unprocessed format, which eliminates the need for upfront data transformation, thereby saving on processing costs. Additionally, storage costs, especially in the cloud, are usually much cheaper than traditional data warehousing solutions.
- Schema-on-read: Traditional systems often employ a schema-on-write approach, where data needs to be cleaned, transformed, and fit into a schema before it is written into the database. Data lakes, on the other hand, support schema-on-read, allowing data to be stored in its raw format and only transformed when it is ready to be used. This provides enormous flexibility for data scientists and analysts who can now shape and transform the data in ways best suited to their specific use-cases.
While lakes address the need for handling the scale and diversity of big data in a flexible, cost-effective, and agile manner, they lack the structured organization and schema of a data warehouse, making them less optimized for standard SQL queries and traditional business intelligence tools. This led to semi modeled or fully modeled data often being exported into a traditional warehouse for use by analysts.
Data warehouses provide structured, filtered data specifically prepared for analysis. They commonly utilize schema-on-write to predefine the shape of the data, and are typically well optimized for analytical SQL queries that compute summarized results from the data. They ensure data quality and consistency standards but are less flexible than data lakes in handling diverse data types. They also require significant computational resources, making them more expensive to scale.
A "lakehouse" architecture introduced by Databricks aims to combine the best features of data lakes and data warehouses into a single platform. The proposed advantages of this architecture include:
- Fewer pieces: Unlike the traditional architecture where data lakes and data warehouses exist separately, a lakehouse treats both as a unified platform, allowing for data to remain in its raw format until it's needed, and then providing the tools to transform and query the data in a structured manner.
- Diverse data types: Like a data lake, a lakehouse can handle diverse data types, including structured, semi-structured, and unstructured data. But unlike a traditional data lake, it organizes this data in a way that can be easily used by business intelligence tools.
- ACID transactions: The lakehouse architecture supports ACID (Atomicity, Consistency, Isolation, Durability) transactions, a key feature traditionally associated with data warehouses. This guarantees data reliability, particularly for concurrent reads/writes and distributed transactions, something that's typically challenging in a data lake environment.
- Schema enforcement and evolution: Like data warehouses, a lakehouse supports schema enforcement (ensuring data adheres to a predefined schema) and schema evolution (allowing the schema to change over time), ensuring the data is trustworthy and can be easily used for analysis.
- Cost and scalability: A lakehouse, like a data lake, leverages cheap and scalable storage options (e.g., object storage on the cloud) for storing data, making it more cost-effective compared to traditional data warehouses.
In a lakehouse architecture, data teams can perform traditional business intelligence tasks, like reporting and SQL queries, while also supporting machine learning and advanced analytics workflows. This all-in-one approach eliminates the need for additional data pipelines to move data between separate lake and warehouse systems. In 2022, Databricks fully open-sourced Delta lake.
Databricks contributions back to Spark
Delta Lake is one of several continued contributions back to the Spark ecosystem.
Active contributions to Apache Spark
Databricks engineers have been some of the top contributors to the Apache Spark project, making several major updates and improvements. They've consistently pushed patches, updates, and new features, ensuring the technology stays current with the industry's changing needs.
Introduction of new modules
Databricks has played a major role in the development and introduction of new Spark modules. For example, the company has been instrumental in the development of MLlib (a machine learning library), Spark Streaming (for processing live data streams), and GraphX (for graph processing).
Spark certification and training
To enhance the skills of professionals working with Apache Spark, Databricks launched the Spark certification program in 2014. This program helped increase Spark's adoption by assuring employers of the skill levels of certified individuals. Databricks also offered training programs for professionals to learn and get up to speed with Apache Spark.
Hosting Spark Summits
Databricks played a key role in organizing Spark Summits, conferences that gathered Spark users and developers from around the globe. These events provided an opportunity to learn about the latest developments, share knowledge, and strengthen the Spark community.
Spark performance improvements
The Databricks team contributed several performance optimizations to Spark over the years, resulting in significant speed improvements. For instance, Project Tungsten, initiated in 2015, was an effort led by Databricks to improve Spark's computational performance.
Databricks' contributions to Apache Spark during the 2013-2019 period were substantial and played a significant role in the evolution of Spark, maintaining its relevance and usefulness in a fast-evolving industry.
Databricks today and market position
Databricks has grown rapidly since 2013. In 2020, it achieved a valuation of $28 billion, and rumors have remained strong since then that they are preparing for an IPO.
Exact figures on Databricks' market share can be difficult to obtain, as it competes in several different markets, including big data processing, data science platforms, and AI services. However, the company has a sizable customer base, with thousands of organizations across various industries using Databricks, including tech giants like Microsoft. Some notable customers include AT&T, Walgreens, Toyota, Mars, Sam’s Club, and Adobe. Some of its main competitors include:
- Amazon Web Services (AWS): AWS offers a broad range of cloud computing products and services, including big data analytics and AI services. Though AWS and Databricks have a partnership, they are also competitors, particularly with AWS's own data analytics offerings like Amazon EMR (Elastic MapReduce).
- Google Cloud Platform (GCP): Like AWS, GCP also offers a suite of cloud computing services, including BigQuery for big data analytics and AI Platform for machine learning.
- Microsoft Azure: Azure provides an array of services similar to AWS and GCP. Azure also has a partnership with Databricks, offering Azure Databricks, an Apache Spark-based analytics platform.
- Snowflake: Snowflake's cloud-based data warehousing service is often compared with Databricks' platform. While Snowflake focuses on data warehousing and Databricks on data science and machine learning, both companies compete in the big data analytics market.
- Cloudera: Cloudera provides a platform for data engineering, data warehousing, machine learning, and analytics that runs in the cloud or on premises.
Since its founding in 2013, Databricks has made a huge impact on the data engineering and data science landscape. Databricks' contributions have made big data processing, machine learning, and AI more accessible and efficient for organizations worldwide, underpinning its influence and growth in the data-centric era.
Schema change detection