Data ingestion is a cornerstone of sophisticated data architecture. It facilitates the transfer of data from diverse sources into one unified system. Problems in the ingestion process - a third-party data provider’s late updates, for example – can hold up the whole data pipeline.
This post takes us through data ingestion, exploring its components, patterns, and the best practices that guide its implementation into modern data architectures.
What is data ingestion?
Data ingestion is the process by which data is imported from diverse sources into a central repository. This repository could be a traditional relational database, a data warehouse, Kafka or Spark. The process can be real-time, where data is ingested immediately as it's produced, or batch-based, where data is ingested in chunks at scheduled intervals. Good ingestion practices can help to ensure long-term data reliability for your organization's data infrastructure at large.
What’s the difference between data ingestion and data integration?
Data ingestion and data integration are related but distinct processes for bringing data into a system. Data Ingestion typically involves multiple sources and often includes remapping because the source and target structures usually differ. While it might involve some standardization or transformations, these are generally simpler compared to data integration. Initial solutions for data ingestion often rely heavily on their data sources, and there's a risk of data corruption if the source changes unexpectedly.
On the other hand, data integration involves combining data from different sources to provide a unified view. It's similar to a database join, but isn't limited to a single database and spans across multiple applications, APIs, files, etc. The data from different sources is combined or joined to produce a single dataset augmented from these sources. Data integration often requires more sophisticated tools and methods due to the need for more complex transformations and the handling of multiple data sources.
What’s the difference between data ingestion and data loading?
Data loading is a subset of the ingestion. While data ingestion is the process of gathering and preparing data from diverse sources, data loading is the act of storing that data in a specific database or system.
The methods of data loading can vary, from full loads where existing data is replaced, to incremental loads where only new data is added, or upserts which update existing records and add new ones. Database-specific tools, such as PostgreSQL's COPY command or MySQL's LOAD DATA INFILE, are commonly used to optimize this process.
Three types of data ingestion
Data ingestion depends on factors like data volume, processing speed, latency or specific business requirements. Here are the three types of data ingestion that suit various situations at data's time of entry into a pipeline.
With Batch ingestion, data is collected and processed in predefined batches. It is scheduled to run at specific intervals, which can be adjusted based on the data update frequency or other needs. Batch processing is well-suited for scenarios where latency is not a primary concern. It is efficient for processing large volumes of data, especially historical data.
Streaming ingestion continuously ingests and processes data as it becomes available in real-time. It is well-suited for applications that require low-latency data insights, ensuring that data is available for analysis almost instantaneously after its generation. Streaming ingestion may require different data retention approaches due to the disparity between rate of processing and data generation.
Real-time data pipelines combine the benefits of both batch and streaming ingestion approaches. They provide a hybrid solution that offers low-latency processing for streaming data while also handling data in batches when necessary.
Key components of data ingestion
Frameworks and technologies
Apache Airflow, Apache Kafka and Apache Spark and widely-used open-source technologies when it comes to building data ingestion architectures. Here's a rundown of each of them:
Apache Airflow is an open-source platform for orchestrating complex workflows and data pipelines. Its architecture is based on Directed Acyclic Graphs (DAGs), where each node represents a task, and the edges define the task dependencies.
For example, Airflow can be set up to extract daily sales data from a cloud-based CRM system, transform this data with Python scripts or SQL queries, and then load it into a data warehouse like BigQuery or Redshift.
Dynamic workflow definition: Airflow allows you to define workflows programmatically using Python.
Extensive operators: Airflow provides a rich set of pre-built operators to perform various tasks, including extracting data from sources, transforming data, and loading data into different destinations.
Scheduling and backfilling: Airflow enables you to schedule data ingestion workflows at specified intervals.
Parallel execution: Airflow's distributed architecture allows parallel execution of tasks, making it efficient for processing large volumes of data during data ingestion.
Apache Kafka is a distributed, high-throughput, and fault-tolerant messaging system that is widely used for building real-time data pipelines and streaming applications. Kafka's architecture is based on publish-subscribe messaging, where producers publish data to topics, and consumers subscribe to those topics to consume the data.
Durability and replication: Kafka provides fault tolerance by replicating data across multiple brokers in the cluster.
Consumer and producer decoupling: Kafka allows producers and consumers to work independently, decoupling the data source from the destination, which ensures more flexibility and modularity in the data ingestion pipeline.
Data history retention: By default, Kafka retains data for a configurable period, even after it has been consumed. This feature allows consumers to replay data from any point in time, making it useful for historical analysis or recovering from failures.
Integration: Kafka has a rich ecosystem and provides various connectors like Kafka Connect to facilitate integration with external systems such as databases, storage systems, and data lakes.
In a typical data ingestion scenario using Kafka, changes or updates from a database are continuously captured and streamed into Kafka topics. Leveraging Kafka Connect with source connectors designed for specific databases, real-time data changes are detected and published to Kafka, turning the database updates into a stream of events. This streaming approach not only ensures that data in Kafka remains current and mirrors the database, but it also provides a scalable and reliable mechanism to handle vast amounts of data, making it ideal for real-time analytics and data-driven applications.
Apache Spark is a powerful open-source distributed computing framework that provides an efficient and flexible platform for processing large-scale data. Spark supports both batch and real-time data ingestion, making it suitable for a wide range of data ingestion tasks.
Unified data processing: Apache Spark offers a unified computing engine that can process data from various sources, such as files, databases, streaming systems, and more.
Ease of use: Spark provides user-friendly APIs in different programming languages like Scala, Java, Python, and R, making it accessible to a wide range of developers and data engineers.
Scalability: Spark is designed for horizontal scalability, allowing you to process and ingest large volumes of data by adding more nodes to the cluster.
For example, Apache Spark can ingest financial transaction data in real-time. With Spark Streaming's DStream API (composed of RDDs), you can execute window operations and stateful transformations to analyze transaction patterns. This can be used to detect anomalies and trigger immediate fraud detection, enabling swift responses to potential threats.
Connectors and APIs for data source integration
Connectors are pre-built interfaces or plugins that enable data integration platforms to interact with specific data sources or applications. Connectors abstract the complexities of data source interactions, allowing data engineers and analysts to focus on configuring the integration rather than dealing with low-level details.
Connectors come in different types, such as database connectors (e.g., JDBC for relational databases), cloud service connectors (e.g., Amazon S3, Google Cloud Storage), application connectors (e.g., Salesforce, SAP), and more. Many platforms allow developers to create custom connectors for proprietary or niche data sources not covered by existing connectors.
Kafka is a great example of a connector based platform with a rich ecosystem of integrations. Kafka Connect offers a collection of pre-built connectors (Source Connectors and Sink Connectors) that enable integration with various systems. Source Connectors pull data from external systems into Kafka, while Sink Connectors write data from Kafka into external storage systems. Built on top of Kafka's producer and consumer APIs, these connectors manage challenges like fault tolerance and scalability without manual intervention.
Popular source connectors include those for databases like MySQL or PostgreSQL, facilitating change data capture (CDC) techniques to ingest database changes into Kafka topics. Sink connectors, such as the ones for Elasticsearch or HDFS, consume data from Kafka topics and push them in downstream systems.
Pipelines and workflow orchestration
Workflow orchestration tools provide a way to manage and automate complex data ingestion workflows. These tools enable the scheduling, coordination, and monitoring of data ingestion tasks, ensuring that data flows efficiently through the pipeline. In addition to Airflow, there are several other tools in this space:
Dagster is an open-source workflow orchestration tool that focuses on data engineering workflows and is built around the concept of solid functions, which represent the fundamental units of computation in a data pipeline. Each solid function takes inputs, performs a specific data processing operation, and produces outputs. These solids can be easily composed to create complex data pipelines with clear data dependencies, making it easier to reason about the data flow. It also offers features like versioning, data lineage tracking, and built-in support for unit testing, making it suitable for complex and data-intensive workflows.
Luigi is a Python-based workflow orchestration framework developed by Spotify. Users define tasks as Python classes, where each task represents a unit of work. Dependencies between tasks are specified explicitly, creating a directed acyclic graph. Luigi's central scheduler ensures that tasks are executed in the correct order based on their dependencies, optimizing resource utilization and maximizing workflow efficiency.
Prefect is another open-source workflow orchestration tool that offers a Python-based API to define workflows. It supports task-level retries, failure handling, and conditional branching, enabling robust error handling and decision-making within the workflow. Prefect also offers a pluggable execution backend, allowing users to execute workflows on various platforms, including cloud services like AWS and GCP.
ArgoWorkflow is a Kubernetes-native workflow orchestration tool designed to run data workflows as containerized tasks. Its YAML-based workflow definitions are executed as Kubernetes custom resources, making it easy to deploy and manage workflows within Kubernetes clusters. ArgoWorkflow supports various advanced features like DAG scheduling, cron scheduling, and time-based workflows, providing fine-grained control over workflow execution.
Data validation and error handling
Data validation involves verifying the correctness and integrity of data as it enters the system. As data is ingested from various sources, it is essential to ensure its accuracy, completeness, and integrity before further processing or storage. Failure to do so will lead to data quality issues, downstream errors, and unreliable insights. There could be several types of applied validations, including:
Schema validation: Checking data against predefined schemas to ensure the correct data types, formats, and constraints are met.
Cross-field validation: Validating relationships between fields within the dataset. For example, ensuring that a customer's age matches their date of birth.
Range and boundary validation: Ensuring numeric data falls within acceptable ranges or limits.
Error handling in data ingestion is the process of dealing with data that does not meet validation criteria or encounters issues during the ingestion process. Proper error handling ensures that the system can gracefully handle invalid data without breaking the entire data pipeline. Common error handling practices include:
Using dead-letter queues: This allows moving problematic data to a separate queue for manual inspection, ensuring it does not disrupt the main data flow.
Applying retry mechanisms: In scenarios where the failure is temporary, implementing retry mechanisms can help in successfully ingesting data once the issue is resolved.
Handling of out-of-order data: For streaming data, applying windowing techniques to handle late-arriving data can mitigate processing errors.
Kafka allows defining and enforcing schemas for the data being ingested, using a Schema Registry. By using Avro or JSON schemas, Kafka ensures that the data conforms to the predefined schema, preventing incompatible or invalid data from entering the system. You can also assign a dead-letter topic to be set up to store messages that could not be processed - engineers can inspect, correct, and reprocess the data later.
Airflow enables users to create custom operators that can perform data validation tasks before ingesting data into the pipeline. These operators can check data quality, schema conformity, or any other specific validation criteria and raise an exception if the validation fails. Airflow also allows setting up retries for failed tasks in a DAG. If a data ingestion task fails, Airflow can automatically retry it for a specified number of times before marking it as failed.
Data ingestion patterns and strategies
When is comes to streaming ingestion and real-time data pipelines, here are some strategies you can apply:
Micro-batching: Use it to ingest data in small, frequent batches, allowing for near real-time processing but with the manageable structure of batches.
Time-windowed aggregation: Use it in scenarios where rolling metrics, like moving averages, are required. Here, incoming data is grouped based on a predefined time window (e.g., every few seconds or minutes) and then processed.
State management: Remembering intermediate results is essential for use cases such as session analysis or continuous aggregates. Offload state management distributed state stores where possible, or solutions built into the tooling such as Kafka Streams state stores.
At-least-once and exactly-once processing: Apply the right type based on your requirements. "At-least-once" ensures that no data is lost, though duplicates might be processed. "Exactly-once" ensures that each piece of data is processed only once, and is more complex to achieve but offers higher data accuracy.
Event time vs. processing time: Consider the difference between the time an event actually occurred (event time) and the time it's processed by your system (processing time). Handling out-of-order data, especially in systems where event time is crucial, can be addressed with techniques like watermarks.
Tools and best practices
In addition to the tooling mentioned above, common best practices in optimizing batch ingestion include:
Parallel processing: By dividing the batch data into smaller chunks and processing them concurrently, you can significantly speed up the ingestion process. This approach is particularly effective in distributed computing environments.
Delta ingestion: Instead of ingesting the entire dataset in every batch, only the changes (deltas) from the previous batch are ingested. This reduces the volume of data being processed and enhances efficiency.
Optimized data formats: Using data formats like Parquet or Avro, which are columnar and allow for better compression, can lead to faster ingestion and storage efficiency.
An efficient data ingestion process lays the foundation for all subsequent data operations and analytics. As a baseline, look at your data needs and consider the appropriate ingestion type for that set of needs. It helps to familiarize yourself with relevant frameworks and their respective connectors for data integration. After that, ensure that you have a solid pipeline setup with clear workflow orchestration. When in doubt, prioritize data validation and error handling to maintain data quality.