Implementing Real-Time Log Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions
This article describes how a data engineering team replaced nightly batch ETL with a Delta Lake‑based real‑time log ingestion pipeline on EMR, detailing the motivations, architecture, implementation steps, encountered issues such as data skew and schema evolution, and the practical solutions they applied to achieve low‑latency, reliable data delivery.
1. Background Introduction
Business Scenario In the traditional offline data warehouse, ETL is the first stage before loading logs. Soul's event‑tracking logs are massive and require dynamic partitioning; over 1,200 daily partitions vary from tens of thousands to billions of rows. The previous ETL flow ingested logs into Kafka, collected them with Flume to HDFS, then ran a day‑level Spark ETL job to write into Hive, taking 2‑3 hours total.
Problems 1. Day‑level ETL jobs are time‑consuming, delaying downstream outputs. 2. Night‑time jobs consume large cluster resources, competing with peak workloads. 3. ETL stability is poor; failures must be resolved at night, affecting a wide scope.
2. Why Choose Delta?
To address the growing issues of day‑level ETL, the team aimed to shift from T+1 batch processing to T+0 real‑time log ingestion, ensuring data consistency while making data immediately usable. Earlier attempts with a Lambda architecture (separate offline and real‑time stores) suffered from lack of transactional guarantees, small‑file pressure, and query performance degradation.
The team evaluated modern data‑lake table formats—Delta Lake (open‑source and commercial), Apache Hudi, and Apache Iceberg—each supporting ACID semantics, upserts, schema evolution, and time travel. A brief comparison:
Open‑Source Delta Advantages: supports streaming source reads, Spark 3.0 SQL operations. Disadvantages: tightly coupled to Spark, requires manual compaction, high‑cost join‑based merges.
Hudi Advantages: fast upsert/delete by primary key, offers Copy‑on‑Write and Merge‑on‑Read modes, automatic compaction. Disadvantages: write path bound to Spark/DeltaStreamer, complex APIs.
Iceberg Advantages: pluggable engine support. Disadvantages: still evolving, some features incomplete, high‑cost join‑based merges.
Alibaba Cloud provided an EMR‑optimized Delta version with enhancements such as SparkSQL/Streaming integration, automatic metadata sync to Hive Metastore, auto‑compaction, and performance optimizations (Z‑ordering, data skipping, merge improvements) for engines like Tez, Hive, and Presto.
3. Implementation Process
During testing, several EMR Delta bugs were reported and quickly resolved (e.g., automatic Hive table creation, Tez engine compatibility, data inconsistency between Presto and Tez). After adopting Delta, the real‑time log ingestion architecture is as follows:
Logs are sent from various endpoints to Kafka, then a Spark job writes them to HDFS in Delta format at minute granularity. Hive automatically creates mapping tables for Delta, enabling queries via Hive MR, Tez, Presto, etc.
The team built a generic Spark‑based ETL tool that requires no user code. Key features added:
Hidden partition capability similar to Iceberg, allowing creation of derived partition columns (e.g., substr(date,1,4) as year ).
Regular‑expression validation for dynamic partitions to filter dirty data (e.g., enforce '\w+' to avoid non‑ASCII partitions).
Custom event‑time field selection for proper partitioning.
Configurable nested JSON parsing depth to flatten complex logs.
SQL‑driven dynamic partition configuration to mitigate data skew and improve resource utilization.
The entire workflow is embedded in Soul's data platform, allowing users to request log ingestion, obtain approval, and configure parameters through a UI, simplifying operations and reducing cost.
To combat small‑file proliferation, EMR Delta offers OPTIMIZE and VACUUM commands for file compaction and cleanup, as well as auto‑compaction policies triggered by file‑count thresholds.
4. Problems & Solutions
(1) Data Skew from Uneven Dynamic Partition Sizes Some event types generate massive batches (up to 1 GB per 5‑minute window) while others are tiny, leading to many small files per partition. The solution was to repartition the DataFrame by dynamic partition columns before writing, producing one file per partition. For oversized partitions, a SQL‑based “salt” technique was added to further split them, reducing the slowest partition runtime from 3 minutes to 40 seconds.
(2) Dynamic Schema Changes at the Application Layer Delta supports schema evolution, but Spark must know the schema when constructing a DataFrame. The team introduced a metadata layer that detects new fields, updates the metadata, and builds the DataFrame schema dynamically, ensuring new columns are automatically written to Delta and propagated to Hive.
(3) Duplicate Data Due to Kafka Offset Commit Timing Using spark‑streaming‑kafka‑0‑10 ’s commitAsync , offsets were committed at the start of the next batch, causing possible re‑processing if the job restarted mid‑batch. Solutions include using Structured Streaming with exactly‑once support for Delta, or managing offsets externally.
(4) Metadata Parsing Overhead for Queries Delta maintains its own metadata; as tables grow, parsing this metadata becomes slower. Alibaba Cloud is optimizing query paths with caching to reduce this cost.
(5) CDC Use Case Considerations While Delta supports updates/deletes, the team has not applied it to CDC scenarios because merge‑based updates are costly for large, frequently updated tables. Ongoing work on merge optimizations (partition pruning, Bloom filters) may enable future CDC adoption.
5. Future Plans
1. Further optimize the real‑time data warehouse on Delta Lake to improve latency for key business metrics. 2. Integrate the internal metadata platform to achieve end‑to‑end log ingestion, real‑time storage, and unified metadata/lineage management. 3. Continue enhancing Delta query performance, exploring features like Z‑ordering for faster ad‑hoc analysis.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.