How ByteDance’s Data Lake Powers Near‑Real‑Time E‑Commerce Analytics
This article explains ByteDance’s data lake technology, its Apache Hudi‑based features, near‑real‑time architecture, and practical e‑commerce use cases such as marketing promotion, traffic diagnosis, logistics monitoring, risk governance, and operational monitoring, while outlining future challenges and plans.
Data Lake Technology Features
From a data development and application perspective, data lake technology offers several characteristics: it can store massive, low‑processed raw data with low storage cost and strong scalability; it adopts a schema‑on‑read approach, allowing flexible downstream usage without predefined schemas.
ByteDance Data Lake (Based on Apache Hudi)
ByteDance’s data lake is a commercially‑ready solution deeply customized from Apache Hudi. Key capabilities include:
Bridging real‑time and batch computing, supporting Flink, Spark, Presto, and Hive on various storage systems (HDFS, Amazon S3, GCS, OSS).
Timeline Service for version management, enabling near‑real‑time incremental reads and writes.
Support for Merge‑on‑Read / Copy‑on‑Write table types and Read‑Optimized / Real‑Time query modes, allowing flexible trade‑offs between data visibility latency and query latency.
Rich metadata management, indexing, and row/column storage formats for high‑performance read/write.
Multi‑source stitching to simplify data integration and build data marts.
Both upsert (primary‑key update) and append (non‑primary‑key) operations for extensible data updates.
Near‑Real‑Time Architecture
The near‑real‑time scenario is divided into analysis‑oriented and operations‑oriented types. Both require high‑efficiency, low‑cost storage to support high productivity and low storage cost. Data lake’s schema‑on‑read model, unified storage, and multi‑source stitching simplify the computation chain and enable reuse of batch results in streaming.
E‑Commerce Data Warehouse Practices
Marketing Promotion
For large‑scale shopping festivals (e.g., 618, Double 11), the platform needs hour‑level cumulative statistics. A near‑real‑time solution streams data into the lake, uses Spark for hourly scheduling, merges real‑time and batch data, and serves results via Presto, achieving low latency and cost‑effective development.
Traffic Diagnosis
Real‑time monitoring of recommendation traffic uses the data lake to ingest massive event‑level data, append it to non‑indexed lake tables, and perform 15‑minute window aggregations with Presto, supporting both near‑real‑time analysis and offline reuse.
Logistics Monitoring
Logistics monitoring requires linking multiple business systems without a unified key. ByteDance’s data lake multi‑source stitching allows each source to update its fields in intermediate lake tables, which are then stitched together, eliminating costly joins and simplifying stateful computation.
Risk Governance
Risk governance analyzes sessions, reports, comments, and transactions in near‑real‑time to detect fraud and black‑gray industry activities. The data lake’s low‑processing, schema‑on‑read model enables flexible queries and reuse of offline dimension tables, meeting both real‑time and batch requirements.
Operations‑Type Scenarios
Data Product Anomaly Monitoring
Monitoring data product anomalies requires visibility into intermediate results. By persisting streaming data into the lake, teams can perform multi‑source comparison, global anomaly detection, and large‑scale entity checks, improving operational efficiency and data quality.
Real‑Time Message Persistence Detection
By CDC‑ing message‑queue data into the lake, the full data pipeline becomes observable and testable, greatly aiding developers in debugging and ensuring data quality.
Future Challenges and Plans
Higher performance requirements to support larger data volumes and improve data visibility and query latency.
Deeper integration with Flink and Spark to provide stronger failover guarantees.
Transition from near‑real‑time analytical applications to near‑real‑time product‑level applications.
ByteDance Data Platform
The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.