Big Data 16 min read

Near Real-Time Data Lake Practices in TikTok E-commerce: Architecture, Techniques, and Case Studies

This article presents a comprehensive overview of TikTok e-commerce's near‑real‑time data lake implementation, detailing data lake characteristics, architecture choices, practical use cases across analysis and operations, and for future challenges and plans.

Big Data Technology & Architecture

Dec 19, 2022

Near Real-Time Data Lake Practices in TikTok E-commerce: Architecture, Techniques, and Case Studies

Introduction The speaker, a big‑data engineer from TikTok e‑commerce's real‑time data warehouse team, shares practices of near‑real‑time scenarios based on data‑lake technology.

Data Lake Characteristics Data lakes store massive raw data with low processing cost, offering low storage cost and strong scalability. Unlike traditional schema‑on‑write warehouses, data lakes use schema‑on‑read, allowing flexible downstream applications.

ByteDance Data Lake (Apache Hudi) Hudi provides streaming primitives, upsert/delete, indexing, and compression. It supports Flink, Spark, Presto, Hive and various storage systems (HDFS, S3, GCS, OSS). Hudi manages data versions via Timeline Service, enabling near‑real‑time incremental reads/writes, and offers Merge‑on‑Read/Copy‑on‑Write table types and Real‑Time/Read‑Optimized query modes.

ByteDance’s data lake bridges real‑time and batch computing, integrates Flink, Spark, Presto, and supports both streaming and batch workloads. It features robust metadata management, multi‑source stitching, and upsert/append capabilities.

Near Real‑Time Architecture Two near‑real‑time scenario types are discussed: analysis‑oriented (large‑scale, flexible, tolerant of latency) and operation‑oriented (cost‑effective, low‑latency). Both require high efficiency and low storage cost.

Applicability of Data Lake to Near Real‑Time Three reasons: (1) Reuse of stream and batch results; (2) Unified storage via HDFS, ingesting ODS/DWD layers into the lake; (3) Simplified computation chain through multi‑source stitching, reducing joins.

Near Real‑Time Architecture Evolution The architecture avoids pure streaming or batch solutions, instead combining their strengths. Real‑time data is ingested into the lake, hourly Spark jobs merge incremental data with batch data, and results are served via Presto for analysis.

E‑commerce Data Warehouse Practices – Analysis Scenarios

(1) Marketing Promotion – For events like 618 and Double‑11, near‑real‑time analysis is needed. A hybrid solution streams data into the lake, uses hourly Spark jobs to merge T‑1 batch data with incremental data, and serves results via Presto, achieving low latency and low cost.

(2) Traffic Diagnosis – Real‑time monitoring of recommendation traffic uses lake ingestion and 15‑minute window aggregations via Presto, supporting both near‑real‑time and offline needs.

(3) Logistics Monitoring – Multi‑source data is stitched in the lake to avoid heavy joins, enabling near‑real‑time dashboards.

(4) Risk Governance – Near‑real‑time risk analysis uses lake’s schema‑on‑read flexibility, combining multiple data domains for fraud detection.

Operational Scenarios

(1) Data Product Anomaly Monitoring – Incrementally sync intermediate results to the lake, enabling multi‑source comparison and global anomaly detection.

(2) Real‑time Message Persistence – CDC streams messages from queues into the lake, providing full visibility and quality checks.

Future Challenges and Planning Upcoming goals include higher performance for larger workloads, deeper integration with Flink and Spark for stronger failover, and shifting from near‑real‑time analytical to product‑level applications.

Conclusion The speaker thanks the audience and ends the session.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Data Warehouse Data Lake Spark Apache Hudi near real-time

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.