Big Data 15 min read

How ByteDance’s Data Lake Powers Near‑Real‑Time E‑Commerce Analytics

This article explains ByteDance’s data lake technology, its Apache Hudi‑based features, near‑real‑time architecture, and practical e‑commerce use cases such as marketing promotion, traffic diagnosis, logistics monitoring, risk governance, and operational monitoring, while outlining future challenges and plans.

ByteDance Data Platform
ByteDance Data Platform
ByteDance Data Platform
How ByteDance’s Data Lake Powers Near‑Real‑Time E‑Commerce Analytics

Data Lake Technology Features

From a data development and application perspective, data lake technology offers several characteristics: it can store massive, low‑processed raw data with low storage cost and strong scalability; it adopts a schema‑on‑read approach, allowing flexible downstream usage without predefined schemas.

ByteDance Data Lake (Based on Apache Hudi)

ByteDance’s data lake is a commercially‑ready solution deeply customized from Apache Hudi. Key capabilities include:

Bridging real‑time and batch computing, supporting Flink, Spark, Presto, and Hive on various storage systems (HDFS, Amazon S3, GCS, OSS).

Timeline Service for version management, enabling near‑real‑time incremental reads and writes.

Support for Merge‑on‑Read / Copy‑on‑Write table types and Read‑Optimized / Real‑Time query modes, allowing flexible trade‑offs between data visibility latency and query latency.

Rich metadata management, indexing, and row/column storage formats for high‑performance read/write.

Multi‑source stitching to simplify data integration and build data marts.

Both upsert (primary‑key update) and append (non‑primary‑key) operations for extensible data updates.

Near‑Real‑Time Architecture

The near‑real‑time scenario is divided into analysis‑oriented and operations‑oriented types. Both require high‑efficiency, low‑cost storage to support high productivity and low storage cost. Data lake’s schema‑on‑read model, unified storage, and multi‑source stitching simplify the computation chain and enable reuse of batch results in streaming.

E‑Commerce Data Warehouse Practices

Marketing Promotion

For large‑scale shopping festivals (e.g., 618, Double 11), the platform needs hour‑level cumulative statistics. A near‑real‑time solution streams data into the lake, uses Spark for hourly scheduling, merges real‑time and batch data, and serves results via Presto, achieving low latency and cost‑effective development.

Traffic Diagnosis

Real‑time monitoring of recommendation traffic uses the data lake to ingest massive event‑level data, append it to non‑indexed lake tables, and perform 15‑minute window aggregations with Presto, supporting both near‑real‑time analysis and offline reuse.

Logistics Monitoring

Logistics monitoring requires linking multiple business systems without a unified key. ByteDance’s data lake multi‑source stitching allows each source to update its fields in intermediate lake tables, which are then stitched together, eliminating costly joins and simplifying stateful computation.

Risk Governance

Risk governance analyzes sessions, reports, comments, and transactions in near‑real‑time to detect fraud and black‑gray industry activities. The data lake’s low‑processing, schema‑on‑read model enables flexible queries and reuse of offline dimension tables, meeting both real‑time and batch requirements.

Operations‑Type Scenarios

Data Product Anomaly Monitoring

Monitoring data product anomalies requires visibility into intermediate results. By persisting streaming data into the lake, teams can perform multi‑source comparison, global anomaly detection, and large‑scale entity checks, improving operational efficiency and data quality.

Real‑Time Message Persistence Detection

By CDC‑ing message‑queue data into the lake, the full data pipeline becomes observable and testable, greatly aiding developers in debugging and ensuring data quality.

Future Challenges and Plans

Higher performance requirements to support larger data volumes and improve data visibility and query latency.

Deeper integration with Flink and Spark to provide stronger failover guarantees.

Transition from near‑real‑time analytical applications to near‑real‑time product‑level applications.

data lakebig data architectureApache Hudinear real-timee-commerce analyticsstreaming and batch
ByteDance Data Platform
Written by

ByteDance Data Platform

The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.