Big Data 18 min read

Real‑Time Analytics with Alibaba Cloud Serverless Spark & Paimon for Taobao Flash Sale

This article details how Alibaba Cloud EMR Serverless Spark combined with the Paimon lakehouse framework enables Taobao Flash Sale’s retail data team to achieve low‑latency, high‑throughput real‑time analytics, batch processing, and feature generation, outlining architecture evolution, performance gains, and practical Spark tuning techniques.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Real‑Time Analytics with Alibaba Cloud Serverless Spark & Paimon for Taobao Flash Sale

Background and Architecture Evolution

Taobao Flash Sale processes billions of retail events daily (orders, traffic, coupons, etc.). The retail data team needed a low‑latency, multi‑dimensional analytics platform for dashboards, traffic allocation, and AB‑experiment result retrieval. Initial "chimney" pipelines required heavy development and maintenance effort.

Lakehouse Stack

The team adopted a lakehouse architecture built on Flink + Paimon + StarRocks . Flink provides real‑time stream processing, Paimon offers an open‑format table store, and StarRocks serves as the analytical engine. To close gaps in batch‑oriented workloads, Spark was later introduced, forming a unified stream‑batch layer.

Challenges after Lakehouse Adoption

Low dimension‑table ingestion efficiency : Many dimension tables lived in ODPS and could not be materialized directly in StarRocks.

Fast requirement iteration vs. modest latency : Business demanded rapid feature rollout, but real‑time latency requirements were relatively relaxed, creating a mismatch between development speed and performance tuning.

Huge data volume and cube explosion : Retail traffic data reached billions of rows, causing cube sizes to inflate by tens of thousands of times and leading to >3‑hour data delays during peak periods.

Introducing Spark for Stream‑Batch Unification

By partnering with Alibaba Cloud EMR Serverless Spark and the ALake Spark team, the team integrated Spark as a batch engine to materialize near‑real‑time physical tables. Over six months Spark delivered:

30‑40% reduction in development effort.

>90% data stability improvement.

Up to 92% performance boost for critical workloads.

Key Spark + Paimon Features

Delete Vector (DV)

Paimon originally supported Copy‑On‑Write (COW) and Merge‑On‑Read (MOR). COW rewrites whole files on update (high write amplification), while MOR writes fast but requires costly read‑time merges. DV records deletions at write time and filters them at read time, preserving MOR write speed and avoiding COW merge overhead. In a benchmark with 500 M rows (20% duplicate primary keys), enabling DV accelerated queries by 3‑5×.

Variant Type for JSON

Variant stores JSON schema and data in a self‑describing columnar format. Writes incur a modest slowdown, but reads achieve up to 12× speedup when Shredding is enabled (1.7× without Shredding). This dramatically reduces JSON parsing bottlenecks in retail scenarios.

Fusion + Celeborn

Fusion is a C++‑based vectorized SQL engine that transforms row‑wise execution into column‑wise, enabling SIMD acceleration and better CPU‑cache utilization. Celeborn, contributed to Apache, provides a remote shuffle service that eliminates fetch‑failure errors and improves shuffle stability and performance for large‑scale jobs.

Real‑Time Feature Production

Before lakehouse adoption, generating a single real‑time feature could take >2 weeks. After integrating Paimon and Spark, the team adopted a staged pipeline: first produce minute‑level features, evaluate importance, then promote to fully real‑time production. This cut feature generation time by >3× and reduced experiment‑to‑insight latency to 10‑30 minutes, saving ~20% of resource costs.

Spark Tuning and Governance Practices

Data Skew Mitigation

Enable adaptive skew join: spark.sql.adaptive.skewJoin.enabled=true Set skew threshold:

spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=256MB

Force map‑join for large‑small table joins: add hint /*+ MAPJOIN(small_table) */ Adjust bucket count using bucket_num = partition_size_GB / 2 (e.g., 864 GB → 432 buckets).

Execution Plan Optimization for Cube/Dimension Expansion

Insert repartition(N) before heavy dimension expansion to increase task parallelism.

Choose N based on data size (e.g., 9 M rows → N≈400) and verify reduced stage duration and task variance (<20%).

Storage Layer Optimization

When performance remains insufficient after parameter tuning, rescale bucket numbers or change bucket keys (primary key for PK tables, high‑frequency join/group‑by columns for non‑PK tables).

Example

TBLPROPERTIES (
  'bucket'='432',
  'primary-key'='ds,user_id,order_id',
  'deletion-vectors.enabled'='true'
)

Results

The combined Spark + Paimon solution now supports AB‑experiment result retrieval, real‑time traffic analysis, and marketing feature production across dozens of retail scenarios. Development efficiency improved by 30‑40%, data stability rose above 90%, and Spark‑driven optimizations delivered up to 92% speedup.

big dataReal-time analyticsPerformance TuningPaimonSparklakehouse
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.