How Alibaba’s A+ Traffic Analysis Achieved Sub‑Second Log Queries with StarRocks & Paimon
This article details how Alibaba's A+ traffic analysis platform tackled trillion‑row log ingestion and high‑concurrency queries by redesigning storage with Paimon, leveraging Flink for real‑time ingestion, and using StarRocks for fast lake analytics, ultimately reducing query latency from minutes to seconds.
Background
The A+ traffic analysis platform is Alibaba Group's unified, full‑domain traffic data analysis system, ingesting logs from pages, mini‑sites, activities, and apps. Daily log volume reaches trillions of records, creating massive challenges for both write throughput (tens of millions of RPS) and low‑latency, high‑concurrency queries.
Business Challenges
High‑throughput writes require handling data skew, anti‑fraud processing, and dimension table joins.
End‑to‑end latency must stay within five minutes to enable rapid business decisions.
Concurrent queries must remain stable and fast even under heavy load.
Query efficiency is limited by data scale, index design, and compute resources; caching or pre‑aggregation is needed.
Technical Background
Log data is stored in an offline warehouse with daily partitions, but partitions can contain up to 500 billion rows, causing long scan times. The offline warehouse does not support streaming, leading to hour‑level data latency.
Solution Overview
After evaluating several options—15‑minute batch scheduling, StarRocks internal tables, and a combined StarRocks + Paimon approach—the team selected Paimon for storage and StarRocks as the compute engine.
Why StarRocks + Paimon?
Paimon provides a real‑time public layer that supports multi‑region Flink subscriptions and stores data in AliORC format.
StarRocks can query AliORC directly (once supported) and also reads Parquet‑formatted Paimon tables.
The combination offers vectorized execution, optimal CBO planning, native deletion‑vector reads, and strong predicate push‑down and partition pruning.
Paimon’s low storage cost and high scalability complement StarRocks’ fast lake analytics.
Technical Design
Data flows from real‑time public layers (App and Web logs) into Flink, which writes Parquet‑formatted Paimon tables partitioned by date, type, product, and event. A user‑device mapping table enables point lookups. StarRocks accesses these tables via an external catalog, performing bucket‑level scans to limit data reads.
Key Optimizations
Bucketed log tables reduce scan range to ~20 million rows per bucket.
Bucketed primary‑key mapping tables support fast point queries.
StarRocks multi‑layer cache (memory, local disk, remote) accelerates hot‑data queries.
Checkpoint‑driven file size control (100 MB–400 MB) balances small‑file overhead and large‑file latency.
Compaction Strategy
Both streaming and periodic compaction were evaluated; checkpoint‑based file size control proved most effective for stable Flink jobs and consistent StarRocks performance.
Results
On a 3000 CU StarRocks instance handling 1.5 trillion rows, first‑page query latency dropped to 4–8 seconds, with subsequent pages at 2–5 seconds. Compared to the original solution (5 minutes query time, 2 hour data freshness), the new stack achieves ~5 seconds latency and 5–10 minute freshness.
Future Outlook
Lower storage cost by reusing the real‑time Paimon layer.
Higher compute performance thanks to AliORC’s efficient I/O.
Expand coverage to event analysis, retention analysis, and path analysis scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
