Breaking the CPU Wall: BIGO’s Gluten Engine Accelerates Spark and Flink
When big‑data workloads hit the CPU wall, BIGO’s adoption of the open‑source Gluten project delivers native‑engine execution for Spark and a roadmap for Flink, achieving up to 30% end‑to‑end speedup, 50% memory savings, and a scalable, cost‑effective data processing platform.
Overview
Big data computation is increasingly constrained by CPU saturation as storage speeds outpace processing, and Java‑based execution cannot fully exploit modern vector instructions. BIGO, a global audio‑video service provider, joined the Gluten open‑source effort to overcome these limits.
Technical Bottlenecks and Strategic Choice
Apache Spark processes petabytes of data daily, but its Java interpreter suffers from 80%+ CPU utilization and limited vectorization (e.g., AVX/AVX2). Further Java‑level optimizations (Tungsten, adaptive query) have reached diminishing returns, with native C++ implementations up to 3‑5× faster for string handling and UDFs.
Native Execution Wave
Gluten enables native execution by converting Spark logical plans to Substrait, then running them on high‑performance back‑ends such as ClickHouse and Velox. Key techniques include:
Vectorized execution that reduces CPU instruction count by ~90% and improves cache hit rates several‑fold.
Columnar shuffle using Arrow format, cutting aggregation latency by 2‑4×.
Hardware‑aware optimizations that bypass JVM JIT overhead.
Gluten Architecture in Spark
The core components are:
Plan Conversion : Spark physical plan → Substrait → Velox/ClickHouse.
Columnar Shuffle : Distributed columnar data exchange via Celeborn.
Fallback Mechanism : Shim layer seamlessly falls back to native Spark when needed.
Scale Benefits
By mid‑2025 BIGO migrated all Spark jobs to native execution, achieving:
Average job runtime reduced by 30%, with CPU‑intensive workloads (e.g., ad attribution) improving up to 50%.
Memory consumption cut by 50%, saving over ¥1.2 million in hardware costs annually.
Over 5,000 workflows and 20,000 jobs migrated, with 600+ PRs and 9 k lines of code contributed.
Gluten for Flink Roadmap
Starting January 2025, BIGO began extending Gluten to Flink, tackling three main challenges: lack of a plugin architecture, state management, and low‑latency stream processing. The plan follows a “from simple to complex” approach:
2025 Q2: Support stateless operators (Filter, Map, simple Aggregate) and achieve a Nexmark POC.
2025 Q3: Enable full‑streaming benchmarks (Join, Window) with Velox runtime.
2025 Q4: Deliver state management and checkpoint integration, pilot real‑time monitoring jobs.
2026 and beyond: Expand to Flink batch, DataStream, and full columnar execution.
Community Involvement
BIGO invites developers to contribute code (e.g., state interfaces, vectorized operators), provide production scenarios, report bugs, improve documentation, or review pull requests. The project is open‑source under Apache 2.0, encouraging a collaborative ecosystem.
Q&A Highlights
Key questions addressed include performance gaps between lab tests and production, the focus on streaming vs. batch for Gluten‑for‑Flink, supported back‑ends (ClickHouse, Velox), deployment environments (Linux distributions, x86_64/ARM), future GPU/FPGA acceleration plans, and advantages over commercial engines such as Alibaba Flash.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
