Architecture Evolution and Capability Building of the Smart Acceleration Engine in the 58 Big Data Platform
The article details the background, architectural challenges, and comprehensive redesign of the Smart Acceleration Engine—including multi‑tenant support, cross‑datacenter scheduling, enriched engine selection, parsing and forwarding enhancements, compatibility adaptations, stability fixes, containerized deployment, and performance gains—demonstrating significant operational improvements and future directions for the platform.
Smart Acceleration Engine is a self‑developed complex computing component of the 58 Big Data Platform, playing a crucial role in supporting business growth and platform stability. With the maturity of big data technologies and rapid development of AIGC, the engine is being iteratively upgraded to achieve notable cost‑reduction and efficiency gains.
Architecture Analysis
The platform’s systematic architecture (see Fig. 1) includes the Smart Acceleration Engine, which provides efficient parsing, flexible forwarding, and strong execution capabilities for ad‑hoc query scenarios. However, increasing data volume and business scale expose several issues:
High code coupling across modules, leading to complex maintenance.
Strong coupling with Hive source code, limiting multi‑engine extensibility and hindering unified SQL entry.
Gateway service also handling compute tasks, causing node overload and long query times.
Resource and business isolation needs across data‑center pools, with cross‑datacenter scheduling bandwidth constraints.
To address these, the architecture is upgraded to use Apache Kyuubi as an independent gateway and introduce StarRocks as an additional compute engine (Fig. 3).
Capability Building
3.1 Multi‑tenant Architecture Refactor
Instead of proxy‑user authentication, Kyuubi’s engine startup now uses doAs to simulate real user sessions, enabling true multi‑tenant support.
3.2 Group Isolation and Multi‑engine Support
Namespaces provide physical isolation for different data‑centers, while logical refactoring allows multiple nodes and services within the same namespace.
3.3 Enhanced Cross‑Datacenter Scheduling
SQL dispatch logic now considers account and data volume, routing queries to appropriate data‑centers and engines (Fig. 7).
3.4 Rich Engine Selection Strategies
Beyond the fixed engine choice, strategies such as RANDOM, LEASTACTIVE, and WEIGHT are added to balance load and avoid single‑node bottlenecks (Fig. 8).
3.5 Parsing and Forwarding Improvements
SQL parsing now fully supports StarRocks syntax; SQLGlot provides dialect rewriting; HBO leverages historical run‑time data for optimal plan selection; an AI matrix predicts the best execution plan using machine‑learning models.
Compatibility Adaptation
Extensive adaptations ensure Spark‑StarRocks compatibility across syntax, metadata binding, query optimization, and execution phases, including Java UDF support, Hive catalog cache disabling, and handling of StarRocks‑unsupported functions.
Stability Fixes
Issues such as FE node hangs, BE crashes, and CBO‑generated massive SQL causing memory overflow are mitigated, and Hive statistics collection is bypassed for better stability (Fig. 11).
Usability Enhancements
Java UDFs can now be fetched from HDFS, and SQL black‑list persistence is moved to metadata for reliability.
Containerized Deployment Exploration
A hybrid cloud‑on‑premise deployment is adopted: FE remains on physical machines, BE partially in the cloud with local storage, and CN as stateless compute nodes with resource isolation and disk‑spilling to address container memory limits (Fig. 14).
Performance Improvements
Data Cache activation and slow HDFS DataNode avoidance significantly improve query latency, with measurable reductions in long‑tail queries (Fig. 15).
Landing Results
Smart Acceleration Engine processes over 100k SQLs daily.
StarRocks lake‑warehouse architecture drives ad‑hoc query growth, surpassing 60k daily SQLs.
ETL instances exceed 10k, with average efficiency gains of 41%.
HiveSQL migration reaches 92.4%, with P95 latency improved by 43.8%.
AI‑driven HBO models increase accuracy by 82% and halve failover rates.
CPU consumption reduced by over 15,000 cores through resource‑efficient architecture.
Summary and Outlook
The project benefited greatly from the Apache Kyuubi and StarRocks communities. Future work will focus on continuous Kyuubi/StarRocks iteration, exploiting Spark 3.5 capabilities, advancing vectorized processing, and exploring AI/algorithm innovations for smarter data‑driven decisions.
Authors: Ma Ruili, Wang Shifa, Liu Kai, Zhou Heming, Wu Yanxing, and the Data Architecture Computing Team.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.