Big Data 11 min read

HuoLala Big Data Infrastructure: Challenges, Practices, and Future Outlook

Senior big data engineer Zhu Yaogai from HuoLala shares the team’s three‑year journey, detailing background challenges, the construction of a multi‑layer big‑data infrastructure, solutions for cost efficiency, operational automation, heterogeneous computing, and future plans, illustrating how high cost‑effectiveness, operational efficiency, and analytical performance drive their evolution.

DataFunSummit

Apr 5, 2024

HuoLala Big Data Infrastructure: Challenges, Practices, and Future Outlook

HuoLala, a multi‑service logistics platform with over 10 million monthly active users and more than 1,000 monthly active drivers, processes over 20 PB of data across 1,000+ machines and 7 IDC sites, handling 20 K+ daily tasks.

The big‑data team, in three years, built three core systems—security, cost control, and stability—and four architectural pillars: platform service layer, compute engine layer, cluster management & storage layer, and basic resource management layer.

Key challenges identified were low cost efficiency (under‑utilized resources, peak‑valley cost waste, rapid cost growth), low operational efficiency (complex SOP‑driven workflows across multiple regions, clusters, and services), and low analytical efficiency (slow ad‑hoc queries).

To improve cost efficiency, the team introduced elastic resource management using a mix of reserved, on‑demand, and spot instances, achieving 20‑30% cost savings while maintaining high‑priority job stability.

Heterogeneous computing was explored by replacing x86 nodes with ARM, reducing hardware cost by at least 15% and power consumption, while maintaining performance after component adaptation (Tez, Spark, YARN) and extensive testing.

Operational efficiency was boosted through automated asset management, workflow orchestration that consolidates SOPs and scripts, and targeted automation of high‑frequency scenarios, dramatically reducing manual effort.

Cloud‑native advances included migrating YARN workloads to Kubernetes, implementing a Remote Shuffle framework to enable stateless compute nodes, and adopting the open‑source Uniffle project with contributions of ~20 k lines of code.

For analytical efficiency, a multi‑engine routing system was built to automatically select the optimal engine (Hive, Spark, Presto, etc.) based on query characteristics, with automatic fallback and a reported 84% compatibility for Presto and a 70% success rate in critical scenarios.

The presentation concluded with future directions: further engine optimization, extreme elasticity, AI‑driven intelligent operations (AIOps), offline‑online mixing, and exploration of lake‑house integration.

Overall, the team’s evolution is driven by the three goals of high cost‑effectiveness, high operational efficiency, and high analytical performance, emphasizing measured adoption of new technologies and incremental, benefit‑oriented improvements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Automation heterogeneous computing cost efficiency data infrastructure

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.