Big Data 16 min read

How Beike Scaled to 600 PB: The Evolution of a Data‑Fusion Architecture

This article details Beike's data‑fusion architecture evolution, covering industry trends, multi‑stage Hadoop upgrades, storage cost optimization with erasure coding, remote shuffle integration, GPU‑centric training stability, and future hybrid‑cloud strategies, while also sharing organizational and operational lessons learned.

DataFunSummit
DataFunSummit
DataFunSummit
How Beike Scaled to 600 PB: The Evolution of a Data‑Fusion Architecture

Introduction

Beike’s data‑fusion architecture has been continuously optimized across storage, scheduling, performance, and stability. The discussion includes small‑file acceleration, asynchronous checkpoints, GPU utilization improvements, hybrid architecture adaptation, and a vision for future intelligent data‑fusion platforms, alongside organizational and technical management practices.

Industry Trend Background

Data volumes are growing exponentially, with Beike’s Hadoop cluster reaching nearly 600 PB and 4,000 servers since 2018. AI demands higher data throughput and lower latency, pushing model parameters from millions to billions and requiring systems with high throughput, low latency, and ease of use.

Pure storage‑compute designs like Hadoop struggle with scalability, flexibility, and performance for diverse workloads, leading to challenges such as data islands and complex metadata management.

Beike Data Architecture Evolution Path

1. Necessity of Evolution – Storage costs rise with data volume despite compression or erasure coding (EC). Compute costs are tied to task count and complexity, making compute‑centric designs insufficient for data‑driven ecosystems.

2. Evolution Stages

2019‑2022: Focus on stability, reducing report failures from 8 incidents to 0 per year.

2023‑2024: Emphasize cost optimization, adopting EC to lower storage costs while maintaining a “ten‑fold growth” strategy.

From 2025: Build hybrid‑cloud data architecture and AI‑focused data platforms.

AI development progressed from traditional models (2019‑2022) to MLOps and AiStudio, then to large‑model training with mixed‑cloud resources, supporting LLMs and diffusion models.

Case Study: Hadoop Rebirth

In 2019, daily Hadoop jobs suffered delays on a 2.7.3 cluster nearing rack capacity. Three parallel improvements were applied:

Horizontal namespace splitting.

Upgrade to Hadoop 3.2.1 for version benefits.

Cross‑datacenter migration to alleviate expansion limits.

The upgrade, completed between 2020‑2021, included namespace splitting, HDFS compatibility between 2.7.3 and 3.2.1, cross‑datacenter caching, and QoS controls, reducing average job time from 900 s to 220 s and keeping namespace objects under 4.5 billion.

Storage Cost Optimization

Key measures after stabilizing the system:

Increase physical storage density while managing cold‑start and failure risks.

Reduce replica factor from 2.68 to a target of 2.0, improving resource utilization.

(1) Compute‑Storage Separation – Decouples storage and compute, allowing on‑demand resource allocation and improving CPU utilization from ~60 % to higher levels, benefiting AI workloads.

(2) Large‑Scale Erasure Coding (EC) – Implemented in 2020, addressing accuracy, performance, and maintainability issues, achieving a replica factor reduction from 3.0 to 2.68.

(3) Remote Shuffle – Integrated Apache Uniffle for true resource decoupling, evolving from local merge to full remote execution, enhancing elasticity and reducing I/O bottlenecks.

Storage Orchestration

With GPU resources spread across multiple clouds and regions (>10 PB of data), a cross‑region storage orchestration system was built to:

Replace centralized storage with logical unified orchestration.

Support multiple protocols and training frameworks (DFS, COS, QBFS).

Provide transparent access and efficient caching/replication to lower bandwidth load.

Training Stability Assurance

Large‑scale model training faces GPU ECC errors and high failure rates (up to 93.3 % for thousand‑GPU tasks). Checkpointing becomes a bottleneck when writing hundreds of GB. Beike developed KeFlashCheckpoint (2024) for Megatron‑LLM and DeepSpeed, cutting a 960‑card, 800‑billion‑parameter model’s training time from 25 days to 23 days (8.3 % speedup) and saving ~700 k GPU‑card‑hours annually.

Additional reliability mechanisms include automated task restarts and elastic training pipelines.

Core Principles and Future Direction

The two guiding principles are: bring data as close as possible to compute resources, and reduce serial load to boost parallel efficiency. Techniques such as Ring AllReduce replace linear communication growth with ring‑based scaling.

Future work focuses on expanding the AI production line, enhancing cross‑region data pipelines, and strengthening organizational collaboration, fault‑culture, and knowledge sharing.

Organizational Collaboration

The team handles big‑data kernels, AI data, heterogeneous acceleration, and search. Emphasis is placed on goal decomposition, consensus building, fault‑postmortems, and continuous debt reduction to maintain system stability.

Conclusion

Beike’s data architecture evolution demonstrates a comprehensive journey from massive Hadoop clusters to AI‑centric, hybrid‑cloud, storage‑orchestrated platforms, offering valuable insights for large‑scale data‑intensive enterprises.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud computingAIStorage OptimizationHadoopData Architecture
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.