58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration
This article presents a comprehensive case study of 58.com’s massive Hadoop‑based offline computing platform, detailing its architecture, scaling challenges, performance‑tuning measures, YARN and SparkSQL upgrades, and the systematic cross‑data‑center migration of thousands of nodes and petabytes of data.
58.com operates an offline computing platform built on the Hadoop ecosystem, comprising over 4,000 servers, hundreds of petabytes of storage, and handling more than 400,000 daily compute tasks, serving core business lines such as commerce, real‑estate, and recruitment.
The Data Platform department provides five key capabilities: data ingestion via Flume and Kafka, offline computation using customized HDFS, YARN, MapReduce and Spark, real‑time computation through a Flink‑based stack called Wstream, multi‑dimensional analysis with Kylin (offline) and Druid (real‑time), and database services built on HBase, OpenTSDB, and JanusGraph.
Rapid cluster growth introduced bottlenecks, especially the NameNode single‑point limitation. The team adopted HDFS Federation to distribute metadata, employed ViewFileSystem for a unified namespace, and enforced directory quota policies to keep maintenance costs low.
Stability was improved by decomposing heavy RPC traffic: Hive Scratch directories were balanced across NameNodes, Yarn log aggregation was redirected to local storage, and ResourceLocalize operations were distributed. DataNode BlockReport frequency was reduced from hourly to every ten hours, incremental BlockReport batching was introduced, and liveless DataNode handling was added to avoid redundant block replication.
Core‑link optimizations targeted lock contention and processing overhead, refining PermissionCheck, QuotaManager, ReplicationMonitor, and choseTarget modules to increase parallelism and reduce lock time.
GC tuning for the NameNode (heap up to 230 GB) switched to CMS, lowered Young GC frequency, and avoided concurrent‑mode and promotion failures, stabilizing long‑running services.
YARN scheduling was enhanced by separating lock scopes, profiling sorting logic to cut CPU usage, and introducing priority‑based queue isolation, raising container throughput to roughly 3,000 containers/s.
The compute engine migrated from Hive to SparkSQL via SparkThriftServer, adding multi‑tenant support, fixing over 50 compatibility issues, and improving driver memory management, HA, and resource isolation.
Cross‑data‑center migration leveraged HDFS decommission with targeted rack placement, increased max‑stream settings, and monitoring tools to ensure safe node removal. Bandwidth consumption was cut by 50 % through read/write locality policies, compression, and traffic shaping. HDFS Balance was extended to allow explicit source/destination nodes, direct block queries from DataNodes, and rate‑limited transfers, enabling PB‑scale rebalancing within five months for 3,000 nodes.
Future work focuses on adopting Hadoop 3.x features such as erasure coding and Ozone object storage, and exploring cloud‑native integration to share resources between online and offline workloads.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.