Big Data 23 min read

58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration

This article presents a comprehensive case study of 58.com’s massive Hadoop‑based offline computing platform, detailing its architecture, scaling challenges, performance‑tuning measures, YARN and SparkSQL upgrades, and the systematic cross‑data‑center migration of thousands of nodes and petabytes of data.

Big Data Technology Architecture

Jun 4, 2020

58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration

58.com operates an offline computing platform built on the Hadoop ecosystem, comprising over 4,000 servers, hundreds of petabytes of storage, and handling more than 400,000 daily compute tasks, serving core business lines such as commerce, real‑estate, and recruitment.

The Data Platform department provides five key capabilities: data ingestion via Flume and Kafka, offline computation using customized HDFS, YARN, MapReduce and Spark, real‑time computation through a Flink‑based stack called Wstream, multi‑dimensional analysis with Kylin (offline) and Druid (real‑time), and database services built on HBase, OpenTSDB, and JanusGraph.

Rapid cluster growth introduced bottlenecks, especially the NameNode single‑point limitation. The team adopted HDFS Federation to distribute metadata, employed ViewFileSystem for a unified namespace, and enforced directory quota policies to keep maintenance costs low.

Stability was improved by decomposing heavy RPC traffic: Hive Scratch directories were balanced across NameNodes, Yarn log aggregation was redirected to local storage, and ResourceLocalize operations were distributed. DataNode BlockReport frequency was reduced from hourly to every ten hours, incremental BlockReport batching was introduced, and liveless DataNode handling was added to avoid redundant block replication.

Core‑link optimizations targeted lock contention and processing overhead, refining PermissionCheck, QuotaManager, ReplicationMonitor, and choseTarget modules to increase parallelism and reduce lock time.

GC tuning for the NameNode (heap up to 230 GB) switched to CMS, lowered Young GC frequency, and avoided concurrent‑mode and promotion failures, stabilizing long‑running services.

YARN scheduling was enhanced by separating lock scopes, profiling sorting logic to cut CPU usage, and introducing priority‑based queue isolation, raising container throughput to roughly 3,000 containers/s.

The compute engine migrated from Hive to SparkSQL via SparkThriftServer, adding multi‑tenant support, fixing over 50 compatibility issues, and improving driver memory management, HA, and resource isolation.

Cross‑data‑center migration leveraged HDFS decommission with targeted rack placement, increased max‑stream settings, and monitoring tools to ensure safe node removal. Bandwidth consumption was cut by 50 % through read/write locality policies, compression, and traffic shaping. HDFS Balance was extended to allow explicit source/destination nodes, direct block queries from DataNodes, and rate‑limited transfers, enabling PB‑scale rebalancing within five months for 3,000 nodes.

Future work focuses on adopting Hadoop 3.x features such as erasure coding and Ozone object storage, and exploring cloud‑native integration to share resources between online and offline workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Migration Big Data SparkSQL YARN Hadoop cluster scaling Offline Computing

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.