Big Data 24 min read

Scaling and Optimizing 58.com’s Hadoop‑Based Offline Computing Platform: Architecture, Challenges, and Solutions

This article details how 58.com built a massive Hadoop‑based offline computing platform with over 4,000 servers and hundreds of petabytes of storage, addressing scaling, stability, GC, YARN scheduling, SparkSQL migration, storage operations, and a large‑scale cross‑datacenter migration.

DataFunTalk
DataFunTalk
DataFunTalk
Scaling and Optimizing 58.com’s Hadoop‑Based Offline Computing Platform: Architecture, Challenges, and Solutions

58.com’s offline computing platform is built on the Hadoop ecosystem, operating a single cluster of more than 4,000 servers, storing hundreds of petabytes, and handling up to 400,000 daily compute tasks, which presents significant scaling and stability challenges.

The Data Platform department provides core capabilities: data ingestion via Flume and Kafka, offline computing using HDFS, YARN, MapReduce and Spark, real‑time processing with Flink (Wstream), multi‑dimensional analysis using Kylin (offline) and Druid (real‑time), and database services based on HBase, OpenTSDB, and JanusGraph.

To expand the cluster, HDFS Federation was adopted to eliminate the NameNode single‑point bottleneck, and ViewFileSystem was used to present a unified namespace while controlling directory depth to keep maintenance costs low.

Stability issues such as RPC overload were mitigated by decomposing heavy workloads (e.g., Hive Scratch, Yarn log aggregation, ResourceLocalize) and by optimizing DataNode block reports, incremental block reports, and handling DN “liveless” situations.

Core link optimizations reduced lock holding time and improved efficiency in PermissionCheck, QuotaManager, ReplicationMonitor, and choseTarget components.

For inter‑NameNode load balancing, a FastCopy tool with HardLink was introduced, allowing data to be copied without full network transfer.

Garbage‑collection tuning employed CMS GC, reduced Young GC frequency, and avoided concurrent mode and promotion failures, keeping the NameNode heap under control.

YARN scheduling was refined for both stability (filtering NM OOM events, app state filtering, DNS caching) and compute stability (label isolation, priority queues, container cgroup isolation) as well as overload protection (limits on users, apps, containers).

The platform migrated from Hive to SparkSQL by deploying a multi‑tenant SparkThriftServer, ensuring compatibility (UDFs, syntax, data quality, parameters) and improving query performance.

SparkThriftServer stability work included driver memory management, HA deployment, retry strategies, and reducing HDFS pressure from Spark jobs.

Operational management introduced storage‑side services such as quota control, permission management, alerting, cost optimization, and large‑scale data compression (GZIP/LZO) that saved over 100 PB of space.

A self‑service analysis platform built on LinkedIn’s Dr‑elephant was extended to support multiple Spark versions and to provide heuristics for MR placement, container sizing, and other performance insights.

The cross‑datacenter migration moved 3,000 machines, 130 PB of data, and 400,000 tasks to a new site within five months, using enhanced HDFS decommission that allowed target‑datacenter specification, increased replication streams, and monitoring tools to avoid hangs and block loss.

Network optimizations reduced cross‑site bandwidth by 50 % through topology‑aware read/write policies, compression of Flume‑to‑HDFS streams, and throttling of large‑scale jobs, while also upgrading 1 Gbps links to 10 Gbps in the new datacenter.

Disk imbalance in the new site was addressed with a customized HDFS Balance tool that supports source/destination specification, direct block queries from DataNodes, rate limiting, and selective node participation, achieving PB‑level balancing within five working days.

Compute migration involved deploying a new YARN cluster, migrating queues gradually, performing gray‑box testing, and ensuring no service disruption.

Future plans focus on upgrading to Hadoop 3.x for erasure coding and Ozone object storage, and exploring cloud‑private‑cloud integration to enable workload sharing between online and offline services.

data migrationBig DataSparkSQLYARNHadoopCluster ScalingOffline Computing
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.