Why Suning.com Sticks with Hadoop: Insights into China’s Big Data Platform Choices
Amid declining Hadoop usage reports, Suning.com’s 2018‑2020 big‑data platform case study reveals why the retailer still relies on Hadoop’s mature ecosystem, how it integrates HDFS, HBase, YARN, Hive, Spark, Flink and emerging tools, and what future resource‑management plans it envisions.
Background
In 2018 the KDnuggets data‑science and machine‑learning tools survey reported a 35% drop in Hadoop usage among respondents, mainly from North America and Europe. The article examines whether this trend threatens Hadoop’s de‑facto status in China, where data volumes are larger and Hadoop adoption remains high.
Suning.com big‑data platform
Suning.com (a major B2C e‑commerce platform) built its data platform on Hadoop starting in 2013. The selection criteria were:
Maturity and stability: Hadoop had been production‑ready for years.
Cost‑effectiveness: Open‑source licensing eliminates software fees; community support (≈7.3 K GitHub stars) reduces maintenance effort.
Core Hadoop‑based architecture
The platform uses the following components, each with a specific role: HDFS – distributed file system for petabyte‑scale data storage. HBase – column‑family store providing real‑time read/write access to tables. YARN – unified resource manager that schedules both batch and streaming jobs. Hive / SparkSQL – primary engines for offline SQL analytics. MapReduce and Spark – supplemental compute for workloads that cannot be expressed in SQL. SparkStreaming – near‑real‑time processing (micro‑batch model). SparkMLlib – machine‑learning library that underpins Suning’s ML platform.
Limitations of the classic Hadoop stack
While Hadoop excels at massive storage and batch analytics, it is not optimized for:
sub‑second OLAP queries (requires specialized real‑time engines),
millisecond‑level streaming (micro‑batch model introduces latency).
No single platform currently satisfies both high‑throughput batch and ultra‑low‑latency workloads.
Component‑level competition
Suning observes intense competition among ecosystem components:
Spark – in‑memory compute, SparkSQL largely replaces MapReduce for most workloads.
Flink – native streaming framework with unbounded data handling, event‑time semantics, exactly‑once guarantees, and asynchronous checkpointing.
Containers – Docker Swarm and Kubernetes are emerging as alternatives to YARN/Mesos for resource orchestration.
Other storage/KV options – Redis and other in‑memory stores are used for specific caching scenarios, but HBase remains dominant for GB‑TB scale key‑value data.
Current strategic direction
Suning plans to retain Hadoop as the foundational layer while augmenting it with specialized tools:
Continue using Spark as the primary compute engine.
Store data on HDFS, object stores such as S3, or distributed object systems like Ceph.
Launch a unified resource‑management project in the second half of the year that will abstract batch, streaming, and container workloads (YARN, Mesos, Kubernetes). The project is expected to reduce machine‑hardware costs by roughly 30%.
Adopt Flink 1.5 (≈3.7 K GitHub stars) for native stream processing, aiming to replace the legacy Storm stack.
Introduce real‑time OLAP engines such as Druid and search‑oriented stores like Elasticsearch to cover scenarios where Hadoop alone is insufficient.
Flink adoption details
Flink 1.5 adds:
SQL and Table API enhancements for unified batch/stream programming.
Improved network stack for lower latency.
Full support for event‑time processing and exactly‑once semantics.
The release follows a roughly five‑month cadence and is backed by an active community (≈3.7 K stars on GitHub).
Interpretation of external reports
Gartner’s “Hadoop is dying” statement is viewed by Suning as a narrow focus on the original HDFS + MapReduce stack. While MapReduce usage declines in favor of Spark and Flink, the broader Hadoop ecosystem—storage, resource management, and mature components—remains essential for large‑scale data processing in China.
Conclusion
Suning does not intend a disruptive overhaul of its data platform. Hadoop will stay at the core, complemented by Spark, Flink, Druid, Elasticsearch, and a forthcoming unified resource‑management layer. This hybrid approach balances the proven stability of Hadoop with the performance advantages of newer compute and real‑time analytics frameworks.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
