Big Data 17 min read

How Big Data Tech Evolved: Lessons from Alibaba, JD, and Didi

This article traces the evolution of big data technologies from early concepts and Google research papers through the rise of Hadoop, examines the platform transformations of Alibaba, JD.com, and Didi, and offers practical stack‑selection guidance for medium‑ and small‑scale enterprises.

Open Source Linux

Dec 3, 2021

How Big Data Tech Evolved: Lessons from Alibaba, JD, and Didi

01 Overview of Big Data Technology Evolution

Big data was first mentioned in the 1990s; Google’s three seminal papers (GFS, MapReduce, Bigtable) from 2003‑2006 laid the foundation, and the emergence of Hadoop and similar systems sparked a decade‑long boom. According to Gartner’s Hype Cycle, big data entered the cycle in 2011, peaked in 2014, and has since matured into a stable, widely‑adopted technology that delivers tangible value across industries.

Key observations from major internet companies’ big‑data stacks:

Data volume will keep expanding, and value extraction will deepen. The rise of IoT and 5G will generate massive upstream data, creating more opportunities for downstream applications.

Real‑time requirements will increase. Offline batch processing is now routine; the next challenge is real‑time ingestion, computation, visualization, and online machine learning.

Underlying technologies are consolidating while applications proliferate. Components such as Spark and Kafka become de‑facto standards, while diverse business products embed big‑data capabilities.

Public and private clouds will coexist. In China, many enterprises still prefer on‑premise deployments due to security and competition concerns, but future stacks will move toward containerization, compute‑storage separation, and multi‑region deployment.

02 Case Study: Alibaba Big Data Technology Evolution

The Feitian big‑data platform started in 2009 with Alibaba’s “Moon‑landing” plan and has run for over ten years. MaxCompute (formerly ODPS) is a core component. Today, Feitian supports 99% of Alibaba’s data storage and compute, processing over 600 PB per day, and underpins Alibaba’s AI services.

Key milestones:

2009: Parallel development of Hadoop (cloud‑ladder 1) and ODPS (cloud‑ladder 2).

2013: Both platforms reached 5,000 servers; ODPS was chosen for its deeper control and performance.

2015: ODPS + BASE launched publicly; ODPS set four SortBenchmark records, sorting 100 TB in under 7 minutes.

2016‑present: Introduction of EMR, Stream Compute, PAI, and a unified programming platform that integrates batch, streaming, and AI workloads.

Core advantages of the Feitian platform:

Extreme compute‑cost optimization.

Comprehensive enterprise data governance.

Seamless integration of big data and AI.

03 Case Study: JD.com Big Data Technology Evolution

JD.com launched its big‑data initiative in 2010, establishing a dedicated data department. The stack evolved from a traditional data warehouse to a Hadoop‑based ecosystem covering Hadoop, Kubernetes, Spark, Hive, Alluxio, Presto, HBase, Storm, Flink, and Kafka.

Current scale: >40,000 servers, single‑cluster size >7,000 nodes, data volume >800 PB, daily growth >1 PB, >1 million jobs per day, and >900 million tables. Offline processing exceeds 30 PB per day; real‑time streams handle nearly a trillion rows daily.

Key components:

Data acquisition: JD Data Express (custom framework) supports both batch and real‑time ingestion.

Storage: JDHDFS (enhanced HDFS), JDHBase (enhanced HBase), and a hot‑cold data management layer.

Offline compute: JDHive, JDSpark, and Adhoc query services (Presto/Kylin) on YARN, with Alluxio caching.

Real‑time compute: JDQ (Kafka‑based bus), JD Real‑time Compute (Storm, Spark Streaming, Flink).

Machine learning: A layered platform (infrastructure, tools, scheduling, algorithms, API).

Scheduling & monitoring: Custom distributed scheduler with high‑availability nodes; monitoring built on Prometheus with extensions.

04 Case Study: Didi Big Data Technology Evolution

Didi’s big‑data journey comprises three phases: self‑built small clusters, centralized platform clusters, and finally a SQL‑centric architecture.

The offline platform is built on Hadoop 2 (HDFS, YARN, MapReduce) plus Spark and Hive, with a custom scheduler and a visual SQL editor for developers.

Real‑time platform (since 2017) uses an internally developed Spark Streaming engine on YARN, offering StreamSQL IDE, monitoring, diagnostics, lineage, and task control.

Didi heavily utilizes HBase (with custom extensions) and Phoenix as a SQL layer on top of HBase.

05 Big Data Platform Stack Selection for SMEs

5.1 Medium‑Scale Enterprises (≈1,000 engineers, ~100 dedicated big‑data staff)

Start with an open‑source Hadoop distribution as the base platform. After gaining operational experience, add custom enhancements to fit business needs and reduce maintenance overhead. Build a unified data platform early to avoid siloed solutions and later integration costs.

Data ingestion: Use proven open‑source tools such as Flume or StreamSets; consider Apache NiFi for experimental use.

Storage: HDFS for batch data, HBase for real‑time data; Kudu can serve as a real‑time warehouse when needed.

Offline compute: Replace MapReduce‑based Hive with Spark‑based Hive; use Spark SQL as the primary engine; consider Impala or Kylin for heavy BI workloads.

Real‑time compute: Spark Streaming and Flink are the de‑facto standards; avoid Storm; Kafka is the mandatory messaging layer.

Machine learning: Spark MLlib for common algorithms; custom implementations for specialized models; internal ML platforms are common.

Scheduling & orchestration: Evaluate DolphinScheduler or build a custom solution.

5.2 Small‑Scale Enterprises (≈100 engineers, a few‑dozen big‑data staff)

Prefer stable, managed cloud big‑data services or turnkey Hadoop distributions (e.g., CDH, FusionInsight) rather than extensive custom development. Adopt the core stack of Hadoop + Hive + HBase + Spark/Flink + Kafka, and rely on cloud‑native monitoring and CI/CD pipelines.

Source: 大数据研习社

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alibaba Case Study Big Data platform architecture Technology evolution Didi JD.com

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.