Building JD's OLAP System: From Data Ingestion to Management and Future Plans
This article explains how JD.com designs and evolves its OLAP platform, covering data sources, ingestion, storage, real‑time and offline processing, key challenges such as timeliness, high throughput, consistency, and the solutions implemented to support massive e‑commerce analytics.
The presentation introduces JD's end‑to‑end OLAP architecture, starting from business requirements such as order data, user click/search behavior, advertising, and monitoring metrics.
Data Ingestion : JD integrates diverse sources (local files, HDFS, Kafka/MQ) and formats (CSV, JSON, AVRO, PARQUET, BINLOG). A unified import service abstracts these sources, allowing analysts to configure topics, target destinations, and schemas via a visual UI.
Timeliness : Real‑time clusters handle frequent updates, while offline clusters process batch jobs with lower latency requirements. Physical isolation of the two clusters prevents interference and optimises resource allocation.
Updates & Deletions : Updates are performed by overwriting full records (e.g., order status changes), and deletions are achieved by dropping partitions or using versioned data to mask old records.
High Throughput : JD deploys 10 GbE networks, SSDs for real‑time workloads, and HDDs for offline jobs, ensuring the system can ingest petabyte‑scale data daily.
Storage : A distributed column‑store with compression (e.g., Snappy) handles massive datasets, while multi‑replica strategies and RAID provide fault tolerance.
Consistency : Distributed coordination (Zookeeper) combined with local transaction mechanisms guarantees data consistency across nodes.
Query Performance : Data is partitioned (often by date) and sharded; pre‑aggregation, materialised views, and various indexes (hash, B‑tree, inverted) accelerate query speed.
Usability : The OLAP service supports JDBC/ODBC and a graphical query interface, enabling analysts without deep SQL knowledge to run queries directly.
QPS Optimisation : Partition caches, result caches, and increased replica counts improve query throughput, while scaling hardware further boosts performance.
Management : Monitoring, alerting, automated node black‑listing, and scripted node replacement reduce operational overhead and accelerate failure recovery.
Evolution Timeline :
1.0 – Simple order analytics using relational databases (Oracle/MySQL).
2.0 – Expansion to logistics, supply‑chain, and payment data; data volume grows to TB/PB, prompting a shift to Hive/Spark offline warehouses.
3.0 – Real‑time analytics with unified OLAP services built on Doris and ClickHouse, supporting both batch and streaming workloads.
Future Plans :
Platform optimisation: dynamic scaling for ClickHouse and intelligent operations automation.
Query speed: smarter caching and automated index creation based on query patterns.
Further integration of real‑time caching in Doris.
The Q&A section addresses engine choices (Druid, ClickHouse vs. Doris), selection criteria, and JD's current data‑ingestion pathways for ClickHouse.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
