Real‑Time Analytics with Doris for JD Customer Service: Architecture, Caching, and Optimization
This article describes how JD.com leverages the open‑source MPP analytical database Doris for real‑time and offline OLAP on customer‑service data, covering data ingestion pipelines, dual‑stream high‑availability design, dynamic partition management, multi‑level caching, monitoring with Prometheus‑Grafana, and performance optimizations applied during major sales events.
Doris is an open‑source MPP analytical database that delivers sub‑second query responses on petabyte‑scale datasets, offering a simple distributed architecture, elastic scaling, and easy operations, which has been adopted by large Chinese companies such as JD.com and Xiaomi.
The article focuses on JD.com’s customer‑service real‑time dashboard, explaining why traditional RDBMSs (MySQL, Oracle) and batch‑oriented systems (Hive, Kylin) cannot meet the latency requirements of massive, multi‑dimensional analytics, and how Doris, alongside Apache Druid and ClickHouse, fills this gap.
Data ingestion uses both real‑time Kafka streams (via Routine Load) and offline HDFS files (via Broker Load and Stream Load). The EasyOLAP Doris pipeline visualizes these flows.
Monitoring is built on Prometheus and Grafana; node_exporter collects host metrics, Doris exposes FE/BE metrics in Prometheus format, and a custom OLAP Exporter gathers Routine Load metrics to ensure data freshness.
For high‑availability during peak traffic, JD implements a dual‑cluster active‑standby design: if one cluster experiences jitter or latency, traffic can be switched to the backup cluster to maintain service stability.
Dynamic partition management was customized to retain partitions for important periods (e.g., 618, 11.11) that the community version would otherwise drop, reducing storage while preserving historical data.
Doris caching comprises three layers: Result Cache, SQL Cache, and Partition Cache. Result Cache stores whole query results, SQL Cache keys on statement, table partition ID and version, while Partition Cache caches read‑only partitions at a finer granularity, allowing queries over recent partitions to reuse cached data and dramatically lower cluster load.
Cache effectiveness is demonstrated during the 11.11 promotion: CPU usage dropped from near 100% without caching to 30‑40% with Result Cache enabled, confirming the benefit for high‑concurrency workloads.
Additional optimizations include import‑task tuning (adjusting batch size, time intervals, and data volume thresholds), enhanced monitoring dashboards that aggregate key metrics (CPU, backlog rows, TP99, QPS), and a suite of operational tools such as import sampling, big‑query analysis, downgrade/recovery, and cluster health inspection utilities.
Future work plans to introduce materialized views, bitmap indexes for precise UV counting, audit‑log‑based query analysis, and further improvements to table creation and rollup strategies to broaden OLAP adoption across JD’s services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
