Scaling JD.com Customer Service with Doris OLAP: Architecture & Caching
JD.com’s customer service team leverages the open‑source MPP database Doris to power real‑time and offline OLAP dashboards, detailing data ingestion pipelines, full‑link monitoring, dual‑stream high‑availability design, dynamic partition management, multi‑layer caching strategies, and performance optimizations applied during the 2020 11.11 shopping festival.
Introduction
Doris is an open‑source MPP analytical database that delivers sub‑second query responses on datasets exceeding 10 PB. Its simple distributed architecture offers elastic scaling and easy operations, making it popular in China’s community and adopted by large companies such as Meituan and Xiaomi.
JD.com Customer Service Business
The JD.com customer service platform monitors metrics like consultation volume, answer rate, and complaint count in real time. To support both high‑concurrency online queries and large‑scale offline analysis, the team needed a solution that could handle massive data volumes with low latency, which traditional RDBMSs (MySQL, Oracle) and batch‑oriented systems (Hive, Kylin) could not provide.
Easy OLAP Design
01 Data Ingestion Pipeline
Real‑time data originates from Kafka, while offline data resides in HDFS. Real‑time ingestion uses Doris’s Routine Load, and offline ingestion employs Broker Load and Stream Load.
02 Full‑Link Monitoring
The project uses Prometheus + Grafana. node_exporter collects host‑level metrics, Doris exposes FE/BE metrics in Prometheus format, and a custom OLAP Exporter gathers Routine Load metrics to detect data‑flow delays.
03 Dual‑Stream High‑Availability Design
To guarantee zero‑downtime during major sales events, a primary‑backup cluster pair writes simultaneously. If one cluster experiences jitter or lag, traffic can be switched to the other cluster, minimizing service disruption.
04 Dynamic Partition Management
JD’s OLAP team extended Doris’s partition feature to retain partitions for specific historical periods (e.g., 618, 11.11) that would otherwise be dropped by the default dynamic partition policy. This preserves critical sales‑event data without manual intervention.
Doris Cache Mechanism
01 Cache Scenarios
High‑concurrency: Doris handles many QPS, but excessive load can cause node jitter.
Complex queries: Multi‑dimensional dashboards generate many joins across tables, leading to second‑level response times despite millisecond‑level per‑query latency.
Repeated queries: Lack of deduplication causes redundant query bursts.
02 Cache Types
Three cache layers coexist:
Result Cache : Stores complete query result sets; consulted first for a cache hit.
SQL Cache : Keys on SQL signature, partition ID, and partition version; invalidated when any of these change, suitable for T+1 update patterns.
Partition Cache : Caches read‑only partitions while leaving updating partitions uncached; splits a multi‑day query into cached and uncached sub‑queries, dramatically reducing load.
All caches are toggled via MySQL‑compatible commands on FE nodes and reside in FE memory for fast access.
03 Cache Effectiveness
During the 2020 11.11 promotion, disabling caches caused CPU usage to hit 100 % on the primary Doris cluster. Enabling Result Cache reduced CPU consumption to 30‑40 %, demonstrating the cache’s role in protecting cluster resources under heavy load.
Optimizations for the 2020 11.11 Promotion
01 Import Task Optimization
The team built an “OLAP Exporter” to monitor import speed, backlog, and pause events. Import tasks are throttled by three thresholds: maximum batch processing time, maximum batch row count, and maximum batch data volume. Adjusting these thresholds (increasing batch size and data volume, fine‑tuning time intervals) kept latency within twice the maximum interval while maintaining stability.
02 Monitoring Metric Refinement
Metrics are split into host‑level and business‑level groups. A dedicated “11.11 Key Metrics” panel aggregates BE CPU usage, real‑time task backlog rows, TP99 latency, and QPS, allowing operators to view cluster health without frequent dashboard switching.
03 Supporting Tools
Import sampling tool: Captures real‑time import metrics, adjusts task parameters, and generates migration statements when tasks are paused.
Large‑query analysis tool: Aggregates queries exceeding latency thresholds, scans volume, and provides per‑business breakdowns, enabling rapid identification of problematic queries.
Degrade‑and‑recover tool: Automatically reduces non‑critical workloads during peak pressure and restores them afterward.
Cluster inspection tool: Checks primary‑backup consistency, replica counts, tablet health, and machine resource usage.
Conclusion & Outlook
JD.com began using Doris in early 2020 and now operates both dedicated and shared clusters as a mature OLAP user. Ongoing challenges include task scheduling, import configuration, and query optimization. Future plans involve wider adoption of materialized views, bitmap indexes for precise UV counting, audit logs for query statistics, and further automation of import scheduling to enhance stability and performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
