Databases 15 min read

Scaling JD.com Customer Service with Doris OLAP: Architecture & Caching

JD.com’s customer service team leverages the open‑source MPP database Doris to power real‑time and offline OLAP dashboards, detailing data ingestion pipelines, full‑link monitoring, dual‑stream high‑availability design, dynamic partition management, multi‑layer caching strategies, and performance optimizations applied during the 2020 11.11 shopping festival.

dbaplus Community

Oct 26, 2021

Scaling JD.com Customer Service with Doris OLAP: Architecture & Caching

Introduction

Doris is an open‑source MPP analytical database that delivers sub‑second query responses on datasets exceeding 10 PB. Its simple distributed architecture offers elastic scaling and easy operations, making it popular in China’s community and adopted by large companies such as Meituan and Xiaomi.

JD.com Customer Service Business

The JD.com customer service platform monitors metrics like consultation volume, answer rate, and complaint count in real time. To support both high‑concurrency online queries and large‑scale offline analysis, the team needed a solution that could handle massive data volumes with low latency, which traditional RDBMSs (MySQL, Oracle) and batch‑oriented systems (Hive, Kylin) could not provide.

Easy OLAP Design

01 Data Ingestion Pipeline

Real‑time data originates from Kafka, while offline data resides in HDFS. Real‑time ingestion uses Doris’s Routine Load, and offline ingestion employs Broker Load and Stream Load.

02 Full‑Link Monitoring

The project uses Prometheus + Grafana. node_exporter collects host‑level metrics, Doris exposes FE/BE metrics in Prometheus format, and a custom OLAP Exporter gathers Routine Load metrics to detect data‑flow delays.

03 Dual‑Stream High‑Availability Design

To guarantee zero‑downtime during major sales events, a primary‑backup cluster pair writes simultaneously. If one cluster experiences jitter or lag, traffic can be switched to the other cluster, minimizing service disruption.

04 Dynamic Partition Management

JD’s OLAP team extended Doris’s partition feature to retain partitions for specific historical periods (e.g., 618, 11.11) that would otherwise be dropped by the default dynamic partition policy. This preserves critical sales‑event data without manual intervention.

Doris Cache Mechanism

01 Cache Scenarios

High‑concurrency: Doris handles many QPS, but excessive load can cause node jitter.

Complex queries: Multi‑dimensional dashboards generate many joins across tables, leading to second‑level response times despite millisecond‑level per‑query latency.

Repeated queries: Lack of deduplication causes redundant query bursts.

02 Cache Types

Three cache layers coexist:

Result Cache : Stores complete query result sets; consulted first for a cache hit.

SQL Cache : Keys on SQL signature, partition ID, and partition version; invalidated when any of these change, suitable for T+1 update patterns.

Partition Cache : Caches read‑only partitions while leaving updating partitions uncached; splits a multi‑day query into cached and uncached sub‑queries, dramatically reducing load.

All caches are toggled via MySQL‑compatible commands on FE nodes and reside in FE memory for fast access.

03 Cache Effectiveness

During the 2020 11.11 promotion, disabling caches caused CPU usage to hit 100 % on the primary Doris cluster. Enabling Result Cache reduced CPU consumption to 30‑40 %, demonstrating the cache’s role in protecting cluster resources under heavy load.

Optimizations for the 2020 11.11 Promotion

01 Import Task Optimization

The team built an “OLAP Exporter” to monitor import speed, backlog, and pause events. Import tasks are throttled by three thresholds: maximum batch processing time, maximum batch row count, and maximum batch data volume. Adjusting these thresholds (increasing batch size and data volume, fine‑tuning time intervals) kept latency within twice the maximum interval while maintaining stability.

02 Monitoring Metric Refinement

Metrics are split into host‑level and business‑level groups. A dedicated “11.11 Key Metrics” panel aggregates BE CPU usage, real‑time task backlog rows, TP99 latency, and QPS, allowing operators to view cluster health without frequent dashboard switching.

03 Supporting Tools

Import sampling tool: Captures real‑time import metrics, adjusts task parameters, and generates migration statements when tasks are paused.

Large‑query analysis tool: Aggregates queries exceeding latency thresholds, scans volume, and provides per‑business breakdowns, enabling rapid identification of problematic queries.

Degrade‑and‑recover tool: Automatically reduces non‑critical workloads during peak pressure and restores them afterward.

Cluster inspection tool: Checks primary‑backup consistency, replica counts, tablet health, and machine resource usage.

Conclusion & Outlook

JD.com began using Doris in early 2020 and now operates both dedicated and shared clusters as a mature OLAP user. Ongoing challenges include task scheduling, import configuration, and query optimization. Future plans involve wider adoption of materialized views, bitmap indexes for precise UV counting, audit logs for query statistics, and further automation of import scheduling to enhance stability and performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Big Data real-time analytics OLAP data ingestion Doris

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.