Big Data 21 min read

Building and Optimizing JD Retail OLAP Platform: Architecture, Real‑time Updates, Materialized Views, and Join Optimization

This article presents JD Retail's OLAP platform construction and practical scenarios, covering control‑plane design, architecture, business management, operational safeguards, real‑time data updates, materialized view acceleration, join optimization techniques, high‑concurrency queries, and large‑scale write throughput for e‑commerce peak periods.

DataFunTalk

May 18, 2022

Building and Optimizing JD Retail OLAP Platform: Architecture, Real‑time Updates, Materialized Views, and Join Optimization

Guest: Li Yang, JD.com, Senior OLAP R&D Engineer Editor: Liu Han, Xueke Network Platform: DataFunTalk

Overview: Today we share the construction and practical scenarios of JD Retail's OLAP platform, divided into four main parts: control‑plane construction, optimization techniques, typical business cases, and large‑promotion preparation.

01 Control‑Plane Introduction

The control‑plane provides highly reliable, efficient, and sustainable operation support and rapid deployment capabilities, especially important for ClickHouse, a high‑performance OLAP engine with weak native operations.

Architecture Design: The workflow starts with request routing and domain resolution, reaches the adminServer where validation occurs, then tasks are queued for workers. Workers consume tasks, write results to storage, and finally persist to the data layer. This handles large‑scale cluster deployment and quota changes.

Business Management Functions:

Cluster account application for users

Business‑level registration

Quota queries (query count, concurrency, timeout)

Custom monitoring and alerting

Slow‑query statistics and alerts

04 Operations Management

Key operational capabilities:

Deploy new clusters quickly after resource allocation

Handle node failures (CPU, memory, disk) by timely up/down or replacement

Quota control during peak promotions to limit queries, concurrency, and timeouts

Cluster health inspection (table creation, deletion, insert, query) with real‑time alerts

02 Optimization Techniques

1. Scenario Challenges

Two main difficulties in JD Retail e‑commerce:

Complex transactional data requiring multi‑table joins and real‑time updates (status, amount, organization changes)

High‑volume click/behavior data that is append‑only, massive, and frequently changes in quality

2. Real‑time Data Updates

Two tables are created: a local table using ReplacingMergeTree for deduplication and a distributed table. The local table partitions by day and sorts by ID. Three deduplication methods are compared: optimize deduplication works only within a single partition final deduplication works across partitions but not across shards argMax with a version column can deduplicate across shards and is the recommended approach

3. Materialized Views

Materialized views pre‑aggregate massive tables (e.g., 1.3 billion rows) using SummingMergeTree. Query latency drops from ~2 seconds to ~0.002 seconds, a 100‑plus‑fold improvement, making them ideal for large‑scale pre‑aggregation scenarios.

4. Join Optimization

Standard distributed join expands to n × n partial queries for n shards, causing high latency. Two optimizations are presented:

Global join: Compute the right table first, materialize a temporary table, then join with the left side, reducing queries to 2 × n.

Local join: Convert the right side to a local table, allowing only n partial queries.

Recommendation: Prefer local join, then global join, and keep the smaller table on the right.

03 Typical Business Cases

1. High‑concurrency Queries

In an advertising real‑time tracking project, QPS reaches ~2000 during peak (e.g., 618 promotion). Optimizations include adding replicas, limiting max_threads, and adjusting query_thread_log storage to prevent disk exhaustion.

2. Large‑throughput Writes

JD Cloud monitoring writes up to 6000 billion rows per day (≈3 GB/s normal, 6 GB/s during promotions) on a 60‑shard, 2‑replica cluster with 120 × 64‑core, 1 TB SSD machines. Optimizations include:

Introducing chproxy for per‑SQL load balancing across nodes

Using local tables for writes, achieving 2‑3× higher throughput than distributed tables

Scaling replicas and tuning system parameters to sustain 2000 QPS

04 Promotion Preparation

During major sales events, the platform ensures stability through:

Resource collection and tier confirmation from business units

Real‑time monitoring and alert subscription (service layer, system layer, CH query/write metrics)

High‑fidelity stress testing with quota settings (e.g., 40% shared CPU, 80% dedicated)

Failure drills such as dual‑stream switchovers and rapid node replacement

Graceful degradation for lower‑priority services during peak load

05 Q&A Highlights

Q: What is the biggest challenge in this topic? A: Handling high concurrency (QPS >2000) on ClickHouse, which defaults to 100 concurrent queries. Solutions include adding replicas, tuning thread limits, expanding log storage, and adjusting quota and memory settings.

Q: What is OLAP and which engines are used? A: OLAP is online analytical processing for massive data. JD Retail primarily uses ClickHouse, supplemented by Doris.

Q: How is the dual‑write/primary‑backup switch handled? A: Both primary and backup clusters receive writes simultaneously; traffic is switched via DNS with negligible latency.

Thank you for listening.

Please like, share, and give a three‑click boost at the end of the article! 🙏

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data ClickHouse Data Warehouse OLAP

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.