Big Data 12 min read

How JD Search Scaled Real‑Time Analytics with Flink and Doris

This article details JD Search's journey from a Storm‑based pipeline to a Flink‑driven architecture backed by Apache Doris, covering business requirements, technical challenges, design trade‑offs, performance optimizations for massive traffic spikes, and future plans for their real‑time OLAP data warehouse.

dbaplus Community
dbaplus Community
dbaplus Community
How JD Search Scaled Real‑Time Analytics with Flink and Doris

Background

JD Search requires real‑time analytics for its e‑commerce search platform, covering overall search metrics, AB‑experiment monitoring, and hot‑search word ranking. Daily data volume reaches billions of exposure logs and expands to hundreds of billions of SKU‑level records, with a requirement for sub‑second query responses.

Initial Architecture and Limitations

The original pipeline used Apache Storm for point‑to‑point processing. Storm became inflexible, could not guarantee exactly‑once semantics, and consumed excessive resources as traffic grew.

Migration to Flink + Doris

Storm was replaced by Apache Flink, which provides high throughput and exactly‑once guarantees. The stream is split into three detail layers—PV detail, SKU detail, and AB‑experiment detail. Each layer writes raw records directly to Apache Doris via routine load, making Doris the real‑time OLAP layer that performs aggregation, rollup materialization, and query serving.

OLAP Layer Requirements

Minute‑level ingestion latency, second‑level query response.

Standard SQL interface to lower the learning curve.

Support for JOINs to enrich dimensions.

Approximate deduplication for traffic data (PV/UV) and exact deduplication for order lines.

Throughput of tens of millions of records per minute (hundreds of billions per day).

High concurrent query load.

Engine Selection

Benchmarks of Druid, Elasticsearch, ClickHouse, and Doris showed that Doris satisfied exactly‑once ingestion, high concurrency, and full SQL support. ClickHouse lacked transaction support and had lower concurrency; Druid and Elasticsearch did not meet the deduplication or SQL requirements.

Flink Job Design

Flink jobs are lightweight: OperatorState stores only Kafka offsets; no keyed state is used. This minimizes memory usage and improves stability. Each job reads from Kafka, performs minimal transformation, and writes raw detail records to Doris using routine load.

Doris Configuration for Real‑Time Loads

Routine load interval set to 30 seconds (max interval, max data size, max rows) to increase write throughput at the cost of a small ingestion latency.

Two rollup tables are defined. Filter fields (e.g., experiment ID, date, minute, channel, platform, category) are placed first to exploit prefix indexes, reducing query latency.

PV and UV are aggregated with HyperLogLog (HLL) sketches, yielding ~0.8 % error while dramatically reducing storage and computation.

SKU counts use SUM; other metrics such as click PV/UV and derived CTR are computed on‑the‑fly.

Modeling for AB‑Experiment Monitoring

During peak promotional events the AB‑experiment exposure model ingests >100 billion rows per day. The model defines:

K‑space: date, minute, channel, platform, first‑/second‑/third‑level category, experiment bucket
V‑space: exposure PV (HLL), exposure UV (HLL), SKU exposure count (SUM), click PV, click UV, derived CTR

Experiment ID is bucketed to enable fast tablet‑level filtering.

Performance Tuning During Peak Loads

Doris runs on a cluster of 30+ backend nodes equipped with NVMe SSDs. Routine‑load parameters are tuned by sampling task statistics every three minutes (receivedBytes, loadedRows, taskExecuteTimeMs) to keep the ingestion rate around 600 million rows per 10 minutes. Rollup tables are limited to two to balance storage overhead and query speed.

Operational Results

Since May (Doris version 0.11.33), the system operates >10 routine‑load tasks, ingesting >200 billion rows daily. Replacing Flink window calculations with Doris aggregation reduced development effort, improved dimensional flexibility, and lowered compute resource consumption while preserving data consistency.

Future Work

Upgrade to Doris 0.12 to use bitmap indexes for precise UV deduplication.

Extend Doris to serve real‑time recommendation use cases.

Enrich the real‑time data lake with additional Flink‑generated streams and new windowed aggregations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkReal-time analyticsStreamingOLAPdoris
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.