How JD Search Scaled Real‑Time Analytics with Flink and Doris
This article details JD Search's journey from a Storm‑based pipeline to a Flink‑driven architecture backed by Apache Doris, covering business requirements, technical challenges, design trade‑offs, performance optimizations for massive traffic spikes, and future plans for their real‑time OLAP data warehouse.
Background
JD Search requires real‑time analytics for its e‑commerce search platform, covering overall search metrics, AB‑experiment monitoring, and hot‑search word ranking. Daily data volume reaches billions of exposure logs and expands to hundreds of billions of SKU‑level records, with a requirement for sub‑second query responses.
Initial Architecture and Limitations
The original pipeline used Apache Storm for point‑to‑point processing. Storm became inflexible, could not guarantee exactly‑once semantics, and consumed excessive resources as traffic grew.
Migration to Flink + Doris
Storm was replaced by Apache Flink, which provides high throughput and exactly‑once guarantees. The stream is split into three detail layers—PV detail, SKU detail, and AB‑experiment detail. Each layer writes raw records directly to Apache Doris via routine load, making Doris the real‑time OLAP layer that performs aggregation, rollup materialization, and query serving.
OLAP Layer Requirements
Minute‑level ingestion latency, second‑level query response.
Standard SQL interface to lower the learning curve.
Support for JOINs to enrich dimensions.
Approximate deduplication for traffic data (PV/UV) and exact deduplication for order lines.
Throughput of tens of millions of records per minute (hundreds of billions per day).
High concurrent query load.
Engine Selection
Benchmarks of Druid, Elasticsearch, ClickHouse, and Doris showed that Doris satisfied exactly‑once ingestion, high concurrency, and full SQL support. ClickHouse lacked transaction support and had lower concurrency; Druid and Elasticsearch did not meet the deduplication or SQL requirements.
Flink Job Design
Flink jobs are lightweight: OperatorState stores only Kafka offsets; no keyed state is used. This minimizes memory usage and improves stability. Each job reads from Kafka, performs minimal transformation, and writes raw detail records to Doris using routine load.
Doris Configuration for Real‑Time Loads
Routine load interval set to 30 seconds (max interval, max data size, max rows) to increase write throughput at the cost of a small ingestion latency.
Two rollup tables are defined. Filter fields (e.g., experiment ID, date, minute, channel, platform, category) are placed first to exploit prefix indexes, reducing query latency.
PV and UV are aggregated with HyperLogLog (HLL) sketches, yielding ~0.8 % error while dramatically reducing storage and computation.
SKU counts use SUM; other metrics such as click PV/UV and derived CTR are computed on‑the‑fly.
Modeling for AB‑Experiment Monitoring
During peak promotional events the AB‑experiment exposure model ingests >100 billion rows per day. The model defines:
K‑space: date, minute, channel, platform, first‑/second‑/third‑level category, experiment bucket
V‑space: exposure PV (HLL), exposure UV (HLL), SKU exposure count (SUM), click PV, click UV, derived CTRExperiment ID is bucketed to enable fast tablet‑level filtering.
Performance Tuning During Peak Loads
Doris runs on a cluster of 30+ backend nodes equipped with NVMe SSDs. Routine‑load parameters are tuned by sampling task statistics every three minutes (receivedBytes, loadedRows, taskExecuteTimeMs) to keep the ingestion rate around 600 million rows per 10 minutes. Rollup tables are limited to two to balance storage overhead and query speed.
Operational Results
Since May (Doris version 0.11.33), the system operates >10 routine‑load tasks, ingesting >200 billion rows daily. Replacing Flink window calculations with Doris aggregation reduced development effort, improved dimensional flexibility, and lowered compute resource consumption while preserving data consistency.
Future Work
Upgrade to Doris 0.12 to use bitmap indexes for precise UV deduplication.
Extend Doris to serve real‑time recommendation use cases.
Enrich the real‑time data lake with additional Flink‑generated streams and new windowed aggregations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
