Databases 19 min read

How JD Ads Cut Storage Costs 87% with Apache Doris Hot‑Cold Tiering

This article details JD Advertising's journey from a 1 PB Apache Doris data lake to a multi‑level hot‑cold tiering architecture, describing two tiering strategies, the performance and schema‑change challenges faced during the upgrade to Doris 2.0, and the optimizations that reduced storage costs by about 87% while boosting query throughput.

JD Cloud Developers
JD Cloud Developers
JD Cloud Developers
How JD Ads Cut Storage Costs 87% with Apache Doris Hot‑Cold Tiering

Background

JD Advertising built an advertising data storage service on Apache Doris to provide real‑time ad performance reports and multidimensional analysis. After years of growth the system stores nearly 1 PB of data, over 18 trillion rows, and handles more than 80 million queries per day. Storage became a bottleneck as data volume kept expanding while compute utilization dropped due to frequent expansions.

Hot‑Cold Tiering Solutions

Two solutions have been tried: a cold‑data lake approach and an in‑Doris hot‑cold tiering approach.

Cold‑Data Lake (V1)

The cold‑data lake solution uses Spark‑Doris‑Connector (SDC) to move cold data from Doris to an external lake such as Iceberg. Queries are rewritten: cold queries are redirected to external tables, while hot queries run directly on Doris OLAP tables. Advantages include decoupling cold‑data processing from the online Doris cluster, preserving stability for real‑time reporting. Disadvantages are the need for ETL tools, UNION of hot and cold data during analysis, and schema‑change dependencies on the lake.

Doris Hot‑Cold Tiering (V2)

Doris 1.2 used local SSD for hot data and HDD for cold data, which works only on bare‑metal servers and requires pre‑estimated cold‑data size. Doris 2.0 introduces a new tiering model that stores cold data on distributed storage systems such as OSS or HDFS, allowing automatic migration based on TTL policies. This eliminates the complexity of the lake approach but requires careful throttling of cold queries to avoid impacting hot queries.

Problem Solving

3.1 Apache Doris 2.0 Performance Optimizations & Issue Fixes

Query performance degradation – In small‑report queries the new optimizer in Doris 2.0 caused a ~50% slowdown; the team disabled the new optimizer to restore performance.

Bucket pruning failure – Rollup queries scanned all buckets; the issue was fixed in PR https://github.com/apache/doris/pull/38565.

Prefix index failure – Date‑to‑DateTime casting broke prefix indexes; the fix involved type alignment on the FE side (PR https://github.com/apache/doris/pull/39446).

FE CPU usage surged, nearly doubling under the same QPS; flame‑graph analysis identified hot spots and optimizations were applied.

BE memory usage grew because SegmentCache thresholds were too high; adjusting the thresholds reduced BE memory from >60% to <25% and avoided OOM risks.

3.2 Cold‑Data Schema Change Optimizations

Cold‑data Add‑Key‑Column operations previously degraded to Direct Schema Change, requiring full data rewrite and causing long SC times (e.g., >7 days for 20 TB). The team introduced a Linked Schema Change that copies data directly in remote storage via ChubaoFS’s CopyObject API, cutting SC time by 40× (PR https://github.com/apache/doris/pull/40963).

They also implemented single‑leader SC for cold data, so only the elected leader replica performs the change, avoiding duplicate data copies.

Furthermore, a Light Schema Change for cold data was added, allowing metadata‑only updates for key column additions without BE task creation, achieving millisecond‑level execution.

3.3 Other Issues

To free up storage before tiering, historical data were backed up to external storage and later restored after the Doris 2.0 upgrade. A data migrator tool automated asynchronous parallel backup, while a restore tool narwal_cli aligned schema differences between snapshots and the current cluster. During restore, real‑time Flink‑to‑Doris loads sometimes failed with error

LOAD_RUN_FAIL; msg:errCode = 2, detailMessage = Table xxxxx is in restore process. Can not load into it

; the fix involved finer‑grained partition‑state checks (PR https://github.com/apache/doris/pull/39595).

After restoration, historical tables were assigned a storage policy with cooldown_ttl=10s to instantly move them to ChubaoFS, while hot tables used cooldown_ttl=2years for automatic cooling. This unified hot‑cold strategy eliminated the need for further online storage expansion.

Conclusion

By applying hot‑cold tiering, JD Advertising reduced storage costs by about 87%. Compared with the cold‑data lake of Doris 1.2, Doris 2.0’s tiering increased concurrent query capacity over tenfold and dramatically lowered latency. The architecture simplified maintenance, cut complexity, and lowered overall cost, thanks to extensive contributions to the Apache Doris community.

distributed storageTiered Storagecold dataApache DorisSchema Changehot data
JD Cloud Developers
Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.