Databases 18 min read

Hot and Cold Data Tiering in Apache Doris 2.0: Architecture, Configuration, and Performance Evaluation

This article explains the hot‑cold data tiering technique in Apache Doris 2.0, covering its motivation, storage‑layer design, configuration steps (resource, storage policy, table/partition settings), cost‑saving calculations, query performance impact, cold‑data compaction, and cache mechanisms, with practical code examples.

DataFunTalk
DataFunTalk
DataFunTalk
Hot and Cold Data Tiering in Apache Doris 2.0: Architecture, Configuration, and Performance Evaluation

In many real‑world analytics scenarios, hot and cold data have different query frequencies and latency requirements, leading to high storage costs when all data is kept on expensive local disks.

To address this, Apache Doris introduces hot‑cold data tiering, which stores hot data on high‑performance SSDs and cold data on cheaper HDDs or object storage, reducing overall storage expenses while meeting performance needs.

Starting with Doris 0.12, dynamic partitioning allows lifecycle management of table partitions. Users can set storage_cooldown_time or dynamic_partition.hot_partition_num to automatically move data from SSD to HDD.

In Doris 2.0, the tiering feature is extended to three‑level storage (SSD, HDD, object storage). Cold data is migrated to object storage with a single replica, cutting storage cost to about one‑third of traditional HDD storage.

Typical cost‑saving calculations show that moving 80 % of data to object storage can reduce storage cost by more than 70 % compared with using only cloud disks.

To enable tiering, users create a RESOURCE that points to an object‑storage bucket (AWS, Azure, Alibaba Cloud, etc.), then define a STORAGE POLICY (e.g., CREATE STORAGE POLICY testPolicy PROPERTIES("storage_resource"="remote_s3", "cooldown_ttl"="1d") ) to specify the cooldown interval and target storage.

Tables or partitions can be bound to a storage policy via the storage_policy property in the CREATE TABLE statement, allowing fine‑grained control over which data is cooled.

After the cooldown time elapses, the data’s LocalDataSize becomes 0 and RemoteDataSize reflects the size stored in object storage. The show tablets command can be used to verify the migration.

Doris 2.0 optimizes query execution for cold data: the first query downloads the remote rowset to a local block cache, and subsequent queries hit the cache, resulting in query latency comparable to hot‑only tables (≈5.8 s in benchmark tests).

Cold‑data compaction is also supported; a single replica performs compaction and uploads compacted parts to object storage, reducing storage footprint without affecting availability.

A block‑level cache further improves cold‑data read performance by keeping frequently accessed blocks locally, using an LRU policy.

Overall, hot‑cold tiering in Apache Doris 2.0 provides significant cost savings, maintains query performance, and offers flexible management of data lifecycle across multiple storage media.

performance optimizationDatabaseObject StorageApache DorisStorage PolicyCold Data Tiering
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.