Databases 18 min read

Sharding, Partitioning, and Auto Bucket in Apache Doris: Design, Implementation, and Best Practices

This article explains how Apache Doris implements data sharding, partitioning (both list and range, static and dynamic), batch partition creation, and the Auto Bucket feature, providing detailed syntax, algorithms, and performance benefits for large‑scale distributed databases.

DataFunTalk
DataFunTalk
DataFunTalk
Sharding, Partitioning, and Auto Bucket in Apache Doris: Design, Implementation, and Best Practices

Data sharding (Sharding) is a divide‑and‑conquer design in distributed databases, allowing unlimited table size by distributing data across nodes.

Sharding can be range or hash; implemented as partitions and buckets. Partitioning slices data by a column (often time) for pruning, while bucket (hash) groups rows with the same hash value.

In Apache Doris, tables are stored as partitions, then buckets, forming tablets, the smallest physical storage unit.

Current Situation and Problems

Example CREATE TABLE statement for a site‑PV table, showing placeholders for PARTITION_DESC and BUCKET_DESC.

Creating Partitions

Doris supports List and Range partitions.

List Partition

Enumerates values of a column (e.g., city) and creates partitions accordingly.

-- List partition example
PARTITION BY LIST(city) (
    PARTITION `p_huabei` VALUES IN ("beijing","tianjin","shijiazhuang"),
    PARTITION `p_dongbei` VALUES IN ("shenyang","dalian"),
    PARTITION `p_huazhong` VALUES IN ("wuhan","changsha"),
    PARTITION `p_xinan` VALUES IN ("chengdu","chongqing")
)

Range Partition

Usually based on a time column; can be static (explicit boundaries) or dynamic (auto‑generated).

-- Static range partition example
PARTITION BY RANGE(sdate) (
    PARTITION `p2022` VALUES LESS THAN ("2023-01-01"),
    PARTITION `p20230101` VALUES LESS THAN ("2023-01-02"),
    PARTITION `pmax` VALUES LESS THAN ("9999-12-31")
)

Dynamic range partition only needs the partition column; other parameters are set in the PARTITION clause.

PARTITION BY RANGE(sdate)()

Additional properties control dynamic partitioning (start, end, time unit, prefix, etc.).

Creating Buckets

Buckets correspond to tablets; total tablets = partitions × buckets × replication factor.

DISTRIBUTED BY HASH(site) BUCKETS 20

Bucket count should be chosen based on partition size (recommended 1‑10 GB per tablet).

Batch Partition and Auto Bucket Design

Apache Doris 1.2.1 adds batch partition creation with concise syntax; 1.2.2 adds Auto Bucket, which automatically calculates bucket count.

Batch Partition Syntax

PARTITION BY RANGE(sdate) (
    FROM ("2013-01-01") TO ("2023-01-01") INTERVAL 1 DAY
)

Multiple time granularities can be combined in a single statement.

PARTITION BY RANGE(sdate) (
    FROM ("2000-01-01") TO ("2021-01-01") INTERVAL 1 YEAR,
    FROM ("2021-01-01") TO ("2022-01-01") INTERVAL 1 MONTH,
    FROM ("2022-01-01") TO ("2023-01-01") INTERVAL 1 WEEK,
    FROM ("2023-01-01") TO ("2023-02-01") INTERVAL 1 DAY,
    FROM ("2023-02-01 00") TO ("2099-12-31 23") INTERVAL 1 HOUR
)

Auto Bucket

Old syntax:

DISTRIBUTED BY HASH(site) BUCKETS 20

New syntax:

DISTRIBUTED BY HASH(site) BUCKETS AUTO

with additional property:

properties("estimate_partition_size" = "100G")

The system computes an initial bucket count N from the estimated partition size, a disk‑capacity‑based count M, then selects the final bucket count using min(N, M, 128) and BE node count logic. Subsequent buckets are adjusted using recent partition size trends (EMA or delta‑based estimation).

Effect

Proper partitioning and bucketing reduce scan volume dramatically (e.g., 1/600 of full table in a query filtering by date and site).

Conclusion

Batch partition syntax simplifies massive historical partition creation, while Auto Bucket removes manual bucket tuning, improving usability and query performance.

ShardingDistributed DatabasepartitioningApache DorisAuto BucketBatch Partition
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.