Understanding Prefix Index, Partitioning, Bucketing, and Flink Integration in Apache Doris
This article explains Doris’s prefix index mechanism, best practices for partitioning and bucketing, and how to correctly use Flink’s batch writes with sequence columns to ensure ordered updates, providing practical guidance for optimizing OLAP workloads in Apache Doris.
Hello everyone, here are some experiences and lessons summarized for you to take directly.
Prefix Index
Unlike traditional databases, Doris does not support creating indexes on arbitrary columns; as an MPP OLAP database, it handles large data volumes by increasing concurrency.
Essentially, Doris stores data in an SSTable-like structure, an ordered format that enables efficient lookups on sorted columns.
In the three data models—Aggregate, Uniq, and Duplicate—the underlying storage is sorted according to the columns defined in the AGGREGATE KEY, UNIQ KEY, or DUPLICATE KEY statements. A prefix index builds on this sorting to provide fast queries based on a given prefix column.
Therefore, when a WHERE clause includes a prefix column of the key, the prefix index can be triggered to accelerate filtering.
For example, if the key order is (k1, k2, k3, v1, v2), then:
where k1 = ... and k2 ... can hit the prefix index; where k1 = ... and k3 ... only k1 can hit the prefix index; where k3 ... cannot hit the prefix index.
Partition and Bucketing
Partition
Generally, for large online data volumes, partitioning is recommended. Doris imports data at partition granularity; a single import updates all tablets within the same partition, reducing compaction pressure after import. Partition columns are usually time columns, enabling partition pruning when the WHERE clause filters on time, which is a crucial optimization.
Bucketing
Three basic conclusions: the number of buckets should not be excessive—64 is sufficient; a single bucket's data size should not be too large, with the official recommendation of 1 GB–10 GB (practically around 1 GB); and a bucket's data scale is optimal at the million‑row level.
Flink Writing to Doris
When Flink writes to Doris, note that the write is batch‑based and the order of rows within a batch is not guaranteed, leading to potential out‑of‑order updates if the same column is updated rapidly.
To address this, Doris supports a sequence column; by specifying a sequence column during import, rows with the same key and REPLACE aggregation type will be replaced according to the sequence value—larger values overwrite smaller ones, and the ordering is controlled by the user.
However, the sequence column can only be used under the Uniq data model.
OK, class dismissed.
3 million characters! The most comprehensive big‑data learning and interview community on the web awaits you!
If this article helped you, don’t forget to watch , like , and collect – the three‑link combo!
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
