Databases 6 min read

Understanding Prefix Index, Partitioning, Bucketing, and Flink Integration in Apache Doris

This article explains Doris’s prefix index mechanism, best practices for partitioning and bucketing, and how to correctly use Flink’s batch writes with sequence columns to ensure ordered updates, providing practical guidance for optimizing OLAP workloads in Apache Doris.

Big Data Technology & Architecture

Dec 1, 2023

Hello everyone, here are some experiences and lessons summarized for you to take directly.

Prefix Index

Unlike traditional databases, Doris does not support creating indexes on arbitrary columns; as an MPP OLAP database, it handles large data volumes by increasing concurrency.

Essentially, Doris stores data in an SSTable-like structure, an ordered format that enables efficient lookups on sorted columns.

In the three data models—Aggregate, Uniq, and Duplicate—the underlying storage is sorted according to the columns defined in the AGGREGATE KEY, UNIQ KEY, or DUPLICATE KEY statements. A prefix index builds on this sorting to provide fast queries based on a given prefix column.

Therefore, when a WHERE clause includes a prefix column of the key, the prefix index can be triggered to accelerate filtering.

For example, if the key order is (k1, k2, k3, v1, v2), then:

where k1 = ... and k2 ... can hit the prefix index; where k1 = ... and k3 ... only k1 can hit the prefix index; where k3 ... cannot hit the prefix index.

Partition and Bucketing

Partition

Generally, for large online data volumes, partitioning is recommended. Doris imports data at partition granularity; a single import updates all tablets within the same partition, reducing compaction pressure after import. Partition columns are usually time columns, enabling partition pruning when the WHERE clause filters on time, which is a crucial optimization.

Bucketing

Three basic conclusions: the number of buckets should not be excessive—64 is sufficient; a single bucket's data size should not be too large, with the official recommendation of 1 GB–10 GB (practically around 1 GB); and a bucket's data scale is optimal at the million‑row level.

Flink Writing to Doris

When Flink writes to Doris, note that the write is batch‑based and the order of rows within a batch is not guaranteed, leading to potential out‑of‑order updates if the same column is updated rapidly.

To address this, Doris supports a sequence column; by specifying a sequence column during import, rows with the same key and REPLACE aggregation type will be replaced according to the sequence value—larger values overwrite smaller ones, and the ordering is controlled by the user.

However, the sequence column can only be used under the Uniq data model.

OK, class dismissed.

3 million characters! The most comprehensive big‑data learning and interview community on the web awaits you!

If this article helped you, don’t forget to watch , like , and collect – the three‑link combo!

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink database optimization OLAP partitioning Apache Doris Prefix Index Bucketing

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.