Apache Doris 3.1 Unveiled: Variant, Index, and Lakehouse Boosts
The Apache Doris 3.1 release strengthens lake‑house capabilities with major upgrades to the VARIANT data type, vertical compaction, inverted index storage, new tokenizers, enhanced materialized view support for Iceberg/Paimon/Hudi, and numerous query‑performance optimizations such as faster partition pruning and dynamic partition clipping, offering smoother handling of thousands of columns and large‑scale semi‑structured data.
Hello everyone, we meet again. After almost a year, the Doris community has released version 3.1. Since there is little online analysis of the new version, we will explore the new capabilities in production environments.
We also saw that Doris is preparing version 4.0, which will add support for large‑model (LLM) functions.
Earlier we provided a detailed interpretation of version 3.0, which you can refer to Apache Doris 3.0 Core Features and Production Practices . Version 3.0 emphasized lake‑house integration and compute‑storage separation:
3.0 version is a milestone for Apache Doris on the lake‑house integration path.
Starting from the 3.0 series, Apache Doris supports a compute‑storage separation mode, allowing users to choose between integrated or separated deployment. Based on a cloud‑native separated architecture, users can achieve physical and read/write isolation across multiple compute clusters and reduce storage costs by using shared object storage or HDFS.In version 3.1, lake‑house capabilities are further strengthened. The official positioning states:
3.1 version is a milestone for Apache Doris in semi‑structured analysis, with significant enhancements on top of the lake‑house integration.VARIANT Semi‑Structured Query Enhancements (★★★★★)
In 3.0, the community optimized the VARIANT data type to handle JSON more efficiently. In 3.1, Doris raises the column limit to tens of thousands by using sparse sub‑columns and sub‑column‑level vertical compaction.
Benefits include:
Stable support for "thousands‑to‑tens of thousands" of sub‑columns, smoother query/merge latency.
Controlled metadata and index growth, avoiding exponential bloat.
Practical extraction of 10,000+ sub‑columns with efficient compaction.
Inverted Index Architecture Optimization (★★★★)
The 3.1 release iterates the inverted index storage format from V2 to V3, reducing index file size by up to 20% and lowering disk I/O, ideal for large‑scale text and log analysis.
Key improvements:
Introduce ZSTD dictionary compression for the index dictionary (enabled via dict_compression).
Add compression for term position information, further shrinking index space.
Three new tokenizers are also added: ICU Tokenizer, IK Tokenizer, and Basic Tokenizer.
Lake‑House Capability Enhancements (★★★)
Materialized View Support
Asynchronous materialized views now fully support partition incremental builds and transparent rewrites for Paimon, Iceberg, and Hudi.
Lake Framework Extensions
Full lifecycle management (Branch/Tag) for Iceberg.
Batch incremental query and Branch/Tag reads for Paimon.
Data Lake Query Improvements
Dynamic partition pruning: generates partition predicates from the right table during multi‑table joins, reducing I/O.
Batch shard execution: produces and executes data shard information in batches, lowering FE memory usage and improving overall efficiency.
Query Performance Boosts (★★★★)
For tables with tens of thousands of partitions, 3.1 introduces several optimizations:
Binary search‑based partition pruning for faster location of needed partitions.
Inclusion of many monotonic functions in partition pruning.
The official explanation:
In real scenarios, filter conditions on time partition columns often involve complex expressions rather than simple comparisons. For example:
to_date(time_stamp) > '2022-12-22',
date_format(timestamp, '%Y-%m-%d %H:%i:%s') > '2022-12-22 11:00:00'
Doris 3.1 introduces monotonic function descriptions; when a function is monotonic, Doris can determine if an entire partition can be pruned by evaluating boundary values.
Version 3.1 already supports CAST and 25 common time‑related functions, covering most typical time‑type partition filters.Although extremely large partition scenarios are uncommon, you may try this capability.
Other minor improvements are present but not detailed here.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
