Databases 30 min read

Apache Doris 4.1: A Unified Data Store and Retrieval Engine for AI & Search

Apache Doris 4.1 introduces a systematic evolution for AI and search workloads, adding low‑cost massive vector storage, unified structured, full‑text and vector search, 100 MB JSON document support, Segment V3 metadata decoupling, sparse column optimizations, lakehouse lifecycle management, and a suite of performance‑boosting features such as aggregate push‑down, condition cache, and spill‑to‑disk, all backed by detailed benchmark results.

DataFunSummit
DataFunSummit
DataFunSummit
Apache Doris 4.1: A Unified Data Store and Retrieval Engine for AI & Search

Unified AI & Search Platform

In the AI era, databases become the infrastructure for intelligent agents, Retrieval‑Augmented Generation (RAG) systems, large‑model applications, and AI observability platforms. Apache Doris 4.1 is positioned as a version that systematically evolves to meet these needs, offering low‑cost massive AI data storage and a unified solution for structured, vector, full‑text, trace, and event‑stream data.

Enhanced Vector Retrieval

Doris 4.1 adds significant vector index and query performance improvements. The Ann Index Only Scan optimization eliminates I/O on original columns during vector search, delivering up to 4× faster index queries. In a typical test (1 M vectors, 16‑core CPU, 64 GB RAM) the system achieves about 900 QPS with 97 % recall , supporting most production‑grade vector retrieval scenarios.

CREATE TABLE sift_1M (
  id INT NOT NULL,
  embedding ARRAY<FLOAT> NOT NULL COMMENT "",
  INDEX ann_index (embedding) USING ANN PROPERTIES(
    "index_type"="ivf",
    "metric_type"="l2_distance",
    "dim"="128",
    "nlist"="1024"
  )
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES ("replication_num" = "1");

The new IVF index, compared with the previous HNSW, reduces memory usage while supporting larger vector scales with minimal recall loss. Production benefits include supporting larger vector scales on the same hardware, significantly lower memory cost, and better suitability for large‑scale vector retrieval.

CREATE TABLE for_ivf_on_disk (
  id BIGINT NOT NULL,
  embedding ARRAY<FLOAT> NOT NULL,
  INDEX idx_emb (embedding) USING ANN PROPERTIES(
    "index_type"="ivf_on_disk",
    "metric_type"="l2_distance",
    "dim"="128",
    "nlist"="1024"
  )
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 8
PROPERTIES ("replication_num" = "1");

IVF_ON_DISK combines memory cache with local file‑system cache, achieving low‑cost, high‑performance vector pruning. Compared with DiskANN, it offers lower index‑build overhead, making trillion‑scale vector search feasible.

Vector Quantization

Doris supports INT8, INT4, and Product Quantization (PQ) schemes. These compress index memory to 1/4–1/8 of the original size with only slight recall degradation. When combined with IVF_ON_DISK, they further reduce large‑scale vector retrieval costs.

CREATE TABLE product_quant (
  id BIGINT NOT NULL,
  embedding ARRAY<FLOAT> NOT NULL,
  INDEX idx_emb (embedding) USING ANN PROPERTIES(
    "index_type"="ivf_on_disk",
    "metric_type"="l2_distance",
    "dim"="128",
    "nlist"="1024",
    "quantizer"="pq",
    "pq_m"=64,
    "pq_nbits"=8
  )
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 8
PROPERTIES ("replication_num" = "1");

Full‑Text Search via search() Function

The new search() function embeds full‑text capabilities directly into SQL, supporting ES‑style query_string syntax, operators such as TERM, PHRASE, WILDCARD, REGEXP, PREFIX, NOT, NESTED, BM25 relevance scoring, and multi‑field strategies ( best_fields and cross_fields). Example queries demonstrate multi‑condition filtering, BM25 ranking, nested JSON array search, and combined search‑with‑aggregation.

-- Multi‑condition: TERM + PHRASE + NOT evaluated in a single pass
SELECT request_id, error_msg, latency_ms
FROM inference_logs
WHERE search('\
  level:ERROR\
  AND error_msg:"CUDA out of memory"\
  AND NOT module:healthcheck\
  AND model_name:gpt*\
')
  AND log_time > NOW() - INTERVAL 1 HOUR
ORDER BY latency_ms DESC
LIMIT 100;

-- BM25 relevance scoring
SELECT request_id, error_msg, score() AS relevance
FROM inference_logs
WHERE search('error_msg:"memory allocation failed" OR error_msg:"CUDA error"')
ORDER BY relevance DESC
LIMIT 20;

-- Nested search inside a VARIANT array
SELECT * FROM agent_logs
WHERE search('NESTED(steps, status:error AND tool:code_exec)');

-- Search + aggregation
SELECT model_name,
       COUNT(*) AS error_count,
       PERCENTILE_APPROX(latency_ms, 0.99) AS p99_latency
FROM inference_logs
WHERE search('level:ERROR AND error_msg:"CUDA out of memory"')
  AND log_time > NOW() - INTERVAL 1 HOUR
GROUP BY model_name
ORDER BY error_count DESC;

Native 100 MB JSON Document Support

Doris 4.1 can store a single JSON document up to 100 MB, enabling complete AI conversation histories, long documents, audio/video transcriptions, agent execution traces, tool‑call logs, and RAG contexts to be stored without splitting or external storage. These large documents are queryable with filters, conditions, aggregations, and joins, turning AI context data into a manageable, structured asset.

Eliminates dependence on separate object storage.

Removes consistency logic between metadata and raw content.

Reduces development overhead for segmenting and re‑assembling data.

Provides lower query latency, stronger transactional guarantees, and simpler operations.

Segment V3 – Decoupled Metadata for Wide Tables

For ultra‑wide, sparse, semi‑structured data, Segment V3 separates metadata from the file footer, loading it on demand. Compared with the previous Segment V2, opening a table with 7 000 columns and 10 000 segments becomes up to 16× faster and uses up to 60× less memory**, dramatically improving response time and resource cost in wide‑table and high‑concurrency scenarios.

Sparse Column Optimizations

To handle JSON data with few hot paths and many cold paths, Doris 4.1 introduces:

Cold‑hot layering : hot paths stay as columnar sub‑columns; cold paths move to sparse storage, avoiding column bloat.

Sparse Sharding via the variant_sparse_hash_shard_count property, distributing long‑tail paths across multiple sparse columns.

Sparse Cache to cache sparse columns, reducing repeated I/O, decode, and deserialization costs.

Performance Benchmarks and Optimizations

Across multi‑table analytical workloads, Doris 4.1 shows:

SSB query throughput +14.3%.

TPC‑H +22.6%.

TPC‑DS +19.1%.

In ClickBench (100 GB, 43 complex queries) on a c7a.metal‑48xl instance, Doris ranks first in cold‑query performance and storage efficiency, second overall only to ClickHouse.

Key engine‑level optimizations include:

Aggregate Push‑down Through Join : performs local aggregation before joins, reducing data volume and memory usage.

Aggregate Expansion : identifies fine‑grained groups and performs a two‑stage aggregation, yielding >10% overall gains.

Nested Column Pruning : reads only required nested fields, cutting I/O; tests show >60% overall speedup, up to 700% in some cases.

Condition Cache : caches filter results per segment, avoiding repeated scans; complex queries gain >10% speedup.

Query Cache : stores intermediate aggregation results for identical query contexts, reducing CPU and I/O.

CASE WHEN Optimizations : branch merging, branch elimination, common sub‑expression extraction, enum extraction, and push‑down improve CASE WHEN execution >200% on average, with some cases >50×.

File Cache and Meta‑Service Improvements

File cache now persists metadata, avoiding heavy I/O at startup. The new system table information_schema.file_cache_info lets users inspect cache details by tablet_id, be_id, type, etc., facilitating hotspot detection and cache balancing.

Lakehouse Lifecycle Management

Doris 4.1 adds full lifecycle support for open lake formats:

Iceberg V2/V3 : supports INSERT, UPDATE, DELETE, MERGE INTO, Deletion Vectors, Row Lineage.

Paimon : enables SQL‑based catalog and table management.

Performance enhancements for lake queries include Iceberg sort‑write with partition pruning (+15% TPC‑DS), Manifest metadata cache (metadata latency reduced to sub‑second), and Parquet page cache (+20% ClickBench overall).

New SQL Syntax and Execution Features

UNNEST : native array expansion for semi‑structured data.

Recursive CTE : supports hierarchical and graph queries.

ASOF JOIN : time‑nearest join for streaming, financial, and IoT use cases.

MERGE INTO : single‑statement upsert (INSERT/UPDATE/DELETE) for CDC and incremental pipelines.

Spill‑to‑Disk Enhancements : recursive spill, coverage of Join, Aggregation, Sort, and dynamic triggering for stable large‑scale queries.

Streaming and Real‑Time Data Ingestion

Continuous load jobs now support:

S3 file source – automatic detection of new files.

MySQL and PostgreSQL CDC – full‑load plus incremental sync.

Routine Load gains flexible partial updates, dynamic parameter tuning, and audit logging.

Timestamp with Time Zone (TIMESTAMPTZ)

Doris 4.1 introduces native TIMESTAMPTZ (timestamp with time zone). Internally stored as UTC, it automatically converts to the session time zone on query, supports both zoned and zone‑less inputs, and is compatible with existing datetime functions.

Operational Enhancements

Elastic scaling of millions of shards in minutes under the compute‑storage separation model.

Cold‑query prefetching and remote storage bandwidth tuning.

Meta‑service caching to reduce metadata request latency.

Object‑storage request merging, cutting storage costs up to 90% in high‑frequency import scenarios.

Improved column compression (ZSTD) and binary encoding for lower storage and faster cold reads.

Overall, Apache Doris 4.1 delivers a comprehensive, AI‑ready data foundation that unifies storage, vector search, full‑text, and analytical capabilities while offering strong performance, scalability, and ease of use for modern data‑intensive applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceSQLAIvector searchLakehouseApache Doris
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.