High‑Performance Inverted Index in Apache Doris for Log Data Storage and Analysis
This article explains how Apache Doris implements a high‑performance, column‑oriented inverted index to address the challenges of massive, real‑time log data storage and analysis, delivering dramatically higher write throughput, lower storage costs, and faster query performance than traditional Elasticsearch and Loki solutions.
Log data is a major component of enterprise big data, requiring high‑throughput real‑time writes, low‑cost massive storage, and fast text search. Traditional architectures like Elasticsearch and Loki cannot simultaneously meet these demands.
Apache Doris adopts information‑retrieval techniques and implements a high‑performance inverted index optimized for AP scenarios, providing efficient full‑text, equality and range queries with over ten‑fold cost‑performance improvement compared to Elasticsearch.
Key challenges of log workloads include rapid data growth, petabyte‑scale storage, and minute‑level latency. Doris’s inverted index, built directly in the storage engine using C++ and vectorized execution, offers four times higher write throughput, 80% storage reduction, and up to 2.3× faster queries.
Implementation details: Doris adds a separate Inverted Index file per column, synchronously written with segment data; queries on indexed columns retrieve DocIDs, convert to RowID bitmaps, and filter rows efficiently. The index leverages CLucene, columnar storage, ZSTD compression, and BKD structures for numeric/date ranges.
Performance tests using ES rally HTTP logs and ClickHouse Hacker News datasets show Doris achieving 4.2× write speed, 57% query latency reduction versus Elasticsearch, and 4.7‑18.5× faster queries than ClickHouse.
Example DDL to create an inverted index on a comment column:
CREATE TABLE hackernews_1m (
`id` BIGINT,
`comment` STRING,
INDEX idx_comment(`comment`) USING INVERTED PROPERTIES("parser" = "english")
) DUPLICATE KEY(`id`) DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES ("replication_num" = "1");Queries can use MATCH_ALL for fast full‑text search, delivering orders‑of‑magnitude speedups over LIKE scans.
Doris also supports other scalar types, arrays, and future extensions for JSONB, Map, and GEO data. The built‑in high‑performance inverted index makes Doris a cost‑effective solution for large‑scale log analysis.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.