Databases 19 min read

High‑Performance Inverted Index in Apache Doris for Log Data Storage and Analysis

This article explains how Apache Doris implements a high‑performance, column‑oriented inverted index to address the challenges of massive, real‑time log data storage and analysis, delivering dramatically higher write throughput, lower storage costs, and faster query performance than traditional Elasticsearch and Loki solutions.

DataFunTalk

May 9, 2023

High‑Performance Inverted Index in Apache Doris for Log Data Storage and Analysis

Log data is a major component of enterprise big data, requiring high‑throughput real‑time writes, low‑cost massive storage, and fast text search. Traditional architectures like Elasticsearch and Loki cannot simultaneously meet these demands.

Apache Doris adopts information‑retrieval techniques and implements a high‑performance inverted index optimized for AP scenarios, providing efficient full‑text, equality and range queries with over ten‑fold cost‑performance improvement compared to Elasticsearch.

Key challenges of log workloads include rapid data growth, petabyte‑scale storage, and minute‑level latency. Doris’s inverted index, built directly in the storage engine using C++ and vectorized execution, offers four times higher write throughput, 80% storage reduction, and up to 2.3× faster queries.

Implementation details: Doris adds a separate Inverted Index file per column, synchronously written with segment data; queries on indexed columns retrieve DocIDs, convert to RowID bitmaps, and filter rows efficiently. The index leverages CLucene, columnar storage, ZSTD compression, and BKD structures for numeric/date ranges.

Performance tests using ES rally HTTP logs and ClickHouse Hacker News datasets show Doris achieving 4.2× write speed, 57% query latency reduction versus Elasticsearch, and 4.7‑18.5× faster queries than ClickHouse.

Example DDL to create an inverted index on a comment column:

CREATE TABLE hackernews_1m (
  `id` BIGINT,
  `comment` STRING,
  INDEX idx_comment(`comment`) USING INVERTED PROPERTIES("parser" = "english")
) DUPLICATE KEY(`id`) DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES ("replication_num" = "1");

Queries can use MATCH_ALL for fast full‑text search, delivering orders‑of‑magnitude speedups over LIKE scans.

Doris also supports other scalar types, arrays, and future extensions for JSONB, Map, and GEO data. The built‑in high‑performance inverted index makes Doris a cost‑effective solution for large‑scale log analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Big Data SQL inverted index Log Analytics Apache Doris

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.