Big Data 9 min read

Archer Engine: Integrating Inverted Index with Iceberg for Scalable Big Data Log Analytics

The article introduces Archer, a new big‑data warehouse engine built on Iceberg that adds an inverted‑index mechanism using Tantivy to provide full‑text and JSON search, storage‑compute separation, and significant performance gains over traditional Elasticsearch and Iceberg connectors.

360 Smart Cloud

May 23, 2024

Archer Engine: Integrating Inverted Index with Iceberg for Scalable Big Data Log Analytics

1. Background

In the big‑data analytics field, Elasticsearch is often used for log storage and analysis, but its high memory/disk requirements, complex operation, limited SQL, and lack of storage‑compute separation become problematic as log volume grows.

To address these issues, the Archer engine was built on top of Iceberg, introducing an inverted‑index mechanism that gives the Qilin data‑warehouse full‑text search capability. Archer stores both forward data and inverted index files on HDFS/S3, uses a local cache, runs the compute engine in containers for elastic scaling, provides complete SQL support, and enables federated queries with Hive/MySQL.

2. Design

Archer stores data files in Parquet format and builds inverted indexes using Tantivy, a Rust library inspired by Lucene that reduces I/O requests and is suitable for HDFS/S3.

2.1 Index Types

Six index types are supported: text, json, json_text, ip, datetime, and raw, each handling different field formats and query capabilities.

2.2 Architecture

Archer follows a storage‑compute separation model. By implementing a custom TantivyDirectory, it can read/write index files directly on HDFS/S3, with a LocalCache to improve I/O performance. Query processing first retrieves matching doc‑ids from the inverted index, then uses Parquet page offsets to trim pages, reducing data reads. Each Parquet page defaults to 1 MB.

Illustrations of the framework and page‑cutting are shown in the original figures.

2.3 Query Integration

Archer integrates with Trino as the query engine. A hidden varchar column “$inverted_index_query” enables predicate push‑down to the inverted index without modifying Trino core. OR logic is supported by returning the query string for matching rows and NULL otherwise.

Example DDL and SQL statements are provided in the source images.

2.4 Optimizations

2.4.1 File Trimming

Tantivy creates several auxiliary files that are unnecessary for our use case; Archer’s custom directory ignores write requests for these files and returns static data on read, reducing small‑file overhead on HDFS/S3.

2.4.2 Memory Management

To avoid excessive memory consumption when building inverted indexes, Archer limits the memory per segment, spilling to external storage when the limit is reached and later merging segments to improve query performance.

2.4.3 Pre‑loading

Archer adds a preload‑plus‑cache layer that reads larger contiguous blocks of index data, cutting remote I/O requests by 100‑1000×.

2.4.4 IO Merging

After page trimming, small, close‑by page requests are merged to balance read amplification and latency.

2.4.5 Dynamic Split Allocation

When an inverted‑index query is detected, Archer disables Trino’s default 128 MB split‑generation and delivers the whole file as a single split, reducing redundant loading.

2.4.6 Serialization

Archer uses Arrow as the in‑memory data format to exchange data between Rust‑based Tantivy and Java‑based Trino via JNI, avoiding repeated serialization/deserialization.

3. Test Results

Using the amazon_reviews dataset, Archer’s full‑text search reduced query time from 15.64 s (Iceberg) to 1.52 s and scanned data from 7.02 GB to 122 KB. For JSON search, Archer completed the query in 1.36 s scanning 14.6 MB, compared with Iceberg’s 14.37 s and 454 MB.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data inverted index Trino Parquet Archer Engine Tantivy

Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.