Big Data 9 min read

Archer Engine: Integrating Inverted Index with Iceberg for Scalable Big Data Log Analytics

The article introduces Archer, a new big‑data warehouse engine built on Iceberg that adds an inverted‑index mechanism using Tantivy to provide full‑text and JSON search, storage‑compute separation, and significant performance gains over traditional Elasticsearch and Iceberg connectors.

360 Smart Cloud
360 Smart Cloud
360 Smart Cloud
Archer Engine: Integrating Inverted Index with Iceberg for Scalable Big Data Log Analytics

1. Background

In the big‑data analytics field, Elasticsearch is often used for log storage and analysis, but its high memory/disk requirements, complex operation, limited SQL, and lack of storage‑compute separation become problematic as log volume grows.

To address these issues, the Archer engine was built on top of Iceberg, introducing an inverted‑index mechanism that gives the Qilin data‑warehouse full‑text search capability. Archer stores both forward data and inverted index files on HDFS/S3, uses a local cache, runs the compute engine in containers for elastic scaling, provides complete SQL support, and enables federated queries with Hive/MySQL.

2. Design

Archer stores data files in Parquet format and builds inverted indexes using Tantivy, a Rust library inspired by Lucene that reduces I/O requests and is suitable for HDFS/S3.

2.1 Index Types

Six index types are supported: text, json, json_text, ip, datetime, and raw, each handling different field formats and query capabilities.

2.2 Architecture

Archer follows a storage‑compute separation model. By implementing a custom TantivyDirectory, it can read/write index files directly on HDFS/S3, with a LocalCache to improve I/O performance. Query processing first retrieves matching doc‑ids from the inverted index, then uses Parquet page offsets to trim pages, reducing data reads. Each Parquet page defaults to 1 MB.

Illustrations of the framework and page‑cutting are shown in the original figures.

2.3 Query Integration

Archer integrates with Trino as the query engine. A hidden varchar column “$inverted_index_query” enables predicate push‑down to the inverted index without modifying Trino core. OR logic is supported by returning the query string for matching rows and NULL otherwise.

Example DDL and SQL statements are provided in the source images.

2.4 Optimizations

2.4.1 File Trimming

Tantivy creates several auxiliary files that are unnecessary for our use case; Archer’s custom directory ignores write requests for these files and returns static data on read, reducing small‑file overhead on HDFS/S3.

2.4.2 Memory Management

To avoid excessive memory consumption when building inverted indexes, Archer limits the memory per segment, spilling to external storage when the limit is reached and later merging segments to improve query performance.

2.4.3 Pre‑loading

Archer adds a preload‑plus‑cache layer that reads larger contiguous blocks of index data, cutting remote I/O requests by 100‑1000×.

2.4.4 IO Merging

After page trimming, small, close‑by page requests are merged to balance read amplification and latency.

2.4.5 Dynamic Split Allocation

When an inverted‑index query is detected, Archer disables Trino’s default 128 MB split‑generation and delivers the whole file as a single split, reducing redundant loading.

2.4.6 Serialization

Archer uses Arrow as the in‑memory data format to exchange data between Rust‑based Tantivy and Java‑based Trino via JNI, avoiding repeated serialization/deserialization.

3. Test Results

Using the amazon_reviews dataset, Archer’s full‑text search reduced query time from 15.64 s (Iceberg) to 1.52 s and scanned data from 7.02 GB to 122 KB. For JSON search, Archer completed the query in 1.36 s scanning 14.6 MB, compared with Iceberg’s 14.37 s and 454 MB.

performance optimizationBig DataInverted IndexTrinoparquetArcher EngineTantivy
360 Smart Cloud
Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.