Archer Engine: Integrating Inverted Index with Iceberg for Scalable Big Data Log Analytics
The article introduces Archer, a new big‑data warehouse engine built on Iceberg that adds an inverted‑index mechanism using Tantivy to provide full‑text and JSON search, storage‑compute separation, and significant performance gains over traditional Elasticsearch and Iceberg connectors.
1. Background
In the big‑data analytics field, Elasticsearch is often used for log storage and analysis, but its high memory/disk requirements, complex operation, limited SQL, and lack of storage‑compute separation become problematic as log volume grows.
To address these issues, the Archer engine was built on top of Iceberg, introducing an inverted‑index mechanism that gives the Qilin data‑warehouse full‑text search capability. Archer stores both forward data and inverted index files on HDFS/S3, uses a local cache, runs the compute engine in containers for elastic scaling, provides complete SQL support, and enables federated queries with Hive/MySQL.
2. Design
Archer stores data files in Parquet format and builds inverted indexes using Tantivy, a Rust library inspired by Lucene that reduces I/O requests and is suitable for HDFS/S3.
2.1 Index Types
Six index types are supported: text, json, json_text, ip, datetime, and raw, each handling different field formats and query capabilities.
2.2 Architecture
Archer follows a storage‑compute separation model. By implementing a custom TantivyDirectory, it can read/write index files directly on HDFS/S3, with a LocalCache to improve I/O performance. Query processing first retrieves matching doc‑ids from the inverted index, then uses Parquet page offsets to trim pages, reducing data reads. Each Parquet page defaults to 1 MB.
Illustrations of the framework and page‑cutting are shown in the original figures.
2.3 Query Integration
Archer integrates with Trino as the query engine. A hidden varchar column “$inverted_index_query” enables predicate push‑down to the inverted index without modifying Trino core. OR logic is supported by returning the query string for matching rows and NULL otherwise.
Example DDL and SQL statements are provided in the source images.
2.4 Optimizations
2.4.1 File Trimming
Tantivy creates several auxiliary files that are unnecessary for our use case; Archer’s custom directory ignores write requests for these files and returns static data on read, reducing small‑file overhead on HDFS/S3.
2.4.2 Memory Management
To avoid excessive memory consumption when building inverted indexes, Archer limits the memory per segment, spilling to external storage when the limit is reached and later merging segments to improve query performance.
2.4.3 Pre‑loading
Archer adds a preload‑plus‑cache layer that reads larger contiguous blocks of index data, cutting remote I/O requests by 100‑1000×.
2.4.4 IO Merging
After page trimming, small, close‑by page requests are merged to balance read amplification and latency.
2.4.5 Dynamic Split Allocation
When an inverted‑index query is detected, Archer disables Trino’s default 128 MB split‑generation and delivers the whole file as a single split, reducing redundant loading.
2.4.6 Serialization
Archer uses Arrow as the in‑memory data format to exchange data between Rust‑based Tantivy and Java‑based Trino via JNI, avoiding repeated serialization/deserialization.
3. Test Results
Using the amazon_reviews dataset, Archer’s full‑text search reduced query time from 15.64 s (Iceberg) to 1.52 s and scanned data from 7.02 GB to 122 KB. For JSON search, Archer completed the query in 1.36 s scanning 14.6 MB, compared with Iceberg’s 14.37 s and 454 MB.
360 Smart Cloud
Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.