Big Data 17 min read

Elasticsearch Overview: Architecture, Lucene Foundations, Application Scenarios, and Optimizations

Elasticsearch, built on Apache Lucene, provides a distributed, near‑real‑time search platform that scales to billions of documents across thousands of nodes, supporting use cases such as log analytics, time‑series monitoring, and product search, while Tencent’s CES adds advanced availability, performance, and cost‑optimizing features.

Tencent Cloud Developer

Aug 27, 2020

Elasticsearch Overview: Architecture, Lucene Foundations, Application Scenarios, and Optimizations

With the rapid development of mobile Internet, IoT, and cloud computing, data volumes have exploded. Search engine technology is essential for extracting useful information from massive datasets.

Elasticsearch, the leading open‑source search engine, enables full‑text search without deep knowledge of underlying information retrieval principles. It can return results in seconds even when handling billions of documents, and addresses concerns such as disaster recovery, data safety, scalability, and maintainability.

Elasticsearch Introduction

Elasticsearch (ES) is built on Apache Lucene and provides a distributed, near‑real‑time indexing and search platform. It is highly reliable, easy to use, and has an active community. Typical features include RESTful APIs for indexing, querying, and cluster management, high scalability to hundreds of nodes, and support for PB‑level data.

1. Elasticsearch Architecture and Principles

Key concepts:

Cluster – a group of ES nodes across multiple machines.

Node – an ES process that can assume different roles.

Master node – manages cluster metadata (index creation, node join/leave, etc.).

Data node – stores index data.

Index – logical collection of documents, similar to a database.

Shard – a subset of an index that enables horizontal scaling.

Primary shard – receives write operations.

Replica shard – copy of a primary shard for high availability and query throughput.

Data is routed to the appropriate primary shard based on the document ID hash. Writes are first persisted to a transaction log (Translog) and later flushed to immutable segment files. Queries are distributed to the relevant primary or replica shards, and results are merged before returning to the client.

2. Lucene Fundamentals

Lucene provides high‑performance indexing and retrieval. Documents are tokenized into terms, building a dictionary and an inverted index. Lucene uses an LSM‑tree structure: in‑memory buffers are flushed to disk as segment files containing dictionaries, postings, and stored fields. Translog ensures durability in case of crashes.

3. Application Scenarios

• Log real‑time analysis – ingest logs with Filebeat, process with Logstash, store in ES, visualize with Kibana. Queries return within seconds even for trillion‑level logs.

• Time‑series analysis – high‑throughput writes (10 M /s) and low‑latency queries (≈10 ms) for monitoring metrics, IoT sensor data, etc.

• Search services – product search, app store search, site‑wide search. Requirements include >100 k QPS, <20 ms latency, and four‑nine availability.

4. Tencent Elasticsearch Service (CES)

Tencent provides an enhanced ES offering (CES) that integrates X‑Pack features, kernel optimizations, and multi‑AZ disaster recovery. The service addresses three main optimization dimensions:

Availability: Robust master election, improved shard balancing, automatic backup to COS, and garbage‑bucket recovery for accidental deletions.

Performance: Segment merge tuning, query pruning using min/max values, CBO‑based cache avoidance, primary‑key deduplication via segment statistics (+45% write speed), Translog lock‑granularity reduction (+20% CPU utilization), and vectorized execution trials.

Cost: Hot‑cold storage separation, roll‑up aggregation for historical data, off‑heap FST storage, and memory‑efficient data structures enabling ~32 GB heap to manage ~50 TB of disk.

5. Conclusion

Elasticsearch is widely used within Tencent for log analysis, time‑series monitoring, and full‑text search. Clusters now reach thousands of nodes and trillions of documents. Ongoing work focuses on further scaling (million‑level shards), storage‑compute separation, and automated diagnostics to reduce operational overhead.

References:

Lucene‑8980: https://github.com/apache/lucene-solr/pull/884

ES‑45765/47790: https://github.com/elastic/elasticsearch/pull/45765

Lucene‑9002: https://github.com/apache/lucene-solr/pull/940

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data scalability Elasticsearch Lucene

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.