Databases 16 min read

Search Engine Architecture: Indexing, Querying, and Elasticsearch Basics

This article explains what a search engine is, describes its core components—indexing and search modules—detailing the workflow from content acquisition to result rendering, and provides an in‑depth overview of Elasticsearch, including its architecture, clusters, shards, replicas, mappings, and basic configuration.

MaGe Linux Operations

May 29, 2020

Search Engine Architecture: Indexing, Querying, and Elasticsearch Basics

Everyone knows what a search engine is—examples include Baidu, Google, Bing, 360 Search, Taobao Search, JD Search, and many others that can retrieve information based on user queries.

A search engine is a system that, according to specific strategies and programs, collects information, organizes and processes it, and then provides retrieval services to users, displaying the relevant results.

Components of a Search Engine

A search engine generally consists of an index component and a search component . The diagram below shows the workflow of these two components.

The red box in the diagram represents the index component, and the green box represents the search component.

Index component: Acquire Content → Build Document → Analyze Document → (Index Document) Acquire Content → Build Document → Analyze Document → (Index Document) Search component: Search User Interface → Build Query → Run Query → Render Results Search User Interface → Build Query (convert user input into a query object) → Run Query → Render Results Why does the index component workflow go from bottom to top? For example, on Taobao, when a product is listed, the raw content is first stored, then the system acquires the content (images, titles), builds a document, analyzes it (tokenization), and finally adds it to the index so it can be searched.

The user’s search box is the Search User Interface. When a query is entered, the browser converts it to a request (Build Query), the Elasticsearch server runs the query against the index (Run Query), and the results are rendered back to the user (Render Results).

An index is a data structure that enables fast random access to stored terms. To retrieve information quickly from large text collections, the text must first be transformed into an indexable format.

Elasticsearch Introduction

Elasticsearch is an open‑source, distributed, RESTful search and analytics engine written in Java. It uses Apache Lucene for indexing and serves as the core of the Elastic Stack, storing data centrally. Besides full‑text search, it supports structured search, analytics, and their combinations.

Full‑text search: each field can be indexed and searched.

Structured search: scales to hundreds of nodes and handles petabytes of data.

Analytics: provides distributed real‑time analytical capabilities.

Basic Concepts of Elasticsearch

Key concepts include Cluster, Node, Index, Shards, Replicas, and Document.

Cluster

A cluster is a collection of one or more nodes that together store all data and provide unified indexing and search across all nodes. Each cluster has a unique name (default "elasticsearch").

Node

A node is a single server that belongs to a cluster, stores data, and participates in indexing and search. Nodes are identified by a name (default a random UUID).

Index

An index is a collection of documents with similar characteristics (e.g., customers, products, orders). The index name must be lowercase and references the mapping that defines field names and types.

Mapping and Settings

Mapping defines field types.

Settings define data distribution.

Index metadata contains both mapping and settings information.

Document

A document is the smallest searchable unit that can be indexed. Documents are stored as JSON objects and can contain fields of types such as text, long, boolean, date, binary, and range. Each document has a unique ID.

{
  "movies": {
    "aliases": {},
    "mappings": {
      "properties": {
        "@version": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } },
        "genre": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } },
        "id": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } },
        "title": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } },
        "year": { "type": "long" }
      }
    },
    "settings": {
      "index": {
        "creation_date": "1575518947108",
        "number_of_shards": "1",
        "number_of_replicas": "1",
        "uuid": "r4qFegpeSmCnxmJzYVPO6w",
        "version": { "created": "7040099" },
        "provided_name": "movies"
      }
    }
  }
}

Shards & Replicas

An index can store massive data that exceeds a single node’s capacity. Elasticsearch allows an index to be split into multiple shards, each a fully functional Lucene index that can reside on any node.

Shards enable horizontal scaling and parallel processing, improving performance and throughput. Replicas provide high availability and allow read operations to be distributed across nodes.

Since Elasticsearch 7, a newly created index has one primary shard and one replica shard by default (earlier versions used five primary shards and one replica).

Each shard runs a Lucene instance. The number of primary shards is set at index creation and cannot be changed later without reindexing; replicas can be added or removed dynamically.

Shard Settings

Setting too few shards limits horizontal scaling and can cause large shard sizes, making reallocation costly. Setting too many shards can affect relevance scoring, waste resources, and degrade performance.

Elasticsearch Principle Overview

In a single search request, the process is as follows: the user’s query is received by a node (typically a replica for reads), the node may handle part of the request locally and forward the rest to another node’s replica. Results from both nodes are merged and returned to the client.

Read requests can be served by replica shards, while write requests must be handled by primary shards (though primary shards can also serve reads).

Origin of Elasticsearch

Several years ago, Shay Banon, a newly married unemployed developer, built a recipe search engine for his wife. He abstracted Lucene into a library called “Compass”. Later, while working on a high‑performance distributed data grid, he rewrote Compass into a standalone service named Elasticsearch. The first public version was released in February 2010, and the project quickly became one of the most popular on GitHub.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Indexing Elasticsearch Sharding

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.