Fundamentals 21 min read

Understanding Full‑Text Search and Indexing with Lucene: Core Concepts and Processes

This article explains the fundamentals of full‑text search, describing how Lucene builds and uses inverted indexes, the steps of tokenization, linguistic processing, term weighting, and relevance scoring, and illustrates these concepts with examples, tables, and diagrams.

Java Captain

Mar 29, 2018

Understanding Full‑Text Search and Indexing with Lucene: Core Concepts and Processes

1. Overview

Lucene is a high‑performance, Java‑based full‑text search library. Before using Lucene, it is essential to understand the basics of full‑text search, which deals with both structured and unstructured data.

Data can be classified as structured (e.g., databases, metadata) or unstructured (e.g., emails, Word documents). Unstructured data is also called full‑text data.

Search on structured data typically uses SQL, while search on unstructured data relies on full‑text techniques such as serial scanning or inverted indexes.

2. What an Inverted Index Stores

To avoid slow serial scanning, an inverted index stores a mapping from terms to the documents that contain them. This mapping consists of a dictionary of terms and a posting list for each term.

Example: to find documents containing both "lucene" and "solr", retrieve the posting lists for each term and intersect them.

3. How to Create an Index

Step 1 – Original Documents

Two example documents are used to illustrate the process.

Step 2 – Tokenizer

The tokenizer splits text into tokens, removes punctuation, and filters out stop words (e.g., "the", "a").

Step 3 – Linguistic Processor

Tokens are normalized: lower‑casing, stemming (e.g., "cars" → "car"), and lemmatization (e.g., "drove" → "drive"). The resulting units are called terms .

Step 4 – Indexer

The indexer builds a dictionary of terms and creates posting lists (inverted index). The dictionary is then sorted alphabetically, and identical terms are merged into posting lists.

Term

Document ID

student

After merging, the posting lists contain term frequencies (tf) and document frequencies (df).

4. Searching the Index

Search involves parsing the user query, performing lexical and syntactic analysis, applying the same linguistic processing, and building a query tree (e.g., "lucene AND learned NOT hadoop").

The query tree is used to retrieve posting lists, apply Boolean operations (intersection, union, difference), and obtain a set of matching documents.

Step 4 – Ranking Results

Documents are ranked by relevance using the Vector Space Model. Each term receives a weight based on tf‑idf, and both documents and the query are represented as vectors. Relevance is computed as the cosine similarity between the document vector and the query vector.

Example calculations show how three documents receive different scores, with the highest‑scoring document returned first.

5. Summary of Indexing and Searching Process

Indexing: (1) Documents → (2) Tokenization & linguistic processing → (3) Build dictionary and inverted index → (4) Store index on disk.

Searching: (a) User enters query → (b) Query is tokenized and processed → (c) Query tree is built → (d) Index is loaded into memory → (e) Posting lists are retrieved and combined → (f) Results are ranked by relevance → (g) Results are returned to the user.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Indexing Lucene Information Retrieval Full-Text Search Vector Space Model Term Weight

Written by

Java Captain

Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.