Understanding Full‑Text Search and Indexing with Lucene: Core Concepts and Processes
This article explains the fundamentals of full‑text search, describing how Lucene builds and uses inverted indexes, the steps of tokenization, linguistic processing, term weighting, and relevance scoring, and illustrates these concepts with examples, tables, and diagrams.
1. Overview
Lucene is a high‑performance, Java‑based full‑text search library. Before using Lucene, it is essential to understand the basics of full‑text search, which deals with both structured and unstructured data.
Data can be classified as structured (e.g., databases, metadata) or unstructured (e.g., emails, Word documents). Unstructured data is also called full‑text data.
Search on structured data typically uses SQL, while search on unstructured data relies on full‑text techniques such as serial scanning or inverted indexes.
2. What an Inverted Index Stores
To avoid slow serial scanning, an inverted index stores a mapping from terms to the documents that contain them. This mapping consists of a dictionary of terms and a posting list for each term.
Example: to find documents containing both "lucene" and "solr", retrieve the posting lists for each term and intersect them.
3. How to Create an Index
Step 1 – Original Documents
Two example documents are used to illustrate the process.
Step 2 – Tokenizer
The tokenizer splits text into tokens, removes punctuation, and filters out stop words (e.g., "the", "a").
Step 3 – Linguistic Processor
Tokens are normalized: lower‑casing, stemming (e.g., "cars" → "car"), and lemmatization (e.g., "drove" → "drive"). The resulting units are called terms .
Step 4 – Indexer
The indexer builds a dictionary of terms and creates posting lists (inverted index). The dictionary is then sorted alphabetically, and identical terms are merged into posting lists.
Term
Document ID
student
1
additional rows omitted for brevity
After merging, the posting lists contain term frequencies (tf) and document frequencies (df).
4. Searching the Index
Search involves parsing the user query, performing lexical and syntactic analysis, applying the same linguistic processing, and building a query tree (e.g., "lucene AND learned NOT hadoop").
The query tree is used to retrieve posting lists, apply Boolean operations (intersection, union, difference), and obtain a set of matching documents.
Step 4 – Ranking Results
Documents are ranked by relevance using the Vector Space Model. Each term receives a weight based on tf‑idf, and both documents and the query are represented as vectors. Relevance is computed as the cosine similarity between the document vector and the query vector.
Example calculations show how three documents receive different scores, with the highest‑scoring document returned first.
5. Summary of Indexing and Searching Process
Indexing: (1) Documents → (2) Tokenization & linguistic processing → (3) Build dictionary and inverted index → (4) Store index on disk.
Searching: (a) User enters query → (b) Query is tokenized and processed → (c) Query tree is built → (d) Index is loaded into memory → (e) Posting lists are retrieved and combined → (f) Results are ranked by relevance → (g) Results are returned to the user.
Java Captain
Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.