How Inverted Indexes Power Fast Full-Text Search
This article explains what an inverted index is, why it’s essential for full‑text search, how it is built and queried, and common token transformations such as stop‑word removal, lemmatization, and stemming.
Content Overview
What is an inverted index and why is it needed?
How does an inverted index work?
1. What is an inverted index?
Imagine a dating website with a user table. A user might say, “I want to find a PHP developer in Shanghai,” or “I want a Java developer in Beijing who loves travel and food.” Matching on gender, city, and language quickly becomes a combinatorial problem.
Relational‑database indexes struggle with such multi‑field full‑text queries, so a full‑text search engine uses an inverted index.
An inverted index stores a "content → document" mapping, enabling rapid full‑text lookup.
2. How does an inverted index work?
The process consists of two main steps:
Creating the inverted index
Searching the inverted index
2.1 Creating the inverted index
Consider two documents:
"Recipe of pasta with sauce pesto"
"Recipe of delicious carbonara pasta"
Each document is tokenized into individual words (tokens) and the token‑to‑document relationships are stored.
The resulting index looks like:
2.2 Inverted index search
Search example: "pasta recipe"
Tokenization yields two tokens: pasta and recipe . Both tokens match in both documents, so both documents are returned with equal scores.
Another search: "carbonara pasta"
Both documents match, but Document #2 receives a higher score because it matches both tokens ( carbonara and pasta ), whereas Document #1 matches only pasta .
2.3 Token transformations
Before storing or searching, tokens can be transformed to improve relevance:
Stop‑word removal : discard high‑frequency, low‑information words such as "of", "the", "for".
Lemmatization (normalization) : map inflected forms to their dictionary base, e.g., "running" → "run", "walks" → "walk", "thought" → "think".
Stemming : truncate word endings to obtain a root form; useful for words not covered by a dictionary, though it cannot handle irregular verbs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
