How Search Engines Work: Building Inverted Indexes
This article explains the core of search engine technology by describing what an inverted index is, how it is built using single‑pass memory and multi‑way merge methods, how indexes can be partitioned and incrementally updated, and how Hadoop can be used for large‑scale indexing.
Information retrieval has become a mature field, and modern search engines consist of many complex modules. This article focuses on a simplified search engine that only involves building and querying an inverted index, the most essential component.
The simplified architecture includes three parts: merger (receives queries, tokenizes them, and gathers top‑K results from downstream indexers), indexer (fetches posting lists and computes cosine similarity), and index (the inverted index itself).
The inverted index stores, for each term, a linked list of document IDs that contain the term, enabling fast retrieval of all documents containing a given word.
Single‑Pass In‑Memory Method : keep posting lists in memory; when a term’s buffer is full, flush it to disk as a separate file, then merge all files sorted by term ID to produce the final index. Advantages: fast due to reduced sorting; disadvantages: higher memory usage and more random disk writes.
Multi‑Way Merge Method : parse documents to emit <term, docID, tf> tuples, write them to disk, externally sort the tuples by term (lexicographically) and docID, then scan the sorted file sequentially to build the posting lists. This approach uses little memory and relies on sequential I/O, making it suitable for very large corpora, though it can be slower without parallelism.
Index Partitioning :
By document ID: split the document collection into shards and build a separate index for each shard. Queries must be sent to all shards, increasing backend load but keeping posting list lengths bounded.
By term: each index stores posting lists for a subset of terms. Queries are routed only to the index that holds the queried term, reducing backend computation.
Choosing a partitioning strategy depends on query patterns, query volume, and corpus size. For typical web search with few query terms, term‑based partitioning is beneficial; for image‑to‑text search with many terms, document‑based partitioning may be simpler.
Incremental Indexing :
To support real‑time search, a double‑buffer design can be used. Two full indexes alternate: while one serves queries, the other is updated with new documents, then they swap. This guarantees freshness but requires twice the index size in memory.
A more cost‑effective solution combines a full‑index service (for historical data) with an incremental index service that uses double buffers for recent data (hourly or minute granularity) and periodically merges updates into the full index.
Building Inverted Indexes with Hadoop :
For massive document collections, a MapReduce job can construct the index. Mappers parse documents and emit (term, tf+docID) tuples; reducers receive all tuples for a term, sort posting lists by docID, and write them to index files. The final index is obtained by merging all reducer outputs.
In practice, documents are often serialized so that each input split contains many documents, reducing the number of small files and improving mapper throughput.
Conclusion
The techniques described above only scratch the surface of inverted index construction and querying. For deeper study, the following books are recommended:
"Deep Learning for Search Engines – Compression, Indexing, and Retrieval of Massive Information" by Lan H. Witten, Alistair Moffat, Timothy C. Bell (translated by Liang Bin).
"Introduction to Information Retrieval" by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze (translated by Wang Bin).
Source: https://www.cnblogs.com/haolujun/p/8302542.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
