Design and Key Technologies of the 360 Search Engine for Billion‑Scale Web Retrieval
This article explains how 360 Search processes billions of web pages daily, detailing its backend architecture, offline indexing, online retrieval, index organization, and relevance models that enable efficient search over a hundred‑billion‑scale web corpus.
360 Search is a core product of the company, operating tens of thousands of servers to crawl up to a billion web pages each day and indexing hundreds of billions of high‑quality pages.
The presentation is divided into four main modules: how to design a search engine, key technologies for hundred‑billion‑scale web computation, web index organization patterns, and web retrieval and relevance.
How to design a search engine : A user query is tokenized, the tokens retrieve posting lists from the inverted index, intersected to obtain a doc list, and then ranked using positional and attribute information; the front‑end extracts summaries for display.
Basic index structures : The system uses both a forward (document) index and an inverted index. The forward index stores document attributes and token lists, while the inverted index maps terms to posting lists.
Retrieval model : The workflow includes query analysis (granularity, term weight, intent), preparation of retrieval resources, determination of the candidate web set, relevance scoring, and re‑search strategies when results are insufficient.
Offline indexing : Built on HBase/HDFS, MapReduce, and Storm/Kafka. It involves index partitioning, batch creation via MapReduce, and frequent updates to accommodate new crawled pages and rank‑related feature changes.
Online retrieval : Consists of distributed services, request broadcasting, and load balancing. Core modules are intersection computation and basic relevance calculation.
Web index organization : Forward index supports independent updates by storing fixed‑length blocks for common attributes and variable‑length blocks for sparse attributes. Inverted index uses block‑wise compression and segment metadata to enable fast range lookups and decompression.
Intersection model : The shortest posting list is selected first; block metadata directs to the appropriate segment, followed by binary search with step‑size optimization to locate doc IDs efficiently.
Basic relevance : Implements term weighting (TF, IDF, BM25) and proximity (tightness) calculations to score documents, followed by higher‑level ranking models such as LTR.
Additional techniques include handling timeliness, cluster resource optimization, retrieval performance, caching, system stability, and real‑time big‑data computation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
