Boolean Algebra and Search Engine Technology
The article outlines how search engines combine the Tao of underlying principles—crawling, binary‑based Boolean indexing, PageRank matrix calculations, and TF‑IDF weighting—with specific Shu implementations to efficiently retrieve, rank, and present relevant web pages using Boolean logic, link analysis, and term relevance metrics.
This article explains the fundamental principles behind search engine technology, distinguishing between the "Tao" (underlying principles) and "Shu" (specific implementation methods).
How Search Engines Work
A search engine essentially performs three tasks: automatically downloading as many web pages as possible, building fast and effective indexes, and ranking web pages fairly and accurately based on relevance.
Boolean Algebra in Search
Boolean algebra originated from binary systems. The Chinese Yin-Yang theory was an early form of binary, while the binary counting system was完善ed by Indian scholars in the 2nd-5th century BC, and later by Leibniz in the 17th century who used 0 and 1. In 1854, George Boole's book "The Laws of Thought" demonstrated how mathematics could solve logical problems.
Boolean algebra has only two elements: 1 (true) and 0 (false), with three basic operations: AND, OR, and NOT. Both Google and Baidu's search fundamentals are based on Boolean algebra. For example, searching for documents about atomic energy applications (but not how to make atomic bombs) would use the query: "atomic energy AND applications AND NOT atomic bombs".
Index Structure
To find thousands of results in milliseconds, search engines use indexes rather than text scanning. A simple index uses a long binary number where each bit represents whether a keyword appears in a document. For example, if there are N documents, the index has N bits. The keyword "atomic energy" might have binary representation 01001..., and "applications" might be 00101.... Performing Boolean AND operation yields the matching documents.
PageRank Algorithm
PageRank (PR) was invented by Larry Page and Sergey Brin, Google's founders. The core idea is that if a web page is linked to by many other pages, it indicates widespread recognition and trust, resulting in a higher ranking. Pages with higher PR contribute more weight to their outgoing links. For instance, if four pages with weights 0.001, 0.01, 0.02, and 0.05 all link to page Y, then PR(Y) = 0.001 + 0.01 + 0.02 + 0.05 = 0.081. The PR calculation essentially involves matrix multiplication from linear algebra.
TF-IDF for Relevance
To measure relevance between a query and web pages, a simple method uses term frequency (TF). However, stop words like "the," "is," "of" have no value in determining page topic. The most widely used weighting in information retrieval is Inverse Document Frequency (IDF), with formula log(D/Dw), where D is total web pages and Dw is pages containing the keyword.
For example, if there are 1 billion Chinese web pages, the stop word "的" appears in all 1 billion, so IDF = log(1 billion/1 billion) = log(1) = 0. If "atomic energy" appears in 2 million pages, IDF = log(500) = 8.96. If "applications" appears in 500 million pages, IDF = log(2) = 1.
The final relevance formula becomes: TF1×IDF1 + TF2×IDF2 + TF3×IDF3 + ...
Baidu Tech Salon
Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.