Principles and Simple Implementation of a Search Engine in Go
This article explains the fundamental concepts of search engine technology—including forward and inverted indexes, tokenizers, stop words, synonym handling, ranking algorithms, and NLP integration—and provides a concise Go implementation with code examples and performance testing.
Search engines are ubiquitous in the information age, serving as the primary gateway for users to retrieve data across personal blogs, e‑commerce platforms, and media services. This article first introduces the core principles behind search engines, focusing on indexing mechanisms.
1. Search Engine Principles
When a user enters a query into Google, the engine must quickly locate the relevant content among billions of indexed pages. This is achieved through the use of indexes.
1.1 Forward Index
A forward index maps a unique identifier (e.g., a person's name) directly to its associated data, similar to how the human brain retrieves information. Example data structure:
{
"Zhang San": {"sex": "F", "height": 175},
"Li Si": {"sex": "M", "height": 165}
}While lookup is O(1) using a hash table, forward indexes are inefficient for attribute‑based searches over massive datasets.
1.2 Inverted Index
Inverted indexes map terms (keywords) to the list of document IDs containing those terms, enabling fast retrieval even in large corpora. Example mapping for two documents:
{
"政采云": [1, 2],
"原创干货": [1],
"技术氛围": [2],
"互联网公司": [2]
}Building an inverted index requires a tokenizer to extract terms from raw text.
1.3 Tokenizer
A tokenizer splits text into individual tokens for indexing and query processing. For the sentence "我好想去政采云有限公司工作", the tokenizer might produce:
["我", "好", "想去", "政采", "云", "有限公司", "工作"]Effective tokenization is crucial for languages like Chinese, where word boundaries are not explicit.
1.4 Stop Words
Stop words (e.g., "the", "is" in English or "的", "在" in Chinese) are filtered out because they add little semantic value and increase index size.
1.5 Synonyms
Synonym handling expands query terms (e.g., mapping "k8s" to "kubernetes" or "开发" to "研发") to improve recall. This often relies on NLP techniques or curated synonym dictionaries.
1.6 Document Ranking Algorithms
After retrieval, results are ranked using heuristics such as term frequency, title relevance, recency, and more sophisticated machine‑learning models.
1.7 Associative Search (NLP)
Modern engines provide query suggestions based on user history, location, and language context, a complex task that still suffers from occasional errors.
1.8 Large Models and Search Engines
Large language models enhance understanding of ambiguous queries but cannot replace traditional search engines due to higher latency, cost, and potential hallucinations. Search engines remain essential for real‑time, trustworthy results.
2. Implementing a Simple Search Engine
The minimal engine supports two core functions: index creation and query execution. The implementation stores indexes in memory for demonstration purposes.
2.1 Data Types
type Doc struct {
Title string `json:"title,omitempty"`
Type string `json:"type,omitempty"`
Source string `json:"source,omitempty"`
Url string `json:"url,omitempty"`
Description string `json:"description,omitempty"`
CreatedAt time.Time `json:"createdAt,omitempty"`
UpdatedAt time.Time `json:"updatedAt,omitempty"`
}2.2 Interface Design
Index interface for storing term‑to‑doc mappings.
Metadata interface for persisting document metadata.
Count interface for tracking term frequencies used in ranking.
2.3 Set Data Structure
type Empty struct{}
type Set struct {
sync.RWMutex
Store map[string]Empty
}
func NewSet() *Set { return &Set{Store: make(map[string]Empty)} }
func (s Set) List() []string { /* ... */ }
func (s Set) Set(key string) { /* ... */ }
func (s Set) Del(key string) { /* ... */ }2.4 Index Creation
func (s Store) CreateIndex(text, docId string, doc *Doc) {
seg := removeDuplicate(s.Seg.Cut(text, true))
for _, v := range seg {
word := utils.ToLower(v)
s.Index.Set(docId, word)
s.Count.Set(docId, word)
}
if err := s.Metadata.Set(docId, doc); err != nil {
log.Println("[ERROR] Update metadata failed:", err.Error())
}
}2.5 Search and Ranking
func (s Store) Search(text string) []string {
var result []string
seg := s.Seg.Cut(text, true)
for _, v := range seg {
key := utils.ToLower(v)
if docs, ok := s.Index.Get(key); ok {
result = append(result, docs...)
}
}
return result
}
func (s Store) SortSearch(text string) []string {
// aggregate term frequencies and sort by weight
}2.6 Performance Test
func main() {
s := store.NewStoreByMEM()
s.CreateIndex("政采云技术团队, 是一个富有活力、创造力的团队,持续保持稳定的原创更新, 欢迎点赞关注!", "111111", &pkg.Doc{Title:"政采云技术团队", Description:"政采云技术团队, 是一个富有活力、创造力的团队", Url:"https://juejin.cn/user/703323667190104", Source:"原创文章"})
now := time.Now()
fmt.Println("关键词: 政采云 目标文档:", s.Search("政采云"))
fmt.Println("关键词: 原创 目标文档:", s.Search("原创"))
fmt.Println("关键词: 创造力 目标文档:", s.Search("创造力"))
fmt.Println(fmt.Sprintf("搜索耗时:%s", time.Since(now)))
}
// Output shows sub‑millisecond query latency.The example demonstrates a single‑node, in‑memory search engine with basic indexing, tokenization, and ranking capabilities. While it lacks distributed storage and fault tolerance, it illustrates core concepts that can be extended to production‑grade systems.
3. Conclusion
Although the basic principles of a search engine appear simple, real‑world deployments must handle massive data volumes, high concurrency, and stringent latency and relevance requirements, making search a complex engineering challenge.
政采云技术
ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.