Backend Development 21 min read

Principles and Simple Implementation of a Search Engine in Go

This article explains the fundamental concepts of search engine technology—including forward and inverted indexes, tokenizers, stop words, synonym handling, ranking algorithms, and NLP integration—and provides a concise Go implementation with code examples and performance testing.

政采云技术

Dec 19, 2023

Principles and Simple Implementation of a Search Engine in Go

Search engines are ubiquitous in the information age, serving as the primary gateway for users to retrieve data across personal blogs, e‑commerce platforms, and media services. This article first introduces the core principles behind search engines, focusing on indexing mechanisms.

1. Search Engine Principles

When a user enters a query into Google, the engine must quickly locate the relevant content among billions of indexed pages. This is achieved through the use of indexes.

1.1 Forward Index

A forward index maps a unique identifier (e.g., a person's name) directly to its associated data, similar to how the human brain retrieves information. Example data structure:

{
  "Zhang San": {"sex": "F", "height": 175},
  "Li Si": {"sex": "M", "height": 165}
}

While lookup is O(1) using a hash table, forward indexes are inefficient for attribute‑based searches over massive datasets.

1.2 Inverted Index

Inverted indexes map terms (keywords) to the list of document IDs containing those terms, enabling fast retrieval even in large corpora. Example mapping for two documents:

{
  "政采云": [1, 2],
  "原创干货": [1],
  "技术氛围": [2],
  "互联网公司": [2]
}

Building an inverted index requires a tokenizer to extract terms from raw text.

1.3 Tokenizer

A tokenizer splits text into individual tokens for indexing and query processing. For the sentence "我好想去政采云有限公司工作", the tokenizer might produce:

["我", "好", "想去", "政采", "云", "有限公司", "工作"]

Effective tokenization is crucial for languages like Chinese, where word boundaries are not explicit.

1.4 Stop Words

Stop words (e.g., "the", "is" in English or "的", "在" in Chinese) are filtered out because they add little semantic value and increase index size.

1.5 Synonyms

Synonym handling expands query terms (e.g., mapping "k8s" to "kubernetes" or "开发" to "研发") to improve recall. This often relies on NLP techniques or curated synonym dictionaries.

1.6 Document Ranking Algorithms

After retrieval, results are ranked using heuristics such as term frequency, title relevance, recency, and more sophisticated machine‑learning models.

1.7 Associative Search (NLP)

Modern engines provide query suggestions based on user history, location, and language context, a complex task that still suffers from occasional errors.

1.8 Large Models and Search Engines

Large language models enhance understanding of ambiguous queries but cannot replace traditional search engines due to higher latency, cost, and potential hallucinations. Search engines remain essential for real‑time, trustworthy results.

2. Implementing a Simple Search Engine

The minimal engine supports two core functions: index creation and query execution. The implementation stores indexes in memory for demonstration purposes.

2.1 Data Types

type Doc struct {
  Title       string `json:"title,omitempty"`
  Type        string `json:"type,omitempty"`
  Source      string `json:"source,omitempty"`
  Url         string `json:"url,omitempty"`
  Description string `json:"description,omitempty"`
  CreatedAt   time.Time `json:"createdAt,omitempty"`
  UpdatedAt   time.Time `json:"updatedAt,omitempty"`
}

2.2 Interface Design

Index interface for storing term‑to‑doc mappings.

Metadata interface for persisting document metadata.

Count interface for tracking term frequencies used in ranking.

2.3 Set Data Structure

type Empty struct{}

type Set struct {
  sync.RWMutex
  Store map[string]Empty
}

func NewSet() *Set { return &Set{Store: make(map[string]Empty)} }

func (s Set) List() []string { /* ... */ }
func (s Set) Set(key string) { /* ... */ }
func (s Set) Del(key string) { /* ... */ }

2.4 Index Creation

func (s Store) CreateIndex(text, docId string, doc *Doc) {
  seg := removeDuplicate(s.Seg.Cut(text, true))
  for _, v := range seg {
    word := utils.ToLower(v)
    s.Index.Set(docId, word)
    s.Count.Set(docId, word)
  }
  if err := s.Metadata.Set(docId, doc); err != nil {
    log.Println("[ERROR] Update metadata failed:", err.Error())
  }
}

2.5 Search and Ranking

func (s Store) Search(text string) []string {
  var result []string
  seg := s.Seg.Cut(text, true)
  for _, v := range seg {
    key := utils.ToLower(v)
    if docs, ok := s.Index.Get(key); ok {
      result = append(result, docs...)
    }
  }
  return result
}

func (s Store) SortSearch(text string) []string {
  // aggregate term frequencies and sort by weight
}

2.6 Performance Test

func main() {
  s := store.NewStoreByMEM()
  s.CreateIndex("政采云技术团队， 是一个富有活力、创造力的团队，持续保持稳定的原创更新， 欢迎点赞关注！", "111111", &pkg.Doc{Title:"政采云技术团队", Description:"政采云技术团队， 是一个富有活力、创造力的团队", Url:"https://juejin.cn/user/703323667190104", Source:"原创文章"})
  now := time.Now()
  fmt.Println("关键词: 政采云 目标文档:", s.Search("政采云"))
  fmt.Println("关键词: 原创 目标文档:", s.Search("原创"))
  fmt.Println("关键词: 创造力 目标文档:", s.Search("创造力"))
  fmt.Println(fmt.Sprintf("搜索耗时：%s", time.Since(now)))
}
// Output shows sub‑millisecond query latency.

The example demonstrates a single‑node, in‑memory search engine with basic indexing, tokenization, and ranking capabilities. While it lacks distributed storage and fault tolerance, it illustrates core concepts that can be extended to production‑grade systems.

3. Conclusion

Although the basic principles of a search engine appear simple, real‑world deployments must handle massive data volumes, high concurrency, and stringent latency and relevance requirements, making search a complex engineering challenge.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Search Engine Go Ranking Information Retrieval inverted index NLP tokenizer

Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.