Unlocking High‑Performance Chinese Segmentation: Inside Go’s gse Library

This article deeply examines the source code of Go’s high‑performance segmentation library gse, revealing its Double‑Array Trie, shortest‑path dynamic programming, and HMM‑Viterbi implementations, and demonstrates practical usage for Chinese tokenization, part‑of‑speech tagging, keyword extraction, and custom dictionary management.

Code Wrench
Code Wrench
Code Wrench
Unlocking High‑Performance Chinese Segmentation: Inside Go’s gse Library

Core Algorithm Architecture

The gse library achieves high performance through three key components:

Double‑Array Trie (DAT) – used for compact dictionary storage and fast lookup.

Shortest‑Path Algorithm based on word frequency – resolves ambiguous segmentation by finding the minimum‑cost path.

Hidden Markov Model (HMM) with Viterbi decoding – discovers out‑of‑vocabulary words.

1. Dictionary Storage: Double‑Array Trie

In dictionary.go, gse builds a DAT using the Go implementation of the C++ cedar library:

// Dictionary struct implements a string double array trie.
type Dictionary struct {
    trie *cedar.Cedar // Cedar double array trie
    // ...
}

Compared with a traditional map[string]int, DAT dramatically reduces memory usage and makes lookup time O(n) with respect to the token length, independent of dictionary size.

2. Segmentation Core: Shortest‑Path & Dynamic Programming

The main segmentation logic resides in segmenter.go. The sentence is modeled as a Directed Acyclic Graph (DAG) where characters are nodes and possible words are edges. The algorithm searches for the shortest path, where edge weight is derived from the inverse of word frequency (or negative log probability).

func (seg *Segmenter) segmentWords(text []Text, searchMode bool) []Segment {
    // jumpers records the shortest‑path info for each position
    jumpers := make([]jumper, len(text))
    for current := 0; current < len(text); current++ {
        // ... (code omitted)
        // Find all possible tokens starting at current position
        numTokens := seg.Dict.LookupTokens(tx, tokens)
        // Dynamic programming: update jumpers
        for iToken := 0; iToken < numTokens; iToken++ {
            location := current + len(tokens[iToken].text) - 1
            // updateJumper keeps the minimal distance
            updateJumper(&jumpers[location], baseDistance, tokens[iToken])
        }
    }
    // ... backtrack jumpers to obtain the optimal segmentation
}

The jumpers array stores the minimal cost to reach each character position. By iterating over the text, using DAT for fast token lookup, and updating costs, the algorithm finally backtracks to produce the optimal segmentation.

3. New‑Word Discovery: HMM & Viterbi

For words not present in the dictionary, gse falls back to an HMM model implemented in the hmm package. The Viterbi algorithm computes the most probable state sequence using pre‑trained emission ( probEmit) and transition ( probTrans) probabilities.

func Viterbi(obs []rune, states []byte) (float64, []byte) {
    // ... initialization
    for t := 1; t < len(obs); t++ {
        // compute max probability for each state
        for _, y := range states {
            // ... prob0 = vtb[t-1][y0] + transP + emP
        }
    }
    // ... backtrack the optimal path
}

When the DAG segmentation cannot split a single character or when higher recall is needed, the HMM module is invoked to recognize names, places, and other new terms.

Practical Application Guide

Scenario 1: Basic Segmentation & Search‑Engine Mode

package main

import (
    "fmt"
    "github.com/go-ego/gse"
)

func main() {
    var seg gse.Segmenter
    seg.LoadDict()
    text := "《复仇者联盟3:无限战争》是全片使用IMAX摄影机拍摄制作的科幻片"
    // Precise mode (suitable for text analysis)
    precise := seg.Cut(text, true)
    fmt.Println("Precise mode:", precise)
    // Search mode (higher recall for indexing)
    search := seg.CutSearch(text, true)
    fmt.Println("Search mode:", search)
}

Scenario 2: Dynamically Maintaining the Dictionary

func addCustomToken(seg *gse.Segmenter) {
    // Dynamically add a new term with frequency and POS tag
    seg.AddToken("西雅图太空针", 100, "n")
    text := "西雅图太空针是地标建筑"
    fmt.Println(seg.Cut(text, true)) // The new term is recognized as a single token
}

Scenario 3: Part‑of‑Speech Tagging & Keyword Extraction

func posTagging(seg *gse.Segmenter) {
    text := "西雅图地标建筑"
    pos := seg.Pos(text, true) // Enable POS tagging
    fmt.Println("POS tagging:", pos) // Example output: [{西雅图 ns} {地标 n} {建筑 n}]
}

Scenario 4: E‑Commerce Search for Electronics

Goal: extract brand, feature, category, and specifications from queries such as “华为5G手机8GB内存”.

Build industry‑specific dictionaries (e.g., dict_brand.txt, dict_category.txt, dict_spec.txt).

Load them with seg.AddToken and assign custom POS tags.

package main

import (
    "fmt"
    "github.com/go-ego/gse"
    "strings"
)

func main() {
    var seg gse.Segmenter
    seg.LoadDict()
    // Simulate loading industry dictionaries
    seg.AddToken("华为", 1000, "brand")
    seg.AddToken("5G", 1000, "spec")
    seg.AddToken("手机", 1000, "category")
    seg.AddToken("8GB", 1000, "spec")
    seg.AddToken("内存", 1000, "n")

    text := "华为5G手机8GB内存"
    pos := seg.Pos(text, true)
    var brand, category string
    var specs []string
    for _, p := range pos {
        word := p.Text
        tag := p.Pos
        switch tag {
        case "brand":
            brand = word
        case "category":
            category = word
        case "spec":
            specs = append(specs, word)
        }
    }
    fmt.Printf("解析结果:
品牌: %s
类目: %s
规格: %s
", brand, category, strings.Join(specs, ", "))
}

This approach quickly builds a vertical‑domain query parser, greatly improving search accuracy and user experience.

Performance and Optimization Recommendations

Official benchmarks show impressive speeds:

Single‑threaded: 9.2 MB/s

Concurrent Goroutines: 26.8 MB/s

Singleton Pattern – Segmenter is large (contains the Trie) and costly to initialize. In web services, create a global singleton or inject it into the request context; do not reload the dictionary on every request .

Concurrency Safety – The Segment method is read‑only and thread‑safe, so it can be called from multiple Goroutines without extra locking.

Dictionary Pruning – For memory‑constrained environments, trim the dictionary to the needed domain or use LoadDictEmbed to load a compact embedded dictionary.

Conclusion and Outlook

The gse library, with its efficient DAT, Viterbi, and HMM implementations, stands out as a leading Go‑based NLP solution. Whether building search engines, recommendation systems, or performing text mining, gse provides a reliable foundation. By dissecting its source code, developers gain both practical usage skills and deeper insight into high‑performance text‑processing system design.

GoNLPHMMTriesegmentationgse
Code Wrench
Written by

Code Wrench

Focuses on code debugging, performance optimization, and real-world engineering, sharing efficient development tips and pitfall guides. We break down technical challenges in a down-to-earth style, helping you craft handy tools so every line of code becomes a problem‑solving weapon. 🔧💻

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.