Unlocking High‑Performance Chinese Segmentation: Inside Go’s gse Library
This article deeply examines the source code of Go’s high‑performance segmentation library gse, revealing its Double‑Array Trie, shortest‑path dynamic programming, and HMM‑Viterbi implementations, and demonstrates practical usage for Chinese tokenization, part‑of‑speech tagging, keyword extraction, and custom dictionary management.
Core Algorithm Architecture
The gse library achieves high performance through three key components:
Double‑Array Trie (DAT) – used for compact dictionary storage and fast lookup.
Shortest‑Path Algorithm based on word frequency – resolves ambiguous segmentation by finding the minimum‑cost path.
Hidden Markov Model (HMM) with Viterbi decoding – discovers out‑of‑vocabulary words.
1. Dictionary Storage: Double‑Array Trie
In dictionary.go, gse builds a DAT using the Go implementation of the C++ cedar library:
// Dictionary struct implements a string double array trie.
type Dictionary struct {
trie *cedar.Cedar // Cedar double array trie
// ...
}Compared with a traditional map[string]int, DAT dramatically reduces memory usage and makes lookup time O(n) with respect to the token length, independent of dictionary size.
2. Segmentation Core: Shortest‑Path & Dynamic Programming
The main segmentation logic resides in segmenter.go. The sentence is modeled as a Directed Acyclic Graph (DAG) where characters are nodes and possible words are edges. The algorithm searches for the shortest path, where edge weight is derived from the inverse of word frequency (or negative log probability).
func (seg *Segmenter) segmentWords(text []Text, searchMode bool) []Segment {
// jumpers records the shortest‑path info for each position
jumpers := make([]jumper, len(text))
for current := 0; current < len(text); current++ {
// ... (code omitted)
// Find all possible tokens starting at current position
numTokens := seg.Dict.LookupTokens(tx, tokens)
// Dynamic programming: update jumpers
for iToken := 0; iToken < numTokens; iToken++ {
location := current + len(tokens[iToken].text) - 1
// updateJumper keeps the minimal distance
updateJumper(&jumpers[location], baseDistance, tokens[iToken])
}
}
// ... backtrack jumpers to obtain the optimal segmentation
}The jumpers array stores the minimal cost to reach each character position. By iterating over the text, using DAT for fast token lookup, and updating costs, the algorithm finally backtracks to produce the optimal segmentation.
3. New‑Word Discovery: HMM & Viterbi
For words not present in the dictionary, gse falls back to an HMM model implemented in the hmm package. The Viterbi algorithm computes the most probable state sequence using pre‑trained emission ( probEmit) and transition ( probTrans) probabilities.
func Viterbi(obs []rune, states []byte) (float64, []byte) {
// ... initialization
for t := 1; t < len(obs); t++ {
// compute max probability for each state
for _, y := range states {
// ... prob0 = vtb[t-1][y0] + transP + emP
}
}
// ... backtrack the optimal path
}When the DAG segmentation cannot split a single character or when higher recall is needed, the HMM module is invoked to recognize names, places, and other new terms.
Practical Application Guide
Scenario 1: Basic Segmentation & Search‑Engine Mode
package main
import (
"fmt"
"github.com/go-ego/gse"
)
func main() {
var seg gse.Segmenter
seg.LoadDict()
text := "《复仇者联盟3:无限战争》是全片使用IMAX摄影机拍摄制作的科幻片"
// Precise mode (suitable for text analysis)
precise := seg.Cut(text, true)
fmt.Println("Precise mode:", precise)
// Search mode (higher recall for indexing)
search := seg.CutSearch(text, true)
fmt.Println("Search mode:", search)
}Scenario 2: Dynamically Maintaining the Dictionary
func addCustomToken(seg *gse.Segmenter) {
// Dynamically add a new term with frequency and POS tag
seg.AddToken("西雅图太空针", 100, "n")
text := "西雅图太空针是地标建筑"
fmt.Println(seg.Cut(text, true)) // The new term is recognized as a single token
}Scenario 3: Part‑of‑Speech Tagging & Keyword Extraction
func posTagging(seg *gse.Segmenter) {
text := "西雅图地标建筑"
pos := seg.Pos(text, true) // Enable POS tagging
fmt.Println("POS tagging:", pos) // Example output: [{西雅图 ns} {地标 n} {建筑 n}]
}Scenario 4: E‑Commerce Search for Electronics
Goal: extract brand, feature, category, and specifications from queries such as “华为5G手机8GB内存”.
Build industry‑specific dictionaries (e.g., dict_brand.txt, dict_category.txt, dict_spec.txt).
Load them with seg.AddToken and assign custom POS tags.
package main
import (
"fmt"
"github.com/go-ego/gse"
"strings"
)
func main() {
var seg gse.Segmenter
seg.LoadDict()
// Simulate loading industry dictionaries
seg.AddToken("华为", 1000, "brand")
seg.AddToken("5G", 1000, "spec")
seg.AddToken("手机", 1000, "category")
seg.AddToken("8GB", 1000, "spec")
seg.AddToken("内存", 1000, "n")
text := "华为5G手机8GB内存"
pos := seg.Pos(text, true)
var brand, category string
var specs []string
for _, p := range pos {
word := p.Text
tag := p.Pos
switch tag {
case "brand":
brand = word
case "category":
category = word
case "spec":
specs = append(specs, word)
}
}
fmt.Printf("解析结果:
品牌: %s
类目: %s
规格: %s
", brand, category, strings.Join(specs, ", "))
}This approach quickly builds a vertical‑domain query parser, greatly improving search accuracy and user experience.
Performance and Optimization Recommendations
Official benchmarks show impressive speeds:
Single‑threaded: 9.2 MB/s
Concurrent Goroutines: 26.8 MB/s
Singleton Pattern – Segmenter is large (contains the Trie) and costly to initialize. In web services, create a global singleton or inject it into the request context; do not reload the dictionary on every request .
Concurrency Safety – The Segment method is read‑only and thread‑safe, so it can be called from multiple Goroutines without extra locking.
Dictionary Pruning – For memory‑constrained environments, trim the dictionary to the needed domain or use LoadDictEmbed to load a compact embedded dictionary.
Conclusion and Outlook
The gse library, with its efficient DAT, Viterbi, and HMM implementations, stands out as a leading Go‑based NLP solution. Whether building search engines, recommendation systems, or performing text mining, gse provides a reliable foundation. By dissecting its source code, developers gain both practical usage skills and deeper insight into high‑performance text‑processing system design.
Code Wrench
Focuses on code debugging, performance optimization, and real-world engineering, sharing efficient development tips and pitfall guides. We break down technical challenges in a down-to-earth style, helping you craft handy tools so every line of code becomes a problem‑solving weapon. 🔧💻
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
