Spelling Correction System for 58.com Search Engine: Rule‑Based and Statistical Methods
This article describes the design and implementation of a spelling‑correction module for 58.com’s search engine, covering common query errors, rule‑based and statistical language‑model approaches, offline dictionary generation, n‑gram and Viterbi decoding, online workflow, and practical examples.
In search engines, users expect high‑quality results related to their query terms, but low‑quality or erroneous queries often lead to poor recall or no results. The spelling‑correction module in 58.com’s main site search addresses this by correcting queries for categories such as real estate, recruitment, classifieds, and used cars, thereby recovering lost traffic and improving user experience.
Common Query Errors
Typical errors include pure pinyin input, misspelled or missing letters in English, and mixed pinyin‑Chinese strings. Errors arise from regional pronunciation differences (e.g., zh→z, ang→an, fei→hui), small mobile screens causing mis‑selection, and input methods like handwriting or voice.
Correction Schemes
Rule‑Based Method
The offline process builds two indexes using the full vocabulary of 58.com’s services and search‑log hot terms: a pinyin index and an edit‑distance index. The pinyin index handles full pinyin, abbreviated pinyin, incomplete pinyin, homophonic and fuzzy‑sound errors; the edit‑distance index handles character‑level misspellings and deletions. During online correction, the query (or its processed form) is looked up in these indexes; if a match is found, the query is corrected accordingly.
Statistical Language‑Model Method
For long queries, an n‑gram language model (3‑gram) is trained with SRILM on the same corpus. The model evaluates the probability of candidate word sequences, and the Viterbi algorithm selects the highest‑probability path. The workflow includes tokenizing the query, generating candidate corrections via pinyin or edit‑distance methods, segmenting candidates, and scoring them with the n‑gram model.
Offline Dictionary Generation
Two dictionaries are created:
Pinyin Dictionary : Generates four types of pinyin‑based indexes (full pinyin, initials, fuzzy‑sound variations, and last‑character initials) for each term.
Edit‑Distance Dictionary : Stores substrings obtained by deleting each character (or each pinyin letter) from a term, together with the deletion position, to support character‑level corrections.
Example entries illustrate how variations such as "ershoudiannao" (二手电脑) map to multiple misspelled forms.
n‑gram Model
The 3‑gram model assumes each word depends only on the two preceding words, drastically reducing computation. Probabilities are calculated as P(word_i | word_{i‑2}, word_{i‑1}) and summed for a candidate sequence. The model is used to rank candidate corrections generated from segmented queries.
Viterbi Decoding
In the 2‑gram case, each candidate’s best path depends on the previous position’s best candidates. In the 3‑gram case, each candidate must consider the two previous positions, storing multiple paths and their probabilities. The final correction is the path with the maximum total probability.
Online Correction Flow
The online process proceeds as follows:
Parse the query to determine whether it is pure pinyin, mixed, or Chinese.
If pure pinyin, apply pinyin correction; if corrected, return the result.
If not corrected, apply edit‑distance correction; if corrected, return the result.
If still uncorrected, perform long‑pinyin correction and return any result.
For queries containing Chinese characters, first apply pinyin correction on the whole string; if unsuccessful, segment the query and apply the language‑model method.
Examples of Correction Scenarios
Erroneous Query
Corrected Query
Method
Level
Shuianhuating
水岸华庭
Pinyin correction (no Chinese characters)
Forced correction without prompt
ATLS
奥特莱斯
Pinyin correction (no Chinese characters)
Forced correction with prompt
linshiG
时工
Pinyin correction (no Chinese characters)
Forced correction with prompt
iphoni4
iphone4
Edit‑distance correction after failed pinyin correction
Forced correction with prompt
shijingshanxiaochaoshi
石景山小超市
Long‑pinyin correction after failed pinyin and edit‑distance
Forced correction with prompt
途an
途安
Pinyin correction (contains Chinese)
Forced correction with prompt
天津医科大学总医院zufang
天津医科大学总医院租房
Mixed pinyin + n‑gram correction
Forced correction without prompt
复试办公公寓
复式办公公寓
n‑gram correction after failed pinyin
Suggested correction with prompt
Conclusion
The module demonstrates how rule‑based and statistical methods can correct most user input errors in the 58.com search scenario, though there remains room for improvement in dictionary construction, candidate generation, and evaluation. Emerging techniques such as CRF‑based error detection, deep‑wide hybrid models, and semantic‑based correction (e.g., Tencent’s approach) represent future directions.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.