Build a Simple Bayesian Spell Checker in 21 Lines of Python
This article explains how to create a functional spell‑checking tool using a 21‑line Python script that leverages Bayesian probability and a large word‑frequency corpus to suggest correct spellings for misspelled inputs.
Introduction
When you type a misspelled word into Google or Baidu, the engine instantly suggests the correct spelling. This article shows how to implement a similar spell‑checker in Python with only 21 lines of code.
Code
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(open('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word)+1)]
deletes = [a + b[1:] for a,b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a,b in splits if len(b)>1]
replaces = [a + c + b[1:] for a,b in splits for c in alphabet if b]
inserts = [a + c + b for a,b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)Underlying Theory
The algorithm is based on a Bayesian model: we choose the candidate c that maximizes P(c|w) ∝ P(w|c)·P(c). P(c) is estimated from word frequencies in a large corpus (big.txt). P(w|c) is approximated by the edit distance between c and the misspelled word w; edits of distance 1 are far more likely than distance 2, and correct words (distance 0) are most probable.
Code Analysis
The words function extracts lowercase alphabetic tokens from the corpus. train builds a frequency dictionary, defaulting unseen words to a count of 1 using collections.defaultdict. edits1 generates all strings one edit away (deletions, transpositions, replacements, insertions). known_edits2 extends this to two‑edit candidates that exist in the dictionary. Finally, correct selects the candidate with the highest probability according to the Bayesian model.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
