Build a Simple Bayesian Spell Checker in 21 Lines of Python

This article explains how to create a functional spell‑checking tool using a 21‑line Python script that leverages Bayesian probability and a large word‑frequency corpus to suggest correct spellings for misspelled inputs.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Build a Simple Bayesian Spell Checker in 21 Lines of Python

Introduction

When you type a misspelled word into Google or Baidu, the engine instantly suggests the correct spelling. This article shows how to implement a similar spell‑checker in Python with only 21 lines of code.

Code

import re, collections

def words(text): return re.findall('[a-z]+', text.lower())

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(open('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
    splits = [(word[:i], word[i:]) for i in range(len(word)+1)]
    deletes = [a + b[1:] for a,b in splits if b]
    transposes = [a + b[1] + b[0] + b[2:] for a,b in splits if len(b)>1]
    replaces = [a + c + b[1:] for a,b in splits for c in alphabet if b]
    inserts = [a + c + b for a,b in splits for c in alphabet]
    return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

Underlying Theory

The algorithm is based on a Bayesian model: we choose the candidate c that maximizes P(c|w) ∝ P(w|c)·P(c). P(c) is estimated from word frequencies in a large corpus (big.txt). P(w|c) is approximated by the edit distance between c and the misspelled word w; edits of distance 1 are far more likely than distance 2, and correct words (distance 0) are most probable.

Code Analysis

The words function extracts lowercase alphabetic tokens from the corpus. train builds a frequency dictionary, defaulting unseen words to a count of 1 using collections.defaultdict. edits1 generates all strings one edit away (deletions, transpositions, replacements, insertions). known_edits2 extends this to two‑edit candidates that exist in the dictionary. Finally, correct selects the candidate with the highest probability according to the Bayesian model.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

NLPBayesiantext processingspell checking
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.