How Gecko Detects Web Page Encodings: Inside the State Machine and Distribution Algorithms
This article explains Gecko's multi‑layer encoding detection, covering its state‑machine approach, character‑distribution analysis, 2‑byte sequence modeling, the overall detection workflow, and the table‑based conversion process that maps all encodings to Unicode.
1. Introduction
Gecko is the layout engine behind browsers such as Netscape, Firefox, and Thunderbird. This article focuses on how Gecko recognises and converts the character encoding of web documents from around the world.
2. Encoding Detection Algorithms
When a page lacks a declared charset, Gecko guesses the most likely encoding by analysing the byte stream. Three main techniques are used:
Coding Scheme (state‑machine) method – a Parallel State Machine with three states (eStart, eItsMe, eError) evaluates each byte against the current encoding rules.
Character Distribution Method – high‑frequency characters tend to cluster in a small range for East‑Asian scripts. Gecko counts mTotalChars and mFreqChars (characters within the top‑512 most frequent set) and computes a Distribution Ratio.
2‑Byte Sequence Method – for single‑byte encodings, Gecko builds a 256×256 matrix of 2‑character sequences, classifying them into four categories (Negative, Low, Medium, Positive) and uses counters PRUint32 mSeqCounters[NUMBER_OF_SEQ_CAT] to compute confidence.
Confidence for multibyte detection:
float confidence = mFreqChars / ((mTotalChars - mFreqChars) * mTypicalDistributionRatio);Confidence for single‑byte detection (when NEGATIVE_APPROACH is defined):
((float)(mTotalSeqs - mSeqCounters[NEGATIVE_CAT]*10))/mTotalSeqs * mFreqChar / mTotalChar;Otherwise:
((float)1.0) * mSeqCounters[POSITIVE_CAT] / mTotalSeqs / mModel->mTypicalPositiveRatio * mFreqChar / mTotalChar;3. Detection Flow
nsUniversalDetector first checks for a BOM. If none, it feeds the data to all active detectors. Any detector reaching the eItsMe state aborts the process and returns its encoding. If no early exit, DataEnd aggregates confidence values and selects the highest.
4. Encoding Conversion
All conversions are ultimately to Unicode. Gecko uses table‑based look‑ups. Each byte range selects a specific lookup table; the process builds a 16‑bit value med, finds the corresponding uMapCell and format, then maps to the target code point.
The typical conversion steps are:
Determine which lookup table to use based on the first byte.
Construct med from the byte(s) and locate the appropriate uMapCell.
Use the format to invoke the mapping routine and obtain the Unicode code point.
5. Summary
The combination of state‑machine analysis, character‑distribution ratios, and 2‑byte sequence modeling gives Gecko robust encoding detection, achieving high accuracy even on large, heterogeneous data sets. Higher frequencies of positive sequences and lower repetition improve reliability, while the table‑driven conversion ensures efficient mapping to Unicode.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baidu Tech Salon
Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
