Fundamentals 14 min read

How Gecko Detects Web Page Encodings: Inside the State Machine and Distribution Algorithms

This article explains Gecko's multi‑layer encoding detection, covering its state‑machine approach, character‑distribution analysis, 2‑byte sequence modeling, the overall detection workflow, and the table‑based conversion process that maps all encodings to Unicode.

Baidu Tech Salon

Mar 7, 2014

How Gecko Detects Web Page Encodings: Inside the State Machine and Distribution Algorithms

1. Introduction

Gecko is the layout engine behind browsers such as Netscape, Firefox, and Thunderbird. This article focuses on how Gecko recognises and converts the character encoding of web documents from around the world.

2. Encoding Detection Algorithms

When a page lacks a declared charset, Gecko guesses the most likely encoding by analysing the byte stream. Three main techniques are used:

Coding Scheme (state‑machine) method – a Parallel State Machine with three states (eStart, eItsMe, eError) evaluates each byte against the current encoding rules.

Character Distribution Method – high‑frequency characters tend to cluster in a small range for East‑Asian scripts. Gecko counts mTotalChars and mFreqChars (characters within the top‑512 most frequent set) and computes a Distribution Ratio.

2‑Byte Sequence Method – for single‑byte encodings, Gecko builds a 256×256 matrix of 2‑character sequences, classifying them into four categories (Negative, Low, Medium, Positive) and uses counters PRUint32 mSeqCounters[NUMBER_OF_SEQ_CAT] to compute confidence.

Confidence for multibyte detection:

float confidence = mFreqChars / ((mTotalChars - mFreqChars) * mTypicalDistributionRatio);

Confidence for single‑byte detection (when NEGATIVE_APPROACH is defined):

((float)(mTotalSeqs - mSeqCounters[NEGATIVE_CAT]*10))/mTotalSeqs * mFreqChar / mTotalChar;

Otherwise:

((float)1.0) * mSeqCounters[POSITIVE_CAT] / mTotalSeqs / mModel->mTypicalPositiveRatio * mFreqChar / mTotalChar;

3. Detection Flow

nsUniversalDetector first checks for a BOM. If none, it feeds the data to all active detectors. Any detector reaching the eItsMe state aborts the process and returns its encoding. If no early exit, DataEnd aggregates confidence values and selects the highest.

4. Encoding Conversion

All conversions are ultimately to Unicode. Gecko uses table‑based look‑ups. Each byte range selects a specific lookup table; the process builds a 16‑bit value med, finds the corresponding uMapCell and format, then maps to the target code point.

The typical conversion steps are:

Determine which lookup table to use based on the first byte.

Construct med from the byte(s) and locate the appropriate uMapCell.

Use the format to invoke the mapping routine and obtain the Unicode code point.

5. Summary

The combination of state‑machine analysis, character‑distribution ratios, and 2‑byte sequence modeling gives Gecko robust encoding detection, achieving high accuracy even on large, heterogeneous data sets. Higher frequencies of positive sequences and lower repetition improve reliability, while the table‑driven conversion ensures efficient mapping to Unicode.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

State Machine browser engine Character Set encoding detection Gecko

Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.