Artificial Intelligence 32 min read

Decoding Randomized Custom Fonts with Python: Glyph Matching and OCR Techniques

This article explains how to handle custom web fonts whose glyph order or shapes are randomized by extracting glyph metadata with FontTools, creating binary signatures for reliable matching, and applying image‑recognition OCR to decode characters when glyph contours also change, complete with code examples and step‑by‑step instructions.

Python Crawling & Data Mining

Dec 17, 2021

Decoding Randomized Custom Fonts with Python: Glyph Matching and OCR Techniques

In the previous article we analyzed custom web fonts using image recognition, but the approach struggled when the glyph order or shapes were randomized. This guide expands on that by showing how to reliably decode such fonts using Python.

Handling Random Glyph Order

The core idea is to extract each glyph's raw data (control points and flags), convert it to a binary signature, and map that signature to the known character. By building a glyphBytes2char dictionary from a sample font with a known glyph order, we can later match any font whose glyph order has been shuffled.

from fontTools.ttLib import TTFont
import numpy as np

def get_glyphBytes(glyph):
    coordinates = np.array(glyph.coordinates).astype("int16")
    return coordinates.tobytes() + glyph.flags

font = TTFont("address.woff")
glyf = font["glyf"]
chars = " ... "  # the known character list
glyphBytes2char = {}
for code, char in zip(glyf.glyphOrder, chars):
    glyph = glyf[code]
    if not hasattr(glyph, "coordinates"):
        continue
    glyphBytes2char[get_glyphBytes(glyph)] = char

To decode a target font, we read each glyph, compute its binary signature, and look it up in the dictionary:

font = TTFont("random.woff")
glyf = font["glyf"]
code2char = {}
for code in glyf.glyphOrder:
    glyph = glyf[code]
    if not hasattr(glyph, "coordinates"):
        continue
    glyphBytes = get_glyphBytes(glyph)
    if glyphBytes in glyphBytes2char:
        code2char[code] = glyphBytes2char[glyphBytes]

The result is a perfect mapping of Unicode code points to the correct characters, even when the glyph order is completely shuffled.

Generating Custom Fonts

We first convert a system arial.ttf to SVG using fontsquirrel.com , then select characters in icomoon.io and export a .woff file. The process is illustrated below:

Understanding Font Tables with FontTools

Using fontTools we can inspect the main tables of a TrueType/WOFF font:

head : global font information (units per EM, bounding box, timestamps).

cmap : maps Unicode code points to glyph names.

glyf : contains the actual glyph outlines.

loca : offsets to each glyph in the glyf table.

maxp : maximum requirements (number of glyphs, points, contours, etc.).

name : human‑readable font metadata.

hmtx : horizontal metrics for each glyph.

Sample code to read these tables:

font = TTFont("sample.woff")
head = font["head"]
print(f"Units per EM: {head.unitsPerEm}, bbox: ({head.xMin},{head.yMin})-({head.xMax},{head.yMax})")

cmap = font["cmap"].getBestCmap()
print(cmap)

When Glyph Shapes Are Also Randomized

If the glyph contours themselves are altered (e.g., using multiple base glyphs to generate a custom font), binary matching fails. In this case we fall back to image‑recognition OCR. We wrap the ddddocr model in a FontOCR class that renders each glyph to a 64×64 bitmap and feeds it to the neural network.

from ddddocr import DdddOcr, np
from PIL import ImageFont, Image, ImageDraw

class FontOCR(DdddOcr):
    def __init__(self, font_path, size=40):
        super().__init__()
        self.font = ImageFont.truetype(font_path, size)
        self.cache = {}
        self.im_cache = {}
    def ocr(self, image):
        img = np.array(image).astype(np.float32)
        img = np.expand_dims(img, 0) / 255.
        img = (img - 0.5) / 0.5
        ort_inputs = {"input1": np.array([img])}
        out = self._DdddOcr__ort_session.run(None, ort_inputs)
        for item in out[0][0]:
            if item == 0:
                continue
            return self._DdddOcr__charset[item]
    def getCharImage(self, char):
        if char in self.im_cache:
            return self.im_cache[char]
        im = Image.new('L', (64, 64), 255)
        draw = ImageDraw.Draw(im)
        w, h = draw.textsize(char, self.font)
        o1, o2 = self.font.getoffset(char)
        x, y = (64 - w - o1) / 2, (64 - h - o2) / 2
        draw.text((x, y), char, 0, self.font)
        self.im_cache[char] = im
        return im

Testing on the system msyh.ttc font yields only 6 misrecognitions out of 601 characters (≈99% accuracy). Testing on a custom shopNum.woff font results in 3 errors, demonstrating the robustness of the OCR fallback.

Putting It All Together

By first attempting binary matching and then falling back to OCR when necessary, we can decode any custom web font, regardless of glyph order or shape randomization. The combined approach achieves near‑perfect accuracy on a 600‑character test set.

Conclusion

We demonstrated how to generate custom fonts, inspect their internal tables with fontTools, create binary signatures for reliable glyph matching, and employ deep‑learning OCR for cases where glyph shapes are altered. This pipeline enables automated extraction of obfuscated text from web pages that use random font techniques.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

OCR custom fonts fontTools glyph matching

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.