Decoding Randomized Custom Fonts with Python: Glyph Matching and OCR Techniques
This article explains how to handle custom web fonts whose glyph order or shapes are randomized by extracting glyph metadata with FontTools, creating binary signatures for reliable matching, and applying image‑recognition OCR to decode characters when glyph contours also change, complete with code examples and step‑by‑step instructions.
In the previous article we analyzed custom web fonts using image recognition, but the approach struggled when the glyph order or shapes were randomized. This guide expands on that by showing how to reliably decode such fonts using Python.
Handling Random Glyph Order
The core idea is to extract each glyph's raw data (control points and flags), convert it to a binary signature, and map that signature to the known character. By building a glyphBytes2char dictionary from a sample font with a known glyph order, we can later match any font whose glyph order has been shuffled.
from fontTools.ttLib import TTFont
import numpy as np
def get_glyphBytes(glyph):
coordinates = np.array(glyph.coordinates).astype("int16")
return coordinates.tobytes() + glyph.flags
font = TTFont("address.woff")
glyf = font["glyf"]
chars = " ... " # the known character list
glyphBytes2char = {}
for code, char in zip(glyf.glyphOrder, chars):
glyph = glyf[code]
if not hasattr(glyph, "coordinates"):
continue
glyphBytes2char[get_glyphBytes(glyph)] = charTo decode a target font, we read each glyph, compute its binary signature, and look it up in the dictionary:
font = TTFont("random.woff")
glyf = font["glyf"]
code2char = {}
for code in glyf.glyphOrder:
glyph = glyf[code]
if not hasattr(glyph, "coordinates"):
continue
glyphBytes = get_glyphBytes(glyph)
if glyphBytes in glyphBytes2char:
code2char[code] = glyphBytes2char[glyphBytes]The result is a perfect mapping of Unicode code points to the correct characters, even when the glyph order is completely shuffled.
Generating Custom Fonts
We first convert a system arial.ttf to SVG using fontsquirrel.com , then select characters in icomoon.io and export a .woff file. The process is illustrated below:
Understanding Font Tables with FontTools
Using fontTools we can inspect the main tables of a TrueType/WOFF font:
head : global font information (units per EM, bounding box, timestamps).
cmap : maps Unicode code points to glyph names.
glyf : contains the actual glyph outlines.
loca : offsets to each glyph in the glyf table.
maxp : maximum requirements (number of glyphs, points, contours, etc.).
name : human‑readable font metadata.
hmtx : horizontal metrics for each glyph.
Sample code to read these tables:
font = TTFont("sample.woff")
head = font["head"]
print(f"Units per EM: {head.unitsPerEm}, bbox: ({head.xMin},{head.yMin})-({head.xMax},{head.yMax})")
cmap = font["cmap"].getBestCmap()
print(cmap)When Glyph Shapes Are Also Randomized
If the glyph contours themselves are altered (e.g., using multiple base glyphs to generate a custom font), binary matching fails. In this case we fall back to image‑recognition OCR. We wrap the ddddocr model in a FontOCR class that renders each glyph to a 64×64 bitmap and feeds it to the neural network.
from ddddocr import DdddOcr, np
from PIL import ImageFont, Image, ImageDraw
class FontOCR(DdddOcr):
def __init__(self, font_path, size=40):
super().__init__()
self.font = ImageFont.truetype(font_path, size)
self.cache = {}
self.im_cache = {}
def ocr(self, image):
img = np.array(image).astype(np.float32)
img = np.expand_dims(img, 0) / 255.
img = (img - 0.5) / 0.5
ort_inputs = {"input1": np.array([img])}
out = self._DdddOcr__ort_session.run(None, ort_inputs)
for item in out[0][0]:
if item == 0:
continue
return self._DdddOcr__charset[item]
def getCharImage(self, char):
if char in self.im_cache:
return self.im_cache[char]
im = Image.new('L', (64, 64), 255)
draw = ImageDraw.Draw(im)
w, h = draw.textsize(char, self.font)
o1, o2 = self.font.getoffset(char)
x, y = (64 - w - o1) / 2, (64 - h - o2) / 2
draw.text((x, y), char, 0, self.font)
self.im_cache[char] = im
return imTesting on the system msyh.ttc font yields only 6 misrecognitions out of 601 characters (≈99% accuracy). Testing on a custom shopNum.woff font results in 3 errors, demonstrating the robustness of the OCR fallback.
Putting It All Together
By first attempting binary matching and then falling back to OCR when necessary, we can decode any custom web font, regardless of glyph order or shape randomization. The combined approach achieves near‑perfect accuracy on a 600‑character test set.
Conclusion
We demonstrated how to generate custom fonts, inspect their internal tables with fontTools, create binary signatures for reliable glyph matching, and employ deep‑learning OCR for cases where glyph shapes are altered. This pipeline enables automated extraction of obfuscated text from web pages that use random font techniques.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Crawling & Data Mining
Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
