Why Unicode Lets Characters Wear Hats and Shoes: The Weird World of Combining Marks
This article explores how Unicode’s complex text layout lets characters like Thai letters stack multiple diacritics—creating “hats” and “shoes”—and examines the storage, rendering, and font challenges these bizarre glyphs pose for both developers and users.
Encoding is an unavoidable topic for programmers, and for front‑end engineers the characters they write appear directly on the screen. Unicode, often called the "universal code," sometimes produces bizarre visual effects when rendering certain scripts.
1. Characters Can Wear Hats and Shoes
In Thai, the greeting "สวัสดี" differs by gender: the male form is สวัสดีครับ and the female form is สวัสดีค่ะ. Beyond that, some Thai characters can carry extra combining marks that look like hats or shoes. For example, the characters ผ, ผู, and ผู้ demonstrate this behavior.
Thai even allows a character to have two hats and a pair of shoes, creating a vertically extensible glyph. The following image illustrates such a stacked character.
2. The Conflict Between Humans and Machines
Unicode’s design aims to resolve a storage‑vs‑display conflict. Storing every possible pre‑composed Thai glyph would require thousands of code points (e.g., 44 × 21 × 4 = 3696), which is wasteful for only about 69 base characters. Instead, Complex Text Layout (CTL) stores each base character separately and combines them at render time using a special “ending character”. This saves storage space.
However, this introduces a recognition problem: humans can easily tell whether a combined Thai glyph is correct, but machines struggle to validate and render them efficiently. Input methods mitigate this by limiting further input after a tone mark, yet artists can still copy‑paste or manually adjust character positions to create novel forms.
Modern WebKit versions suppress vertically stacked characters to preserve layout, so the phenomenon may not appear in all browsers.
3. Emoticons Made from Mixed Scripts
Some emoticons combine characters from different languages. The crying‑eye symbol ༎ຶ merges a Lao character (the eye) with a Tibetan character (the tear). Its Unicode code points are \u0f0e and \u0eb6. Similar mash‑ups include:
▷ˋε´◁ – ε is a Greek letter
ʕ-'ᴥ'-ʔ – uses International Phonetic Alphabet symbols
(·ཀ·」∠) – ཀ is Tibetan
(ง •̀_•́ )ง – ง is Thai
罒 д 罒 – 罒 is a Chinese character, д is Cyrillic
Using such mixed‑script characters makes a user appear fluent in many languages.
4. Font Misalignment
The appearance of a character also depends on the font. The same code point can render correctly in one font and appear misaligned or split in another. For example, the combining Cyrillic diacritic ҈ (U+0488) often shows a misaligned glyph in many fonts, while Courier New displays it separately.
There is no universal rule for how such characters should be displayed; the vastness of Unicode makes comprehensive standardization impossible.
5. Chaos and Innovation?
Unicode continues to evolve, now encompassing emojis as standard characters. Some developers experiment with creating new glyphs that intentionally cause misalignment or stacking, a practice seen in iOS’s “flower text” feature and in certain input method extensions.
These creative uses highlight both the flexibility and the potential for chaos within Unicode’s ever‑growing character set.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent IMWeb Frontend Team
IMWeb Frontend Community gathering frontend development enthusiasts. Follow us for refined live courses by top experts, cutting‑edge technical posts, and to sharpen your frontend skills.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
