Fundamentals 9 min read

Converting Full‑Width and Half‑Width Characters in Python

This article explains the Unicode mapping between full‑width and half‑width characters, demonstrates simple Python functions for converting between them, and provides a flexible dictionary‑based approach for custom text conversion tasks.

Python Programming Learning Circle

Apr 28, 2022

Converting Full‑Width and Half‑Width Characters in Python

When processing text, mismatched full‑width and half‑width characters often cause problems, so a program is needed to convert them quickly.

The Unicode ranges are simple: full‑width characters are 0xFF01‑0xFF5E (decimal 65281‑65374), half‑width characters are 0x21‑0x7E (decimal 33‑126), and the space characters are 0x3000 (full‑width) and 0x20 (half‑width). Apart from the space, each half‑width code plus 65248 equals its full‑width counterpart.

Useful built‑in functions include chr() (returns a character for an integer 0‑255), unichr() (returns a Unicode character), and ord() (returns the integer code point of a character).

Example of printing the mapping:

for i in xrange(33,127):
    print i, chr(i), i+65248, unichr(i+65248)

Simple conversion functions:

def full2half(s):
    n = []
    s = s.decode('utf-8')
    for char in s:
        num = ord(char)
        if num == 0x3000:
            num = 32
        elif 0xFF01 <= num <= 0xFF5E:
            num -= 0xFEE0
        n.append(unichr(num))
    return ''.join(n)

def half2full(s):
    n = []
    s = s.decode('utf-8')
    for char in s:
        num = ord(char)
        if num == 32:
            num = 0x3000
        elif 0x21 <= num <= 0x7E:
            num += 0xFEE0
        n.append(unichr(num))
    return ''.join(n)

In real scenarios you may need selective conversion, e.g., converting letters and numbers to half‑width while keeping punctuation full‑width. This can be achieved with custom mapping dictionaries:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

FH_SPACE = ((u"　", u" "),)
FH_NUM = ((u"０", u"0"), (u"１", u"1"), (u"２", u"2"), (u"３", u"3"), (u"４", u"4"),
          (u"５", u"5"), (u"６", u"6"), (u"７", u"7"), (u"８", u"8"), (u"９", u"9"))
FH_ALPHA = ((u"ａ", u"a"), (u"ｂ", u"b"), (u"ｃ", u"c"), (u"ｄ", u"d"), (u"ｅ", u"e"),
            (u"ｆ", u"f"), (u"ｇ", u"g"), (u"ｈ", u"h"), (u"ｉ", u"i"), (u"ｊ", u"j"),
            (u"ｋ", u"k"), (u"ｌ", u"l"), (u"ｍ", u"m"), (u"ｎ", u"n"), (u"ｏ", u"o"),
            (u"ｐ", u"p"), (u"ｑ", u"q"), (u"ｒ", u"r"), (u"ｓ", u"s"), (u"ｔ", u"t"),
            (u"ｕ", u"u"), (u"ｖ", u"v"), (u"ｗ", u"w"), (u"ｘ", u"x"), (u"ｙ", u"y"), (u"ｚ", u"z"),
            (u"Ａ", u"A"), (u"Ｂ", u"B"), (u"Ｃ", u"C"), (u"Ｄ", u"D"), (u"Ｅ", u"E"),
            (u"Ｆ", u"F"), (u"Ｇ", u"G"), (u"Ｈ", u"H"), (u"Ｉ", u"I"), (u"Ｊ", u"J"),
            (u"Ｋ", u"K"), (u"Ｌ", u"L"), (u"Ｍ", u"M"), (u"Ｎ", u"N"), (u"Ｏ", u"O"),
            (u"Ｐ", u"P"), (u"Ｑ", u"Q"), (u"Ｒ", u"R"), (u"Ｓ", u"S"), (u"Ｔ", u"T"),
            (u"Ｕ", u"U"), (u"Ｖ", u"V"), (u"Ｗ", u"W"), (u"Ｘ", u"X"), (u"Ｙ", u"Y"), (u"Ｚ", u"Z"))
FH_PUNCTUATION = ((u"．", u"."), (u"，", u","), (u"！", u"!"), (u"？", u"?"), (u"”", u'"'),
                  (u"’", u"'"), (u"‘", u"`"), (u"＠", u"@"), (u"＿", u"_"), (u"：", u":"),
                  (u"；", u";"), (u"＃", u"#"), (u"＄", u"$"), (u"％", u"%"), (u"＆", u"&"),
                  (u"（", u"("), (u"）", u")"), (u"‐", u"-"), (u"＝", u"="), (u"＊", u"*"),
                  (u"＋", u"+"), (u"－", u"-"), (u"／", u"/"), (u"＜", u"<"), (u"＞", u">"),
                  (u"［", u"["), (u"￥", u"\\"), (u"］", u"]"), (u"＾", u"^"), (u"｛", u"{"),
                  (u"｜", u"|"), (u"｝", u"}"), (u"～", u"~"))

FH_ASCII = lambda: ((fr, to) for m in (FH_ALPHA, FH_NUM, FH_PUNCTUATION) for fr, to in m)

def convert(text, *maps, **ops):
    """Full‑width / half‑width conversion.
    Args:
        text: unicode string to convert
        maps: conversion maps
        skip: characters to skip (tuple or string)
    Returns:
        converted unicode string
    """
    if "skip" in ops:
        skip = ops["skip"]
        if isinstance(skip, basestring):
            skip = tuple(skip)
        def replace(t, fr, to):
            return t if fr in skip else t.replace(fr, to)
    else:
        def replace(t, fr, to):
            return t.replace(fr, to)
    for m in maps:
        if callable(m):
            m = m()
        elif isinstance(m, dict):
            m = m.items()
        for fr, to in m:
            text = replace(text, fr, to)
    return text

if __name__ == '__main__':
    text = u"成田空港—【ＪＲ特急成田エクスプレス号・横浜行，2站】—東京—【ＪＲ新幹線はやぶさ号・新青森行,6站 】—新青森—【ＪＲ特急スーパー白鳥号・函館行，4站 】—函館"
    print(convert(text, FH_ASCII, {u"【": u"[", u"】": u"]", u",": u"，", u".": u"。", u"?": u"？", u"!": u"！"}, skip="，。？！“”"))

Note: In English typography, quotation marks are not distinguished as opening or closing quotes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python Unicode String processing Full-width Half-width

Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.