Fundamentals 13 min read

Understanding Character Encoding: From ASCII to Unicode and UTF‑8

This article explains the fundamentals of character encoding, covering the evolution from the 7‑bit ASCII standard to Chinese GB2312, the development of Unicode and UTF‑8, and provides practical guidance for handling these encodings in Windows and Linux C programs, including a sample UTF‑8 detection function.

ITPUB

Sep 19, 2016

Understanding Character Encoding: From ASCII to Unicode and UTF‑8

Byte is the basic storage unit of a computer (8 bits) and a character is the basic textual unit; for example, 'A' and '汉' are both characters.

0. Concept – Early computers needed a way to represent characters as byte values. The American National Standards Institute (ANSI) created ASCII, a 7‑bit code mapping 128 values to English letters, digits, punctuation, and control characters.

1. Chinese character encoding – Because Chinese requires many more symbols, a single byte is insufficient; at least two bytes are needed. GB2312 was defined with two‑byte codes whose highest bits are set to 1, ensuring compatibility with ASCII (which never uses those high bits) and avoiding the null byte (0) that C strings use as terminators. GB2312 can represent about 6,000 Chinese characters, enough for common use.

Subsequent extensions such as GBK, GB18030 and Taiwan’s BIG5 expanded the repertoire, but all kept the rule of being ASCII‑compatible and avoiding zero bytes.

2. Unicode – To enable universal text exchange, Unicode assigns a unique code point to every character worldwide. It uses UCS‑2 (2 bytes) or UCS‑4 (4 bytes); UCS‑2 can represent 65,535 code points, covering most common characters. Unicode is backward compatible with ASCII by prefixing zeros. UTF‑8 is a variable‑length encoding of Unicode code points for storage and transmission, preserving ASCII compatibility while avoiding null bytes in multibyte sequences.

3. Programming considerations

Windows NT and later kernels use Unicode internally. Source files written in Visual Studio may contain GB2312 literals; the WinAPI converts them to Unicode when printing.

Linux distributions typically use UTF‑8. A C string literal like const char* pszText = "中文" will contain UTF‑8 bytes, and the terminal expects UTF‑8 streams. The environment variable LANG indicates the active locale.

Relevant conversion functions:

Windows: MultiByteToWideChar() and WideCharToMultiByte() Linux: iconv() family in glibc

Standard C: mbstowcs(), wcstombs() (locale‑dependent)

Note that wchar_t size varies (2 bytes on MSVC, 4 bytes on GCC).

4. Encoding detection – Determining an unknown byte sequence’s encoding is unreliable; BOM markers (FF FE/FE FF for Unicode, EF BB BF for UTF‑8) are only hints. A practical heuristic is to scan for valid UTF‑8 byte patterns.

Below is a C function that attempts to guess whether a text buffer is UTF‑8. It returns 0 if the data conforms to UTF‑8 rules, –1 for an illegal leading byte, and –2 for an illegal continuation byte.

// Return values:
// 0  -> input string follows UTF‑8 rules, may be UTF‑8
// -1 -> illegal UTF‑8 leading byte detected
// -2 -> illegal UTF‑8 continuation byte detected

int IsTextMaybeUTF8(const char* pszSrc)
{
    const unsigned char* puszSrc = (const unsigned char*)pszSrc; // must be unsigned
    // Check for UTF‑8 BOM (EF BB BF)
    if (puszSrc[0] != 0 && puszSrc[0] == 0xEF &&
        puszSrc[1] != 0 && puszSrc[1] == 0xBB &&
        puszSrc[2] != 0 && puszSrc[2] == 0xBF)
    {
        return 0;
    }

    BOOL bIsNextByte = FALSE;
    int nBytes = 0; // number of bytes for current UTF‑8 character
    const unsigned char* pCur = (const unsigned char*)pszSrc;

    while (pCur[0] != 0)
    {
        if (!bIsNextByte)
        {
            bIsNextByte = TRUE;
            if ((pCur[0] >> 7) == 0) { bIsNextByte = FALSE; nBytes = 1; }
            else if ((pCur[0] >> 5) == 0x06) { nBytes = 2; }
            else if ((pCur[0] >> 4) == 0x0E) { nBytes = 3; }
            else if ((pCur[0] >> 3) == 0x1E) { nBytes = 4; }
            else { return -1; }
        }
        else
        {
            // continuation byte must start with bits 10xxxxxx
            if ((pCur[0] >> 6) != 0x02) { return -2; }
            nBytes--;
            if (nBytes == 0) bIsNextByte = FALSE;
        }
        pCur++;
    }
    return 0;
}

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Unicode UTF-8 C Programming character encoding ASCII GB2312

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.