Fundamentals 10 min read

Why Java’s char Can’t Represent All Unicode Characters – Understanding UTF‑16 and Code Points

This article explains how Java’s char type stores Unicode code units in UTF‑16, why its range of \u0000 to \uffff limits direct representation of newer Unicode characters, and how methods like String.length, getBytes, and code‑point APIs help handle multi‑byte characters such as emojis and rare Chinese glyphs.

Programmer DD

Jul 22, 2020

Why Java’s char Can’t Represent All Unicode Characters – Understanding UTF‑16 and Code Points

According to Java documentation, a char is internally represented using UTF‑16 encoding, with a minimum value of \u0000 (0) and a maximum value of \uffff (65535), meaning each character occupies two bytes.

char : The char data type is a single 16‑bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive). from The Java™ Tutorials

Example code demonstrates the differences between String.length(), String.getBytes(), and String.toCharArray() for Chinese characters, emojis, and rare Chinese glyphs:

public class Main {
    public static void main(String[] args) {
        // Chinese common characters
        String s = "你好";
        System.out.println("1. string length =" + s.length());
        System.out.println("1. string bytes length =" + s.getBytes().length);
        System.out.println("1. string char length =" + s.toCharArray().length);
        System.out.println();
        // emojis
        s = "👦👩";
        System.out.println("2. string length =" + s.length());
        System.out.println("2. string bytes length =" + s.getBytes().length);
        System.out.println("2. string char length =" + s.toCharArray().length);
        System.out.println();
        // Rare Chinese character
        s = "𡃁妹";
        System.out.println("3. string length =" + s.length());
        System.out.println("3. string bytes length =" + s.getBytes().length);
        System.out.println("3. string char length =" + s.toCharArray().length);
        System.out.println();
    }
}

1. string length =2
1. string bytes length =6
1. string char length =2
2. string length =4
2. string bytes length =8
2. string char length =4
3. string length =3
3. string bytes length =7
3. string char length =3

On macOS, String.getBytes() uses the system default encoding UTF‑8, so the byte lengths reflect UTF‑8 encoding (Chinese characters: 3 bytes each, emojis: 4 bytes each, rare characters: 3 bytes for one BMP character and 4 bytes for the supplementary character). String.length() returns the number of Unicode code units (UTF‑16 code units), not the number of perceived characters. For characters within the Basic Multilingual Plane (BMP), the code‑unit count matches the character count, but for supplementary characters (e.g., many emojis), the count is double because they are represented as surrogate pairs.

Historically, Java’s char was designed when Unicode contained few characters and fit within two bytes (Unicode 1.1.5 to 3.0). As Unicode expanded (Unicode 4.0 and later), many characters moved to supplementary planes, requiring four bytes in UTF‑16, which are stored as two 16‑bit code units (surrogate pairs). The surrogate range U+D800–U+DFFF is reserved in the BMP and used to encode these supplementary code points without conflict.

Therefore, String.length() may return a value larger than the visual character count for strings containing characters outside the BMP. To work with actual Unicode code points, Java 1.5 introduced methods such as codePointAt(int index), codePointBefore(int index), and codePointCount(int beginIndex, int endIndex), which operate on code points rather than code units.

When using String.substring, be aware that the index parameters refer to code‑unit positions, which can cause unexpected results when the substring cuts through a surrogate pair.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

String Unicode UTF-16 char Code Points

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.