Fundamentals 9 min read

Why Java’s char Can’t Represent All Unicode Characters – Code Units vs. Code Points

This article explains how Java stores characters as UTF‑16 code units, why the char type cannot cover the entire Unicode range, how surrogate pairs work, and demonstrates the differences in length, byte length, and char array size for regular Chinese characters, emojis, and rare Chinese glyphs.

Senior Brother's Insights
Senior Brother's Insights
Senior Brother's Insights
Why Java’s char Can’t Represent All Unicode Characters – Code Units vs. Code Points

Java’s char type is defined as a 16‑bit Unicode code unit, which historically was sufficient because early Unicode versions contained fewer than 65,535 characters. Modern Unicode has expanded beyond the Basic Multilingual Plane (BMP) to over 140,000 characters, requiring four bytes (two code units) for many symbols such as emojis and rare Chinese glyphs.

Example Program

public class Main {
    public static void main(String[] args) {
        // 中文常见字
        String s = "你好";
        System.out.println("1. string length =" + s.length());
        System.out.println("1. string bytes length =" + s.getBytes().length);
        System.out.println("1. string char length =" + s.toCharArray().length);
        System.out.println();
        // emojis
        s = "👦👩";
        System.out.println("2. string length =" + s.length());
        System.out.println("2. string bytes length =" + s.getBytes().length);
        System.out.println("2. string char length =" + s.toCharArray().length);
        System.out.println();
        // 生僻的中文字
        s = "𡃁妹";
        System.out.println("3. string length =" + s.length());
        System.out.println("3. string bytes length =" + s.getBytes().length);
        System.out.println("3. string char length =" + s.toCharArray().length);
    }
}

Running the program on macOS (default UTF‑8) produces the following results:

1. string length = 2
1. string bytes length = 6
1. string char length = 2

2. string length = 4
2. string bytes length = 8
2. string char length = 4

3. string length = 3
3. string bytes length = 7
3. string char length = 3

Why the Lengths Differ

String.length()

returns the number of UTF‑16 code units, not the number of Unicode code points. For characters in the BMP (e.g., most Chinese characters), one code unit equals one code point, so the length matches the visual character count. For characters outside the BMP—such as emojis or rare Chinese glyphs—Java uses a surrogate pair: two 16‑bit code units represent a single code point. Consequently, length() reports 2 × the actual character count for those symbols.

Unicode Basics

Code Point : The abstract numeric value assigned to a Unicode character (U+0000 … U+10FFFF).

Code Unit : The minimal bit sequence used by an encoding to store a code point. UTF‑8 uses 8‑bit units, UTF‑16 uses 16‑bit units.

In UTF‑16, the range U+D800–U+DFFF is reserved for surrogate pairs. The leading surrogate (0xD800–0xDBFF) holds the high ten bits of the code point, and the trailing surrogate (0xDC00–0xDFFF) holds the low ten bits. This design avoids conflicts with valid BMP characters.

Practical Implications

Because char and String.length() operate on code units, operations like substring can split surrogate pairs, producing malformed strings. Java 1.5 introduced code‑point‑aware methods such as codePointAt(int), codePointBefore(int), and codePointCount(int, int), which use code‑unit indices but return values based on actual Unicode characters.

When processing text that may contain characters outside the BMP (e.g., emojis, historic scripts, rare Chinese characters), developers should prefer these code‑point methods or use String.codePoints() streams to avoid truncation bugs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaUnicodeUTF-16Surrogate Pairstring lengthCode Point
Senior Brother's Insights
Written by

Senior Brother's Insights

A public account focused on workplace, career growth, team management, and self-improvement. The author is the writer of books including 'SpringBoot Technology Insider' and 'Drools 8 Rule Engine: Core Technology and Practice'.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.