Fundamentals 11 min read

Understanding Unicode Encoding (UTF-8, UTF-16, UTF-32) and Emoji Detection in Java

This article explains the Unicode standard, its code planes and ranges, the three UTF encoding forms (UTF-8, UTF-16, UTF-32), compares their storage characteristics, discusses byte order marks, and provides Java code for detecting emoji characters in strings.

Huajiao Technology

Apr 21, 2020

Understanding Unicode Encoding (UTF-8, UTF-16, UTF-32) and Emoji Detection in Java

Unicode is an industry standard that assigns a unique binary code to every character in every language, covering the range 0x0000‑0x10FFFF, which is divided into 17 planes. Plane 0 is the Basic Multilingual Plane containing most common characters, while other planes include supplementary symbols, emoji, and private‑use areas.

The Unicode Transformation Format (UTF) defines three encoding schemes: UTF‑8, UTF‑16, and UTF‑32. UTF‑8 uses 1‑4 bytes per code point, UTF‑16 uses 1‑2 16‑bit units, and UTF‑32 uses a fixed 4‑byte unit for every code point.

UTF‑8 encoding patterns:

Unicode (hex)

UTF‑8 bytes (binary)

000000‑00007F

0xxxxxxx

000080‑0007FF

110xxxxx 10xxxxxx

000800‑00FFFF

1110xxxx 10xxxxxx 10xxxxxx

010000‑10FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF‑16 uses one 16‑bit unit for code points below 0x10000 and a surrogate pair (two units) for higher code points. The surrogate range D800‑DFFF is reserved for this purpose, with D800‑DBFF as high surrogates and DC00‑DFFF as low surrogates.

UTF‑32 stores each code point directly in four bytes, which is simple but wasteful for most text.

A practical comparison shows that 1,000 Chinese characters occupy about 3 KB in UTF‑8 (3 bytes each) and 2 KB in UTF‑16 (2 bytes each), while 1,000 ASCII characters occupy 1 KB in UTF‑8 and still 2 KB in UTF‑16, illustrating the trade‑offs between the encodings.

Byte order matters for UTF‑16 and UTF‑32. Files may begin with a Byte Order Mark (BOM): EF BB BF for UTF‑8 with BOM, FF FE for UTF‑16LE, FE FF for UTF‑16BE, etc.

Below is Java code that checks whether a string contains any emoji characters by examining Unicode code points against known emoji ranges.

public static boolean containsEmoji(String str) {
    int len = str.length();
    for (int i = 0; i < len; i++) {
        int codePoint = Character.codePointAt(str, i);
        if (isEmojiCharacterByWiki(codePoint)) {
            return true;
        }
    }
    return false;
}

/**
 * Determines whether a code point belongs to an emoji range.
 */
private static boolean isEmojiCharacterByWiki(int codePoint) {
    return ((codePoint >= 0X2070) && (codePoint <= 0X2BFF)) ||
           ((codePoint >= 0X3000) && (codePoint <= 0X30FF)) ||
           ((codePoint >= 0X3200) && (codePoint <= 0X32FF)) ||
           ((codePoint >= 0x1F000) && (codePoint <= 0x1FA6F));
}

References include the Unicode specification, Java language documentation, and online emoji tables.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

java Emoji encoding Unicode UTF-8 UTF-16 UTF-32

Written by

Huajiao Technology

The Huajiao Technology channel shares the latest Huajiao app tech on an irregular basis, offering a learning and exchange platform for tech enthusiasts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.