Fundamentals 21 min read

Understanding Character Sets, Encodings, BOM, and Their Interaction in Java, Tomcat, and MySQL

This article explains the fundamentals of characters, character sets and encodings such as ASCII, GB2312, Unicode, UTF‑8/16/32, BOM, and demonstrates how Java, browsers, Tomcat and MySQL handle encoding conversion, endianness and string length issues with practical code examples.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Understanding Character Sets, Encodings, BOM, and Their Interaction in Java, Tomcat, and MySQL

Basic Concepts

A character is any textual symbol; a character set is a collection of characters (e.g., ASCII, GB2312, BIG5, GB18030, Unicode). To store characters computers need a character encoding that maps each character to a binary representation.

Character Encoding

Encoding converts characters from a character set into a sequence of bits, bytes or pulses so that text can be stored or transmitted.

ANSI

ANSI is not a single encoding but a family of region‑specific extensions to ASCII; Windows often refers to these as “ANSI encoding”.

Unicode

Unicode is an industry standard that defines a universal character set and several encoding schemes. It assigns a unique code point to every character. CodeUnit: the minimum number of bytes needed to represent a code point in a given Unicode encoding (e.g., 1 byte for UTF‑8, 2 bytes for UTF‑16, 4 bytes for UTF‑32). CodePoint: the abstract numeric value of a character. CodeSpace: the total range of possible code points (0x0000‑0x10FFFF, about 1.1 million). CodePlane: Unicode divides the code space into 17 planes; plane 0 is the Basic Multilingual Plane (BMP). SurrogatePair: a pair of 16‑bit values (high‑surrogate 0xD800‑0xDBFF and low‑surrogate 0xDC00‑0xDFFF) used in UTF‑16 to encode code points above 0xFFFF.

UTF‑8

UTF‑8 uses a variable‑length encoding: 1 byte for 0x00‑0x7F, 2 bytes for 0x80‑0x7FF, 3 bytes for 0x800‑0xFFFF, and 4 bytes for 0x10000‑0x10FFFF. The maximum length is 4 bytes.

Unicode (hex)

UTF‑8 bit pattern

000000‑00007F

0xxxxxxx

000080‑0007FF

110xxxxx 10xxxxxx

000800‑00FFFF

1110xxxx 10xxxxxx 10xxxxxx

010000‑10FFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Example: the Chinese character “汉” (U+6C49) is encoded as E6 B1 89 in UTF‑8.

UTF‑16

UTF‑16 uses 2 bytes for BMP code points and 4 bytes (a surrogate pair) for code points ≥ 0x10000. The conversion algorithm is described in the source.

UTF‑32

UTF‑32 stores each code point as a 32‑bit unsigned integer.

Endianness

Big‑Endian stores the most‑significant byte first; Little‑Endian stores the least‑significant byte first. Byte order does not affect the logical encoding, only the physical storage of multi‑byte units.

BOM (Byte Order Mark)

BOM is the Unicode character U+FEFF placed at the start of a byte stream to indicate the encoding and, for UTF‑16/UTF‑32, the byte order. The UTF‑8 BOM is the byte sequence EF BB BF.

Java Character Encoding

Java uses Unicode as its character set and UTF‑16 as its internal encoding. char is a 16‑bit code unit.

String Length vs. Code Points

String.length()

returns the number of char units, not the number of Unicode code points. Use String.codePointCount() for the true character count.

Why Java Does Not Use Fixed‑Length Encoding

UTF‑16 was originally fixed‑length, but the growth of Unicode required surrogate pairs, making it variable‑length.

Code Point / Code Unit Analysis

public static void main(String[] args) throws Exception {
    // example code that prints code units and code points
}

The example demonstrates how a surrogate pair occupies two char units but represents a single code point.

Encoding / Decoding in Java

Use String.getBytes(Charset) to encode a string to a byte array and new String(byte[], Charset) to decode.

byte[] utf8Bytes = "编码转换".getBytes("UTF-8");
String utf8 = new String(utf8Bytes, "UTF-8");
byte[] gbkBytes = utf8.getBytes("GBK");
String gbk = new String(gbkBytes, "GBK");

Browser / Tomcat / MySQL Encoding

Browsers use UTF‑8 for URIs, ISO‑8859‑1 for HTTP headers, and the charset from Content‑Type or meta for the body.

Tomcat’s URI encoding defaults to ISO‑8859‑1 (Tomcat 7) or UTF‑8 (Tomcat 8). The useBodyEncodingForURI attribute can make the request body encoding apply to the URI.

Spring MVC should configure CharacterEncodingFilter; Spring Boot configures OrderedCharacterEncodingFilter with UTF‑8 by default.

MySQL Character Sets

MySQL uses several variables: character_set_client, character_set_connection, character_set_database, character_set_server, character_set_results, and character_set_system. The SET NAMES utf8mb4 statement sets client, connection, and results to utf8mb4.

Consistent settings across client, server and connection are required to avoid garbled text.

Connector/J

When using Connector/J, the URL should contain useUnicode=true&characterEncoding=utf8 (or utf8mb4 for full Unicode). Newer versions map characterEncoding=utf8 to MySQL’s utf8mb4 charset.

References

https://www.cnblogs.com/binarylei/p/10760233.html

https://baike.baidu.com/item/Unicode/750500

https://baike.baidu.com/item/字符集/946585

https://baike.baidu.com/item/字符编码/8446880

https://www.zhihu.com/question/27562173/answer/76208352

http://www.imooc.com/article/26166

http://www.fmddlmyy.cn/text6.html

https://blog.csdn.net/duduniao999/article/details/80872701

https://www.cnblogs.com/lanhaicode/p/11214827.html

https://www.zhihu.com/question/30945431/answer/50046808

https://www.cnblogs.com/jave1ove/p/7454966.html

https://jingyan.baidu.com/article/148a1921189b234d71c3b1df.html

https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-charsets.html

JavaMySQLUnicodeUTF-8character encodingBOM
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.