Understanding Character Sets, Encodings, BOM, and Their Interaction in Java, Tomcat, and MySQL
This article explains the fundamentals of characters, character sets and encodings such as ASCII, GB2312, Unicode, UTF‑8/16/32, BOM, and demonstrates how Java, browsers, Tomcat and MySQL handle encoding conversion, endianness and string length issues with practical code examples.
Basic Concepts
A character is any textual symbol; a character set is a collection of characters (e.g., ASCII, GB2312, BIG5, GB18030, Unicode). To store characters computers need a character encoding that maps each character to a binary representation.
Character Encoding
Encoding converts characters from a character set into a sequence of bits, bytes or pulses so that text can be stored or transmitted.
ANSI
ANSI is not a single encoding but a family of region‑specific extensions to ASCII; Windows often refers to these as “ANSI encoding”.
Unicode
Unicode is an industry standard that defines a universal character set and several encoding schemes. It assigns a unique code point to every character. CodeUnit: the minimum number of bytes needed to represent a code point in a given Unicode encoding (e.g., 1 byte for UTF‑8, 2 bytes for UTF‑16, 4 bytes for UTF‑32). CodePoint: the abstract numeric value of a character. CodeSpace: the total range of possible code points (0x0000‑0x10FFFF, about 1.1 million). CodePlane: Unicode divides the code space into 17 planes; plane 0 is the Basic Multilingual Plane (BMP). SurrogatePair: a pair of 16‑bit values (high‑surrogate 0xD800‑0xDBFF and low‑surrogate 0xDC00‑0xDFFF) used in UTF‑16 to encode code points above 0xFFFF.
UTF‑8
UTF‑8 uses a variable‑length encoding: 1 byte for 0x00‑0x7F, 2 bytes for 0x80‑0x7FF, 3 bytes for 0x800‑0xFFFF, and 4 bytes for 0x10000‑0x10FFFF. The maximum length is 4 bytes.
Unicode (hex)
UTF‑8 bit pattern
000000‑00007F
0xxxxxxx
000080‑0007FF
110xxxxx 10xxxxxx
000800‑00FFFF
1110xxxx 10xxxxxx 10xxxxxx
010000‑10FFFF
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Example: the Chinese character “汉” (U+6C49) is encoded as E6 B1 89 in UTF‑8.
UTF‑16
UTF‑16 uses 2 bytes for BMP code points and 4 bytes (a surrogate pair) for code points ≥ 0x10000. The conversion algorithm is described in the source.
UTF‑32
UTF‑32 stores each code point as a 32‑bit unsigned integer.
Endianness
Big‑Endian stores the most‑significant byte first; Little‑Endian stores the least‑significant byte first. Byte order does not affect the logical encoding, only the physical storage of multi‑byte units.
BOM (Byte Order Mark)
BOM is the Unicode character U+FEFF placed at the start of a byte stream to indicate the encoding and, for UTF‑16/UTF‑32, the byte order. The UTF‑8 BOM is the byte sequence EF BB BF.
Java Character Encoding
Java uses Unicode as its character set and UTF‑16 as its internal encoding. char is a 16‑bit code unit.
String Length vs. Code Points
String.length()returns the number of char units, not the number of Unicode code points. Use String.codePointCount() for the true character count.
Why Java Does Not Use Fixed‑Length Encoding
UTF‑16 was originally fixed‑length, but the growth of Unicode required surrogate pairs, making it variable‑length.
Code Point / Code Unit Analysis
public static void main(String[] args) throws Exception {
// example code that prints code units and code points
}The example demonstrates how a surrogate pair occupies two char units but represents a single code point.
Encoding / Decoding in Java
Use String.getBytes(Charset) to encode a string to a byte array and new String(byte[], Charset) to decode.
byte[] utf8Bytes = "编码转换".getBytes("UTF-8");
String utf8 = new String(utf8Bytes, "UTF-8");
byte[] gbkBytes = utf8.getBytes("GBK");
String gbk = new String(gbkBytes, "GBK");Browser / Tomcat / MySQL Encoding
Browsers use UTF‑8 for URIs, ISO‑8859‑1 for HTTP headers, and the charset from Content‑Type or meta for the body.
Tomcat’s URI encoding defaults to ISO‑8859‑1 (Tomcat 7) or UTF‑8 (Tomcat 8). The useBodyEncodingForURI attribute can make the request body encoding apply to the URI.
Spring MVC should configure CharacterEncodingFilter; Spring Boot configures OrderedCharacterEncodingFilter with UTF‑8 by default.
MySQL Character Sets
MySQL uses several variables: character_set_client, character_set_connection, character_set_database, character_set_server, character_set_results, and character_set_system. The SET NAMES utf8mb4 statement sets client, connection, and results to utf8mb4.
Consistent settings across client, server and connection are required to avoid garbled text.
Connector/J
When using Connector/J, the URL should contain useUnicode=true&characterEncoding=utf8 (or utf8mb4 for full Unicode). Newer versions map characterEncoding=utf8 to MySQL’s utf8mb4 charset.
References
https://www.cnblogs.com/binarylei/p/10760233.html
https://baike.baidu.com/item/Unicode/750500
https://baike.baidu.com/item/字符集/946585
https://baike.baidu.com/item/字符编码/8446880
https://www.zhihu.com/question/27562173/answer/76208352
http://www.imooc.com/article/26166
http://www.fmddlmyy.cn/text6.html
https://blog.csdn.net/duduniao999/article/details/80872701
https://www.cnblogs.com/lanhaicode/p/11214827.html
https://www.zhihu.com/question/30945431/answer/50046808
https://www.cnblogs.com/jave1ove/p/7454966.html
https://jingyan.baidu.com/article/148a1921189b234d71c3b1df.html
https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-charsets.html
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
