Why Java 8 Switched String Storage to byte[] and How It Saves Memory
The article explains how Java 8 changed the internal representation of String from a char[] to a byte[] to reduce memory consumption, the role of Latin‑1 encoding, the impact on garbage collection, and why UTF‑16 remains the practical choice for Java strings.
Why Optimize String to Save Memory
If you are not a Java 8 holdout, you may have noticed that the String class source code switched from using char[] to byte[] to store string content. The primary reason is to reduce the memory occupied by strings, which also lowers the frequency of garbage‑collection cycles.
Using the command jmap -histo:live pid | head -n 10 you can view heap object statistics. In a running Java 8 project, the snapshot shows that String objects (17638 instances) occupy 423 312 bytes, ranking third in memory usage.
Because Java 8’s String implementation still relies on char[], the top memory consumer is the char array itself: 17 673 instances using 1 621 352 bytes.
Thus, optimizing String memory is essential; optimizing a library that rarely uses String would be of little benefit.
Why Does byte[] Save Memory?
In the JVM, a char occupies two bytes and uses UTF‑16 encoding, covering the range '\u0000' to '\uffff'. Consequently, representing a String with char[] always consumes two bytes per character, even when the character could be represented with a single byte.
In practice, single‑byte characters appear more frequently than double‑byte ones. Simply converting char[] to byte[] is insufficient; it must be combined with the Latin‑1 encoding, which stores each character in one byte, yielding greater space savings than UTF‑8.
Example: String name = "jack"; With Latin‑1 encoding, this string occupies only 4 bytes.
For a string like "小明", only UTF‑16 can represent it: String name = "小明"; Starting with JDK 9, the String source adds a coder field to distinguish the encoding used:
/**
* The identifier of the encoding used to encode the bytes in {@code value}.
* The supported values in this implementation are:
*
* LATIN1
* UTF16
*/
private final byte coder;Java automatically selects the appropriate encoding (Latin‑1 or UTF‑16) based on the string’s content.
Therefore, after the change from char[] to byte[], English characters occupy one byte while Chinese characters still occupy two bytes; previously both occupied two bytes.
Why Use UTF‑16 Instead of UTF‑8?
In UTF‑8, characters 0–127 are encoded with one byte (identical to ASCII); characters 128 and above use two, three, or four bytes.
If a character fits in one byte, the highest bit is 0.
If it requires multiple bytes, the first byte starts with a sequence of 1 bits equal to the number of bytes, followed by a 0, and continuation bytes start with 10.
Because UTF‑8 is variable‑length, random‑access operations such as charAt or substring become inefficient: the JVM would need to scan from the start to locate the nth character.
UTF‑16 also uses a variable length (2 or 4 bytes). Characters in the Unicode range 0–FFFF are stored in two bytes; characters 0x10000–0x10FFFF are stored as surrogate pairs (four bytes).
In Java, however, a char is always two bytes, and a four‑byte Unicode character is represented by two char values. All String operations work on these fixed‑size char units, making UTF‑16 effectively a fixed‑length encoding within the Java runtime.
Source: https://www.zhihu.com/question/447224628
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
