Backend Development 4 min read

How to Accurately Substring Chinese Text in PHP Across GBK and UTF-8

This guide explains how to correctly truncate strings containing Chinese characters in PHP by detecting character byte length for GBK and UTF‑8 encodings, and provides a reusable my_substr function with example usage.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
How to Accurately Substring Chinese Text in PHP Across GBK and UTF-8

Key points to know:

In GBK encoding, a Chinese character occupies 2 bytes; in UTF‑8 it occupies 3 bytes.

The ord() function returns the ASCII value of the first character of a string.

Chinese characters have ASCII values greater than 0xA0.

The essential technique is to determine whether each character in the string is Chinese or English by checking if ord(substr($str,$start,1)) > 0xA0 . If true, it is a Chinese character; otherwise, it is an English character.

The following PHP function my_substr implements this logic, allowing you to specify the start position, length, and byte size (2 for GBK, 3 for UTF‑8):

<code><?php
/* param $str   The string to be truncated.
 * param $start Starting position, 0 for the first character.
 * param $length Number of characters to extract; if empty, extract to the end.
 * param $bite   Byte length of a Chinese character, default 2 for GBK, 3 for UTF‑8.
 */
function my_substr($str, $start, $length = "", $bite = 2) {
    $pos = 0; // byte position in the string
    // Calculate byte offset for the start position
    for ($i = 0; $i < $start; $i++) {
        if (ord(substr($str, $i, 1)) > 0xA0) {
            $pos += $bite; // Chinese character
        } else {
            $pos += 1; // English character
        }
    }

    if ($length == "") {
        return substr($str, $pos); // to the end
    } else {
        if ($length < 0) {
            $length = 0;
        }
        $string = "";
        for ($i = 1; $i <= $length; $i++) {
            if (ord(substr($str, $pos, 1)) > 0xA0) {
                $string .= substr($str, $pos, $bite);
                $pos += $bite;
            } else {
                $string .= substr($str, $pos, 1);
                $pos += 1;
            }
        }
        return $string;
    }
}
$str = "a这是一段中文";
echo my_substr($str, 0); // output whole string
echo "\n";
echo my_substr($str, 0, 1); // output 'a'
echo "\n";
echo my_substr($str, 1, 2); // output '这是一'
?>
</code>

Adjust the $bite parameter to 3 when working with UTF‑8 encoded strings.

encodingUTF-8Chinese charactersGBKsubstring
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.