Backend Development 6 min read

Sensitive Word Detection Algorithm in PHP Using Multibyte String Traversal

This article explains how to build a tree‑based sensitive‑word detection algorithm in PHP, discusses the challenges of correctly iterating over multibyte strings, and provides a complete implementation with code examples that handle Unicode characters efficiently.

php中文网 Courses
php中文网 Courses
php中文网 Courses
Sensitive Word Detection Algorithm in PHP Using Multibyte String Traversal

The article introduces a simple algorithm that constructs a tree where each character of a keyword is a node, then traverses input strings to check for the presence of those keywords.

Implementation challenges : While building the tree is straightforward, correctly iterating over a string in PHP requires handling multibyte characters, because functions like mb_substr and mb_strlen must be used to obtain each character’s length.

A basic string traversal method is shown:

<code>$strLen = mb_strlen($str);
for ($i = 0; $i < $strLen; $i++) {
    echo mb_substr($str, $i, 1, "utf8"), PHP_EOL;
}</code>

This method relies on the mb_* family of functions, which can be slow for large strings because mb_substr recalculates the character position each time.

The correct approach is to follow the UTF‑8 encoding rules when extracting characters.

Algorithm implementation (full PHP code):

<code>&lt;?php
/**
 * Illegal keyword check
 */
class SensitiveWords {
    protected $tree = null;
    protected $callIsNumeric = true;
    /**
     * Load illegal words list, one word per line
     */
    public function __construct($path = __DIR__ . '/sensitiveWords.txt') {
        $this->tree = new WordNode();
        $file = fopen($path, "r");
        while (!feof($file)) {
            $words = trim(fgets($file));
            if ($words == '') { continue; }
            // Detect pure numeric words
            if (is_numeric($words)) { $this->callIsNumeric = false; }
            $this->setTree($words);
        }
        fclose($file);
    }

    protected function setTree($words) {
        $array = $this->strToArr($words);
        $tree = $this->tree;
        $l = count($array) - 1;
        foreach ($array as $k => $item) {
            $tree = $tree->getChildAlways($item);
            if ($l == $k) { $tree->end = true; }
        }
    }

    /**
     * Return found illegal words
     * @param string $str
     * @return array
     */
    public function check($str) {
        // compress string
        $str = trim(str_replace([' ', "\n", "\r"], ['', '', ''], $str));
        $ret = [];
        loop:
        $strLen = strlen($str);
        if ($strLen === 0) { return array_unique($ret); }
        if ($this->callIsNumeric && is_numeric($str)) { return array_unique($ret); }
        $tree = $this->tree;
        $words = '';
        for ($i = 0; $i < $strLen; $i++) {
            // Determine UTF‑8 byte length based on leading byte
            $ord = ord($str[$i]);
            if ($ord <= 127) {
                $word = $str[$i];
            } elseif ($ord <= 223) {
                $word = $str[$i] . $str[$i + 1];
                $i += 1;
            } elseif ($ord <= 239) {
                $word = $str[$i] . $str[$i + 1] . $str[$i + 2];
                $i += 2;
            } elseif ($ord <= 244) {
                $word = $str[$i] . $str[$i + 1] . $str[$i + 2] . $str[$i + 3];
                $i += 3;
            } else {
                // PHP cannot handle >4‑byte UTF‑8 sequences
                continue;
            }
            $tree = $tree->getChild($word);
            if (is_null($tree)) {
                $str = substr($str, $i + 1);
                goto loop;
            } else {
                $words .= $word;
                if ($tree->end) { $ret[] = $words; }
            }
        }
        return array_unique($ret);
    }

    protected function strToArr($str) {
        $array = [];
        $strLen = mb_strlen($str);
        for ($i = 0; $i < $strLen; $i++) {
            $array[] = mb_substr($str, $i, 1, "utf8");
        }
        return $array;
    }
}

/**
 * Single character node
 */
class WordNode {
    public $end = false;
    protected $child = [];

    public function getChildAlways($word) {
        if (!isset($this->child[$word])) {
            $this->child[$word] = new self();
        }
        return $this->child[$word];
    }

    public function getChild($word) {
        if ($word === '') { return null; }
        return $this->child[$word] ?? null;
    }
}
?>
</code>

The proper way to traverse a string is to follow the UTF‑8 encoding pattern (using utf8 as the character set) when extracting characters, as demonstrated in the code above.

AlgorithmPHPMultibyteString TraversalSensitive Words
php中文网 Courses
Written by

php中文网 Courses

php中文网's platform for the latest courses and technical articles, helping PHP learners advance quickly.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.