Backend Development 6 min read

Sensitive Word Detection Algorithm in PHP Using Multibyte String Traversal

This article explains how to build a tree‑based sensitive‑word detection algorithm in PHP, discusses the challenges of correctly iterating over multibyte strings, and provides a complete implementation with code examples that handle Unicode characters efficiently.

php Courses

Oct 11, 2022

Sensitive Word Detection Algorithm in PHP Using Multibyte String Traversal

The article introduces a simple algorithm that constructs a tree where each character of a keyword is a node, then traverses input strings to check for the presence of those keywords.

Implementation challenges : While building the tree is straightforward, correctly iterating over a string in PHP requires handling multibyte characters, because functions like mb_substr and mb_strlen must be used to obtain each character’s length.

A basic string traversal method is shown:

$strLen = mb_strlen($str);
for ($i = 0; $i < $strLen; $i++) {
    echo mb_substr($str, $i, 1, "utf8"), PHP_EOL;
}

This method relies on the mb_* family of functions, which can be slow for large strings because mb_substr recalculates the character position each time.

The correct approach is to follow the UTF‑8 encoding rules when extracting characters.

Algorithm implementation (full PHP code):

<?php
/**
 * Illegal keyword check
 */
class SensitiveWords {
    protected $tree = null;
    protected $callIsNumeric = true;
    /**
     * Load illegal words list, one word per line
     */
    public function __construct($path = __DIR__ . '/sensitiveWords.txt') {
        $this->tree = new WordNode();
        $file = fopen($path, "r");
        while (!feof($file)) {
            $words = trim(fgets($file));
            if ($words == '') { continue; }
            // Detect pure numeric words
            if (is_numeric($words)) { $this->callIsNumeric = false; }
            $this->setTree($words);
        }
        fclose($file);
    }

    protected function setTree($words) {
        $array = $this->strToArr($words);
        $tree = $this->tree;
        $l = count($array) - 1;
        foreach ($array as $k => $item) {
            $tree = $tree->getChildAlways($item);
            if ($l == $k) { $tree->end = true; }
        }
    }

    /**
     * Return found illegal words
     * @param string $str
     * @return array
     */
    public function check($str) {
        // compress string
        $str = trim(str_replace([' ', "
", "\r"], ['', '', ''], $str));
        $ret = [];
        loop:
        $strLen = strlen($str);
        if ($strLen === 0) { return array_unique($ret); }
        if ($this->callIsNumeric && is_numeric($str)) { return array_unique($ret); }
        $tree = $this->tree;
        $words = '';
        for ($i = 0; $i < $strLen; $i++) {
            // Determine UTF‑8 byte length based on leading byte
            $ord = ord($str[$i]);
            if ($ord <= 127) {
                $word = $str[$i];
            } elseif ($ord <= 223) {
                $word = $str[$i] . $str[$i + 1];
                $i += 1;
            } elseif ($ord <= 239) {
                $word = $str[$i] . $str[$i + 1] . $str[$i + 2];
                $i += 2;
            } elseif ($ord <= 244) {
                $word = $str[$i] . $str[$i + 1] . $str[$i + 2] . $str[$i + 3];
                $i += 3;
            } else {
                // PHP cannot handle >4‑byte UTF‑8 sequences
                continue;
            }
            $tree = $tree->getChild($word);
            if (is_null($tree)) {
                $str = substr($str, $i + 1);
                goto loop;
            } else {
                $words .= $word;
                if ($tree->end) { $ret[] = $words; }
            }
        }
        return array_unique($ret);
    }

    protected function strToArr($str) {
        $array = [];
        $strLen = mb_strlen($str);
        for ($i = 0; $i < $strLen; $i++) {
            $array[] = mb_substr($str, $i, 1, "utf8");
        }
        return $array;
    }
}

/**
 * Single character node
 */
class WordNode {
    public $end = false;
    protected $child = [];

    public function getChildAlways($word) {
        if (!isset($this->child[$word])) {
            $this->child[$word] = new self();
        }
        return $this->child[$word];
    }

    public function getChild($word) {
        if ($word === '') { return null; }
        return $this->child[$word] ?? null;
    }
}
?>

The proper way to traverse a string is to follow the UTF‑8 encoding pattern (using utf8 as the character set) when extracting characters, as demonstrated in the code above.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

algorithm PHP multibyte String Traversal sensitive words

Written by

php Courses

php中文网's platform for the latest courses and technical articles, helping PHP learners advance quickly.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.