Sensitive Word Detection Algorithm in PHP Using Multibyte String Traversal
This article explains how to build a tree‑based sensitive‑word detection algorithm in PHP, discusses the challenges of correctly iterating over multibyte strings, and provides a complete implementation with code examples that handle Unicode characters efficiently.
The article introduces a simple algorithm that constructs a tree where each character of a keyword is a node, then traverses input strings to check for the presence of those keywords.
Implementation challenges : While building the tree is straightforward, correctly iterating over a string in PHP requires handling multibyte characters, because functions like mb_substr and mb_strlen must be used to obtain each character’s length.
A basic string traversal method is shown:
<code>$strLen = mb_strlen($str);
for ($i = 0; $i < $strLen; $i++) {
echo mb_substr($str, $i, 1, "utf8"), PHP_EOL;
}</code>This method relies on the mb_* family of functions, which can be slow for large strings because mb_substr recalculates the character position each time.
The correct approach is to follow the UTF‑8 encoding rules when extracting characters.
Algorithm implementation (full PHP code):
<code><?php
/**
* Illegal keyword check
*/
class SensitiveWords {
protected $tree = null;
protected $callIsNumeric = true;
/**
* Load illegal words list, one word per line
*/
public function __construct($path = __DIR__ . '/sensitiveWords.txt') {
$this->tree = new WordNode();
$file = fopen($path, "r");
while (!feof($file)) {
$words = trim(fgets($file));
if ($words == '') { continue; }
// Detect pure numeric words
if (is_numeric($words)) { $this->callIsNumeric = false; }
$this->setTree($words);
}
fclose($file);
}
protected function setTree($words) {
$array = $this->strToArr($words);
$tree = $this->tree;
$l = count($array) - 1;
foreach ($array as $k => $item) {
$tree = $tree->getChildAlways($item);
if ($l == $k) { $tree->end = true; }
}
}
/**
* Return found illegal words
* @param string $str
* @return array
*/
public function check($str) {
// compress string
$str = trim(str_replace([' ', "\n", "\r"], ['', '', ''], $str));
$ret = [];
loop:
$strLen = strlen($str);
if ($strLen === 0) { return array_unique($ret); }
if ($this->callIsNumeric && is_numeric($str)) { return array_unique($ret); }
$tree = $this->tree;
$words = '';
for ($i = 0; $i < $strLen; $i++) {
// Determine UTF‑8 byte length based on leading byte
$ord = ord($str[$i]);
if ($ord <= 127) {
$word = $str[$i];
} elseif ($ord <= 223) {
$word = $str[$i] . $str[$i + 1];
$i += 1;
} elseif ($ord <= 239) {
$word = $str[$i] . $str[$i + 1] . $str[$i + 2];
$i += 2;
} elseif ($ord <= 244) {
$word = $str[$i] . $str[$i + 1] . $str[$i + 2] . $str[$i + 3];
$i += 3;
} else {
// PHP cannot handle >4‑byte UTF‑8 sequences
continue;
}
$tree = $tree->getChild($word);
if (is_null($tree)) {
$str = substr($str, $i + 1);
goto loop;
} else {
$words .= $word;
if ($tree->end) { $ret[] = $words; }
}
}
return array_unique($ret);
}
protected function strToArr($str) {
$array = [];
$strLen = mb_strlen($str);
for ($i = 0; $i < $strLen; $i++) {
$array[] = mb_substr($str, $i, 1, "utf8");
}
return $array;
}
}
/**
* Single character node
*/
class WordNode {
public $end = false;
protected $child = [];
public function getChildAlways($word) {
if (!isset($this->child[$word])) {
$this->child[$word] = new self();
}
return $this->child[$word];
}
public function getChild($word) {
if ($word === '') { return null; }
return $this->child[$word] ?? null;
}
}
?>
</code>The proper way to traverse a string is to follow the UTF‑8 encoding pattern (using utf8 as the character set) when extracting characters, as demonstrated in the code above.
php中文网 Courses
php中文网's platform for the latest courses and technical articles, helping PHP learners advance quickly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.