How I Reduced Log Keyword Counting from Hours to Minutes Using PHP, Grep, Regex & Trie
This article walks through solving a massive log‑keyword counting task—600,000 short messages and 50,000 keywords—by evolving from a simple grep‑based approach to regex optimizations, word‑splitting, a trie data structure, and finally a multi‑process Redis queue, achieving a performance boost from hours to under ten minutes.
Original - grep
Design
Initially I thought of using the Linux grep command combined with PHP's exec() to count keyword occurrences, e.g., grep 'keyword' | wc -l.
Code
foreach ($word_list as $keyword) {
$count = intval(exec("grep '{$keyword}' file.log | wc -l"));
record($keyword, $count);
}The naive implementation ran for six hours on an old machine.
Evolution - Regex
Design
When the product demanded real‑time streaming data, I switched to regular expressions, building a pattern like /keyword1|keyword2|.../ using the | operator.
Regex pitfalls
Long patterns exceed PHP's backtrack limit ( pcre.backtrack_limit), causing crashes; increase it via ini_set('pcre.backtrack_limit', n) or split keywords.
Special characters in keywords (e.g., /) generate warnings; escape them with preg_quote().
Code
$end = 0;
$step = 1500;
$pattern = array();
while ($end < count($word_list)) {
$tmp_arr = array_slice($word_list, $end, $step);
$end += $step;
$item = implode('|', $tmp_arr);
$pattern[] = preg_quote($item);
}
$content = file_get_contents($log_file);
$lines = explode("
", $content);
foreach ($lines as $line) {
for ($i = 0; $i < count($pattern); $i++) {
preg_match_all("/{$pattern[$i]}/", $line, $match);
}
$match = array_unique(array_filter($match));
dealResult($match);
}Awakening - Word Splitting
Design
I considered hashing each keyword and checking each message word against the hash for O(1) lookups, but needed a way to split messages into all possible substrings of length 2‑8.
Code
$str_list = getStrList($msg);
foreach ($str_list as $str) {
$keywords = getKeywords($str);
foreach ($keywords as $keyword) {
if (isset($word_list[$keyword])) {
record($keyword);
}
}
}
function getStrList($msg) {
$separators = array(',', '。', '的');
$words = preg_split('/(?<!^)(?!$)/u', $msg);
$str = array();
$str_list = array();
foreach ($words as $word) {
if (in_array($word, $separators)) {
$str_list[] = $str;
$str = array();
} else {
$str[] = $word;
}
}
return array_filter($str_list);
}
function getKeywords($str) {
if (count($str) < 2) return array();
$keywords = array();
for ($i = 0; $i < count($str); $i++) {
for ($j = 2; $j < 9; $j++) {
$keywords[] = array_slice($str, $i, $j);
}
}
return $keywords;
}Final - Trie Tree
Trie tree
A trie (prefix tree) stores keywords character by character, marking the end of each keyword with a special terminator (e.g., `).
Design
Split each keyword into characters using preg_split(), e.g., "科学家" → "科","学","家".
Add a terminator after the last character.
Insert characters into the tree, creating nodes as needed.
During matching, traverse the tree; when a terminator node is reached, a keyword is found.
Construction code
$node = array(
'depth' => $depth,
'next' => array(
$val => $node,
// ...
),
); private function insert(&$node, $words) {
if (empty($words)) return;
$word = array_shift($words);
if (isset($node['next'][$word])) {
$this->insert($node['next'][$word], $words);
} else {
$tmp_node = array(
'depth' => $node['depth'] + 1,
'next' => array(),
);
$node['next'][$word] = $tmp_node;
$this->insert($node['next'][$word], $words);
}
}Query code
private function query($node, $words, &$matched) {
$word = array_shift($words);
if (isset($node['next'][$word])) {
$matched[] = $word;
if ($node['next'] > 1 && isset($node['next'][$word]['next']['`'])) {
return true;
}
return $this->query($node['next'][$word], $words, $matched);
} else {
$matched = array();
return false;
}
}Processing 1,000 messages takes about 3 seconds in PHP; Java achieves ~1 second.
Multiprocessing
Design
To further cut runtime, I used multiple processes via a Redis list queue. Each worker pops log lines, processes them with the trie, and records matches.
Running 10 workers completes the 600k‑message job in under ten minutes.
Conclusion
Multiple tools—grep, regex, word splitting, trie trees, and concurrent processing—can be combined to dramatically improve log‑keyword counting performance, turning an hour‑long task into a ten‑minute one.
Author: 枕边书
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
