How I Reduced Log Keyword Counting from Hours to Minutes Using PHP, Grep, Regex & Trie

This article walks through solving a massive log‑keyword counting task—600,000 short messages and 50,000 keywords—by evolving from a simple grep‑based approach to regex optimizations, word‑splitting, a trie data structure, and finally a multi‑process Redis queue, achieving a performance boost from hours to under ten minutes.

21CTO
21CTO
21CTO
How I Reduced Log Keyword Counting from Hours to Minutes Using PHP, Grep, Regex & Trie

Original - grep

Design

Initially I thought of using the Linux grep command combined with PHP's exec() to count keyword occurrences, e.g., grep 'keyword' | wc -l.

Code

foreach ($word_list as $keyword) {
    $count = intval(exec("grep '{$keyword}' file.log | wc -l"));
    record($keyword, $count);
}

The naive implementation ran for six hours on an old machine.

Evolution - Regex

Design

When the product demanded real‑time streaming data, I switched to regular expressions, building a pattern like /keyword1|keyword2|.../ using the | operator.

Regex pitfalls

Long patterns exceed PHP's backtrack limit ( pcre.backtrack_limit), causing crashes; increase it via ini_set('pcre.backtrack_limit', n) or split keywords.

Special characters in keywords (e.g., /) generate warnings; escape them with preg_quote().

Code

$end = 0;
$step = 1500;
$pattern = array();
while ($end < count($word_list)) {
    $tmp_arr = array_slice($word_list, $end, $step);
    $end += $step;
    $item = implode('|', $tmp_arr);
    $pattern[] = preg_quote($item);
}

$content = file_get_contents($log_file);
$lines = explode("
", $content);
foreach ($lines as $line) {
    for ($i = 0; $i < count($pattern); $i++) {
        preg_match_all("/{$pattern[$i]}/", $line, $match);
    }
    $match = array_unique(array_filter($match));
    dealResult($match);
}

Awakening - Word Splitting

Design

I considered hashing each keyword and checking each message word against the hash for O(1) lookups, but needed a way to split messages into all possible substrings of length 2‑8.

Code

$str_list = getStrList($msg);
foreach ($str_list as $str) {
    $keywords = getKeywords($str);
    foreach ($keywords as $keyword) {
        if (isset($word_list[$keyword])) {
            record($keyword);
        }
    }
}

function getStrList($msg) {
    $separators = array(',', '。', '的');
    $words = preg_split('/(?<!^)(?!$)/u', $msg);
    $str = array();
    $str_list = array();
    foreach ($words as $word) {
        if (in_array($word, $separators)) {
            $str_list[] = $str;
            $str = array();
        } else {
            $str[] = $word;
        }
    }
    return array_filter($str_list);
}

function getKeywords($str) {
    if (count($str) < 2) return array();
    $keywords = array();
    for ($i = 0; $i < count($str); $i++) {
        for ($j = 2; $j < 9; $j++) {
            $keywords[] = array_slice($str, $i, $j);
        }
    }
    return $keywords;
}

Final - Trie Tree

Trie tree

A trie (prefix tree) stores keywords character by character, marking the end of each keyword with a special terminator (e.g., `).

Design

Split each keyword into characters using preg_split(), e.g., "科学家" → "科","学","家".

Add a terminator after the last character.

Insert characters into the tree, creating nodes as needed.

During matching, traverse the tree; when a terminator node is reached, a keyword is found.

Construction code

$node = array(
    'depth' => $depth,
    'next' => array(
        $val => $node,
        // ...
    ),
);
private function insert(&$node, $words) {
    if (empty($words)) return;
    $word = array_shift($words);
    if (isset($node['next'][$word])) {
        $this->insert($node['next'][$word], $words);
    } else {
        $tmp_node = array(
            'depth' => $node['depth'] + 1,
            'next' => array(),
        );
        $node['next'][$word] = $tmp_node;
        $this->insert($node['next'][$word], $words);
    }
}

Query code

private function query($node, $words, &$matched) {
    $word = array_shift($words);
    if (isset($node['next'][$word])) {
        $matched[] = $word;
        if ($node['next'] > 1 && isset($node['next'][$word]['next']['`'])) {
            return true;
        }
        return $this->query($node['next'][$word], $words, $matched);
    } else {
        $matched = array();
        return false;
    }
}

Processing 1,000 messages takes about 3 seconds in PHP; Java achieves ~1 second.

Multiprocessing

Design

To further cut runtime, I used multiple processes via a Redis list queue. Each worker pops log lines, processes them with the trie, and records matches.

Running 10 workers completes the 600k‑message job in under ten minutes.

Conclusion

Multiple tools—grep, regex, word splitting, trie trees, and concurrent processing—can be combined to dramatically improve log‑keyword counting performance, turning an hour‑long task into a ten‑minute one.

Author: 枕边书
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

regexLog ProcessingGrepTriemultiprocessing
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.