Fundamentals 10 min read

Aho-Corasick Automaton: Efficient Multi‑Pattern Text Search and Real‑Time Highlighting

This article explains the Aho‑Corasick automaton, a classic multi‑pattern matching algorithm that builds a Trie with fail pointers to achieve linear‑time search over massive keyword sets, and demonstrates a Java implementation for highlighting keywords in HTML documents.

JD Tech Talk
JD Tech Talk
JD Tech Talk
Aho-Corasick Automaton: Efficient Multi‑Pattern Text Search and Real‑Time Highlighting

Introduction

Most free‑text search systems use Lucene‑style tokenisation, which works well for a small number of keywords but degrades sharply when the keyword set reaches tens of thousands. This article introduces the Aho‑Corasick (AC) automaton, a classic multi‑pattern matching algorithm that handles massive text data with real‑time performance and accuracy.

AC Automaton: A Revolutionary Tool for Text Search

The AC automaton can be thought of as a super‑search machine. By building a prefix tree (Trie) from a list of keywords, it can locate all occurrences of those keywords in a large document far more efficiently than scanning each word individually.

Construction and Search Mechanism

Building the Trie

The automaton first constructs a Trie where each node represents a character and paths represent prefixes of one or more keywords.

图片
图片

Fail (mismatch) pointers

During search, if the current path cannot be extended, the automaton follows a pre‑computed fail pointer to the longest possible suffix that may still lead to a match, avoiding a full restart.

图片
图片

Real‑time Search and Reporting

As the text is streamed, the AC automaton traverses the Trie and instantly reports each matched keyword together with its position, enabling a single pass over the input to find millions of patterns.

图片
图片

Algorithm Core Components and Complexity

Core components: goto: transition function that moves to the next state for each input character. fail: failure transition used when no direct match exists, jumping to a shallower state. output: set of matched patterns emitted when a terminal state is reached.

Time complexity: The Aho‑Corasick algorithm runs in O(n) time where n is the length of the text, independent of the number of keywords.

Applications of the AC Automaton

The automaton can be used to:

Find, link, or highlight keywords in plain text.

Enrich raw text with semantic markup.

Detect and correct spelling errors by comparing against a dictionary.

Example: Highlighting Keywords in HTML with Java

The following Java program demonstrates how to use the Aho‑Corasick library to search an HTML document and wrap matched keywords with <b> tags for bold highlighting.

public class HighlightKeywordsInHtml {
   public static void main(String[] args) {
        // Define HTML content string
        String htmlContent = createHtmlContentForNanjingUniversity();
        // Build Aho‑Corasick Trie
        Trie trie = buildAhoCorasickTrie();
        // Tokenize HTML
        Collection<Token> tokens = trie.tokenize(htmlContent);
        StringBuilder html = new StringBuilder();
        html.append("<html><body><p>");
        for (Token token : tokens) {
            if (token.isMatch()) {
                html.append("<b>");
            }
            html.append(token.getFragment());
            if (token.isMatch()) {
                html.append("</b>");
            }
        }
        html.append("</p></body></html>");
        System.out.println(html);
   }
   private static String createHtmlContentForNanjingUniversity() {
        String speech = """
        <!DOCTYPE html>
        <html lang=\"zh-CN\">
        <head>...</head>
        <body>...</body>
        </html>
        """;
        return speech;
   }
   private static Trie buildAhoCorasickTrie() {
        return Trie.builder()
            .ignoreOverlaps()
            .onlyWholeWords()
            .ignoreCase()
            .addKeywords(Arrays.asList("南京大学", "南大", "地球科学"))
            .build();
   }
}

Running the program produces the expected highlighted HTML output.

Conclusion

This article provides an introductory overview of the Aho‑Corasick automaton, its linear‑time performance, core components, and a practical Java implementation for keyword highlighting in HTML documents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaalgorithmCode ExampleAho-CorasickTriemulti-pattern matchingtext search
JD Tech Talk
Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.