Fundamentals 10 min read

How the Aho‑Corasick Automaton Supercharges Large‑Scale Text Search

This article explains the Aho‑Corasick automaton, detailing its construction, fail‑pointer mechanism, linear‑time complexity, and practical Java implementation for highlighting keywords in HTML, demonstrating its power for massive real‑time text search.

JD Cloud Developers
JD Cloud Developers
JD Cloud Developers
How the Aho‑Corasick Automaton Supercharges Large‑Scale Text Search

Introduction

Currently most free‑text search techniques follow a Lucene‑like strategy, parsing the query into components to locate keywords. This works well for a small number of keywords, but when the number reaches 100 000 or more, efficiency drops sharply, especially when exhaustive dictionary comparison is required. This article introduces the Aho‑Corasick (AC) automaton, a classic multi‑pattern matching algorithm that can handle massive text data while guaranteeing real‑time and accurate search.

AC Automaton: A Revolutionary Tool for Text Search

The AC automaton can be likened to a super‑search machine. Imagine you have a large book and a list of many words; the task is to quickly find all occurrences of those words in the book. Traditional per‑word search would be huge work, whereas the AC automaton builds a special tree structure— a prefix tree or Trie— to dramatically improve search efficiency.

Construction and Search Mechanism

Building the Prefix Tree (Trie)

The AC automaton first constructs a prefix tree from all keywords. Each node represents a character and points to possible next characters, forming paths that represent one or more keyword prefixes.

During search, if the current path cannot match a keyword, the AC automaton uses fail pointers to quickly backtrack. These pointers are pre‑set on each node and point to alternative matching paths, avoiding the inefficiency of restarting from the root.

Real‑time Search and Efficient Reporting

The AC automaton reads the text while traversing the prefix tree. As soon as a keyword is found, it immediately reports the word and its position, enabling one‑pass, high‑efficiency processing of massive keyword sets.

Algorithm Core Components and Complexity

Core components: goto (transition): each encountered character is fed to the goto structure to determine the new state. fail (failure transition): if no matching state is found, the algorithm triggers fail and backtracks to a shallower state. output (output): when a state matches an entire keyword, it is added to the output set.

Time complexity: The Aho‑Corasick algorithm runs in O(n) time, where n is the length of the text, meaning performance is linear and independent of the number of keywords.

Applications of the AC Automaton

The AC automaton can be used in many scenarios, such as:

Finding, linking, or highlighting keywords in text to improve retrievability.

Adding semantics to plain text, enriching content.

Checking grammar errors by comparing with a dictionary.

Case Study: Highlighting Keywords in HTML with Aho‑Corasick

This Java program demonstrates how to use the Aho‑Corasick library to search and highlight keywords in HTML. It builds a trie, tokenizes the HTML, and wraps matched tokens with <b> tags for bold display.

Step 1: Maven dependency configuration.

<dependency>
    <groupId>org.ahocorasick</groupId>
    <artifactId>ahocorasick</artifactId>
    <version>0.6.3</version>
</dependency>

Step 2: Code implementation.

public class HighlightKeywordsInHtml {
   public static void main(String[] args) {
       // Define HTML content string, e.g., Nanjing University description
       String htmlContent = createHtmlContentForNanjingUniversity();
       Trie trie = buildAhoCorasickTrie();
       Collection<Token> tokens = trie.tokenize(htmlContent);
       StringBuilder html = new StringBuilder();
       html.append("<html><body><p>");
       for (Token token : tokens) {
           if (token.isMatch()) {
               html.append("<b>");
           }
           html.append(token.getFragment());
           if (token.isMatch()) {
               html.append("</b>");
           }
       }
       html.append("</p></body></html>");
       System.out.println(html);
   }
   // Helper methods omitted for brevity
}

Step 3: Run the program – the output matches expectations.

In summary, the Aho‑Corasick automaton offers excellent performance for large‑scale keyword search and can be applied to various text‑processing tasks.

References

1. http://cr.yp.to/bib/1975/aho.pdf

2. https://github.com/robert-bor/aho-corasick

JavaalgorithmAho-CorasickTrietext searchkeyword highlighting
JD Cloud Developers
Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.