Aho-Corasick Automaton: Efficient Multi‑Pattern Text Search and Real‑Time Highlighting
This article explains the Aho‑Corasick automaton, a classic multi‑pattern matching algorithm that builds a Trie with fail pointers to achieve linear‑time search over massive keyword sets, and demonstrates a Java implementation for highlighting keywords in HTML documents.
Introduction
Most free‑text search systems use Lucene‑style tokenisation, which works well for a small number of keywords but degrades sharply when the keyword set reaches tens of thousands. This article introduces the Aho‑Corasick (AC) automaton, a classic multi‑pattern matching algorithm that handles massive text data with real‑time performance and accuracy.
AC Automaton: A Revolutionary Tool for Text Search
The AC automaton can be thought of as a super‑search machine. By building a prefix tree (Trie) from a list of keywords, it can locate all occurrences of those keywords in a large document far more efficiently than scanning each word individually.
Construction and Search Mechanism
Building the Trie
The automaton first constructs a Trie where each node represents a character and paths represent prefixes of one or more keywords.
Fail (mismatch) pointers
During search, if the current path cannot be extended, the automaton follows a pre‑computed fail pointer to the longest possible suffix that may still lead to a match, avoiding a full restart.
Real‑time Search and Reporting
As the text is streamed, the AC automaton traverses the Trie and instantly reports each matched keyword together with its position, enabling a single pass over the input to find millions of patterns.
Algorithm Core Components and Complexity
Core components: goto: transition function that moves to the next state for each input character. fail: failure transition used when no direct match exists, jumping to a shallower state. output: set of matched patterns emitted when a terminal state is reached.
Time complexity: The Aho‑Corasick algorithm runs in O(n) time where n is the length of the text, independent of the number of keywords.
Applications of the AC Automaton
The automaton can be used to:
Find, link, or highlight keywords in plain text.
Enrich raw text with semantic markup.
Detect and correct spelling errors by comparing against a dictionary.
Example: Highlighting Keywords in HTML with Java
The following Java program demonstrates how to use the Aho‑Corasick library to search an HTML document and wrap matched keywords with <b> tags for bold highlighting.
public class HighlightKeywordsInHtml {
public static void main(String[] args) {
// Define HTML content string
String htmlContent = createHtmlContentForNanjingUniversity();
// Build Aho‑Corasick Trie
Trie trie = buildAhoCorasickTrie();
// Tokenize HTML
Collection<Token> tokens = trie.tokenize(htmlContent);
StringBuilder html = new StringBuilder();
html.append("<html><body><p>");
for (Token token : tokens) {
if (token.isMatch()) {
html.append("<b>");
}
html.append(token.getFragment());
if (token.isMatch()) {
html.append("</b>");
}
}
html.append("</p></body></html>");
System.out.println(html);
}
private static String createHtmlContentForNanjingUniversity() {
String speech = """
<!DOCTYPE html>
<html lang=\"zh-CN\">
<head>...</head>
<body>...</body>
</html>
""";
return speech;
}
private static Trie buildAhoCorasickTrie() {
return Trie.builder()
.ignoreOverlaps()
.onlyWholeWords()
.ignoreCase()
.addKeywords(Arrays.asList("南京大学", "南大", "地球科学"))
.build();
}
}Running the program produces the expected highlighted HTML output.
Conclusion
This article provides an introductory overview of the Aho‑Corasick automaton, its linear‑time performance, core components, and a practical Java implementation for keyword highlighting in HTML documents.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
