How to Build a MySQL‑Based Fuzzy Company Name Matcher with Regex and IKAnalyzer
This article walks through a real‑world approval workflow where a business user adds a company, explains why MySQL REGEXP is chosen over Elasticsearch for fuzzy matching, and details the step‑by‑step implementation—including name preprocessing, tokenization with IKAnalyzer, and a custom SQL query that ranks results by match degree.
Problem Context
A company‑registration workflow requires the administrator to verify that a newly submitted full company name does not already exist in the database. The core challenge is to implement fuzzy matching and relevance ranking for Chinese company names.
Technology Selection
Option 1: introduce Elasticsearch – rejected because the system is small‑scale and the added infrastructure and data‑sync cost outweigh the benefits.
Option 2: use MySQL built‑in fuzzy search – selected.
MySQL provides three fuzzy‑search mechanisms: LIKE – requires full‑field match, unsuitable. REGEXP – arbitrary pattern match, chosen despite being slower than full‑text.
Full‑text index – limited customizability, cannot handle arbitrary patterns, rejected.
Given the modest record count, the performance penalty of REGEXP is acceptable.
Pre‑processing: Strip Administrative Regions
The method formatCompanyName removes province, city, county and village names from the input using a regular expression with named groups and a static ADDRESS array that enumerates Chinese provinces and their subordinate cities.
private String formatCompanyName(String targetCompanyName) {
String regex = "(?<province>[^省]+自治区|.*?省|.*?行政区|.*?市)"
+ "?(?<city>[^市]+自治州|.*?地区|.*?行政单位|.+盟|市辖区|.*?市|.*?县)"
+ "?(?<county>[^(区|市|县|旗|岛)]+区|.*?市|.*?县|.*?旗|.*?岛)"
+ "?(?<village>.*)";
Matcher matcher = Pattern.compile(regex).matcher(targetCompanyName);
while (matcher.find()) {
String province = matcher.group("province");
if (StringUtils.isNotBlank(province) && targetCompanyName.contains(province)) {
targetCompanyName = targetCompanyName.replace(province, "");
}
String city = matcher.group("city");
if (StringUtils.isNotBlank(city) && targetCompanyName.contains(city)) {
targetCompanyName = targetCompanyName.replace(city, "");
}
String county = matcher.group("county");
if (StringUtils.isNotBlank(county) && targetCompanyName.contains(county)) {
targetCompanyName = targetCompanyName.replace(county, "");
}
}
// Remove any remaining city‑level keywords from ADDRESS array
for (String[] city : ADDRESS) {
for (String b : city) {
if (targetCompanyName.contains(b)) {
targetCompanyName = targetCompanyName.replace(b, "");
}
}
}
return targetCompanyName;
}The AddressUtil class defines ADDRESS as a two‑dimensional String array containing provinces and their major cities.
Tokenization with IKAnalyzer
Dependencies added in pom.xml:
<dependency>
<groupId>com.janeluo</groupId>
<artifactId>ikanalyzer</artifactId>
<version>2012_u6</version>
<exclusions>
<exclusion>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>7.3.0</version>
</dependency>Utility class converts a string into a list of tokens using IKAnalyzer in smart mode:
public static List<String> iKSegmenterToList(String target) throws Exception {
if (StringUtils.isEmpty(target)) return new ArrayList<>();
List<String> result = new ArrayList<>();
StringReader sr = new StringReader(target);
IKSegmenter ik = new IKSegmenter(sr, true); // true = enable smart mode
Lexeme lex;
while ((lex = ik.next()) != null) {
result.add(lex.getLexemeText());
}
return result;
}Matching Workflow
public JsonResult matchCompanyName(CompanyDTO companyDTO, String accessToken, String localIp) {
String sourceCompanyName = companyDTO.getCompanyName();
String targetCompanyName = sourceCompanyName;
// Remove parentheses and generic keywords
targetCompanyName = targetCompanyName.replaceAll("[(]|[)]|[(]|[)]", "");
targetCompanyName = targetCompanyName.replaceAll("[(集团|股份|有限|责任|分公司)]", "");
// Skip region stripping for banks
if (!targetCompanyName.contains("银行")) {
targetCompanyName = formatCompanyName(targetCompanyName);
}
// Tokenize
String splitCompanyName = splitWord(targetCompanyName);
// Query MySQL using REGEXP
List<Company> matchedCompany = companyRepository.queryMatchCompanyName(splitCompanyName, targetCompanyName);
List<String> result = new ArrayList<>();
for (Company companyInfo : matchedCompany) {
result.add(companyInfo.getCompanyName());
if (companyDTO.getCompanyId().equals(companyInfo.getCompanyId())) {
result.remove(companyInfo.getCompanyName()); // exclude self
}
}
return JsonResult.successResult(result);
}The repository defines a native SQL query that uses REGEXP for fuzzy matching and orders results by the proportion of matched characters:
@Query(value = "SELECT * FROM company WHERE isDeleted = '0' " +
"AND companyName REGEXP ?1 " +
"ORDER BY length(REPLACE(companyName, ?2, '')) / length(companyName)", nativeQuery = true)
List<Company> queryMatchCompanyName(String companyNameRegex, String companyName);Explanation of the ranking expression: REPLACE(companyName, ?2, '') removes the matched token string from the original name. length(...) computes the number of characters removed; dividing by the total length yields a relevance score (higher score = more characters matched).
Key Takeaways
For low‑traffic systems, MySQL REGEXP provides adequate fuzzy‑search without the overhead of a dedicated search engine.
Pre‑processing to strip province/city/county/village names dramatically improves match precision.
IKAnalyzer produces meaningful token sets; the pipe‑separated token string becomes the REGEXP pattern.
Sorting by length(REPLACE(...)) / length(...) effectively ranks matches by relevance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
