Backend Development 13 min read

How to Build a MySQL‑Based Fuzzy Company Name Matcher with Regex and IKAnalyzer

This article walks through a real‑world approval workflow where a business user adds a company, explains why MySQL REGEXP is chosen over Elasticsearch for fuzzy matching, and details the step‑by‑step implementation—including name preprocessing, tokenization with IKAnalyzer, and a custom SQL query that ranks results by match degree.

Architect

May 24, 2024

How to Build a MySQL‑Based Fuzzy Company Name Matcher with Regex and IKAnalyzer

Problem Context

A company‑registration workflow requires the administrator to verify that a newly submitted full company name does not already exist in the database. The core challenge is to implement fuzzy matching and relevance ranking for Chinese company names.

Technology Selection

Option 1: introduce Elasticsearch – rejected because the system is small‑scale and the added infrastructure and data‑sync cost outweigh the benefits.

Option 2: use MySQL built‑in fuzzy search – selected.

MySQL provides three fuzzy‑search mechanisms: LIKE – requires full‑field match, unsuitable. REGEXP – arbitrary pattern match, chosen despite being slower than full‑text.

Full‑text index – limited customizability, cannot handle arbitrary patterns, rejected.

Given the modest record count, the performance penalty of REGEXP is acceptable.

Pre‑processing: Strip Administrative Regions

The method formatCompanyName removes province, city, county and village names from the input using a regular expression with named groups and a static ADDRESS array that enumerates Chinese provinces and their subordinate cities.

private String formatCompanyName(String targetCompanyName) {
    String regex = "(?<province>[^省]+自治区|.*?省|.*?行政区|.*?市)"
        + "?(?<city>[^市]+自治州|.*?地区|.*?行政单位|.+盟|市辖区|.*?市|.*?县)"
        + "?(?<county>[^(区|市|县|旗|岛)]+区|.*?市|.*?县|.*?旗|.*?岛)"
        + "?(?<village>.*)";
    Matcher matcher = Pattern.compile(regex).matcher(targetCompanyName);
    while (matcher.find()) {
        String province = matcher.group("province");
        if (StringUtils.isNotBlank(province) && targetCompanyName.contains(province)) {
            targetCompanyName = targetCompanyName.replace(province, "");
        }
        String city = matcher.group("city");
        if (StringUtils.isNotBlank(city) && targetCompanyName.contains(city)) {
            targetCompanyName = targetCompanyName.replace(city, "");
        }
        String county = matcher.group("county");
        if (StringUtils.isNotBlank(county) && targetCompanyName.contains(county)) {
            targetCompanyName = targetCompanyName.replace(county, "");
        }
    }
    // Remove any remaining city‑level keywords from ADDRESS array
    for (String[] city : ADDRESS) {
        for (String b : city) {
            if (targetCompanyName.contains(b)) {
                targetCompanyName = targetCompanyName.replace(b, "");
            }
        }
    }
    return targetCompanyName;
}

The AddressUtil class defines ADDRESS as a two‑dimensional String array containing provinces and their major cities.

Tokenization with IKAnalyzer

Dependencies added in pom.xml:

<dependency>
    <groupId>com.janeluo</groupId>
    <artifactId>ikanalyzer</artifactId>
    <version>2012_u6</version>
    <exclusions>
        <exclusion>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
        </exclusion>
    </exclusions>
</dependency>

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>7.3.0</version>
</dependency>

Utility class converts a string into a list of tokens using IKAnalyzer in smart mode:

public static List<String> iKSegmenterToList(String target) throws Exception {
    if (StringUtils.isEmpty(target)) return new ArrayList<>();
    List<String> result = new ArrayList<>();
    StringReader sr = new StringReader(target);
    IKSegmenter ik = new IKSegmenter(sr, true); // true = enable smart mode
    Lexeme lex;
    while ((lex = ik.next()) != null) {
        result.add(lex.getLexemeText());
    }
    return result;
}

Matching Workflow

public JsonResult matchCompanyName(CompanyDTO companyDTO, String accessToken, String localIp) {
    String sourceCompanyName = companyDTO.getCompanyName();
    String targetCompanyName = sourceCompanyName;
    // Remove parentheses and generic keywords
    targetCompanyName = targetCompanyName.replaceAll("[（]|[）]|[(]|[)]", "");
    targetCompanyName = targetCompanyName.replaceAll("[(集团|股份|有限|责任|分公司)]", "");
    // Skip region stripping for banks
    if (!targetCompanyName.contains("银行")) {
        targetCompanyName = formatCompanyName(targetCompanyName);
    }
    // Tokenize
    String splitCompanyName = splitWord(targetCompanyName);
    // Query MySQL using REGEXP
    List<Company> matchedCompany = companyRepository.queryMatchCompanyName(splitCompanyName, targetCompanyName);
    List<String> result = new ArrayList<>();
    for (Company companyInfo : matchedCompany) {
        result.add(companyInfo.getCompanyName());
        if (companyDTO.getCompanyId().equals(companyInfo.getCompanyId())) {
            result.remove(companyInfo.getCompanyName()); // exclude self
        }
    }
    return JsonResult.successResult(result);
}

The repository defines a native SQL query that uses REGEXP for fuzzy matching and orders results by the proportion of matched characters:

@Query(value = "SELECT * FROM company WHERE isDeleted = '0' " +
    "AND companyName REGEXP ?1 " +
    "ORDER BY length(REPLACE(companyName, ?2, '')) / length(companyName)", nativeQuery = true)
List<Company> queryMatchCompanyName(String companyNameRegex, String companyName);

Explanation of the ranking expression: REPLACE(companyName, ?2, '') removes the matched token string from the original name. length(...) computes the number of characters removed; dividing by the total length yields a relevance score (higher score = more characters matched).

Key Takeaways

For low‑traffic systems, MySQL REGEXP provides adequate fuzzy‑search without the overhead of a dedicated search engine.

Pre‑processing to strip province/city/county/village names dramatically improves match precision.

IKAnalyzer produces meaningful token sets; the pipe‑separated token string becomes the REGEXP pattern.

Sorting by length(REPLACE(...)) / length(...) effectively ranks matches by relevance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Database Backend Development MySQL fuzzy-matching regex IKAnalyzer

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.