Backend Development 17 min read

Implementing Fuzzy Company Name Matching with MySQL REGEXP and Java

This article describes a backend solution for a company approval workflow that extracts key information from company names, tokenizes them using IKAnalyzer, applies MySQL REGEXP for fuzzy matching, and sorts results by match degree, complete with code examples and deployment tips.

Top Architect

Jul 11, 2024

Implementing Fuzzy Company Name Matching with MySQL REGEXP and Java

1. Business Scenario Overview

The goal is to implement a company application approval process involving two roles: a business role that submits a company addition request and an administrator role that approves it. The core workflow can be summarized as: the business role applies to add a company, and the administrator reviews and approves the request.

When a business adds a company, it may only provide a short name; the full name might already exist in the system. To prevent duplicate entries, the administrator must check existing records for the same company before approving.

2. Implementation Idea

From a technical perspective, the required functional points are:

Tokenization (word segmentation)

Matching with existing data in the database

Sorting results according to match degree

The tokenization function is readily available via existing analyzers, so the main focus of the requirement is how to match the data in the database and sort by relevance.

3. Fuzzy Matching Technology Selection

Option 1: Introduce Elasticsearch

Option 2: Use MySQL

Because the system scale is small, introducing Elasticsearch would add unnecessary cost and complexity (data synchronization, increased system complexity). Therefore, using MySQL's built‑in capabilities is preferred.

MySQL provides three fuzzy search methods:

LIKE matching – requires the pattern to match the entire target field.

RegExp – regular expression matching; the target field only needs to contain the pattern.

Full‑text index – creates a full‑text index on CHAR, VARCHAR, TEXT columns and queries via SQL.

Analysis of the three methods for the current scenario:

LIKE cannot satisfy the requirement, so it is discarded.

Full‑text index has poor customizability and does not support arbitrary pattern queries, so it is discarded.

RegExp can achieve arbitrary pattern matching; its efficiency is slightly lower than full‑text but acceptable for the modest data volume.

Therefore, RegExp is chosen for fuzzy matching.

4. Implementation Effect Display

5. Core Code

5.1 Extract Company Key Information

This step removes useless information such as location names, parentheses, and keywords like "Group", "Ltd", "Co" from the input company name, leaving only the essential part.

/**
 * Remove meaningless information from the company name before matching
 * @param targetCompanyName
 * @return cleaned company name
 */
private String formatCompanyName(String targetCompanyName) {
    String regex = "(?<province>[^省]+自治区|.*?省|.*?行政区|.*?市)" +
                   "?(?<city>[^市]+自治州|.*?地区|.*?行政单位|.+盟|市辖区|.*?市|.*?县)" +
                   "?(?<county>[^(区|市|县|旗|岛)]+区|.*?市|.*?县|.*?旗|.*?岛)" +
                   "?(?<village>.*)";
    Matcher matcher = Pattern.compile(regex).matcher(targetCompanyName);
    while (matcher.find()) {
        String province = matcher.group("province");
        log.info("province:{}", province);
        if (StringUtils.isNotBlank(province) && targetCompanyName.contains(province)) {
            targetCompanyName = targetCompanyName.replace(province, "");
        }
        log.info("处理完省份的公司名称:{}", targetCompanyName);
        String city = matcher.group("city");
        log.info("city:{}", city);
        if (StringUtils.isNotBlank(city) && targetCompanyName.contains(city)) {
            targetCompanyName = targetCompanyName.replace(city, "");
        }
        log.info("处理完城市的公司名称:{}", targetCompanyName);
        String county = matcher.group("county");
        log.info("county:{}", county);
        if (StringUtils.isNotBlank(county) && targetCompanyName.contains(county)) {
            targetCompanyName = targetCompanyName.replace(county, "");
        }
        log.info("处理完区县级的公司名称:{}", targetCompanyName);
    }
    String[][] address = AddressUtil.ADDRESS;
    for (String[] cityArr : address) {
        for (String b : cityArr) {
            if (targetCompanyName.contains(b)) {
                targetCompanyName = targetCompanyName.replace(b, "");
            }
        }
    }
    log.info("处理后的公司名称:{}", targetCompanyName);
    return targetCompanyName;
}

Address utility class provides a two‑dimensional array of Chinese provinces, cities and districts for further cleaning.

public class AddressUtil {
    public static final String[][] ADDRESS = {
        {"北京"},
        {"天津"},
        {"安徽","安庆","蚌埠","亳州","巢湖","池州","滁州","阜阳","合肥","淮北","淮南","黄山","六安","马鞍山","宿州","铜陵","芜湖","宣城"},
        {"澳门"},
        {"香港"},
        {"福建","福州","龙岩","南平","宁德","莆田","泉州","厦门","漳州"},
        // ... (other provinces omitted for brevity) ...
    };
}

5.2 Tokenization Related Code

POM dependencies to introduce the IKAnalyzer Chinese tokenizer and the Lucene query parser.

<!-- ikAnalyzer Chinese tokenizer -->
<dependency>
    <groupId>com.janeluo</groupId>
    <artifactId>ikanalyzer</artifactId>
    <version>2012_u6</version>
    <exclusions>
        <exclusion>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
        </exclusion>
    </exclusions>
</dependency>

<!-- lucene-queryParser query analysis module -->
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>7.3.0</version>
</dependency>

IKAnalyzerSupport class configures the tokenizer.

@Slf4j
public class IKAnalyzerSupport {
    /**
     * IK segmentation
     * @param target
     * @return list of tokens
     */
    public static List<String> iKSegmenterToList(String target) throws Exception {
        if (StringUtils.isEmpty(target)) {
            return new ArrayList();
        }
        List<String> result = new ArrayList<>();
        StringReader sr = new StringReader(target);
        // false: disable smart segmentation (affects precision)
        IKSegmenter ik = new IKSegmenter(sr, true);
        Lexeme lex;
        while ((lex = ik.next()) != null) {
            result.add(lex.getLexemeText());
        }
        return result;
    }
}

ServiceImpl method that performs tokenization on the cleaned company name.

/**
 * Tokenize the target company name
 * @param targetCompanyName
 * @return tokenized string joined by '|'
 */
private String splitWord(String targetCompanyName) {
    log.info("对处理后端公司名称进行分词");
    List<String> splitWord = new ArrayList<>();
    String result = targetCompanyName;
    try {
        splitWord = iKSegmenterToList(targetCompanyName);
        result = splitWord.stream().distinct().collect(Collectors.joining("|"));
        log.info("分词结果:{}", result);
    } catch (Exception e) {
        log.error("分词报错:{}", e.getMessage());
    }
    return result;
}

5.3 Matching

ServiceImpl method that combines preprocessing, tokenization, and database matching.

public JsonResult matchCompanyName(CompanyDTO companyDTO, String accessToken, String localIp) {
    // Preprocess company name
    String sourceCompanyName = companyDTO.getCompanyName();
    String targetCompanyName = sourceCompanyName;
    log.info("处理前公司名称:{}", targetCompanyName);
    // Remove parentheses
    targetCompanyName = targetCompanyName.replaceAll("[（]|[）]|[(]|[)]", "");
    // Remove keywords like Group, Ltd, etc.
    targetCompanyName = targetCompanyName.replaceAll("[(集团|股份|有限|责任|分公司)]", "");
    if (!targetCompanyName.contains("银行")) {
        // Remove administrative region info
        targetCompanyName = formatCompanyName(targetCompanyName);
    }
    // Tokenization
    String splitCompanyName = splitWord(targetCompanyName);
    // Matching via repository
    List<Company> matchedCompany = companyRepository.queryMatchCompanyName(splitCompanyName, targetCompanyName);
    List<String> result = new ArrayList();
    for (Company companyInfo : matchedCompany) {
        result.add(companyInfo.getCompanyName());
        if (companyDTO.getCompanyId().equals(companyInfo.getCompanyId())) {
            result.remove(companyInfo.getCompanyName());
        }
    }
    return JsonResult.successResult(result);
}

Repository interface defines a native SQL query that uses REGEXP for fuzzy matching and orders results by match degree.

/**
 * Fuzzy match company name
 * @param companyNameRegex tokenized company name
 * @param companyName original company name
 */
@Query(value = "SELECT * FROM company WHERE isDeleted = '0' and companyName REGEXP ?1 ORDER BY length(REPLACE(companyName,?2,''))/length(companyName)", nativeQuery = true)
List<Company> queryMatchCompanyName(String companyNameRegex, String companyName);

The ordering expression calculates the number of occurrences of the keyword (the second parameter) in the company name; the more matches, the higher the rank.

In addition to the technical tutorial, the article contains promotional material for a ChatGPT service, a knowledge‑sharing community, and related offers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java fuzzy-matching regex

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.