Using Regular Expressions in Java and Groovy for Web Crawling

This article explains how regular expressions can be applied in Java and Groovy to extract information from JSON and HTML responses during web crawling, provides a reusable Java Regex utility class, demonstrates Groovy-specific regex operators, and shows sample code and console output.

FunTester
FunTester
FunTester
Using Regular Expressions in Java and Groovy for Web Crawling

Regular expressions are powerful and ubiquitous, especially in web crawling where data may be returned as JSON or HTML.

For JSON responses a JSON parser can be used directly, while HTML responses are often treated as strings and regexes are applied to extract the required information.

The article presents several case studies and then introduces a Java utility class Regex that offers common regex operations such as isRegex (find), isMatch (full match), regexAll (return all matches) and getRegex (extract and clean a match).

package com.fun.utils;

import com.fun.frame.SourceCode;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * 正则验证的封装
 */
public class Regex extends SourceCode {

    private static Logger logger = LoggerFactory.getLogger(Regex.class);

    /**
     * 正则校验文本是否匹配
     *
     * @param text  需要匹配的文本
     * @param regex 正则表达式
     * @return
     */
    public static boolean isRegex(String text, String regex) {
        return Pattern.compile(regex).matcher(text).find();
    }

    /**
     * 正则校验文本是否完全匹配,不包含其他杂项,相当于加上了^和$
     *
     * @param text  需要匹配的文本
     * @param regex 正则表达式
     * @return
     */
    public static boolean isMatch(String text, String regex) {
        return Pattern.compile(regex).matcher(text).matches();
    }

    /**
     * 返回所有匹配项
     *
     * @param text  需要匹配的文本
     * @param regex 正则表达式
     * @return
     */
    public static List<String> regexAll(String text, String regex) {
        List<String> result = new ArrayList<>();
        Matcher matcher = Pattern.compile(regex).matcher(text);
        while (matcher.find()) {
            result.add(matcher.group());
        }
        return result;
    }

    /**
     * 获取匹配项,不包含文字信息,会删除regex的内容
     * <p>不保证完全正确</p>
     *
     * @param text
     * @param regex
     * @return
     */
    public static String getRegex(String text, String regex) {
        String result = EMPTY;
        try {
            result = regexAll(text, regex).get(0);
            String[] split = regex.split("(\\.|\\+|\\*|\\?)");
            for (int i = 0; i < split.length; i++) {
                String s1 = split[i];
                if (!s1.isEmpty())
                    result = result.replaceAll(s1, EMPTY);
            }
        } catch (Exception e) {
            logger.warn("获取匹配对象失败!", e);
        } finally {
            return result;
        }
    }

}

Groovy can reuse the same Java regex syntax; the article explains the Groovy operators =~ (equivalent to Pattern.compile(regex).matcher(text)) and ==~ (equivalent to Pattern.compile(regex).matcher(text).matches()), and shows how to define regex literals with slashes.

public static void main(String[] args) {
        def str = "fantester"
        def matcher = str =~ "\\wt"
        println matcher.find()
        println matcher[0]
        println matcher.size()
        matcher.each {println it}
        def b = str ==~ ".*er"
        output b

        def stra = /.*test\w+/

        println str ==~ stra

        ("fanfanfanfan" =~ "\\wf").each {println it}

        "fanfanfanfan".eachMatch(/\wa/) {println it}

}

The console output of the Groovy demo is displayed, illustrating true/false results and the matched groups.

The author notes that Groovy syntax is highly expressive, can produce concise code comparable to Python, and includes a disclaimer that the article was originally published on the "FunTester" public account.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

javaregexGroovy
FunTester
Written by

FunTester

10k followers, 1k articles | completely useless

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.