Backend Development 14 min read

Building a Site Search Engine with Java Indexing and File Parsing

This article explains how to build a site‑wide search engine using Java, covering crawling concepts, forward and inverted indexing, module design, tokenization methods, and detailed code examples for file enumeration, HTML parsing, and index generation.

Java Captain

Apr 2, 2022

Building a Site Search Engine with Java Indexing and File Parsing

Preface

We cannot use a small server to create a full‑scale search engine like Baidu or Sogou; instead we implement a site‑wide search that indexes resources within a website.

1. How Search Engines Work

A search engine works like a bee that constantly crawls web pages, extracts content, and builds indexes for later queries.

We can use Python or download a document package to get started. (The author originally wanted to index League of Legends data.)

It is recommended not to crawl arbitrary sites to avoid legal issues, although the author’s school website was freely crawled for practice.

Why use indexes? Because crawling generates massive data; traversing it directly would be too time‑consuming (high time complexity).

We need two kinds of indexes: a forward index and an inverted index.

Example with League of Legends: the forward index maps a hero name to its skills.

Q skill – Alpha Assault

W skill – Meditation

E skill – Unparalleled

R skill – High Plains Bloodline

The inverted index maps a skill to the heroes that have it.

Barbarian King

Wuju (the Sword Saint)

Sword Princess

2. Module Division

1. Index Module

1) Scan downloaded documents, analyze content, and build forward and inverted indexes, then save the index data to files.

2) Load the prepared indexes and provide APIs for forward‑index and inverted‑index lookups.

2. Search Module

1) Call the index module to perform a complete search process.

Input: user query string Output: complete search results

3. Web Module

Implement a simple web application that interacts with users through a browser, containing both front‑end and back‑end components.

3. How to Perform Tokenization

Tokenization Principles

1. Dictionary‑based: enumerate all possible words and store them in a dictionary file.

2. Statistics‑based: collect a large corpus, manually annotate it, and use word co‑occurrence probabilities.

Java offers many third‑party tokenization libraries, such as ansj, which can be added via Maven.

After downloading the latest version, add it to pom.xml.

Test code using the library:

import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.ToAnalysis;
import java.util.List;
public class TastAnsj {
    public static void main(String[] args) {
        String str = "易大师是一个有超高机动性的刺客、战士型英雄，擅长利用快速的打击迅速击溃对手，易大师一般打野和走单人路，作为无极剑道的最后传人，易可以迅速砍出大量伤害，同时还能利用技能躲避猛烈的攻击，避开敌人的集火。";
        List<Term> terms = ToAnalysis.parse(str).getTerms();
        for (Term term : terms) {
            System.out.println(term.getName());
        }
    }
}

4. File Reading

Copy the path of the downloaded documents into a constant string.

Use recursion to enumerate all HTML files; if the path is absolute, add it to the file list, otherwise continue recursion.

import java.io.File;
import java.util.ArrayList;

// Read the downloaded documents
public class Parser {
    private static final String INPUT_PATH = "D:/test/docs/api";
    public void run() {
        // Entry point of Parser
        // 1. Enumerate all files under the path (HTML)
        ArrayList<File> fileList = new ArrayList<>();
        enumFile(INPUT_PATH, fileList);
        System.out.println(fileList);
        System.out.println(fileList.size());
        // 2. For each file, open and parse its content
        // 3. Save the constructed index data structure to a file
    }
    // First parameter: start path; second: result list
    private void enumFile(String inputPath, ArrayList<File> fileList) {
        File rootPath = new File(inputPath);
        File[] files = rootPath.listFiles();
        for (File f : files) {
            // If f is a directory, recurse; if a regular file, add to list
            if (f.isDirectory()) {
                enumFile(f.getAbsolutePath(), fileList);
            } else {
                if (f.getAbsolutePath().endsWith(".html"))
                    fileList.add(f);
            }
        }
    }
    public static void main(String[] args) {
        Parser parser = new Parser();
        parser.run();
    }
}

After running, many files are printed; the next step is to filter useful ones.

else {
    if (f.getAbsolutePath().endsWith(",html"))
        fileList.add(f);
}

The following image shows the result.

4.1 Open File and Parse Content

Parsing is divided into three parts: Title, URL, and Content.

4.1.1 Parse Title

f.getName()

directly reads the file name. The code removes the trailing ".html" (5 characters) to obtain the title.

private String parseTitle(File f) {
    String name = f.getName();
    return name.substring(0, f.getName().length() - 5);
}

4.1.2 Parse URL

The URL is constructed by combining a base URL with the relative path derived from the file’s absolute path.

private String parseUrl(File f) {
    String part1 = "https://docs.oracle.com/javase/8/docs/api/";
    String part2 = f.getAbsolutePath().substring(INPUT_PATH.length());
    return part1 + part2;
}

4.1.3 Parse Content

Read the file character by character, using '<' and '>' as switches. An int return value of -1 indicates end‑of‑file.

private String parseContent(File f) throws IOException {
    try (FileReader fileReader = new FileReader(f)) {
        boolean isCopy = true;
        StringBuilder content = new StringBuilder();
        while (true) {
            int ret = 0;
            try {
                ret = fileReader.read();
            } catch (IOException e) {
                e.printStackTrace();
            }
            if (ret == -1) {
                break;
            }
            char c = (char) ret;
            if (isCopy) {
                if (c == '<') {
                    isCopy = false;
                    continue;
                }
                if (c == '
' || c == '\r') {
                    c = ' ';
                }
                content.append(c);
            } else {
                if (c == '>') {
                    isCopy = true;
                }
            }
        }
        return content.toString();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    }
    return "";
}

The complete module code is as follows:

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;

public class Parser {
    private static final String INPUT_PATH = "D:/test/docs/api";
    public void run() {
        ArrayList<File> fileList = new ArrayList<>();
        enumFile(INPUT_PATH, fileList);
        System.out.println(fileList);
        System.out.println(fileList.size());
        for (File f : fileList) {
            System.out.println("开始解析" + f.getAbsolutePath());
            parseHTML(f);
        }
    }
    private String parseTitle(File f) {
        String name = f.getName();
        return name.substring(0, f.getName().length() - 5);
    }
    private String parseUrl(File f) {
        String part1 = "https://docs.oracle.com/javase/8/docs/api/";
        String part2 = f.getAbsolutePath().substring(INPUT_PATH.length());
        return part1 + part2;
    }
    private String parseContent(File f) throws IOException {
        try (FileReader fileReader = new FileReader(f)) {
            boolean isCopy = true;
            StringBuilder content = new StringBuilder();
            while (true) {
                int ret = 0;
                try {
                    ret = fileReader.read();
                } catch (IOException e) {
                    e.printStackTrace();
                }
                if (ret == -1) {
                    break;
                }
                char c = (char) ret;
                if (isCopy) {
                    if (c == '<') {
                        isCopy = false;
                        continue;
                    }
                    if (c == '
' || c == '\r') {
                        c = ' ';
                    }
                    content.append(c);
                } else {
                    if (c == '>') {
                        isCopy = true;
                    }
                }
            }
            return content.toString();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        return "";
    }
    private void parseHTML(File f) {
        String title = parseTitle(f);
        String url = parseUrl(f);
        try {
            String content = parseContent(f);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    private void enumFile(String inputPath, ArrayList<File> fileList) {
        File rootPath = new File(inputPath);
        File[] files = rootPath.listFiles();
        for (File f : files) {
            if (f.isDirectory()) {
                enumFile(f.getAbsolutePath(), fileList);
            } else {
                if (f.getAbsolutePath().endsWith(".html"))
                    fileList.add(f);
            }
        }
    }
    public static void main(String[] args) {
        Parser parser = new Parser();
        parser.run();
    }
}

If this article helped you, please give the author a free like. Thank you!

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Indexing Search Engine Parsing Site Search

Written by

Java Captain

Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.