Scrape Web Pages with Jsoup and Export to Excel Using EasyExcel in Java
This article demonstrates how to use Jsoup to fetch and parse HTML content from a web page, extract specific table data via CSS selectors, map the data to Java objects, and efficiently write the results to an Excel file with EasyExcel, highlighting memory advantages over POI.
Jsoup is a Java HTML parser that loads a URL or HTML string into a DOM tree and offers jQuery‑like selector APIs for extracting elements.
Traditional Excel libraries such as Apache POI consume large amounts of memory; EasyExcel rewrites POI’s 07‑xlsx handling, reducing memory usage from around 100 MB to a few megabytes, making it suitable for large files.
Maven dependencies required are:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>easyexcel</artifactId>
<version>3.0.5</version>
</dependency>The target page (e.g., https://www.maigoo.com/news/3jcNODk3.html) is inspected with browser dev tools to locate the data: the #t_container div, its 22nd child ( div:eq(21)), and the table rows inside.
Java code connects to the URL with a custom User‑Agent, timeout, and referer, then selects rows using:
Document document = Jsoup.connect(url)
.ignoreContentType(true)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3")
.timeout(30000)
.header("referer", "https://www.maigoo.com")
.get();
Elements select = document.select("#t_container > div:eq(21) table tr");Each row is processed to extract td values, convert them to appropriate types, and build a WealthEntity object annotated with @ExcelProperty:
for (int i = 1; i < select.size(); i++) {
Element tr = select.get(i);
Elements tds = tr.select("td");
Integer index = Integer.valueOf(tds.get(0).text());
String companyName = tds.get(1).text();
String income = tds.get(2).text();
String profit = tds.get(3).text();
WealthEntity wealthEntity = WealthEntity.builder()
.index(index)
.companyName(companyName)
.income(income)
.profit(profit)
.build();
list.add(wealthEntity);
}The collected list is written to an Excel file with EasyExcel:
String fileName = "D:/2023财富世界100强.xlsx";
EasyExcel.write(fileName, WealthEntity.class)
.sheet("100强")
.doWrite(list);A screenshot of the resulting Excel file shows the extracted ranking, company name, income, and profit columns.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
The Dominant Programmer
Resources and tutorials for programmers' advanced learning journey. Advanced tracks in Java, Python, and C#. Blog: https://blog.csdn.net/badao_liumang_qizhi
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
