How to Scrape China’s County GDP Rankings with Java Jsoup and EasyExcel
This tutorial explains how to collect 2022 county‑level GDP and public budget data from the Chinese National Bureau of Statistics using Java's Jsoup library, transform the HTML tables into structured Excel files with EasyExcel, and provides complete source code and step‑by‑step analysis.
Background
In the post‑pandemic era, 2022 economic data for Chinese counties are publicly available on the National Bureau of Statistics website. 1279 county‑level units have disclosed GDP and general public budget revenue, allowing the creation of a top‑100 county GDP and budget ranking, with Kunshan leading.
The article uses Java as the programming language and demonstrates how to crawl web pages with Jsoup, providing detailed example code.
1. Getting Started with Jsoup
1.1 Page Structure Analysis
When using Jsoup, first open the target page in a browser, press F12, and locate the elements that contain the table data.
Open the div that contains the table and identify the rows.
The same approach applies to the general public budget revenue table.
2. Java Implementation for Jsoup Scraping
2.1 Adding Jsoup Dependency
Use Maven to manage dependencies. The essential pom.xml snippet is:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.yelang</groupId>
<artifactId>jsoupdemo</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>easyexcel</artifactId>
<version>3.0.5</version>
</dependency>
</dependencies>
</project>2.2 Data Entity Classes
Define a base class for common fields (index, county name, province) and extend it for GDP and budget entities.
package com.yelang.entity;
import java.io.Serializable;
import com.alibaba.excel.annotation.ExcelProperty;
public class CountyBase implements Serializable {
private static final long serialVersionUID = -1760099890427975758L;
@ExcelProperty(value = {"序号"}, index = 1)
private Integer index;
@ExcelProperty(value = {"县级地区"}, index = 2)
private String name;
@ExcelProperty(value = {"所属省"}, index = 3)
private String province;
// getters, setters, constructors omitted for brevity
} package com.yelang.entity;
import java.io.Serializable;
import com.alibaba.excel.annotation.ExcelProperty;
public class Gdp extends CountyBase implements Serializable {
private static final long serialVersionUID = 5265057372502768147L;
@ExcelProperty(value = {"GDP(亿元)"}, index = 4)
private String gdp;
// getters, setters, constructors omitted for brevity
} package com.yelang.entity;
import java.io.Serializable;
import com.alibaba.excel.annotation.ExcelProperty;
public class Gpbr extends CountyBase implements Serializable {
private static final long serialVersionUID = 8612514686737317620L;
@ExcelProperty(value = {"一般公共预算收入(亿元)"}, index = 4)
private String gpbr;
// getters, setters, constructors omitted for brevity
}2.3 Actual Crawling Code
The method grabGdp fetches the GDP table, parses rows, creates Gdp objects and writes them to an Excel file with EasyExcel.
static void grabGdp() {
String target = "https://www.maigoo.com/news/665462.html";
try {
Document doc = Jsoup.connect(target)
.ignoreContentType(true)
.userAgent(FetchCsdnCookie.ua[1])
.timeout(300000)
.header("referer", "https://www.maigoo.com")
.get();
Elements elements = doc.select("#t_container > div:eq(3) table tr");
List<Gdp> list = new ArrayList<>();
for (int i = 1; i < elements.size(); i++) {
Element tr = elements.get(i);
Elements tds = tr.select("td");
Integer index = Integer.valueOf(tds.get(0).text());
String name = tds.get(1).text();
String province = tds.get(2).text();
String gdp = tds.get(3).text();
Gdp county = new Gdp(index, name, province, gdp);
list.add(county);
}
String fileName = "E:/gdptest/2023全国百强县GDP排行榜.xlsx";
EasyExcel.write(fileName, Gdp.class).sheet("GDP百强榜").doWrite(list);
System.out.println("完成...");
} catch (Exception e) {
System.out.println(e.getMessage());
System.out.println("发生异常,继续下一轮循环");
}
}Element selection uses a CSS‑like query similar to jQuery:
Elements elements = doc.select("#t_container > div:eq(3) table tr");3. Process Analysis and Results
3.1 Crawling Process
Debugging the source page reveals the DOM hierarchy; Jsoup’s select method extracts the required cells.
3.2 Output Files
After execution, two Excel files are generated on the target disk, containing the scraped GDP and budget data in the same order as the web tables.
Conclusion
The article demonstrates a complete Java solution for web scraping county‑level economic data using Jsoup, converting HTML tables into structured Excel files via EasyExcel, and provides full source code for reference.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
