Master jsoup: Real‑World Spring Boot 3 Examples for HTML Parsing
This tutorial walks through practical jsoup usage within Spring Boot 3, covering dependency setup, parsing HTML from strings, fragments, URLs or files, extracting titles, links, images, applying CSS selectors, modifying elements, and sanitizing content to prevent XSS attacks.
Spring Boot 3 practical case collection includes 118 examples; this article introduces jsoup, a Java library that simplifies HTML and XML processing.
1. Introduction
jsoup provides an easy‑to‑use API for fetching URLs, parsing data, extracting and modifying content using DOM, CSS and XPath selectors. It implements the WHATWG HTML5 specification and parses HTML into a DOM identical to modern browsers.
WHATWG HTML5 specification: https://html.spec.whatwg.org/multipage/syntax.html
Fetch and parse HTML from a URL, file or string.
Find and extract data with DOM traversal or CSS selectors.
Manipulate HTML elements, attributes and text.
Clean user‑submitted content against a safelist to prevent XSS attacks.
Output tidy HTML.
jsoup can handle malformed “tag soup” HTML and still produce a reasonable parse tree.
2. Practical Cases
2.1 Dependency Management
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.18.3</version>
</dependency>2.2 Parse HTML from a String
String html = """
<html>
<head><title>Parse String HTML Document</title></head>
<body><p>Parsed HTML into a doc.</p></body>
</html>
""";
Document doc = Jsoup.parse(html);
Elements titleElement = doc.getElementsByTag("title");
System.err.printf("title: %s%n", titleElement);Output:
title: <title>Parse String HTML Document</title>2.3 Parse HTML Fragment
String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();
System.err.printf("body:
%s%n", body);Output:
<body><div><p>Lorem ipsum.</p></div></body>2.4 Load HTML Document
From URL:
Document document = Jsoup.connect("http://www.baidu.com").get();
System.err.println(document);Output:
From File:
ClassPathResource resource = new ClassPathResource("templates/invoice.html");
Document document = Jsoup.parse(resource.getFile(), "utf-8");
System.err.println(document);Output:
2.5 Retrieve Element Content
Get page title:
Document document = Jsoup.connect("http://www.baidu.com").get();
System.err.println(document.title());Output: 百度一下,你就知道 Get favicon:
Document document = Jsoup.connect("http://www.baidu.com").get();
Element element = document.head().select("link[href~=.*\\.(ico|png)]").first();
String favImage = null;
if (element == null) {
element = document.head().select("meta[itemprop=image]").first();
if (element != null) {
favImage = element.attr("content");
}
} else {
favImage = element.attr("href");
}
System.err.println(favImage);Output: https://www.baidu.com/favicon.ico Get all links:
Document document = Jsoup.connect("http://www.baidu.com").get();
Elements links = document.select("a[href]");
for (Element link : links) {
System.out.printf("text: %s, link : %s%n", link.text(), link.attr("href"));
}Output (example screenshot):
Get all images:
Document document = Jsoup.connect("http://www.baidu.com").get();
Elements images = document.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
for (Element image : images) {
System.out.printf("src : %s, width: %s, height: %s%n", image.attr("src"), image.attr("height"), image.attr("width"));
}Output (example screenshot):
2.6 Use CSS Selectors
Document doc = Jsoup.connect("http://www.baidu.com").get();
Elements links = doc.select("a[href]");
Elements pngs = doc.select("img[src$=.png]");
Element masthead = doc.select("div.masthead").first();
Elements resultDivs = doc.select("h3.r > div");
Elements resultAs = resultDivs.select("a");Most CSS selectors are supported.
2.7 Modify Elements
String html = """
<html>
<head><title>Parse String HTML Document</title></head>
<body><p>Parsed HTML into a doc.</p></body>
</html>
""";
Document doc = Jsoup.parse(html);
Element div = doc.select("body").first();
div.prepend("<p>First</p>");
div.append("<p>Last</p>");
System.err.println(doc);Output (screenshot):
Modify specific element content:
String html = """
<html>
<head><title>Parse String HTML Document</title></head>
<body><p class=\"xxxooo\">Parsed HTML into a doc.</p></body>
</html>
""";
Document doc = Jsoup.parse(html);
Element div = doc.select("p.xxxooo").first();
div.text("xxxooo pack...");
System.err.println(doc);Output (screenshot):
2.8 Prevent XSS Attacks
String unsafe = "<p><ahref='http://www.pack.com/'onclick='getCookies()'>惊喜</a></p>";
String safe = Jsoup.clean(unsafe, Safelist.basic());
System.err.println(safe);Output:
<p><a href="http://www.pack.com/" rel="nofollow">惊喜</a></p>Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Spring Full-Stack Practical Cases
Full-stack Java development with Vue 2/3 front-end suite; hands-on examples and source code analysis for Spring, Spring Boot 2/3, and Spring Cloud.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
