Information Security 16 min read

MarkupLM-based Detection of Malicious Content Scraping

The article presents a MarkupLM‑based approach that enriches BERT with XPath embeddings to jointly model webpage text and structure, enabling site‑level detection of malicious content‑scraping pages that bypass traditional rule‑based filters and demonstrating the critical role of structural cues in improving spam classification accuracy.

Baidu Geek Talk

Jan 6, 2025

MarkupLM-based Detection of Malicious Content Scraping

Search engines such as Baidu receive billions of queries daily, making them a target for cheating groups that try to harvest traffic through large‑scale content scraping. The anti‑spam team maintains the search ecosystem by filtering low‑quality spam pages using advanced techniques. This article introduces a webpage modeling approach based on MarkupLM, which incorporates XPath embeddings to automatically extract structural features of spam pages and combines them with textual information for site‑level detection.

1. Business Background

Content scraping refers to the practice where site administrators copy content from other websites, either manually or programmatically, to republish it on their own sites. While legitimate scrapers improve content value, malicious scrapers ignore user experience and use spam techniques to boost rankings, causing loss of traffic and revenue for genuine content creators.

2. Traditional Solutions

Content duplication detection using text fingerprints (e.g., SimHash, MD5) and semantic similarity.

Webpage structure analysis via DOM‑tree comparison and HTML tag density/ nesting analysis (e.g., <div>, <p>).

Machine‑learning models that combine HTML structure features, content similarity, and user‑behavior signals for classification.

Auxiliary methods such as user‑report mechanisms.

3. MarkupLM Model

MarkupLM extends BERT by adding an XPath embedding module on top of the standard embedding layer. Each token is associated with an XPath expression that is split into hierarchical units; each unit receives a tag‑name embedding and an index embedding, which are summed to form the unit representation. All unit vectors are concatenated, projected through a feed‑forward network, and combined with the original token embedding.

Figure 3 (not shown) illustrates the architecture. The model is pre‑trained on three tasks: Masked Markup Language Modeling (MMLM), Node Relation Prediction (NRP), and Title‑Page Matching (TPM), enabling it to capture both textual and structural cues.

4. Structure Modeling

XPath provides a path from the root node to a text token (e.g., /html/body/div/li[1]/div/span[2]). By embedding each XPath unit and preserving hierarchical order, the model learns the positional information of text within the markup document.

5. Effect Verification

Two experiments were conducted:

Experiment 1 shuffled text and XPath pairs from black‑list and white‑list samples, creating mixed pairs and measuring the proportion classified as spam.

Experiment 2 masked XPath embeddings (replaced with <pad> and index 0) and evaluated recall, precision, and accuracy (recall = 0.121, precision = 0.829, accuracy = 0.548).

Results show that black‑XPath combined with any text leads to high spam detection, while masking XPath dramatically reduces recall, confirming the importance of structural information.

6. Application to Malicious Scraping

The model can identify spam pages that evade rule‑based detectors, especially when malicious content is hidden behind unconventional tags (e.g., <h5>, <li>) or when titles do not match body content. By leveraging MarkupLM’s XPath embeddings, the system captures both tag relationships and semantic inconsistencies.

7. Conclusion and Outlook

The paper discusses challenges in detecting malicious scraping sites and demonstrates how MarkupLM, with its combined textual and layout modeling, improves detection performance. Future work may extend this approach to other web‑based fraud detection tasks or use the learned webpage features as a foundation for multi‑class spam classification.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning content scraping detection document understanding MarkupLM XPath embedding

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.