Artificial Intelligence 20 min read

How Multi-Agent VLMs and PNU Loss Achieve High‑Accuracy Harmful Content Detection with Only 50 Labels

This article presents a low‑resource offensive content detection framework that leverages multi‑agent visual‑language models (MA‑VLMs) for self‑training and a novel Positive‑Negative‑Unlabeled (PNU) loss, enabling accurate classification with as few as 50 annotated samples across multimodal datasets.

Tencent Advertising Technology

Feb 5, 2026

How Multi-Agent VLMs and PNU Loss Achieve High‑Accuracy Harmful Content Detection with Only 50 Labels

Problem Statement

High‑quality annotated data for harmful‑content detection is extremely scarce because (1) harmful posts are filtered by platform safeguards, (2) expert annotators are required to interpret sarcasm, metaphor, and cultural nuance, and (3) subjective judgments lead to low inter‑annotator agreement. This creates a bottleneck for low‑resource languages and multimodal memes.

Method Overview

The authors propose a Multi‑Agent Visual‑Language Model guided Self‑Training framework (MA‑VLMs) combined with a custom Positive‑Negative‑Unlabeled (PNU) loss . The approach trains a lightweight classifier (e.g., CLIP‑Large) with only 50‑100 labeled examples and iteratively expands the training set using pseudo‑labels verified by two complementary VLM agents.

MA‑VLMs Agents

Reviewer agent : safety‑oriented, focuses on content compliance.

User agent : expression‑oriented, emphasizes legitimate user expression.

Both agents generate an initial judgment with a rationale, then cross‑review each other’s reasoning. A sample is accepted as a consistent unlabeled instance only when the classifier’s prediction matches the judgments of both agents; otherwise it is placed in a divergent unlabeled set to preserve uncertainty.

Self‑Training Pipeline

Initial small‑sample training : Train a CLIP‑Large backbone with an added MLP head on n labeled samples (n = 50‑100). Select the best epoch on a validation split.

Predict & rank unlabeled data : Run the classifier on a large pool of unlabeled items, obtain class probabilities, and select the top‑k (k = 500) high‑confidence samples.

MA‑VLMs verification : Feed the top‑k samples to the Reviewer and User agents. If the classifier’s label agrees with both agents, move the sample to the consistent set and assign a pseudo‑label; otherwise place it in the divergent set .

PNU loss re‑training : Merge the consistent set (with soft pseudo‑labels) and the divergent set (treated as unlabeled) with the original labeled data, then fine‑tune the classifier using the following loss:

L = L_{PN}(y_{lab}, \hat{y})
    + \lambda_{soft}\,L_{PN}^{soft}(y_{consist}, \hat{y})
    + \lambda_{PU}\,L_{PU/NU}(y_{div}, \hat{y}; P)

where L_{PN} is standard cross‑entropy on labeled data, L_{PN}^{soft} applies soft labels to the consistent set to avoid over‑fitting, and L_{PU/NU} handles the divergent set as positive‑unlabeled (PU) or negative‑unlabeled (NU) depending on a tunable parameter P∈[-0.1, 0.2]. Setting P=0 reduces the loss to ordinary PN.

Validation & rollback : Evaluate on a held‑out validation set after each re‑training step. If performance improves, continue; otherwise revert to the previous model and discard the latest pseudo‑labels.

Dual‑Agent Prompting

The Reviewer first outputs a judgment with reasoning; the User then reviews that reasoning and produces a final decision. Only when both agents’ decisions align with the classifier is a sample labeled as consistent. This dual‑perspective prompting captures hidden hateful intent in multimodal memes that single‑agent prompts miss.

Customized PNU Loss Details

Positive‑Negative (PN) loss : Standard cross‑entropy on the manually labeled data.

Soft PN loss : Applies soft pseudo‑labels (e.g., 0.9/0.1) to the consistent set to mitigate over‑fitting.

PU/NU loss : Treats divergent samples as unlabeled; the sign and magnitude of P decide whether they are modeled as PU (positive‑unlabeled) or NU (negative‑unlabeled). When P is large negative, the loss degrades to pure PN; intermediate values exploit the information in divergent samples.

Experimental Setup

Datasets

Facebook Hateful Memes (FHM) – 10 k multimodal memes.

Multimedia Automatic Misogyny Identification (MAMI) – 11 k Instagram memes.

Hate Speech and Offensive Language (HSOL) – 24 783 tweets (balanced 10 k subset used).

Sentiment140 (Sent140) – 1.6 M tweets (balanced 10 k subset used).

Training Details

Train/val/test split 8:1:1; only n∈{50,100,250,full} labeled samples retained.

Baseline models: Qwen2.5‑VL‑7B (70 B‑parameter multimodal LLM), RGCL, CLIP‑Large.

Self‑training uses CLIP as the classifier and a frozen Qwen2.5‑VL‑72B as the two MA‑VLMs agents.

All models trained for 10 epochs; best checkpoint chosen by macro‑F1 on validation.

Macro‑F1 is the primary metric; accuracy is secondary.

Results

With n=100 labeled samples, the self‑trained CLIP+Qwen72B achieves 74.22 % accuracy and 72.68 % macro‑F1 on FHM, surpassing the supervised CLIP baseline (64.11 % Acc / 59.24 % M‑F1) and approaching the large Qwen7B model (70.78 % Acc / 70.41 % M‑F1) while using only 1/20 of its inference cost. On HSOL, the method reaches 86.69 % macro‑F1, outperforming all supervised baselines.

In the extreme low‑resource setting ( n=50), the self‑trained model still attains 71.27 % macro‑F1 on FHM, far above supervised CLIP (48.76 %) and Qwen7B (39.11 %).

Ablation studies show:

Increasing labeled data from 50 to 250 yields only modest gains for the self‑trained model, confirming low dependence on annotation volume.

Dual‑agent prompting outperforms zero‑shot, few‑shot, and chain‑of‑thought prompt formats, especially on gender‑discrimination tasks.

The PNU parameter P influences performance: values around 0.0–0.2 give the best trade‑off, while extreme negatives revert to pure PN loss.

Conclusion

The MA‑VLMs guided self‑training framework with PNU loss breaks the annotation bottleneck for low‑resource harmful‑content detection. It requires as few as 50 labeled examples, balances safety and expression through dual‑agent negotiation, and achieves 7 B‑parameter model performance with a lightweight CLIP‑Large classifier at only 1/20 of the inference cost. The paradigm—multi‑agent verification plus a custom loss—generalizes to other low‑resource classification tasks such as sentiment analysis.

content moderation self-training visual-language models Multi‑modal AI low-resource learning PNU loss

Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.