How Baidu’s Ernie‑SimCSE Uses Contrastive Learning to Crush Spam Promotion
This article explains how Baidu's anti‑spam team tackled large‑scale promotional spam on Baidu Zhidao by combining the Ernie pretrained model with SimCSE contrastive learning, detailing the problem background, traditional methods, text‑representation stages, the SimCSE approach, training pipeline, optimizations, and experimental results.
Background
In Baidu Zhidao, the large‑scale Q&A community is heavily targeted by spammers who post identical promotional questions across many user accounts (group‑promotion spam). These posts appear in categories such as training, medical‑cosmetic services, etc., and their rapid, coordinated evolution degrades user experience and harms brand reputation.
Limitations of Rule‑Based Anti‑Spam
Traditional anti‑spam pipelines rely on two independent dimensions:
Question‑level detection: handcrafted patterns or semantic classifiers trained on individual questions.
User‑level detection: identification of single cheaters or coordinated groups.
Both approaches depend on manually engineered features, which generalize poorly and can be evaded by slight content variations.
Text Representation Evolution
Stage 1 – Statistical: TF‑IDF keyword vectors.
Stage 2 – Deep Models: Word‑level embeddings (GloVe, word2vec) combined with averaging, SIF, or CNN/LSTM encoders (e.g., DSSM, Skip‑Thought, Siam‑CNN, InferSent).
Stage 3 – Pre‑trained Large Models: BERT, ERNIE and variants; Sentence‑BERT encodes each sentence independently to improve efficiency.
Stage 4 – Contrastive Learning: Borrowed from computer‑vision, contrastive objectives pull semantically similar sentences together and push dissimilar ones apart.
Contrastive Learning for Sentence Embeddings
Contrastive learning creates positive pairs (similar samples) and negative pairs (dissimilar samples) via data augmentation. The model minimizes the distance between positives while maximizing the distance to negatives, typically using the InfoNCE loss:
SimCSE
SimCSE (Simple Contrastive Learning of Sentence Embeddings) simplifies the augmentation step by exploiting dropout noise. For a given sentence, two forward passes through the same encoder produce two stochastic embeddings; these form a positive pair, while all other embeddings in the batch serve as negatives. The unsupervised variant follows three steps:
Feed each sentence twice through the encoder (dropout creates different representations).
Treat the two outputs as a positive pair; treat all other batch elements as negatives.
Optimize the InfoNCE loss to update model parameters.
Both unsupervised and supervised SimCSE architectures are illustrated below (unsupervised version is used in production):
Ernie‑SimCSE Model Architecture
The final anti‑spam solution integrates a domain‑adapted ERNIE encoder with SimCSE contrastive learning, forming the Ernie‑SimCSE model. Training proceeds in three phases:
Pre‑training on click‑log Q‑T graph (Ernie‑Search): Masked language modeling on sampled query‑title‑query sequences extracted from large‑scale click logs.
Post‑training on Baidu Zhidao questions (Ernie‑Search‑ZD): Further fine‑tuning the encoder with the platform’s question corpus to capture domain‑specific semantics.
Contrastive fine‑tuning (Ernie‑Search‑CSE): Apply SimCSE’s dropout‑based contrastive loss on top of Ernie‑Search‑ZD. The loss is extended to also treat the dropout‑augmented samples as additional negatives, reducing bias from in‑batch negatives.
Anti‑Spam Deployment Pipeline
The production workflow consists of three logical steps:
Encoder training: Train the Ernie‑SimCSE model to obtain a sentence encoder that maps each question to a dense semantic vector.
Semantic index construction: Encode the entire historical question corpus and store the vectors in an approximate nearest‑neighbor (ANN) index (e.g., Faiss, HNSW) for fast similarity search.
Detection: When a new question arrives, encode it, retrieve the top‑k most similar historical questions, and flag the query if the similarity scores exceed a calibrated threshold (indicating a possible group‑promotion spam).
Experimental Evaluation
Model performance was validated on a manually curated set of 17 known spam questions and 10 random non‑spam sentences. The Ernie‑SimCSE embeddings produced significantly higher cosine similarity scores for the spam set, confirming effective semantic clustering of promotional content. Heatmaps and example retrievals demonstrate that the model can surface near‑duplicate spam queries such as “北京肋软骨隆鼻刘彦军做的怎么样?” among the top‑10 most similar historical questions.
Conclusion
Semantic indexing with contrastive‑learned sentence embeddings provides superior generalization over rule‑based anti‑spam methods, especially for rapidly evolving group‑promotion campaigns. The approach scales with the size of the index, introducing computational considerations for large corpora. The success of large‑scale pretrained models such as ERNIE demonstrates the maturity of NLP for industrial security tasks, and ongoing research in contrastive learning and domain adaptation will further improve detection robustness.
Code example
[1] Arora, S., Liang, Y., & Ma, T. (2019). A simple but tough-to-beat baseline for sentence embeddings. Paper presented at 5th International Conference on Learning Representations, ICLR 2017, Toulon, France.
[2] Huang P S, He X, Gao J, et al. Learning deep structured semantic models for web search using clickthrough data[C]//Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2013: 2333-2338.
[3] Kiros R, Zhu Y, Salakhutdinov R R, et al. Skip-thought vectors[J]. Advances in neural information processing systems, 2015, 28.
[4] Feng M, Xiang B, Glass M R, et al. Applying deep learning to answer selection: A study and an open task[C]//2015 IEEE workshop on automatic speech recognition and understanding (ASRU). IEEE, 2015: 813-820.
[5] Conneau A, Kiela D, Schwenk H, et al. Supervised learning of universal sentence representations from natural language inference data[J]. arXiv preprint arXiv:1705.02364, 2017.
[6] Reimers N, Gurevych I. Sentence-bert: Sentence embeddings using siamese bert-networks[J]. arXiv preprint arXiv:1908.10084, 2019.
[7] Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021.
SimCSE: Simple Contrastive Learning of Sentence Embeddings
. In Proceedings of the 2021 Conference on Empirical
[8] van den Oord, A., Li, Y., and Vinyals, O., “Representation Learning with Contrastive Predictive Coding”, <i>arXiv e-prints</i>, 2018.
推荐阅读
:
质量评估模型助力风险决策水平提升
合约广告平台架构演进实践
AI技术在基于风险测试模式转型中的应用
Go语言躲坑经验总结
PaddleBox:百度基于GPU的超大规模离散DNN模型训练解决方案
聊聊机器如何"写"好广告文案?Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
