Low-Quality Text Detection Using Unsupervised Language Model Perplexity
This article proposes a method to identify low-quality text in business data by training a large-scale unsupervised language model to compute sentence perplexity, converting the detection problem into a threshold decision, and details model design, challenges, optimizations, and online performance results.
Introduction – To detect random characters, incoherent semantics, and non‑standard language in business data, the authors present a method that leverages a massive unsupervised corpus to train a language model and compute sentence perplexity, turning low‑quality text identification into a simple threshold decision. The approach requires no manual labeling, achieves high accuracy, and transfers well across domains.
Background – With the rapid growth of mobile internet and increasing service‑level requirements, 58.com processes billions of text posts daily. User‑generated content often contains spam, illegal, or nonsensical posts (e.g., obscure characters, random strings, incoherent sentences), which degrade data quality and user experience. Manual rule‑based filtering is lagging, low‑coverage, and costly.
Characteristics of Low‑Quality Text
Strong adversarial nature and fast evolution – spammers constantly change patterns to evade rules.
Data sparsity – rare problematic patterns are expensive to collect manually.
Large volume – manual review is inefficient.
Detection Scheme Design
Perplexity Definition – Perplexity (PPL) measures how well a language model predicts a sentence under a known probability distribution. A high perplexity indicates the sentence is unlikely under the model and may be low‑quality.
Model v1 – Built on a Transformer block (see Figure 2). The model predicts the probability of each token given its left context, computes the sentence perplexity, and flags sentences whose score exceeds a preset threshold. Issues identified:
Average‑value smoothing hides local low‑probability spikes in long sentences.
Fixed‑length truncation introduces padding noise and harms convergence.
Insufficient corpus size limits generalization.
To mitigate the averaging problem, a moving‑window approach is applied: compute perplexity over sliding windows and use the maximum window score as the final sentence score.
Model v2 – Adopts Google’s BERT (masked language modeling) to predict token probabilities using full bidirectional context (see Figure 5). Advantages:
Large pre‑training corpus with cross‑domain coverage.
Deeper architecture captures richer semantic features.
Masking eliminates padding noise, removing the need for fixed‑length truncation.
Context‑aware probability estimation is more accurate.
The perplexity formula is updated to use the masked token probability (see Figure 6).
Performance Optimizations
Adaptive number of MASK positions based on sentence length to reduce inference calls.
Uniform sampling of MASK positions to preserve N‑gram context.
Caching model outputs in Redis to avoid redundant computation.
Online Evaluation
After deployment, the service processes tens of millions of requests daily. Table 1 shows daily call volume, hit count, and accuracy (≈ 97‑99%). Latency analysis (Figure 7) indicates 90 % of calls finish within 31.1 ms and 99 % within 91.4 ms.
Date
Calls
Hits
Accuracy
2019/4/26
11,990,226
37,445
98.53%
2019/5/17
20,705,285
26,153
98.21%
2019/5/24
33,048,004
23,874
97.50%
2019/8/9
39,929,396
21,569
96.65%
2019/8/28
41,672,416
26,415
97.10%
Conclusion and Future Work
The unsupervised pre‑trained language model approach provides fast, accurate, and domain‑adaptable low‑quality text detection. Remaining challenges include sensitivity to sentence length, abbreviation mis‑detections, and multi‑semantic sentences. Planned improvements:
Adopt newer pre‑training models for better contextual probability estimation.
Use dynamic window sizes for short vs. long texts.
Fine‑tune on domain‑specific data.
Incorporate syntactic analysis to refine perplexity calculations.
Author – Zhao Zhongxin, responsible for opinion clustering and low‑quality text detection algorithms, researching Neural Topic Model, Memory Network, and other NLP techniques.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.