Artificial Intelligence 13 min read

Low-Quality Text Detection Using Unsupervised Language Model Perplexity

This article proposes a method to identify low-quality text in business data by training a large-scale unsupervised language model to compute sentence perplexity, converting the detection problem into a threshold decision, and details model design, challenges, optimizations, and online performance results.

58 Tech
58 Tech
58 Tech
Low-Quality Text Detection Using Unsupervised Language Model Perplexity

Introduction – To detect random characters, incoherent semantics, and non‑standard language in business data, the authors present a method that leverages a massive unsupervised corpus to train a language model and compute sentence perplexity, turning low‑quality text identification into a simple threshold decision. The approach requires no manual labeling, achieves high accuracy, and transfers well across domains.

Background – With the rapid growth of mobile internet and increasing service‑level requirements, 58.com processes billions of text posts daily. User‑generated content often contains spam, illegal, or nonsensical posts (e.g., obscure characters, random strings, incoherent sentences), which degrade data quality and user experience. Manual rule‑based filtering is lagging, low‑coverage, and costly.

Characteristics of Low‑Quality Text

Strong adversarial nature and fast evolution – spammers constantly change patterns to evade rules.

Data sparsity – rare problematic patterns are expensive to collect manually.

Large volume – manual review is inefficient.

Detection Scheme Design

Perplexity Definition – Perplexity (PPL) measures how well a language model predicts a sentence under a known probability distribution. A high perplexity indicates the sentence is unlikely under the model and may be low‑quality.

Model v1 – Built on a Transformer block (see Figure 2). The model predicts the probability of each token given its left context, computes the sentence perplexity, and flags sentences whose score exceeds a preset threshold. Issues identified:

Average‑value smoothing hides local low‑probability spikes in long sentences.

Fixed‑length truncation introduces padding noise and harms convergence.

Insufficient corpus size limits generalization.

To mitigate the averaging problem, a moving‑window approach is applied: compute perplexity over sliding windows and use the maximum window score as the final sentence score.

Model v2 – Adopts Google’s BERT (masked language modeling) to predict token probabilities using full bidirectional context (see Figure 5). Advantages:

Large pre‑training corpus with cross‑domain coverage.

Deeper architecture captures richer semantic features.

Masking eliminates padding noise, removing the need for fixed‑length truncation.

Context‑aware probability estimation is more accurate.

The perplexity formula is updated to use the masked token probability (see Figure 6).

Performance Optimizations

Adaptive number of MASK positions based on sentence length to reduce inference calls.

Uniform sampling of MASK positions to preserve N‑gram context.

Caching model outputs in Redis to avoid redundant computation.

Online Evaluation

After deployment, the service processes tens of millions of requests daily. Table 1 shows daily call volume, hit count, and accuracy (≈ 97‑99%). Latency analysis (Figure 7) indicates 90 % of calls finish within 31.1 ms and 99 % within 91.4 ms.

Date

Calls

Hits

Accuracy

2019/4/26

11,990,226

37,445

98.53%

2019/5/17

20,705,285

26,153

98.21%

2019/5/24

33,048,004

23,874

97.50%

2019/8/9

39,929,396

21,569

96.65%

2019/8/28

41,672,416

26,415

97.10%

Conclusion and Future Work

The unsupervised pre‑trained language model approach provides fast, accurate, and domain‑adaptable low‑quality text detection. Remaining challenges include sensitivity to sentence length, abbreviation mis‑detections, and multi‑semantic sentences. Planned improvements:

Adopt newer pre‑training models for better contextual probability estimation.

Use dynamic window sizes for short vs. long texts.

Fine‑tune on domain‑specific data.

Incorporate syntactic analysis to refine perplexity calculations.

Author – Zhao Zhongxin, responsible for opinion clustering and low‑quality text detection algorithms, researching Neural Topic Model, Memory Network, and other NLP techniques.

transformerNLPBERTlanguage modellow-quality text detectionperplexity
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.