How Cleanlab Cut Data Review by 34×: A Real‑World Text Classification Case Study

This article walks through a real text‑classification project where noisy labels inflated the review workload to over 15,000 samples, and shows how using cleanlab’s confident‑learning framework reduced the manual audit set to 438 items, boosting efficiency by thirty‑four times while improving model performance.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
How Cleanlab Cut Data Review by 34×: A Real‑World Text Classification Case Study

Abstract

Data quality limits model performance. In a Chinese social‑media spam detection dataset of 39,116 records, 15,192 appeared suspicious. Using cleanlab and Confident Learning, the review set was reduced to 438 samples, achieving a 34× efficiency gain.

Problem Statement

Goal: binary text classification (normal vs spam‑advertising). After labeling ~40k examples and training an ERNIE model, treating every model‑label disagreement as suspicious flagged ~40% of the data, which is impractical to audit.

Environment Setup

Install required packages:

pip install onnxruntime paddlenlp cleanlab pandas numpy tqdm scipy

Data file all.txt contains lines text<TAB>label. Example:

怎么说,你这边有空了解吗	正常
我关注你了,互关一下,详细介绍	灌水广告

Generating Prediction Probabilities

Load the ONNX‑exported ERNIE tokenizer and model, then run batched inference to obtain a probability matrix pred_probs of shape (n_samples, 2). The batch size of 64 avoids memory overflow.

import onnxruntime as ort
from paddlenlp.transformers import ErnieTokenizer
import numpy as np, math
from tqdm import tqdm
from scipy.special import softmax

tokenizer = ErnieTokenizer.from_pretrained('ernie-3.0-tiny-medium-v2-zh')
session = ort.InferenceSession('model-ad.onnx')
class_labels = ['正常', '灌水广告']
label_to_id = {name: i for i, name in enumerate(class_labels)}

def load_data(filepath, label_map):
    texts, labels = [], []
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if '\t' not in line:
                continue
            text, lbl = line.split('\t', 1)
            if lbl in label_map:
                texts.append(text)
                labels.append(label_map[lbl])
    return texts, np.array(labels)

texts, given_labels = load_data('all.txt', label_to_id)

def get_pred_probs_batched(texts, tokenizer, session, batch_size=64):
    num_batches = math.ceil(len(texts) / batch_size)
    all_probs = []
    for i in tqdm(range(num_batches), desc='Inference'):
        batch = texts[i*batch_size:(i+1)*batch_size]
        inputs = tokenizer(batch, return_tensors='np', max_length=128,
                         padding='max_length', truncation=True)
        input_ids = inputs['input_ids'].astype(np.int64)
        token_type_ids = inputs['token_type_ids'].astype(np.int64)
        logits = session.run(None, {'input_ids': input_ids,
                                    'token_type_ids': token_type_ids})[0]
        probs = softmax(logits, axis=1)
        all_probs.append(probs)
    return np.vstack(all_probs)

pred_probs = get_pred_probs_batched(texts, tokenizer, session, batch_size=64)

Detecting Label Issues with cleanlab

Wrap the texts and original labels in a Datalab object and call find_label_issues with the probability matrix.

from cleanlab import Datalab
import pandas as pd
import numpy as np

lab = Datalab(data={'text': texts}, labels=given_labels)
issues = lab.find_label_issues(pred_probs=pred_probs)

print(f'Found {len(issues)} potential label issues.')

if not issues.empty:
    thresholds = lab.thresholds
    review = []
    for idx, row in issues.iterrows():
        pred_idx = np.argmax(pred_probs[idx])
        review.append({
            'index': idx,
            'text': texts[idx],
            'given_label': class_labels[given_labels[idx]],
            'model_suggestion': class_labels[pred_idx],
            'label_quality_score': row['label_quality_score'],
            'is_label_issue': row['is_label_issue']
        })
    df = pd.DataFrame(review).sort_values('label_quality_score')
    df.to_csv('label_issues_to_review.csv', index=False, encoding='utf-8-sig')
    print('Exported review file with', len(df), 'rows.')
    for i, name in enumerate(class_labels):
        print(f' - Class {name}: confidence threshold = {thresholds[i]:.4f}')
else:
    print('No obvious label errors detected.')

The resulting CSV contains 438 rows, reducing manual review effort by more than 30×.

Confident Learning Theory

Confident Learning treats label noise as a statistical problem. It relies on out‑of‑sample prediction probabilities (usually obtained via cross‑validation) to estimate a noise matrix Q, identify mislabeled samples, and optionally correct them during training.

Step 1 – Estimate the noise matrix Q

For each class j, compute the confidence threshold t_j = E[p_i[j] | argmax(p_i) = j] In this project the thresholds were approximately t_正常 = 0.92 and t_灌水广告 = 0.88. Using these thresholds, a confident‑count matrix C is built by counting only predictions whose probability exceeds the class‑specific threshold. Normalising C by the total number of samples yields the joint distribution matrix Q, which estimates the probability that a true label i is observed as noisy label j.

Step 2 – Identify label errors (Prune by Confidence)

For a sample x_i with given label y_i and model prediction ŷ_i, if the model’s probability for y_i is below the corresponding threshold t_{y_i}, the sample is flagged as a potential label error. cleanlab implements this logic internally and returns a ranked list based on label_quality_score (the model’s confidence in the given label).

Step 3 – Correct and learn

cleanlab provides a CleanLearning wrapper that can re‑weight or correct noisy samples during training, thereby improving downstream model robustness.

Practical Considerations

Model quality : The base classifier must be better than random; otherwise confidence scores are unreliable.

Out‑of‑sample probabilities : Use cross‑validation or a held‑out set to generate pred_probs. If the data distribution or model changes, repeat the entire pipeline.

Threshold interpretation : A threshold represents the average confidence the model has when it predicts a class correctly. Samples with lower confidence for their given label are likely mislabeled.

Conclusion

Applying cleanlab reduced the manual review set from 15,192 to 438 samples (≈34× efficiency gain), lowered labeling costs, and produced a reproducible data‑quality benchmark consistent with Data‑Centric AI principles.

References

cleanlab GitHub: https://github.com/cleanlab/cleanlab

Confident Learning paper: "Confident Learning: Estimating Uncertainty in Dataset Labels"

machine learningData qualitytext classificationData‑Centric AIcleanlabconfident learninglabel noise
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.