Industry Insights 17 min read

How Baidu Scales Real-Time Content Safety for Millions of Mini‑Programs

This article explains Baidu's evolving inspection scheduling system for its smart mini‑programs, detailing the challenges of massive page volumes, the V1.0 offline architecture, the V2.0 real‑time enhancements, resource constraints, deduplication logic, and the measurable improvements in risk detection and ecosystem health.

Baidu Geek Talk

May 23, 2022

How Baidu Scales Real-Time Content Safety for Millions of Mini‑Programs

Background and Goals

Baidu Smart Mini‑Programs serve billions of page views daily. To protect user experience, the inspection system must handle massive traffic while respecting several constraints: different program categories have distinct risk scores, each program has a crawl‑quota that limits how many pages can be fetched, content‑safety evaluation consumes significant compute for spidering, rendering and analysis, and traffic‑based risk weighting must be considered.

V1.0 Offline Inspection Scheduling

Data sources: SDK logs (page visits, performance, exceptions) are collected in Baidu’s log platform.

Page discovery: High‑volume pages are selected based on traffic distribution, industry category, release cycles and historical violations.

Inspection platform: Tasks generate crawl jobs and feed pages to risk, experience and red‑line detection services via Kafka.

Low‑quality signal distribution: Machine‑reviewed risky pages are manually verified before downstream propagation.

Online intervention: Detected low‑quality pages trigger actions ranging from page blocking to program shutdown or entity blacklisting.

The V1.0 pipeline ran on a daily cadence, resulting in a T+1 delay for risk exposure.

V2.0 Real‑Time + Offline Inspection Scheduling

Design Principles

Real‑time detection is prioritized; offline data supplements gaps.

Maintain high coverage while respecting per‑program crawl limits.

Distribute crawl load uniformly across programs.

Architecture Overview

Real‑time and offline discovery pipelines feed a unified scheduling engine. The engine de‑duplicates candidates and dispatches them for crawling and machine review.

Offline Page Discovery

Uses the previous day’s logs to aggregate page PVs, filters out entries from a “mis‑hit pool” and applies per‑program quota limits. Candidate pages and their PVs are stored in Doris tables for the scheduler.

Real‑Time Page Discovery

Consumes Baidu’s BigPipe service. Structured Streaming processes the stream in 5‑minute windows with a 15‑minute watermark to discard stale data. Pages with high PV are always sent for inspection; low‑PV pages are sampled at 1 % using a random number filter. Additional filters remove exempt programs, mis‑hit pool entries and enforce per‑program crawl quotas.

Scheduling Strategy

Divide the day into bn batches; compute the current batch from the minute of day.

Process real‑time windows first; if the crawl quota for a program remains, supplement with offline pages ordered by PV.

Deduplicate pages: high‑PV pages are de‑duplicated daily, while medium/low‑PV pages are de‑duplicated over longer intervals using Redis sets keyed by MD5‑hashed URLs, sharded across 100 partitions.

Result Storage & Index Management

Inspection results are written to Elasticsearch. An alias‑based rollover creates daily indices with a maximum age of 1 day, 10 million documents or 2 GB size, enabling efficient deletion of stale data.

PUT /online-realtime-risk-page-%{now/H%7BYYYY.MM.dd%7C%2B08:00%7D}-1/
{
  "aliases": {
    "online-realtime-risk-page-index": { "is_write_index": true }
  }
}
POST /online-realtime-risk-page-index/_rollover
{
  "conditions": {
    "max_age": "1d",
    "max_docs": 10000000,
    "max_size": "2gb"
  }
}

Key Benefits

Daily inspected pages grew to tens of millions, dramatically improving coverage.

Risk exposure was reduced from days to minutes, and intervention latency from days to hours.

Combining real‑time and offline data maximized page coverage while using fewer resources.

Technical Details of URL Deduplication

Redis stores MD5 hashes of page URLs in 100 sharded sets. The shard key is computed by converting each character of the URL to an integer, summing them to obtain x, and taking x mod 100. This design enables fast membership checks and scalable de‑duplication.

Future Directions

As page volume and content diversity continue to rise, the scheduling algorithms will be further refined, risk signals enriched, and resources scaled to maintain high‑quality user experiences.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Risk Management Big Data Cloud Computing Real-time Streaming Content Safety

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.