From Manual Ops to AI‑Powered Monitoring: Scaling Weibo Ads Infrastructure
This article outlines how the Weibo advertising team evolved its operations from hand‑crafted scripts to a fully automated, AI‑enhanced platform, covering service governance, multi‑datacenter deployment, a custom automation system (Kunkka), effective alerting, full‑link tracing, and a massive metric monitoring solution built on big‑data technologies.
Value of Operations in the Advertising System
Operations (SRE) directly affect service uptime, latency, and resource utilization of the advertising platform, which are core KPIs for business continuity.
Evolution of Operations Practices
The team progressed through four stages:
Manual stage : Direct command‑line interventions on single‑instance services.
Tool stage : Introduction of Puppet, Shell and Python scripts to automate repetitive tasks.
DevOps stage : Standardized, platform‑based CI/CD pipelines that bridge development and operations.
AI‑Ops stage : Use of AI and big‑data techniques for predictive anomaly detection and automated remediation.
Service Governance for Complex Business Scenarios
To mitigate sudden traffic spikes (e.g., viral posts) and avoid single‑datacenter failures, more than 100 services were refactored for multi‑datacenter, multi‑carrier deployment:
Balanced deployment across at least two data centers.
Distribution across different ISPs.
Redundant capacity per data center.
Even traffic distribution to prevent hot spots.
Upstream and downstream requests confined to the same data center to reduce cross‑datacenter latency.
Regular traffic stress tests copy production traffic into an isolated sandbox, exposing bottlenecks for targeted optimization.
Automated Operations Platform – Kunkka
Kunkka, built since 2017, is based on SaltStack and Jenkins . The CI/CD flow is:
Developers push code to GitLab.
Jenkins automatically builds the artifact and uploads it to Nexus.
Operators select target hosts in Kunkka and trigger deployment via SaltStack.
Kunkka integrates with the internal DCP platform to generate Docker images, push them to a private registry, and deploy to cloud hosts for rapid scaling. A multi‑level approval workflow (code review → operations review → production release) ensures safe roll‑outs.
Effective Alerting with Prometheus
To reduce alert fatigue, three practical principles are applied:
Validate necessity : Each alert request is reviewed with developers to confirm relevance.
Aggregate alerts : Identical alerts within a configurable time window are merged before notification.
Root‑cause tracing : Alerts are linked back to the originating metric, allowing downstream symptoms to be suppressed.
These measures cut alert volume dramatically while preserving critical signals.
Full‑Link Trace System
Every request receives a globally unique TraceId. Logs are collected, parsed, and stored as follows:
Log lines are shipped to Kafka (e.g., via Filebeat or custom agents). Flink consumes the streams, extracts the TraceId, enriches fields, and writes structured records to ClickHouse.
ClickHouse serves real‑time queries for request‑level tracing, metric aggregation, and user‑level diagnostics.
Developers can query by dimensions such as UID to reconstruct a user’s end‑to‑end journey across the advertising system.
Massive Metric Monitoring Platform – Oops
Oops follows a four‑layer architecture designed for >120 TB of real‑time metrics and peak QPS of 1.25 million:
Data collection : Filebeat agents forward logs to Kafka.
Metric cleaning : Flink parses logs, performs dimension joins (e.g., exposure ↔ interaction) via HBase, and writes cleaned metrics to ClickHouse. Parallel sinks also write raw data to ElasticSearch (search) and HDFS (offline analytics).
Metric storage : ClickHouse replaces Graphite as the primary OLAP time‑series store, offering sub‑second multi‑dimensional query performance.
Metric visualization : Grafana queries ClickHouse with simple SQL to render dashboards, line charts, and tables.
Materialized views and aggregation tables (e.g., per‑second, per‑minute, per‑user) enable flexible ad‑hoc analysis without sacrificing query speed.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
