Sina Weibo’s AI Agent Ops: Three Steps, Five Stages, Multi‑Scenario Practice
The article details how Sina Weibo tackles rising operational complexity by evolving its traditional AIOps into a three‑step AI system—AI‑assisted coding, knowledge‑base and MCP tool foundations, and AI Agent deployment—showcasing multiple production scenarios, case studies, and lessons learned.
Macro Background: Growing Ops Complexity
Cloud‑native and micro‑service adoption, together with a multi‑cloud topology, have pushed Sina Weibo’s service count beyond 3,000, daily active users above 250 million, request volume to the hundred‑billion level, and monitoring items into the tens of millions, making traditional operations increasingly difficult.
Traditional AIOps Construction
To address the pain points, Sina Weibo first unified six major data categories (metrics, alerts, logs, etc.) and performed data cleaning (“garbage in, garbage out”). The unified data were stored in real‑time warehouses or streaming engines for downstream AI algorithms. The architecture consisted of four layers: data source (agents, APIs, queues), data processing (including SOP and experience rules), analysis (root‑cause, anomaly detection, capacity planning), and the top layer (alert center, dashboards, automation, self‑healing). This setup reduced MTTR, improved alert accuracy, and cut alarm storms, but still suffered from data silos, loss of expert knowledge, high human‑machine interaction cost, and weak model generalization.
AI System Construction – Three Steps, Five Key Stages
Step 1: AI‑assisted programming – Automation of ops scripts, tool building, and system development raised code coverage of core modules to nearly 100%.
Step 2: Foundational capabilities – A comprehensive ops knowledge base was built and encapsulated into MCP tools and SKILL, providing a reusable foundation for higher‑level AI Agents.
Step 3: AI Agent construction – High‑frequency operational scenarios were gradually transformed into AI Agents, forming a matrix of tools that dramatically improve efficiency.
Knowledge‑Base Foundations
The knowledge base comprises four subsystems: fault‑case repository, SOP manual library, business‑logic knowledge, and expert‑experience repository. It lowers communication and learning costs, supports cross‑coverage and rotation, and mitigates single‑point risks.
AI Agent Representative Scenarios
Full‑site hotspot event analysis
Root‑cause analysis (including sentiment analysis)
Interface anomaly detection
刷站 analysis
Client crash analysis
Detailed Scenario Practices
Case 1: Hotspot Response AI Agent
When a news hotspot spikes, traditional handling required manual judgment, leading to delayed scaling decisions. The AI Agent automatically collects data, aggregates trends, identifies the current hotspot, predicts traffic trends, pinpoints key time nodes, and finally generates LLM‑driven operation suggestions, shortening response time dramatically.
Case 2: Sentiment‑Driven Root‑Cause Analysis
Users report issues on Weibo (e.g., “Weibo crashed”). Previously, scattered feedback was manually collated, causing slow diagnosis. The AI Agent now pulls real‑time monitoring data, performs multi‑dimensional feature drilling, and produces a report covering sentiment overview, affected modules, complaint summary, metric verification, root‑cause identification, and actionable suggestions. Analysis revealed that over 90 % of incidents stem from changes, highlighting the efficiency of AI‑driven triage.
Case 3: Client Crash Analysis
Client‑side crashes generate massive, fragmented logs, making diagnosis time‑consuming. Using the AI Agent workflow, processing time is significantly reduced, as demonstrated by concrete examples in the accompanying screenshots.
Construction Experience Summary
The AIOps platform, built on the internal Wegent system, enables registration of MCP tools, creation of custom AI Agent products, and access to AI Agent APIs (e.g., crash analysis, traffic surge diagnosis, code review) without altering existing ops habits. Multiple agents can be orchestrated into complex workflows to boost efficiency.
Key Learnings
Not every scenario requires a large model; lightweight AI or rule‑based automation can be more cost‑effective.
Transmit only necessary data to AI Agents to control processing costs.
Introduce human‑AI collaboration for destructive operations, adding confirmation steps for safety.
Employ small models for low‑scale data labeling tasks to improve efficiency while saving resources.
AI + AIOps Capability Layering
The overall capability is divided into three layers: (1) AI‑assisted development to accelerate ops tool creation; (2) Knowledge base + MCP tools + APIs as the foundational AI capability stack; (3) AI Agent decision support and AI‑enhanced automation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
