Operations 13 min read

Sina Weibo’s AI Agent Ops: Three Steps, Five Stages, Multi‑Scenario Practice

The article details how Sina Weibo tackles rising operational complexity by evolving its traditional AIOps into a three‑step AI system—AI‑assisted coding, knowledge‑base and MCP tool foundations, and AI Agent deployment—showcasing multiple production scenarios, case studies, and lessons learned.

dbaplus Community
dbaplus Community
dbaplus Community
Sina Weibo’s AI Agent Ops: Three Steps, Five Stages, Multi‑Scenario Practice

Macro Background: Growing Ops Complexity

Cloud‑native and micro‑service adoption, together with a multi‑cloud topology, have pushed Sina Weibo’s service count beyond 3,000, daily active users above 250 million, request volume to the hundred‑billion level, and monitoring items into the tens of millions, making traditional operations increasingly difficult.

Traditional AIOps Construction

To address the pain points, Sina Weibo first unified six major data categories (metrics, alerts, logs, etc.) and performed data cleaning (“garbage in, garbage out”). The unified data were stored in real‑time warehouses or streaming engines for downstream AI algorithms. The architecture consisted of four layers: data source (agents, APIs, queues), data processing (including SOP and experience rules), analysis (root‑cause, anomaly detection, capacity planning), and the top layer (alert center, dashboards, automation, self‑healing). This setup reduced MTTR, improved alert accuracy, and cut alarm storms, but still suffered from data silos, loss of expert knowledge, high human‑machine interaction cost, and weak model generalization.

AI System Construction – Three Steps, Five Key Stages

Step 1: AI‑assisted programming – Automation of ops scripts, tool building, and system development raised code coverage of core modules to nearly 100%.

Step 2: Foundational capabilities – A comprehensive ops knowledge base was built and encapsulated into MCP tools and SKILL, providing a reusable foundation for higher‑level AI Agents.

Step 3: AI Agent construction – High‑frequency operational scenarios were gradually transformed into AI Agents, forming a matrix of tools that dramatically improve efficiency.

Knowledge‑Base Foundations

The knowledge base comprises four subsystems: fault‑case repository, SOP manual library, business‑logic knowledge, and expert‑experience repository. It lowers communication and learning costs, supports cross‑coverage and rotation, and mitigates single‑point risks.

AI Agent Representative Scenarios

Full‑site hotspot event analysis

Root‑cause analysis (including sentiment analysis)

Interface anomaly detection

刷站 analysis

Client crash analysis

Detailed Scenario Practices

Case 1: Hotspot Response AI Agent

When a news hotspot spikes, traditional handling required manual judgment, leading to delayed scaling decisions. The AI Agent automatically collects data, aggregates trends, identifies the current hotspot, predicts traffic trends, pinpoints key time nodes, and finally generates LLM‑driven operation suggestions, shortening response time dramatically.

Case 2: Sentiment‑Driven Root‑Cause Analysis

Users report issues on Weibo (e.g., “Weibo crashed”). Previously, scattered feedback was manually collated, causing slow diagnosis. The AI Agent now pulls real‑time monitoring data, performs multi‑dimensional feature drilling, and produces a report covering sentiment overview, affected modules, complaint summary, metric verification, root‑cause identification, and actionable suggestions. Analysis revealed that over 90 % of incidents stem from changes, highlighting the efficiency of AI‑driven triage.

Case 3: Client Crash Analysis

Client‑side crashes generate massive, fragmented logs, making diagnosis time‑consuming. Using the AI Agent workflow, processing time is significantly reduced, as demonstrated by concrete examples in the accompanying screenshots.

Construction Experience Summary

The AIOps platform, built on the internal Wegent system, enables registration of MCP tools, creation of custom AI Agent products, and access to AI Agent APIs (e.g., crash analysis, traffic surge diagnosis, code review) without altering existing ops habits. Multiple agents can be orchestrated into complex workflows to boost efficiency.

Key Learnings

Not every scenario requires a large model; lightweight AI or rule‑based automation can be more cost‑effective.

Transmit only necessary data to AI Agents to control processing costs.

Introduce human‑AI collaboration for destructive operations, adding confirmation steps for safety.

Employ small models for low‑scale data labeling tasks to improve efficiency while saving resources.

AI + AIOps Capability Layering

The overall capability is divided into three layers: (1) AI‑assisted development to accelerate ops tool creation; (2) Knowledge base + MCP tools + APIs as the foundational AI capability stack; (3) AI Agent decision support and AI‑enhanced automation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DevOpsKnowledge BaseAI AgentAIOpsOperations AutomationAI OpsSina Weibo
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.