Operations 21 min read

Intelligent Operations (AIOps) Insights, Planning, and Large‑Model Agent Practices at ByteDance

The article summarizes ByteDance's intelligent operations (AIOps) strategy, covering frontier concepts, a five‑level automation roadmap, large‑model applications for fault diagnosis and smart Q&A, and a comprehensive AIOps platform that accelerates algorithm deployment, improves efficiency, and reduces operational costs.

DataFunSummit
DataFunSummit
DataFunSummit
Intelligent Operations (AIOps) Insights, Planning, and Large‑Model Agent Practices at ByteDance

This article is compiled from ByteDance intelligent operations architect Wang Feng's presentation at the Data Intelligence Conference, covering frontier concepts of intelligent operations, planning paths, large‑model applications, and AIOps platform practice. The conference highlighted large‑model use in intelligent Q&A and fault‑diagnosis scenarios, achieving rapid algorithm iteration and deployment via the AIOps platform, thereby enhancing efficiency and practical capability across operation scenarios.

Author: ByteDance STE Team & DataFun community volunteer Cheng Siqi

Intelligent Operations Frontier Insights

Intelligent operations, a typical combination of AI technology and operations services, has made significant progress in recent years. The maturity curve from 2021 to 2023 shows three key technologies advancing: generative AI has moved from emergence to the peak of expectation inflation; ModelOps sits at the boundary between expectation inflation and bubble burst; Autonomous System entered the emergence phase in 2023.

According to Gartner's 2021 AIOps application scenario definition, most intelligent‑operations development follows this route. A 2022 OpsRamp research report listed the top five prioritized AIOps scenarios: intelligent alerts, root‑cause analysis, anomaly detection, capacity optimization, and self‑healing. The rise of ChatGPT spurred industry interest in large models, and many vendors now offer "AIOps+" solutions that combine AIOps with large‑model capabilities.

In 2007 Turing‑Award winner Professor Joseph introduced the concept of an autonomous system, consisting of perception, experience knowledge, decision, and action—similar to autonomous driving. With large‑model support, intelligent operations can now explore "software‑system autonomous driving," enabling self‑healing after failures based on accumulated experience.

Operations can be divided into five levels: basic scripting, tool‑based operations, platform‑based operations, digital operations, and fully automated intelligent operations. Each level builds on the previous one, with standardization being the cornerstone; without it, any tool, platform, or AI technique cannot fully resolve operational pain points. Large‑model breakthroughs provide strong momentum toward achieving the fifth‑level AIOps.

System Intelligent Operations Planning

System intelligent operations focus on the infrastructure layer beneath applications (OS, kernel, drivers, etc.). The goal is rapid server delivery while ensuring stability. Challenges include diverse hardware demands, massive scale (millions of servers), and managing multiple software versions, all of which require strict operational controls and cross‑stack coordination.

Our comprehensive plan creates three business values—quality assurance, efficiency improvement, and cost optimization—across four layers:

Data Layer : collects monitoring metrics, alarm records, incident tickets, change logs, and builds a knowledge base for training intelligent agents.

Platform Capability Layer : accumulates foundational algorithm capabilities and delivers scenario‑specific solutions to speed up implementation.

Algorithm Scenario Layer : implements use cases such as panic‑cause classification for OS crashes, predictive hardware failure (disk, memory, optical modules), temperature and water‑cooling monitoring, routine anomaly detection, and change‑risk interception.

Memory Fault Prediction focuses on Uncorrectable Errors (UE) that can cause server crashes. By predicting UE from Correctable Errors (CE) and applying soft‑repair techniques (online page‑offline, service‑ticket‑based hardware repair, hot VM migration, or PPR), ByteDance reduced memory‑fault frequency by about 60%.

Large‑Model Agent Practice

We built a large‑model intelligent‑operations solution on MLOps, LLMOps, and OpsPlatform, using AI Agents (AIAgent). MLOps accelerates algorithm deployment, OpsPlatform provides operational tools, and LLMOps supplies large‑model capabilities. Plugins encapsulate platform abilities for development, deployment, and debugging, supporting both single‑agent and multi‑agent scenarios.

Key challenges for agents are role definition, multi‑agent coordination, and planning. Agents aim to achieve expert‑level performance in specific domains, avoid over‑generalization, and integrate tightly with knowledge bases and tasks.

LLM Agent use cases split into single‑agent (knowledge consulting and tool usage) and multi‑agent (fault diagnosis and operation enhancement). Single‑agent knowledge consulting replaces keyword‑based document search with semantic reasoning; tool usage enables natural‑language interaction with operational tools. Multi‑agent fault diagnosis leverages several expert agents coordinated by a master agent, while operation enhancement focuses on AI‑driven workflow orchestration and verification.

Intelligent Q&A follows six steps: user query, model‑driven rewriting, multi‑plugin processing, prompt generation, vector‑store retrieval, and iterative refinement. OCR and text‑to‑SQL handle non‑text inputs. Knowledge is organized from small to large, guided by semantics, and deployed via Feishu bots, product pages, and floating windows.

Fault diagnosis involves information extraction, planning, and execution. Planning uses experience knowledge bases and tool plugins to gather topology and configuration data, then applies RAG to leverage historical incidents. Token optimization ensures concise prompts.

In multi‑agent mode, a master agent orchestrates sub‑agents for anomaly analysis, fault diagnosis, and tool handling, with a summarizer agent aggregating results. Knowledge graphs are considered to enhance agent reasoning.

Using ReAct, agents invoke diagnostic tools and summarize findings; this capability will be continuously iterated.

AIOps Platform Boosts Efficiency

Gartner's 2021 AIOps platform definition emphasizes its importance for several reasons:

AIOps platforms support multiple algorithm scenarios with common workflows.

Business users demand rapid rollout and real‑time effect visualization.

Deployed solutions need visualization and continuous operation.

Limited algorithm engineers create ROI challenges for multiple scenarios.

Algorithm scenarios must integrate with existing operations systems.

Although Gartner does not provide concrete implementation guidance, products like Dynatrace illustrate similar architectures.

The platform targets algorithm engineers, SREs, and other operations platforms, organized into three layers: data, algorithm service capability, and platform functions.

Core functionalities include data management, scenario solidification, atomic algorithm services, large‑model plugin management, agent design/debug, and algorithm monitoring.

To accelerate AI Agent development, the platform offers orchestration that combines tools and algorithms into plugins callable by agents. Users can quickly define, debug, and share agents via an Agent market, and compose multi‑agent fleets for complex scenarios.

By using the AIOps platform, algorithm deployment cycles have shrunk from a month to weekly or even same‑day releases, with some models supporting configuration‑as‑deployment.

Additionally, the platform promotes cross‑department algorithm sharing, enabling multiple teams to co‑build and share an algorithm marketplace, thereby increasing resource utilization and trust in the platform.

AI agentsLarge Language ModelsplatformAIOpsIntelligent Operationsoperations automation
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.