Operations 18 min read

How Large‑Model AI Is Transforming Intelligent Operations (AIOps)

This article explores the latest concepts, planning roadmap, and practical applications of large‑model AI in intelligent operations, detailing AIOps use cases, system‑level automation, multi‑agent architectures, and how a dedicated platform accelerates deployment and efficiency across data‑center environments.

ByteDance SYS Tech
ByteDance SYS Tech
ByteDance SYS Tech
How Large‑Model AI Is Transforming Intelligent Operations (AIOps)

Intelligent Operations Frontier Insights

ByteDance’s STE architect Wang Feng presented at the Data Intelligence Conference, highlighting the evolution of AI maturity from 2021 to 2023 and the rapid progress of three key technologies for intelligent operations: generative AI, which has moved from emergence to the peak of expectation inflation; ModelOps, currently at the boundary between expectation inflation and bubble burst; and Autonomous Systems, entering the emergence phase in 2023.

According to Gartner’s 2021 AIOps scenario definition and OpsRamp’s 2022 research, the top five prioritized AIOps scenarios are intelligent alerts, root‑cause analysis, anomaly detection, capacity optimization, and self‑healing. The rise of large models like ChatGPT has spurred the industry to combine AIOps with these models, a trend we refer to as “AIOps+”.

System Intelligent Operations Planning

System intelligent operations focus on the infrastructure layer beneath applications, including OS, kernels, drivers, and related software stacks. The goal is rapid server delivery while ensuring stability, which becomes challenging as hardware demands diversify and data‑center scale reaches millions of servers.

We propose a four‑layer roadmap: a data layer (monitoring metrics, alerts, events, change records, and knowledge bases), a platform capability layer (accumulating generic algorithm capabilities and scenario‑specific implementations), an algorithm‑scenario layer (e.g., panic cause classification, hardware fault prediction, temperature and power monitoring, anomaly detection, and change‑risk mitigation), and a quality‑efficiency‑cost value layer.

Large Model Agent Practice

We built a solution that integrates large‑model AI into intelligent operations via MLOps, LLMOps, and OpsPlatform, encapsulating platform capabilities as plugins for AI agents. Challenges include defining agent roles, coordinating multiple agents, and planning execution.

Single‑agent scenarios cover knowledge Q&A and tool usage, where large models bridge the gap between natural‑language queries and technical documentation, and enable OCR and text‑to‑SQL for database queries. Multi‑agent scenarios focus on fault diagnosis and operation enhancement, requiring expert‑level agents to collaborate, with a central coordinator assigning tasks such as anomaly analysis, fault isolation, and tool interaction.

The intelligent Q&A workflow involves user query, LLM rewriting, plugin processing, vector retrieval, prompt generation, and iterative refinement, supporting multimodal inputs like screenshots. Fault diagnosis follows information extraction, planning, execution, and uses Retrieval‑Augmented Generation (RAG) to leverage historical failure knowledge while optimizing token usage.

AIOps Platform Boosts Efficiency

The platform provides six core capabilities: data management, scenario cataloging, atomic algorithm services, large‑model plugin management, agent design/debug, and algorithm service monitoring. It enables rapid model deployment—reducing release cycles from weeks to days—and supports configuration‑as‑code, cross‑department algorithm sharing, and visualized, continuously operated services.

AutomationAI agentsplatformlarge language modelAIOpsIntelligent Operations
ByteDance SYS Tech
Written by

ByteDance SYS Tech

Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.