Artificial Intelligence 9 min read

How ByteBrain’s AI‑Powered Infra is Redefining Cloud and Database Performance

ByteDance’s ByteBrain team showcases how large‑model AI, operations research, and system‑level innovations have produced award‑winning papers and billions of yuan in cost savings while improving on‑call efficiency, database estimation, and cloud infrastructure reliability.

Volcano Engine Developer Services

Apr 28, 2025

How ByteBrain’s AI‑Powered Infra is Redefining Cloud and Database Performance

In recent years ByteDance’s infrastructure team has been continuously investing in AI for Infra/System, aiming to use AI techniques to optimize cloud computing systems and achieving significant results.

Within the first four months of 2025, the ByteBrain team published or had accepted 11 papers in top conferences on AI for Infra, including 10 CCF‑A papers (SIGMOD×3, VLDB×4, EuroSys, FSE, WWW) and one ICLR paper.

While academic papers are a side product, the primary industrial benefit is business impact: ByteBrain leverages large language models (LLMs) to improve Volcano Engine stability, achieving a 26% on‑call efficiency gain, and uses operations‑research algorithms to cut system costs by over 1 billion CNY in three years. The team also made progress in anomaly detection, root‑cause analysis, AI for DB, DB for AI, Text2SQL, and LLM multi‑agent applications, such as applying pre‑trained language models to NDV (Number of Distinct Values) estimation—a first‑of‑its‑kind technique published at SIGMOD 2025 and now deployed in production.

ByteDance is scaling large‑model technologies across cloud computing and IT infrastructure, actively sharing research outcomes with the open‑source community and top academic venues, confirming its leadership in the AI for Infra field.

About the ByteBrain team

ByteBrain is ByteDance’s AI for Infra service platform that uses AI—especially machine learning, large models, and operations‑research techniques—to automatically optimize the full lifecycle of infrastructure and systems. Optimization targets include databases, storage, big‑data systems, VMs, containers, networks, operations, and stability. The main research directions are AIOps, AI4DB, operations‑research optimization, and LLM‑for‑Infra, with functional modules such as capacity planning, resource scheduling, system tuning, anomaly detection, root‑cause analysis, slow‑SQL optimization, Text2SQL, and LLM‑agents.

The team is recruiting researchers and interns; contact: [email protected].

List of ByteBrain’s academic papers (corresponding author) up to April 2025:

PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre‑trained Language Models : https://arxiv.org/pdf/2504.00608 SIGMOD, 2025 – Xianghong Xu, Xiao He, Tieying Zhang*, Rui Shi, Lei Zhang, Jianjun Chen

AdaNDV: Adaptive Number of Distinct Value Estimation via Learning to Select and Fuse Estimators : https://arxiv.org/pdf/2502.16190 VLDB, 2025 – Xianghong Xu, Tieying Zhang*, Xiao He, Haoyang Li, Rong Kang, Shuai Wang, Linhui Xu, Zhimin Liang, Shangyu Luo, Lei Zhang, Jianjun Chen

Adaptive and Efficient Log Parsing as a Cloud Service : https://www.arxiv.org/pdf/2504.09113 SIGMOD, 2025 – Zeyan Li, Jie Song, Tieying Zhang*, Tao Yang, Yingjie Ye, Pengfei Duan, Jianjun Chen

Data‑Agnostic Cardinality Learning from Imperfect Workloads VLDB, 2025 – Peizhi Wu, Rong Kang, Tieying Zhang*, Jianjun Chen, Ryan Marcus, Zachary G. Ives

TickIt: Leveraging Large Language Models for Automated Ticket Escalation : https://arxiv.org/pdf/2504.08475 FSE, 2025 – Fengrui Liu, Xiao He, Tieying Zhang*, Jianjun Chen, Yi Li, Lihua Yi, Haipeng Zhang, Gang Wu, Rui Shi

ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning : https://arxiv.org/pdf/2412.03104 VLDB, 2025 – Zhe Xie, Zeyan Li, Xiao He, Longlong Xu, Xidao Wen, Tieying Zhang*, Jianjun Chen, Rui Shi, Dan Pei*

Flow‑of‑Action: SOP Enhanced LLM‑Based Multi‑Agent System for Root Cause Analysis : https://www.arxiv.org/pdf/2502.08224 WWW, 2025 – Changhua Pei, Zexin Wang, Fengrui Liu, Zeyan Li, Yang Liu, Xiao He, Rong Kang, Tieying Zhang*, Jianjun Chen, Jianhui Li*, Gaogang Xie, Dan Pei

E2ETune: End‑to‑End Knob Tuning via Fine‑tuned Generative Language Model : https://arxiv.org/pdf/2404.11581 VLDB, 2025 – Xinmei Huang, Haoyang Li, Jing Zhang*, Xinxin Zhao, Zhiming Yao, Yiyan Li, Tieying Zhang*, Jianjun Chen, Hong Chen, Cuiping Li

Learning to Communicate Through Implicit Communication Channels : https://arxiv.org/pdf/2411.01553 ICLR, 2025 – Han Wang, Binbin Chen, Tieying Zhang, Baoxiang Wang

ABase: The Multi‑Tenant NoSQL Serverless Database for Diverse and Dynamic Workloads in Large‑scale Cloud Environments SIGMOD, 2025 – Rong Kang, Yanbin Chen, Ye Liu, Fuxin Jiang, Qingshuo Li, Miao Ma, Jian Liu, Guangling Zhao, Tieying Zhang, Jianjun Chen, Lei Zhang

Towards VM Rescheduling Optimization Through Deep Reinforcement Learning : https://drive.google.com/file/d/1mKMh0HUMSu1JsUhtbck4pnZgO11VBopJ/view EuroSys, 2025 – Xianzhong Ding, Yunkai Zhang, Binbin Chen, Donghao Ying, Tieying Zhang*, Jianjun Chen, Lei Zhang, Alberto Cerpa, Wan Du*

cloud computing AI large language models databases research Infrastructure Optimization

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.