How ByteBrain’s AI‑Powered Infra is Redefining Cloud and Database Performance
ByteDance’s ByteBrain team showcases how large‑model AI, operations research, and system‑level innovations have produced award‑winning papers and billions of yuan in cost savings while improving on‑call efficiency, database estimation, and cloud infrastructure reliability.
In recent years ByteDance’s infrastructure team has been continuously investing in AI for Infra/System, aiming to use AI techniques to optimize cloud computing systems and achieving significant results.
Within the first four months of 2025, the ByteBrain team published or had accepted 11 papers in top conferences on AI for Infra, including 10 CCF‑A papers (SIGMOD×3, VLDB×4, EuroSys, FSE, WWW) and one ICLR paper.
While academic papers are a side product, the primary industrial benefit is business impact: ByteBrain leverages large language models (LLMs) to improve Volcano Engine stability, achieving a 26% on‑call efficiency gain, and uses operations‑research algorithms to cut system costs by over 1 billion CNY in three years. The team also made progress in anomaly detection, root‑cause analysis, AI for DB, DB for AI, Text2SQL, and LLM multi‑agent applications, such as applying pre‑trained language models to NDV (Number of Distinct Values) estimation—a first‑of‑its‑kind technique published at SIGMOD 2025 and now deployed in production.
ByteDance is scaling large‑model technologies across cloud computing and IT infrastructure, actively sharing research outcomes with the open‑source community and top academic venues, confirming its leadership in the AI for Infra field.
About the ByteBrain team
ByteBrain is ByteDance’s AI for Infra service platform that uses AI—especially machine learning, large models, and operations‑research techniques—to automatically optimize the full lifecycle of infrastructure and systems. Optimization targets include databases, storage, big‑data systems, VMs, containers, networks, operations, and stability. The main research directions are AIOps, AI4DB, operations‑research optimization, and LLM‑for‑Infra, with functional modules such as capacity planning, resource scheduling, system tuning, anomaly detection, root‑cause analysis, slow‑SQL optimization, Text2SQL, and LLM‑agents.
The team is recruiting researchers and interns; contact: [email protected].
List of ByteBrain’s academic papers (corresponding author) up to April 2025:
PLM4NDV: Minimizing Data Access for Number of Distinct Values Estimation with Pre‑trained Language Models : https://arxiv.org/pdf/2504.00608 SIGMOD, 2025 – Xianghong Xu, Xiao He, Tieying Zhang*, Rui Shi, Lei Zhang, Jianjun Chen
AdaNDV: Adaptive Number of Distinct Value Estimation via Learning to Select and Fuse Estimators : https://arxiv.org/pdf/2502.16190 VLDB, 2025 – Xianghong Xu, Tieying Zhang*, Xiao He, Haoyang Li, Rong Kang, Shuai Wang, Linhui Xu, Zhimin Liang, Shangyu Luo, Lei Zhang, Jianjun Chen
Adaptive and Efficient Log Parsing as a Cloud Service : https://www.arxiv.org/pdf/2504.09113 SIGMOD, 2025 – Zeyan Li, Jie Song, Tieying Zhang*, Tao Yang, Yingjie Ye, Pengfei Duan, Jianjun Chen
Data‑Agnostic Cardinality Learning from Imperfect Workloads VLDB, 2025 – Peizhi Wu, Rong Kang, Tieying Zhang*, Jianjun Chen, Ryan Marcus, Zachary G. Ives
TickIt: Leveraging Large Language Models for Automated Ticket Escalation : https://arxiv.org/pdf/2504.08475 FSE, 2025 – Fengrui Liu, Xiao He, Tieying Zhang*, Jianjun Chen, Yi Li, Lihua Yi, Haipeng Zhang, Gang Wu, Rui Shi
ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning : https://arxiv.org/pdf/2412.03104 VLDB, 2025 – Zhe Xie, Zeyan Li, Xiao He, Longlong Xu, Xidao Wen, Tieying Zhang*, Jianjun Chen, Rui Shi, Dan Pei*
Flow‑of‑Action: SOP Enhanced LLM‑Based Multi‑Agent System for Root Cause Analysis : https://www.arxiv.org/pdf/2502.08224 WWW, 2025 – Changhua Pei, Zexin Wang, Fengrui Liu, Zeyan Li, Yang Liu, Xiao He, Rong Kang, Tieying Zhang*, Jianjun Chen, Jianhui Li*, Gaogang Xie, Dan Pei
E2ETune: End‑to‑End Knob Tuning via Fine‑tuned Generative Language Model : https://arxiv.org/pdf/2404.11581 VLDB, 2025 – Xinmei Huang, Haoyang Li, Jing Zhang*, Xinxin Zhao, Zhiming Yao, Yiyan Li, Tieying Zhang*, Jianjun Chen, Hong Chen, Cuiping Li
Learning to Communicate Through Implicit Communication Channels : https://arxiv.org/pdf/2411.01553 ICLR, 2025 – Han Wang, Binbin Chen, Tieying Zhang, Baoxiang Wang
ABase: The Multi‑Tenant NoSQL Serverless Database for Diverse and Dynamic Workloads in Large‑scale Cloud Environments SIGMOD, 2025 – Rong Kang, Yanbin Chen, Ye Liu, Fuxin Jiang, Qingshuo Li, Miao Ma, Jian Liu, Guangling Zhao, Tieying Zhang, Jianjun Chen, Lei Zhang
Towards VM Rescheduling Optimization Through Deep Reinforcement Learning : https://drive.google.com/file/d/1mKMh0HUMSu1JsUhtbck4pnZgO11VBopJ/view EuroSys, 2025 – Xianzhong Ding, Yunkai Zhang, Binbin Chen, Donghao Ying, Tieying Zhang*, Jianjun Chen, Lei Zhang, Alberto Cerpa, Wan Du*
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
