Operations 10 min read

Tackling QQ’s Legacy Ops: Automation, Capacity Management & Fault Analysis

This article shares Tencent’s QQ operations team insights on handling legacy issues, standardizing package and configuration management, leveraging the ZhiYun automation platform, and applying capacity management and fault‑root analysis techniques to boost efficiency and reduce costs.

Efficient Ops
Efficient Ops
Efficient Ops
Tackling QQ’s Legacy Ops: Automation, Capacity Management & Fault Analysis

Introduction

“坐而论道” is a technical exchange format created by the Efficient Operations community, where experts ask and answer questions weekly, each discussion focusing on a specific topic.

Highlights

Legacy problems such as lack of standardized package management, hard‑coded IPs, and incomplete monitoring are common challenges.

ZhiYun solves continuous deployment difficulties, enabling rapid automated service up/down, meeting about 80% of operational efficiency needs.

External openness of ZhiYun is currently limited to Tencent‑related enterprises.

QQ and WeChat, both major IM products, share similar operational challenges.

Q1: How to address legacy issues and integrate historical tools?

Answer: Since 2009 we have pursued standardization, collaborating closely with development and testing teams to embed DevOps culture early. We introduced package management, configuration management, CMDB, and quality monitoring standards, gradually eliminating legacy pain points. An example: non‑standard packages required a week of packaging effort by the ops team, which helped us conquer many legacy issues and later integrate these practices into the ZhiYun automation platform.

For services outside the standardization system—often due to acquisitions or legacy architectures—we still enforce packaging and ZhiYun integration for growing services, while stable services receive basic routing and fault‑tolerance without heavy automation.

Q2: Future technical roadmap for ZhiYun?

Answer: ZhiYun continues to refine its system for smoother operation and broader applicability of standardized processes. New design ideas focus on service scheduling, cross‑IDC migration, SET replication, and cost control, all built around the core automatic deployment capability.

Q3: Plans for ZhiYun’s external expansion?

Answer: A simplified version of the package management module (based on TARS) is available in Tencent Cloud Marketplace and is used by several Tencent‑affiliated enterprises such as Webank, Futu, and Didi. Full external release is currently limited to Tencent‑related companies due to suitability and internal focus.

Q4: Differences between QQ and WeChat operations?

Answer: Both face similar IM operational challenges, but differ in three key areas:

Internationalization: WeChat has a global user base, QQ is primarily domestic. Feature set: QQ is a larger platform with many downstream applications, affecting operational complexity. PC focus: QQ continues to expand on PC platforms (e.g., online education), unlike WeChat.

Q5: Experience with capacity management and fault root‑cause analysis

Answer: Capacity management drives cost control, business trend forecasting, and scheduling decisions. For example, analyzing average load, per‑machine QPS, and capacity consistency helped save 300 million RMB in 2014. Fault root‑cause analysis leverages the ROOT project, aggregating alerts across monitoring systems to pinpoint likely failure points. Automated self‑healing strategies handle basic alerts (e.g., crashes, disk full) using standardized ops procedures.

Additionally, the “ZhiZi” APM project injects timing logic into each app method, enabling mobile app performance monitoring similar to solutions like TingYun and OneAPM.

monitoringAutomationoperationsdevopscapacity-managementfault-analysislegacy-issues
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.