How Tencent’s PCG Achieves Full‑Link Observability and AI‑Powered SRE
The talk details Tencent PCG’s end‑to‑end observability platform, its data‑standardization pipeline, client‑backend session linking, AI‑enhanced SRE Agent with large language models, and the roadmap toward a SaaS offering, illustrating how modern operations integrate AI for rapid fault localization.
GOPS2023 Review
In 2023 the PCG observability platform defined a framework consisting of data coverage, data standardization, storage engines, and productization capabilities, aiming to provide standardized data that helps developers and operators locate and resolve faults efficiently. The essence of observability was described as data integration, understanding, linking, experience optimization, and interaction efficiency.
Full‑Link Observability
With the rise of large models, standardized data is now consumed not only by humans but also by AI, improving efficiency and richness of insights. The alert‑location and proactive inspection scenarios have been rebuilt using an Agent approach, representing the biggest change for 2025.
Data infrastructure has been extended from backend‑only to client‑side, achieving true full‑link correlation. Good data quality is crucial for AI and Agent effectiveness.
Example: Tencent Video’s playback flow passes through client rendering, domain/access layer, logic services, component layer, and even AI‑driven recommendation. By aggregating all client and backend data into the observability platform, any user‑reported fault can be quickly traced to its source.
Client‑backend linking is realized through a Session concept: a user’s interaction chain on the app is abstracted as a Session, generating Session ID, View ID, and network spans. The Session ID is correlated with the backend Trace ID, enabling end‑to‑end tracing of alerts back to user actions.
Frontend/Client Metrics System
Data collection now covers mini‑programs, web, cross‑platform frameworks, iOS, and Android, with full coverage from collection to standardized processing, including compressed fields extraction and error parsing. This enables a unified metric system covering API monitoring, custom speed tests, white‑screen detection, fault discovery, alarm configuration, root‑cause analysis, and user‑side fault reproduction.
SRE Agent
Two years ago Tencent achieved partial automated fault localization with small models, but they lacked interpretability and flexibility. Now the focus is on large‑model‑driven agents: an 8B expert model (soon upgraded to 14B) improves accuracy by ~12% and speeds up inference threefold compared to earlier DeepSeek‑based solutions.
The agent follows a “Debug & Search” pattern: it receives a problem description, iteratively calls tools, searches for relevant information, and refines its answer across multiple rounds. Reinforcement learning (GRPO) and a sandbox of historical fault data are used to train the model.
Time‑series data, which is token‑inefficient, is handled by a dedicated tokenizer and embedding strategy, similar to image encoding, to prevent context overflow.
MCP Tools re‑engineer trace data into human‑readable formats and convert Unix timestamps to RFC1123Z strings, reducing token usage and improving model comprehension.
Context and memory handling involve compressing past dialogues and summarizing long logs, enabling multi‑turn interactions without exceeding token limits.
The agent is a hybrid system: (1) training specialized models for niche scenarios where prompts alone are insufficient; (2) leveraging rich, standardized fault data; (3) integrating RAG, tools, and rigorous evaluation pipelines.
AI Team Collaborative Localization
Multiple agents may cooperate when alerts span different domains (e.g., monitoring vs. business configuration) or when context exceeds a single model’s capacity. Coordination strategies mirror traditional software engineering practices for managing complexity.
Galileo SaaS
After building the complex Galileo platform, Tencent is exploring SaaS and open‑source delivery models (including Docker images) to provide the observability solution to external industries, despite the platform’s many dependencies and AI components.
The SaaS aims to support 50+ monitoring object types and adapt to diverse business environments and frameworks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
