Operations 14 min read

How Baidu Built a Robust Microservice Monitoring System for Game Services

This article details Baidu's comprehensive microservice monitoring practice for its game platform, covering the initial fragmented setup, systematic redesign across risk control, intelligent monitoring, smart alerting, and rapid fault localization, and presents the resulting monitoring architecture, visualizations, and future improvement goals.

Baidu Geek Talk

Jul 14, 2021

How Baidu Built a Robust Microservice Monitoring System for Game Services

Background

Game services at Baidu grew rapidly, with each developer maintaining 2–3 micro‑services and more expected. Early monitoring relied on Argus (log server), Monitor (business metrics) and SIA (visualization). The initial setup lacked systematic design, had incomplete coverage, and produced slow, noisy alerts, leading to delayed risk exposure and difficult fault isolation.

Initial Exploration

Log and Server Monitoring

Baidu Argus collected machine status and business logs. The implementation was single‑dimensional, without per‑instance thresholds or multi‑dimensional alerting.

Service Polling Monitoring

The Monitor platform offered visual configuration for periodic polling of core APIs. As services iterated quickly, custom per‑scenario configurations became inefficient.

Service Visualization

SIA visualized traffic, availability and performance metrics, helping developers observe service health, but its advanced analytics were under‑utilized.

Problems Identified

Risk exposure occurred after impact.

Monitoring items were chaotic and coverage was incomplete.

Anomaly details were insufficient for rapid diagnosis.

Developers were bombarded with noisy alerts.

Systematic Monitoring Design

Risk Control

Automated test cases and release gating were introduced, reducing more than 95% of online issues. Key risk‑control measures include pre‑deployment sanity checks, canary releases, automated rollback and strict version control.

Intelligent Monitoring

Three core challenges were addressed:

Detect global anomalies across multiple dimensions.

Obtain instance‑level loss spikes quickly.

Identify whether availability issues stem from a data‑center, an interface or downstream services.

Smart anomaly detection algorithms in SIA combine latency, traffic, SLA and revenue metrics. Both periodic (moving‑average, STL decomposition) and non‑periodic (change‑point detection) methods are used to capture system fluctuations.

Full‑scene coverage is achieved by dividing monitoring into four quadrants—service, interface, error‑code and data‑center—ensuring no blind spots. Fine‑grained filters allow slicing by service, interface, error code, data‑center and instance.

Smart Alerting

Alerts are tiered by severity and scenario to reduce noise. Key features:

Intelligent merging of duplicate alerts.

Relevance‑based filtering.

Automatic escalation chain: email → instant‑messaging (Flow) → SMS → phone call, with repeated calls until acknowledgment.

Alert templates support rich‑text placeholders for error codes, timestamps and remediation steps, providing actionable context.

Efficient Fault Localization

Critical‑logic alerts embed trace links and robot notifications, delivering minute‑level alerts and automatically pinpointing the problematic code line.

Real‑time trace integration leverages Baidu Trace and DataHub messaging; traces are indexed within 5 minutes, enabling rapid search of logs and request context.

Monitoring Panorama

The final architecture provides comprehensive coverage of servers, logs, service status, business data and core scenarios. Developers can assess system health through visual dashboards, intelligent alerts and detailed reports.

Conclusion and Outlook

The systematic monitoring build achieved significant improvements in timeliness, fault‑localization efficiency and coverage. Future work aims to automate failure handling, implement intelligent resource scaling and further enhance system maintainability and availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Microservices Operations Alerting visualization game services Baidu

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.