Tencent Classroom Monitoring Practices: Challenges, Strategies, and Future Directions
During the pandemic’s “停课不停学” surge, Tencent Classroom tackled a 120‑fold traffic jump by rapidly deploying Grafana dashboards, Kibana logs, internal Moniter and cloud monitoring tools, establishing a three‑layer feedback‑alert‑on‑call model, and now plans automation, unified visualizations, and chaos‑engineering to further boost observability and service reliability.
During the COVID‑19 pandemic, Tencent Classroom launched the “停课不停学” initiative, causing a massive traffic surge (PCU grew from 5 × 10⁴ to 6 × 10⁶, QPS from 1.4 × 10⁴ to 6.5 × 10⁵). The article summarizes the monitoring challenges and the solutions adopted.
Challenges : rapid detection of potential issues in a short time, handling massive request volume, identifying which metrics to monitor, setting appropriate alerts, and ensuring efficient post‑alert handling.
Response strategy includes:
Adopt a quick‑to‑monitor approach and iteratively optimize.
Use Grafana‑based quality dashboards for visualizing key metrics (success rate, request volume, error details, connection count, slow queries, replication lag, CPU load, etc.).
Deploy a full‑link log system built on Kibana for centralized log collection and query.
Leverage the internal “Moniter” system for CPU, memory, disk, network and custom business metrics.
Employ a network‑management system with agents on each server to collect and report performance data.
Utilize Tencent Cloud Monitoring (host, service, log, custom metrics, cloud probing) for both application and data‑layer services such as TDSQL and Redis.
Three‑pronged operational model :
Feedback : automatic feedback via the iFeedback platform and manual feedback through QQ/WeChat groups.
Alerting : multiple alert types (Moniter alerts, business alerts, baseline metric alerts, probing alerts, model‑driven alerts) delivered via enterprise‑WeChat bots, SMS, email, etc.
On‑call : daily patrols, pre‑class inspections, hourly checks, and 24/7 on‑site duty covering all service domains.
Future plans focus on:
Tool optimization – unified quality panorama dashboard, expanded use of Cloud Monitoring (basic monitoring, custom monitoring, log monitoring, cloud probing).
Automation – auto‑ticket generation and notification for alerts, automated analysis to reduce manual effort.
Advanced visualization – richer data‑visualization to better illustrate business impact.
Architecture improvements – chaos engineering and fault‑injection, building observability pillars (metrics, logs, tracing) to understand why failures occur.
The article concludes that continuous monitoring evolution has improved service quality and user experience, and invites collaboration on Golang, cloud‑native, DevOps, and backend projects.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.