Building a Full‑Stack Operations Monitoring System: Strategies, Implementation & Lessons
This article details the end‑to‑end process of designing and deploying an operations‑focused monitoring framework, covering semantic, log, and performance monitoring, protobuf integration, implementation challenges, actual effects, and future optimization directions for large‑scale online services.
Overview
The article presents a complete workflow for constructing a monitoring system tailored to the operations side of an online service, aiming to achieve alerting for entry points, performance reporting, and graceful degradation.
Why Add Monitoring
Before monitoring, failures were discovered passively through user reports, leading to delayed fixes and reduced service availability. Introducing monitoring addresses three key problems: detecting faulty entry points early via semantic checks, identifying runtime errors through log analysis, and tracking performance metrics for proactive optimization.
Monitoring Scheme and Implementation Details
The solution is divided into three sub‑domains:
Semantic Monitoring – validates the completeness and timeliness of API responses. It includes field‑level (presence) checks and content‑level (value) checks, with alerts sent via email or SMS. Implementation involves configuring entry‑specific monitoring cases on the Numen platform and defining URL/parameter mappings.
Log Monitoring – leverages the Noah monitoring platform to watch critical nodes such as database operations or third‑party calls. Rules are defined by log file paths, match strings, and frequency thresholds; alerts trigger when patterns exceed defined limits.
Performance Monitoring – generates daily reports containing request counts, average/max/min latency, and percentile‑based slow‑request statistics. Data is sourced from ad‑hoc queries on protobuf (pb) logs, parsed via proto definitions, stored in MySQL tables, and accessed through API endpoints for report generation.
Figures illustrating the monitoring flow, semantic monitoring lists, field‑level and content‑level examples, log monitoring configuration, and performance dashboards are included.
Protobuf Usage
Protobuf serves as the serialization protocol for operation logs. Developers define .proto files describing messages (e.g., Order), compile them to generate language‑specific classes, and use these classes to parse binary logs. This enables efficient storage and retrieval of structured data for monitoring.
Challenges and Solutions
During implementation, issues arose such as handling string‑type fields (e.g., every_time) in ad‑hoc queries across Hive, Impala, MySQL, and PostgreSQL, which have differing SQL syntax. The resolution involved consulting each engine’s built‑in string functions to normalize processing.
Results and Future Work
Monitoring has moved from non‑existent to a functional baseline, providing alerts, performance insights, and a framework for future enhancements. Planned improvements include automated degradation strategies, richer visualizations (charts, graphs), and packaging the monitoring capabilities as a reusable service for other product lines.
Key Images
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baidu Waimai Technology Team
The Baidu Waimai Technology Team supports and drives the company's business growth. This account provides a platform for engineers to communicate, share, and learn. Follow us for team updates, top technical articles, and internal/external open courses.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
