Operations 13 min read

Building a Full‑Stack Operations Monitoring System: Strategies, Implementation & Lessons

This article details the end‑to‑end process of designing and deploying an operations‑focused monitoring framework, covering semantic, log, and performance monitoring, protobuf integration, implementation challenges, actual effects, and future optimization directions for large‑scale online services.

Baidu Waimai Technology Team

Jul 4, 2017

Building a Full‑Stack Operations Monitoring System: Strategies, Implementation & Lessons

Overview

The article presents a complete workflow for constructing a monitoring system tailored to the operations side of an online service, aiming to achieve alerting for entry points, performance reporting, and graceful degradation.

Why Add Monitoring

Before monitoring, failures were discovered passively through user reports, leading to delayed fixes and reduced service availability. Introducing monitoring addresses three key problems: detecting faulty entry points early via semantic checks, identifying runtime errors through log analysis, and tracking performance metrics for proactive optimization.

Monitoring Scheme and Implementation Details

The solution is divided into three sub‑domains:

Semantic Monitoring – validates the completeness and timeliness of API responses. It includes field‑level (presence) checks and content‑level (value) checks, with alerts sent via email or SMS. Implementation involves configuring entry‑specific monitoring cases on the Numen platform and defining URL/parameter mappings.

Log Monitoring – leverages the Noah monitoring platform to watch critical nodes such as database operations or third‑party calls. Rules are defined by log file paths, match strings, and frequency thresholds; alerts trigger when patterns exceed defined limits.

Performance Monitoring – generates daily reports containing request counts, average/max/min latency, and percentile‑based slow‑request statistics. Data is sourced from ad‑hoc queries on protobuf (pb) logs, parsed via proto definitions, stored in MySQL tables, and accessed through API endpoints for report generation.

Figures illustrating the monitoring flow, semantic monitoring lists, field‑level and content‑level examples, log monitoring configuration, and performance dashboards are included.

Protobuf Usage

Protobuf serves as the serialization protocol for operation logs. Developers define .proto files describing messages (e.g., Order), compile them to generate language‑specific classes, and use these classes to parse binary logs. This enables efficient storage and retrieval of structured data for monitoring.

Challenges and Solutions

During implementation, issues arose such as handling string‑type fields (e.g., every_time) in ad‑hoc queries across Hive, Impala, MySQL, and PostgreSQL, which have differing SQL syntax. The resolution involved consulting each engine’s built‑in string functions to normalize processing.

Results and Future Work

Monitoring has moved from non‑existent to a functional baseline, providing alerts, performance insights, and a framework for future enhancements. Planned improvements include automated degradation strategies, richer visualizations (charts, graphs), and packaging the monitoring capabilities as a reusable service for other product lines.

Key Images

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance-monitoring Protobuf Log Monitoring operations monitoring ad-hoc queries semantic monitoring

Written by

Baidu Waimai Technology Team

The Baidu Waimai Technology Team supports and drives the company's business growth. This account provides a platform for engineers to communicate, share, and learn. Follow us for team updates, top technical articles, and internal/external open courses.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.