Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned
This article details how 58 Group’s cloud authentication service introduced an observability framework—optimizing logs, employing distributed tracing, defining SLO/SLA metrics, and implementing burn‑rate alerts—to improve fault detection, reduce false alarms, and achieve faster root‑cause analysis across the system.
Observability originated from control theory as a way to infer internal system state from external outputs, and in computing it measures system fault observability. Its three pillars—logs, tracing, and metrics—are interrelated yet distinct, forming the foundation of a robust observability system.
In the context of 58 Group’s cloud authentication service, several challenges were identified: difficulty in troubleshooting due to long, multi‑service flows; high false‑alarm rates caused by traffic spikes or user retries; and a lack of a global view of system health.
To address these, the team introduced a comprehensive observability architecture, focusing on three main areas:
3.1 Optimizing Logs
The existing logging was fragmented with multiple formats. A new unified logging format was designed, categorizing logs into monitoring logs, Hive logs, and cluster logs, each with standardized fields such as request IDs and parameters. Result codes were split into res_code (overall request result) and biz_code (business‑level result) to distinguish internal failures from external input errors.
3.2 Using Distributed Tracing
Log data alone could not provide end‑to‑end request visibility, so the internal tracing tool wtrace was adopted. By correlating logs with trace IDs, the full request flow—including input parameters, latency, and exception stacks—could be reconstructed, enabling precise debugging across service boundaries.
3.3 Optimizing Monitoring Metrics
Metrics were divided into core SLI/SLO indicators and auxiliary metrics. Core SLOs (e.g., authentication success rate) trigger alerts, while auxiliary metrics help diagnose the root cause when an SLO breach occurs. This two‑tier approach allows rapid pinpointing of issues.
4 Solving False Alarms
False alarms stemmed from traffic jitter and low‑traffic scenarios. Inspired by Google SRE, burn‑rate alerts were introduced, linking alert thresholds to error budgets. The burn‑rate is calculated as error_rate / (1 - SLO) , indicating how quickly the error budget is consumed.
Formulas used:
检测用时 = ((1-SLO)/error_ratio)) * 告警窗口 * 燃烧率 错误预算 =(燃烧率 * 告警窗口)/ 时间窗口Alert expressions combine short‑ and long‑window burn‑rate checks, e.g.:
( error_ratio_rate1h > 14.4 and error_ratio_rate5m > 14.4 ) or ( error_ratio_rate6h > 6 and error_ratio_rate30m > 6 ) or ......Additional strategies include higher burn‑rate thresholds for low‑traffic services and distinguishing internal from external exceptions to reduce mis‑alerts.
4.3 Alert Management Platform
The internal platform Algalon integrates SLO management, burn‑rate alert configuration, and end‑to‑end incident handling. After implementation, alarm volume dropped from 445 to 3 in two weeks, improving precision from <1% to over 95%.
5 Global Control and Dashboards
Comprehensive dashboards now display SLO fulfillment rates, error‑budget consumption, and daily trends, providing a clear overview of system health and facilitating proactive management.
6 Overall Summary
The observability system standardized logging with a custom client, leveraged wtrace for distributed tracing, and built SLO‑centric monitoring with auxiliary metrics and burn‑rate alerts. While the solution greatly improved fault detection and reduced false alarms, challenges remain in log collection intrusiveness, configuration complexity, and low‑traffic alert tuning.
References and author bios are listed at the end of the original article.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.