Operations 16 min read

Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned

This article details how 58 Group’s cloud authentication service introduced an observability framework—optimizing logs, employing distributed tracing, defining SLO/SLA metrics, and implementing burn‑rate alerts—to improve fault detection, reduce false alarms, and achieve faster root‑cause analysis across the system.

58 Tech

Nov 27, 2024

Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned

Observability originated from control theory as a way to infer internal system state from external outputs, and in computing it measures system fault observability. Its three pillars—logs, tracing, and metrics—are interrelated yet distinct, forming the foundation of a robust observability system.

In the context of 58 Group’s cloud authentication service, several challenges were identified: difficulty in troubleshooting due to long, multi‑service flows; high false‑alarm rates caused by traffic spikes or user retries; and a lack of a global view of system health.

To address these, the team introduced a comprehensive observability architecture, focusing on three main areas:

3.1 Optimizing Logs

The existing logging was fragmented with multiple formats. A new unified logging format was designed, categorizing logs into monitoring logs, Hive logs, and cluster logs, each with standardized fields such as request IDs and parameters. Result codes were split into res_code (overall request result) and biz_code (business‑level result) to distinguish internal failures from external input errors.

3.2 Using Distributed Tracing

Log data alone could not provide end‑to‑end request visibility, so the internal tracing tool wtrace was adopted. By correlating logs with trace IDs, the full request flow—including input parameters, latency, and exception stacks—could be reconstructed, enabling precise debugging across service boundaries.

3.3 Optimizing Monitoring Metrics

Metrics were divided into core SLI/SLO indicators and auxiliary metrics. Core SLOs (e.g., authentication success rate) trigger alerts, while auxiliary metrics help diagnose the root cause when an SLO breach occurs. This two‑tier approach allows rapid pinpointing of issues.

4 Solving False Alarms

False alarms stemmed from traffic jitter and low‑traffic scenarios. Inspired by Google SRE, burn‑rate alerts were introduced, linking alert thresholds to error budgets. The burn‑rate is calculated as error_rate / (1 - SLO), indicating how quickly the error budget is consumed.

Formulas used:

检测用时 = ((1-SLO)/error_ratio)) * 告警窗口 * 燃烧率

错误预算 =（燃烧率 * 告警窗口）/ 时间窗口

Alert expressions combine short‑ and long‑window burn‑rate checks, e.g.:

( error_ratio_rate1h > 14.4 and error_ratio_rate5m > 14.4 )  or ( error_ratio_rate6h > 6 and error_ratio_rate30m > 6 ) or ......

Additional strategies include higher burn‑rate thresholds for low‑traffic services and distinguishing internal from external exceptions to reduce mis‑alerts.

4.3 Alert Management Platform

The internal platform Algalon integrates SLO management, burn‑rate alert configuration, and end‑to‑end incident handling. After implementation, alarm volume dropped from 445 to 3 in two weeks, improving precision from <1% to over 95%.

5 Global Control and Dashboards

Comprehensive dashboards now display SLO fulfillment rates, error‑budget consumption, and daily trends, providing a clear overview of system health and facilitating proactive management.

6 Overall Summary

The observability system standardized logging with a custom client, leveraged wtrace for distributed tracing, and built SLO‑centric monitoring with auxiliary metrics and burn‑rate alerts. While the solution greatly improved fault detection and reduced false alarms, challenges remain in log collection intrusiveness, configuration complexity, and low‑traffic alert tuning.

References and author bios are listed at the end of the original article.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Observability Distributed Tracing SLO Error Budget cloud authentication

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.