How to Build a Highly Available, Stable, and Observable SMS Service
This article explains how to design a high‑availability SMS system by identifying stability bottlenecks, defining reliability goals, implementing failover strategies for Redis, MySQL and external services, establishing a comprehensive observability framework, and measuring key quality metrics to ensure 99.99% uptime.
Background
SMS is widely used in user registration, password recovery, account changes, payment confirmation, activity verification, and marketing. The article focuses on improving SMS system high availability and observability.
Current Issues
The core SMS workflow heavily depends on external resources such as downstream services and MySQL. Failures in these resources cause complete service outage, severely affecting stability. The system also relies on multiple third‑party providers, and without quality evaluation, anomalies are only detected after a day, leading to business loss.
Improvement Goals
Increase SMS service stability to detect faults quickly and maintain interface availability above 99.99%.
Enhance observability by establishing quality monitoring and evaluation for multiple providers to detect channel anomalies promptly.
Overall Solution
The SMS system consists of two parts: the SMS service and the SMS metrics observation module.
SMS Service
Provides basic capabilities such as sending verification codes, validation, receipts, and upstream handling. Stability risk points include strong dependencies on downstream encryption services, Redis, and MySQL.
Service dependency: Downstream encryption service failure leads to phone number encryption failure and terminates the verification flow.
Redis dependency: Redis failure prevents requestID generation, ending the verification flow.
MySQL dependency: MySQL failure blocks query, update, and record storage, ending the verification flow.
Metrics Observation Module
Offers metric calculation, visualization, and alerting. Current observability gaps:
Missing core metrics such as SMS fill‑rate and delivery‑rate, making it impossible to evaluate third‑party provider quality.
Visualization is not user‑friendly and lacks sufficient dimensions.
No alert mechanism for metric anomalies, preventing timely quality awareness.
Optimization Ideas
Replace Redis‑based unique ID generation with a UUID algorithm.
Decouple services using a message queue.
Introduce redundant storage (e.g., Redis) for MySQL disaster recovery.
Refine quality monitoring by defining new metrics and improving data collection.
Design Practices
Eliminate Redis Strong Dependency
Redis is only used to generate a globally unique ID; replace it with UUID.
Eliminate Service Strong Dependency
Use a message queue to decouple the SMS service from downstream encryption services.
Eliminate MySQL Strong Dependency
Introduce Redis as a redundant storage layer; during MySQL failures, execute equivalent Redis commands for disaster recovery.
MySQL Failure Detection & Recovery
Detect MySQL status via error codes and frequency analysis; recovery relies on manual intervention and alerting.
Redis Failure Detection & Recovery
Detect Redis status by parsing error messages and frequency; recovery also depends on manual control and alerts.
Failover State Object
<code>// State failover object
type State struct {
acquireStatus func() Status // Get current failover status
setMySQLFatalFlag func(context.Context) error // Enable MySQL fatal flag
setRedisFatalFlag func(context.Context) error // Enable Redis fatal flag
runMaster func() error // Execute when MySQL is healthy
runBackup func() error // Execute when Redis is healthy
recordSQL func() // Log SQL during MySQL failure for later recovery
}
// Run selects the processing flow based on the current status
func (s *State) Run(ctx context.Context) error {
var fn func(context.Context) error
switch s.acquireStatus() {
case StatusHealthy:
fn = s.runMaster
case StatusMysqlFatal:
fn = s.runBackup
case StatusRedisFatal:
fn = s.runMaster
default:
fn = s.runBackup
}
return fn(ctx)
}
</code>SQL Disaster Recovery
Use a state‑pattern failover object to trigger different behaviors based on MySQL/Redis health, avoiding repetitive conditional code.
Failure Record Recovery
During failures, the RecordSQL function writes the SQL statements to disk; these can be replayed to restore MySQL data after recovery.
Quality Observation System
Added ten new metrics, including third‑party success rate, receipt rate, delivery rate, and fill‑rate. Their formulas and meanings are defined to evaluate provider stability and user verification effectiveness.
Metrics Collection & Visualization Architecture
The collected data feeds daily quality reports, recent trend charts per carrier, and monitoring alerts, providing comprehensive visibility into SMS service quality.
Benefits
SMS service quality now guarantees 99.99% availability with no major incidents in the past year.
Established a quality observation system that detects and resolves channel anomalies within 20 minutes.
Future Outlook
Fine‑grained regional SMS operations: coverage is global, but some regions still need quality improvements.
Automation of analysis tools: current post‑degradation analysis is time‑consuming and requires manual effort.
Inke Technology
Official account of Inke Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.