How LApiGateway Achieves 99.999% Uptime: Architecture, SLA & Risk Mitigation
LApiGateway, Huolala's internal micro‑service gateway, achieves five‑nine availability through a dual‑plane architecture, comprehensive monitoring, SLA definition, risk classification, heartbeat health checks, traffic migration strategies, strict change governance, and regular fault drills, all detailed in this technical overview.
LApiGateway Overview
LApiGateway is the internal micro‑service gateway of Huolala, responsible for traffic forwarding and providing features such as authentication, rate‑limiting, parameter modification and validation to improve developer efficiency.
Architecture
The gateway consists of a control plane and a data plane.
Control Plane
Service configuration managed by LApi Management Platform and Apollo configuration center.
Service discovery via Consul, obtaining node registration info (IP, group, gray version, etc.).
Monitoring composed of Trace service and HLL Monitor for request monitoring and alarm.
Data Plane
Requests enter through load balancers (KONG, SLB), pass through LApi nodes where a series of plugins process them before being forwarded to downstream services. Plugins include account authentication (depends on Account Service) and SSO authentication (depends on SSO Service).
During data processing LApi may rely on:
Account Service – user authentication.
Kafka – persisting request‑generated messages.
Lone – publishing windows and service permission management.
SSO Service – employee authentication.
SLA Definition
The SLA is defined by the “availability percentage”, i.e., the success rate of proxy service requests within a calculation period, excluding failures caused by LApi itself. A calculation period is 5 minutes, and there are 105 120 periods per year.
Achieving five‑nines (99.999 %) means the total unavailable time in a year must be less than 5 minutes.
Challenges and Solutions
External Risks
Uncontrollable factors such as ECS instance failures, network jitter, or traffic attacks. Mitigation relies on rapid recovery and health‑check mechanisms.
Node heartbeat checks:
KONG TCP connection heartbeat (~9 s detection).
Consul heartbeat (~6 s detection).
In a 4‑node cluster, heartbeat checks raise the 5‑minute availability from 75 % to 99.25 % (KONG traffic) and 99.50 % (SOA traffic).
Cluster Faults
If more than half of the nodes fail, simply removing faulty nodes can worsen load and cause total outage. Traffic migration to a healthy cluster within minutes is required.
Migration steps:
Detect fault via LApi Management Platform and Consul service registry.
Shift traffic to a reserve cluster group with spare capacity.
Complete migration within 2–3 minutes (goal: <30 s with full automation).
Internal Risks
Mitigated through three measures:
Exception case protection – cataloguing system, application and third‑party component failure cases and their solutions.
Change governance – strict code‑review, regression testing, staged gray releases, and service‑integration procedures.
Daily operations – continuous health‑status monitoring, routing change notifications, and post‑change load verification.
Fault Drills
Regular drills simulate potential failures to uncover hidden issues. Past drill records are shown below.
Conclusion
Through continuous stability investment, LApiGateway has maintained a five‑nine availability over two years. Ongoing optimization will keep the platform reliable and provide users with a high‑quality service experience.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
