Operations 9 min read

How LApiGateway Achieves 99.999% Uptime: Architecture, SLA & Risk Mitigation

LApiGateway, Huolala's internal micro‑service gateway, achieves five‑nine availability through a dual‑plane architecture, comprehensive monitoring, SLA definition, risk classification, heartbeat health checks, traffic migration strategies, strict change governance, and regular fault drills, all detailed in this technical overview.

Huolala Tech

Jul 11, 2024

How LApiGateway Achieves 99.999% Uptime: Architecture, SLA & Risk Mitigation

LApiGateway Overview

LApiGateway is the internal micro‑service gateway of Huolala, responsible for traffic forwarding and providing features such as authentication, rate‑limiting, parameter modification and validation to improve developer efficiency.

Architecture

The gateway consists of a control plane and a data plane.

Control Plane

Service configuration managed by LApi Management Platform and Apollo configuration center.

Service discovery via Consul, obtaining node registration info (IP, group, gray version, etc.).

Monitoring composed of Trace service and HLL Monitor for request monitoring and alarm.

Data Plane

Requests enter through load balancers (KONG, SLB), pass through LApi nodes where a series of plugins process them before being forwarded to downstream services. Plugins include account authentication (depends on Account Service) and SSO authentication (depends on SSO Service).

During data processing LApi may rely on:

Account Service – user authentication.

Kafka – persisting request‑generated messages.

Lone – publishing windows and service permission management.

SSO Service – employee authentication.

SLA Definition

The SLA is defined by the “availability percentage”, i.e., the success rate of proxy service requests within a calculation period, excluding failures caused by LApi itself. A calculation period is 5 minutes, and there are 105 120 periods per year.

Achieving five‑nines (99.999 %) means the total unavailable time in a year must be less than 5 minutes.

Challenges and Solutions

External Risks

Uncontrollable factors such as ECS instance failures, network jitter, or traffic attacks. Mitigation relies on rapid recovery and health‑check mechanisms.

Node heartbeat checks:

KONG TCP connection heartbeat (~9 s detection).

Consul heartbeat (~6 s detection).

In a 4‑node cluster, heartbeat checks raise the 5‑minute availability from 75 % to 99.25 % (KONG traffic) and 99.50 % (SOA traffic).

Cluster Faults

If more than half of the nodes fail, simply removing faulty nodes can worsen load and cause total outage. Traffic migration to a healthy cluster within minutes is required.

Migration steps:

Detect fault via LApi Management Platform and Consul service registry.

Shift traffic to a reserve cluster group with spare capacity.

Complete migration within 2–3 minutes (goal: <30 s with full automation).

Internal Risks

Mitigated through three measures:

Exception case protection – cataloguing system, application and third‑party component failure cases and their solutions.

Change governance – strict code‑review, regression testing, staged gray releases, and service‑integration procedures.

Daily operations – continuous health‑status monitoring, routing change notifications, and post‑change load verification.

Fault Drills

Regular drills simulate potential failures to uncover hidden issues. Past drill records are shown below.

Conclusion

Through continuous stability investment, LApiGateway has maintained a five‑nine availability over two years. Ongoing optimization will keep the platform reliable and provide users with a high‑quality service experience.

high availability SLA risk mitigation LApiGateway Microservice Gateway

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.