Cloud Native 12 min read

Zero‑Downtime Deployments with Alibaba Cloud Lightweight Message Queue

This article explains how Alibaba Cloud Lightweight Message Queue (formerly MNS) enables lossless, zero‑downtime service releases by redesigning the network entry layer, using load‑balancer draining, injecting HTTP close frames, and providing CI/CD scripts that work across ECS and Kubernetes environments.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Zero‑Downtime Deployments with Alibaba Cloud Lightweight Message Queue

Alibaba Cloud Lightweight Message Queue (formerly MNS) is a high‑concurrency, elastic message‑queue service used in retail, finance, automotive, gaming and AI scenarios. The article examines its "lossless release" capability from a developer’s perspective, detailing the technical advantages, architecture, implementation steps, and practical validation.

1. Core Advantages and Business Value

Million‑TPS, zero‑perceived errors : Unlike many lossless solutions that still cause brief traffic interruptions, this design has been proven in production to avoid any client‑side errors during release.

Compatibility with existing users : No client upgrades are required, eliminating a major migration barrier.

High robustness, low maintenance : The solution is simple, robust, and requires no architectural changes.

Strong universality : Works with any stateless HTTP‑based application.

2. Architecture Overview

The lossless release focuses on the network entry layer. For stateless services, only the TCP connections to the instance being upgraded need to be gracefully removed. The simplified model includes:

Focus on the network entry (load balancer → backend).

Maintain a generic architecture similar to typical HTTP services.

Ensure compatibility with various deployment forms (ACK, ECS, different LB versions).

Decouple from the application so no client changes are needed.

Architecture diagram
Architecture diagram

3. Core Implementation Process

The implementation consists of two main phases: removing connections from the instance to be released, then publishing the new version.

Phase 1 – Remove Connections

Step 1: Remove TCP connection requests while ensuring existing connections continue to be forwarded and the application remains responsive.

Step 2: Gracefully close residual connections.

Phase 2 – Publish Application

Step 3: Verify that no connections or pending requests remain, then perform the release.

Step 4: Re‑introduce traffic to the newly released instance.

Key technical challenges addressed:

How to stop new TCP connections without breaking existing ones.

Ensuring graceful shutdown of residual connections.

Compatibility with Kubernetes (ACK) where kube‑proxy adds an extra network layer.

4. Graceful Connection Termination

Because TCP can only be closed by the client, the solution injects an HTTP Connection: close header (or an HTTP close frame) in the server’s response. For idle connections, the load balancer is drained and the socket timeout is waited for, allowing the client to discard the connection automatically.

pubstart)
    offline
    stop_http
    stopjava
    startjava
    start_http
    online
;;
offline_http() {
    echo "[ 1/10] -- offline http from load balance server"
    # Delete status flag to trigger LB health‑check failure and enable Nginx close‑frame response
    rm -f $STATUSROOT_HOME/status
    curl localhost:7001/shutDownGracefully
    sleep $SOCKET_TIMEOUT + $HEALTH_CHECK
}

5. CI/CD Integration

For ECS, the release script modifies the offline and online phases, removing the status flag, waiting for the socket timeout, and then restoring the flag. For Kubernetes (ACK), the same script is placed in a preStop hook of the pod definition:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: main-container
        image: my-image:latest
        lifecycle:
          preStop:
            exec:
              command:
                - sh
                - /home/admin/offline.sh
        ports:
          - containerPort: 8080

6. Validation

Testing in a simulated environment shows that after applying the lossless release, error rates during deployment drop to zero even under million‑TPS load, confirming the effectiveness of the approach.

Test results
Test results

7. Conclusion

The lossless release technique for Alibaba Cloud Lightweight Message Queue combines load‑balancer draining, Nginx‑level HTTP close‑frame injection, and timeout handling to achieve true zero‑downtime deployments. It works across ECS and Kubernetes, scales to million‑TPS workloads, and requires no client modifications, embodying the product’s “customer‑first” philosophy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud-nativeZero DowntimeAlibaba Cloudmessage-queueMNS
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.