Cloud Computing 17 min read

How to Achieve Service Discovery High Availability with Push‑Empty Protection in MSE

This article walks through a real‑world Kubernetes outage caused by DNS and Nacos client bugs, explains the chain of failures, and presents a failure‑oriented design that adds push‑empty protection and outlier removal using Alibaba Cloud MSE to keep microservices highly available.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How to Achieve Service Discovery High Availability with Push‑Empty Protection in MSE

Background

The service registry is a core component for service registration and discovery in microservice architectures. In the CAP model, a registry can sacrifice a tiny amount of consistency (C) to guarantee availability (A), because an unavailable registry can cause catastrophic system failures.

Real‑world Incident

A customer deployed many microservices on an Alibaba Cloud Kubernetes cluster. An ECS network‑card glitch briefly recovered, but it triggered a large‑scale, sustained service outage that damaged business.

Failure Chain Analysis

ECS failure node hosted all CoreDNS pods; the low‑version Kubernetes cluster lacked NodeLocal DNSCache, causing DNS resolution problems.

The client used a defective Nacos‑client 1.4.1 version; when DNS lookup failed, the heartbeat stopped renewing, and only a restart could recover it.

Alibaba Cloud announced the severe bug in May, but the customer did not receive the notice and continued using the buggy version in production.

Design for Failure

When network jitter or registry unavailability causes batch service flapping, the microservice should treat this as an abnormal condition and adopt a conservative strategy to avoid "no provider" errors that could take the entire system offline.

High‑Availability Mechanisms for Service Discovery

Push‑Empty Protection : If the registry pushes an empty address list, the client ignores the update to prevent "no provider" errors. This works without upgrading the client and is independent of the registry implementation (supports Nacos, Eureka, Zookeeper, etc.).

Outlier Instance Removal : Detects abnormal instances based on network errors or HTTP 5xx responses, applies thresholds (QPS lower bound, removal ratio), and sends alerts (e.g., DingTalk).

Practical Implementation

Prerequisites: a Kubernetes cluster (see Alibaba Cloud docs) and an activated MSE microservice governance professional edition.

Steps:

Enable MSE governance in the console.

Install the MSE governance component from the Marketplace.

Enable governance for the target namespace.

Deploy demo applications (sc‑consumer, sc‑consumer‑empty, sc‑provider, Nacos server) using the following YAML:

# Enable push‑empty protection for sc‑consumer
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sc-consumer
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sc-consumer
  template:
    metadata:
      annotations:
        msePilotCreateAppName: sc-consumer
      labels:
        app: sc-consumer
    spec:
      containers:
      - env:
        - name: JAVA_HOME
          value: /usr/lib/jvm/java-1.8-openjdk/jre
        - name: spring.cloud.nacos.discovery.server-addr
          value: nacos-server:8848
        - name: profiler.micro.service.registry.empty.push.reject.enable
          value: "true"
        image: registry.cn-hangzhou.aliyuncs.com/mse-demo-hz/demo:sc-consumer-0.1
        imagePullPolicy: Always
        name: sc-consumer
        ports:
        - containerPort: 18091
        livenessProbe:
          tcpSocket:
            port: 18091
          initialDelaySeconds: 10
          periodSeconds: 30
---
# sc‑consumer‑empty (no push‑empty protection)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sc-consumer-empty
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sc-consumer-empty
  template:
    metadata:
      annotations:
        msePilotCreateAppName: sc-consumer-empty
      labels:
        app: sc-consumer-empty
    spec:
      containers:
      - env:
        - name: JAVA_HOME
          value: /usr/lib/jvm/java-1.8-openjdk/jre
        - name: spring.cloud.nacos.discovery.server-addr
          value: nacos-server:8848
        image: registry.cn-hangzhou.aliyuncs.com/mse-demo-hz/demo:sc-consumer-0.1
        imagePullPolicy: Always
        name: sc-consumer-empty
        ports:
        - containerPort: 18091
        livenessProbe:
          tcpSocket:
            port: 18091
          initialDelaySeconds: 10
          periodSeconds: 30
---
# sc‑provider
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sc-provider
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sc-provider
  template:
    metadata:
      annotations:
        msePilotCreateAppName: sc-provider
      labels:
        app: sc-provider
    spec:
      containers:
      - env:
        - name: JAVA_HOME
          value: /usr/lib/jvm/java-1.8-openjdk/jre
        - name: spring.cloud.nacos.discovery.server-addr
          value: nacos-server:8848
        image: registry.cn-hangzhou.aliyuncs.com/mse-demo-hz/demo:sc-provider-0.3
        imagePullPolicy: Always
        name: sc-provider
        ports:
        - containerPort: 18084
        livenessProbe:
          tcpSocket:
            port: 18084
          initialDelaySeconds: 10
          periodSeconds: 30
---
# Nacos server
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nacos-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nacos-server
  template:
    metadata:
      labels:
        app: nacos-server
    spec:
      containers:
      - env:
        - name: MODE
          value: standalone
        image: nacos/nacos-server:latest
        imagePullPolicy: Always
        name: nacos-server
      dnsPolicy: ClusterFirst
      restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
  name: nacos-server
spec:
  ports:
  - port: 8848
    protocol: TCP
    targetPort: 8848
  selector:
    app: nacos-server
  type: ClusterIP

To enable push‑empty protection, add the environment variable

profiler.micro.service.registry.empty.push.reject.enable=true

to the consumer.

Test script (curl.sh) to continuously call the service and log 500 responses:

while :
    result=`curl $1 -s`
    if [[ "$result" == *"500"* ]]; then
        echo `date +%F-%T` $result
    else
        echo `date +%F-%T` $result
    fi
    sleep 0.1
done

Result Verification

During DNS outage simulation (CoreDNS scaled to 0), the sc‑consumer‑empty instance repeatedly returned 500 errors and logged "Load balancer does not have available server for client: mse-service-provider". After restoring CoreDNS, the instance remained disconnected until the provider was restarted. The sc‑consumer without push‑empty protection never reported errors.

After‑action

When push‑empty protection triggers, MSE reports events and alerts to DingTalk. It is recommended to combine it with outlier instance removal to isolate invalid provider addresses and maintain continuous business flow.

Technical references:

https://help.aliyun.com/document_detail/95108.htm#task-skz-qwk-qfb

https://help.aliyun.com/document_detail/347625.htm#task-2140253

https://common-buy.aliyun.com/?commodityCode=mse_basic_public_cn

https://help.aliyun.com/document_detail/170443.htm#concept-2519524

https://cs.console.aliyun.com

https://mse.console.aliyun.com

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Microserviceshigh availabilityKubernetesservice discoveryNacosMSEoutlier removalpush empty protection
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.