How to Achieve Service Discovery High Availability with Push‑Empty Protection in MSE
This article walks through a real‑world Kubernetes outage caused by DNS and Nacos client bugs, explains the chain of failures, and presents a failure‑oriented design that adds push‑empty protection and outlier removal using Alibaba Cloud MSE to keep microservices highly available.
Background
The service registry is a core component for service registration and discovery in microservice architectures. In the CAP model, a registry can sacrifice a tiny amount of consistency (C) to guarantee availability (A), because an unavailable registry can cause catastrophic system failures.
Real‑world Incident
A customer deployed many microservices on an Alibaba Cloud Kubernetes cluster. An ECS network‑card glitch briefly recovered, but it triggered a large‑scale, sustained service outage that damaged business.
Failure Chain Analysis
ECS failure node hosted all CoreDNS pods; the low‑version Kubernetes cluster lacked NodeLocal DNSCache, causing DNS resolution problems.
The client used a defective Nacos‑client 1.4.1 version; when DNS lookup failed, the heartbeat stopped renewing, and only a restart could recover it.
Alibaba Cloud announced the severe bug in May, but the customer did not receive the notice and continued using the buggy version in production.
Design for Failure
When network jitter or registry unavailability causes batch service flapping, the microservice should treat this as an abnormal condition and adopt a conservative strategy to avoid "no provider" errors that could take the entire system offline.
High‑Availability Mechanisms for Service Discovery
Push‑Empty Protection : If the registry pushes an empty address list, the client ignores the update to prevent "no provider" errors. This works without upgrading the client and is independent of the registry implementation (supports Nacos, Eureka, Zookeeper, etc.).
Outlier Instance Removal : Detects abnormal instances based on network errors or HTTP 5xx responses, applies thresholds (QPS lower bound, removal ratio), and sends alerts (e.g., DingTalk).
Practical Implementation
Prerequisites: a Kubernetes cluster (see Alibaba Cloud docs) and an activated MSE microservice governance professional edition.
Steps:
Enable MSE governance in the console.
Install the MSE governance component from the Marketplace.
Enable governance for the target namespace.
Deploy demo applications (sc‑consumer, sc‑consumer‑empty, sc‑provider, Nacos server) using the following YAML:
# Enable push‑empty protection for sc‑consumer
apiVersion: apps/v1
kind: Deployment
metadata:
name: sc-consumer
spec:
replicas: 1
selector:
matchLabels:
app: sc-consumer
template:
metadata:
annotations:
msePilotCreateAppName: sc-consumer
labels:
app: sc-consumer
spec:
containers:
- env:
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
- name: spring.cloud.nacos.discovery.server-addr
value: nacos-server:8848
- name: profiler.micro.service.registry.empty.push.reject.enable
value: "true"
image: registry.cn-hangzhou.aliyuncs.com/mse-demo-hz/demo:sc-consumer-0.1
imagePullPolicy: Always
name: sc-consumer
ports:
- containerPort: 18091
livenessProbe:
tcpSocket:
port: 18091
initialDelaySeconds: 10
periodSeconds: 30
---
# sc‑consumer‑empty (no push‑empty protection)
apiVersion: apps/v1
kind: Deployment
metadata:
name: sc-consumer-empty
spec:
replicas: 1
selector:
matchLabels:
app: sc-consumer-empty
template:
metadata:
annotations:
msePilotCreateAppName: sc-consumer-empty
labels:
app: sc-consumer-empty
spec:
containers:
- env:
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
- name: spring.cloud.nacos.discovery.server-addr
value: nacos-server:8848
image: registry.cn-hangzhou.aliyuncs.com/mse-demo-hz/demo:sc-consumer-0.1
imagePullPolicy: Always
name: sc-consumer-empty
ports:
- containerPort: 18091
livenessProbe:
tcpSocket:
port: 18091
initialDelaySeconds: 10
periodSeconds: 30
---
# sc‑provider
apiVersion: apps/v1
kind: Deployment
metadata:
name: sc-provider
spec:
replicas: 1
selector:
matchLabels:
app: sc-provider
template:
metadata:
annotations:
msePilotCreateAppName: sc-provider
labels:
app: sc-provider
spec:
containers:
- env:
- name: JAVA_HOME
value: /usr/lib/jvm/java-1.8-openjdk/jre
- name: spring.cloud.nacos.discovery.server-addr
value: nacos-server:8848
image: registry.cn-hangzhou.aliyuncs.com/mse-demo-hz/demo:sc-provider-0.3
imagePullPolicy: Always
name: sc-provider
ports:
- containerPort: 18084
livenessProbe:
tcpSocket:
port: 18084
initialDelaySeconds: 10
periodSeconds: 30
---
# Nacos server
apiVersion: apps/v1
kind: Deployment
metadata:
name: nacos-server
spec:
replicas: 1
selector:
matchLabels:
app: nacos-server
template:
metadata:
labels:
app: nacos-server
spec:
containers:
- env:
- name: MODE
value: standalone
image: nacos/nacos-server:latest
imagePullPolicy: Always
name: nacos-server
dnsPolicy: ClusterFirst
restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
name: nacos-server
spec:
ports:
- port: 8848
protocol: TCP
targetPort: 8848
selector:
app: nacos-server
type: ClusterIPTo enable push‑empty protection, add the environment variable
profiler.micro.service.registry.empty.push.reject.enable=trueto the consumer.
Test script (curl.sh) to continuously call the service and log 500 responses:
while :
result=`curl $1 -s`
if [[ "$result" == *"500"* ]]; then
echo `date +%F-%T` $result
else
echo `date +%F-%T` $result
fi
sleep 0.1
doneResult Verification
During DNS outage simulation (CoreDNS scaled to 0), the sc‑consumer‑empty instance repeatedly returned 500 errors and logged "Load balancer does not have available server for client: mse-service-provider". After restoring CoreDNS, the instance remained disconnected until the provider was restarted. The sc‑consumer without push‑empty protection never reported errors.
After‑action
When push‑empty protection triggers, MSE reports events and alerts to DingTalk. It is recommended to combine it with outlier instance removal to isolate invalid provider addresses and maintain continuous business flow.
Technical references:
https://help.aliyun.com/document_detail/95108.htm#task-skz-qwk-qfb
https://help.aliyun.com/document_detail/347625.htm#task-2140253
https://common-buy.aliyun.com/?commodityCode=mse_basic_public_cn
https://help.aliyun.com/document_detail/170443.htm#concept-2519524
https://cs.console.aliyun.com
https://mse.console.aliyun.com
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
