Cloud Native 12 min read

Prevent Service Outages with Nacos: Custom Heartbeat, Protection Threshold & Retry

This article examines Nacos’s default health‑check timing, explains the risks of a 15‑second blind spot when a service crashes, and shows how to configure custom heartbeat intervals, protection thresholds, and Spring Cloud LoadBalancer retry settings to minimize downtime and avoid cascade failures.

Senior Brother's Insights

Jul 1, 2021

Prevent Service Outages with Nacos: Custom Heartbeat, Protection Threshold & Retry

Introduction

Microservice architectures often rely on service registries such as Nacos for discovery and governance. However, the default health‑check intervals can leave a window where a crashed instance is still considered healthy, leading to request failures.

Nacos Health‑Check Mechanism

Nacos maintains temporary instances via heartbeat reports. The client sends a heartbeat every 5 seconds. If the server does not receive a heartbeat for 15 seconds, the instance is marked unhealthy; after 30 seconds without a heartbeat, the instance is removed.

Problem When a Service Crashes

If a service is killed abruptly (e.g., kill -9), it cannot call the deregistration API, so Nacos continues to treat the instance as alive for up to 15 seconds. During this gap, some requests are routed to the dead instance, causing errors.

Customizing the Heartbeat Interval

Reducing the heartbeat interval shortens the detection window. Since Nacos 1.1.0, you can configure heartbeat interval, timeout, and instance‑deletion timeout via instance metadata.

String serviceName = randomDomainName();

Instance instance = new Instance();
instance.setIp("1.1.1.1");
instance.setPort(9999);
Map<String, String> metadata = new HashMap<String, String>();
// heartbeat interval in milliseconds
metadata.put(PreservedMetadataKeys.HEART_BEAT_INTERVAL, "3000");
// heartbeat timeout (server marks unhealthy after 6 s)
metadata.put(PreservedMetadataKeys.HEART_BEAT_TIMEOUT, "6000");
// instance deletion timeout (server removes after 9 s)
metadata.put(PreservedMetadataKeys.IP_DELETE_TIMEOUT, "9000");
instance.setMetadata(metadata);

naming.registerInstance(serviceName, instance);

For Spring Cloud Alibaba projects, the same settings can be placed in application.yml:

spring:
  application:
    name: user-service-provider
  cloud:
    nacos:
      discovery:
        server-addr: 127.0.0.1:8848
        heart-beat-interval: 1000 # ms
        heart-beat-timeout: 3000   # ms
        ip-delete-timeout: 6000    # ms

If the above properties are ignored, you can set the values directly in the metadata section:

spring:
  application:
    name: user-service-provider
  cloud:
    nacos:
      discovery:
        server-addr: 127.0.0.1:8848
        metadata:
          preserved.heart.beat.interval: 1000
          preserved.heart.beat.timeout: 3000
          preserved.ip.delete.timeout: 6000

Protection Threshold

Nacos provides a protection‑threshold (a float between 0 and 1) representing healthy instances / total instances . When the ratio falls below the threshold, Nacos returns **all** instances (healthy and unhealthy) to the consumer, preventing a complete traffic blackout in high‑concurrency scenarios.

For example, if a service has 100 instances and 98 become unhealthy, returning only the 2 healthy ones could overload them. By exposing all instances, some requests may fail, but the system avoids a cascade failure.

Spring Cloud LoadBalancer Retry

Even with a shortened heartbeat, there is still a brief period where a failed instance remains in Nacos. To handle request failures gracefully, configure retry behavior in Spring Cloud LoadBalancer.

Example application.yml for the consumer:

spring:
  application:
    name: user-service-consumer
  cloud:
    nacos:
      discovery:
        server-addr: 127.0.0.1:8848
    loadbalancer:
      retry:
        enabled: true
        max-retries-on-same-service-instance: 1
        max-retries-on-next-service-instance: 2
        retry-on-all-operations: true

max-retries-on-same-service-instance

controls how many attempts are made on the same instance before switching; max-retries-on-next-service-instance limits attempts on other instances; retry-on-all-operations enables retries for non‑GET methods (requires idempotent operations).

Common Pitfall

The retry mechanism relies on the spring-retry library. If the dependency is missing, the configuration has no effect. Add the following Maven dependency:

<dependency>
    <groupId>org.springframework.retry</groupId>
    <artifactId>spring-retry</artifactId>
</dependency>

Version differences may also affect property names.

Conclusion

Integrating Nacos alone does not guarantee seamless microservice operation. Adjusting heartbeat intervals, configuring protection thresholds, and enabling load‑balancer retries are essential steps to reduce downtime and prevent system‑wide avalanches.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Microservices nacos retry spring cloud Load Balancer health-check

Written by

Senior Brother's Insights

A public account focused on workplace, career growth, team management, and self-improvement. The author is the writer of books including 'SpringBoot Technology Insider' and 'Drools 8 Rule Engine: Core Technology and Practice'.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.