Graceful Shutdown in Kubernetes: Concepts, Case Studies, and Optimizations
This article explains the concept of graceful shutdown, outlines the standard steps, and presents detailed Kubernetes, Spring Boot, and Nacos case studies, followed by optimization techniques, code examples, and practical recommendations for handling MQ, scheduled tasks, and traffic control during service termination.
1. Concept
Graceful shutdown refers to the process of stopping a system, service, or application in a controlled manner to ensure data safety, prevent errors, and maintain overall stability.
Typical steps include:
Backup data : Persist any in‑memory modifications or caches to the database or disk.
Stop receiving new requests .
Process unfinished requests .
Notify dependent components .
Wait for all elements to exit safely, then shut down the system .
2. Case Studies
2.1 Kubernetes shutdown process
When kubectl delete pod is executed, two parallel processes start:
Network rule update : kube‑apiserver marks the pod as Terminating in etcd, the endpoint controller removes the pod IP, and kube‑proxy updates iptables so traffic no longer routes to the pod.
Container deletion : kube‑apiserver marks the pod as Terminating , kubelet cleans up storage and network resources, a PreStop hook is invoked, kubelet sends SIGTERM to the container, and if the container does not exit within the default 30 s, kubelet sends SIGKILL .
2.2 k8s + Spring Boot + Nacos case
The PreStop hook performs two actions: Nacos deregistration and a 35‑second sleep. The pod’s terminationGracePeriodSeconds is also set to 35 s.
Problem
The Spring Boot application shuts down in about 2 s, which is insufficient to finish pending thread tasks, asynchronous messages, or scheduled jobs. Because the terminationGracePeriodSeconds is 35 s, the PreStop sleep plus request time exceeds the grace period, causing kubelet to grant an additional 2 s before issuing SIGKILL .
Why is a 35 s sleep needed after Nacos deregistration? Nacos service‑change propagation via HTTP can take up to 10 s, and Ribbon’s default cache refresh interval is 30 s, so 35 s was chosen to cover both.
Code example – Nacos instance change listener
/**
* Subscribe to Nacos instance change notifications
* Manually refresh Ribbon service instance cache
* Nacos client 1.4.6 (1.4.1 has a critical bug)
*/
@Component
@Slf4j
public class NacosInstancesChangeEventListener extends Subscriber
{
@Resource
private SpringClientFactory springClientFactory;
@PostConstruct
public void registerToNotifyCenter(){
NotifyCenter.registerSubscriber(this);
}
@Override
public void onEvent(InstancesChangeEvent event) {
String service = event.getServiceName();
// service: DEFAULT_GROUP@@demo ribbonService: demo
String ribbonService = service.substring(service.indexOf("@@") + 2);
log.info("#### Received Nacos instance change event:{} ribbonServiceName: {}", event.getServiceName(), ribbonService);
ILoadBalancer loadBalancer = springClientFactory.getLoadBalancer(ribbonService);
if(loadBalancer != null){
((ZoneAwareLoadBalancer
) loadBalancer).updateListOfServers();
log.info("Refresh ribbon service instance cache: {} success", ribbonService);
}
}
@Override
public Class
subscribeType() {
return InstancesChangeEvent.class;
}
/**
* Nacos 1.4.4~1.4.6 requires this method; versions >=2.1.2 fixed it.
* When multiple registries exist, change events are not isolated, so we need to decide whether to handle the event.
*/
@Override
public boolean scopeMatches(InstancesChangeEvent event) {
return true;
}
}2.3 Optimization points
Reduce the 35 s sleep after Nacos deregistration if possible.
Determine a reasonable value for terminationGracePeriodSeconds based on PreStop duration and Spring Boot shutdown time.
Optimization 1
The 35 s sleep accounts for Nacos service discovery time plus Ribbon cache refresh (≈40 s in worst case). To shorten it:
Enable UDP for Nacos (requires coordination with operations).
Listen to Nacos change notifications and refresh Ribbon cache immediately when a service goes offline.
Optimization 2 – Adjust terminationGracePeriodSeconds
The value should be slightly larger than the total time spent in PreStop plus Spring Boot shutdown (which depends on business logic such as MQ messages, scheduled tasks, and thread‑pool tasks). Spring Boot’s default graceful shutdown buffer is 30 s, so a practical setting is 10 + 30 = 40 s .
Thread‑pool configuration example
// Without these settings, when kill -15 occurs, unfinished thread‑pool tasks are forced to close
threadPoolTaskExecutor.setWaitForTasksToCompleteOnShutdown(true);
threadPoolTaskExecutor.setAwaitTerminationSeconds(30);3. Further Optimizations
MQ and scheduled tasks
When a service deregisters from Nacos, it can also listen to its own deregistration event and stop consuming MQ messages and scheduled jobs, achieving a cleaner shutdown.
Traffic control
If a gateway (e.g., Spring Cloud Gateway) is used instead of k8s traffic control, the gateway should also listen to Nacos deregistration events to refresh its Ribbon cache and stop routing traffic to the shutting‑down service.
4. Conclusion
The article presents a comprehensive graceful shutdown solution for microservices running on Kubernetes, covering basic concepts, detailed case studies, and practical optimizations such as handling MQ, scheduled tasks, and traffic control. Success depends on both the mechanical shutdown steps and the business‑specific logic that must be addressed during service termination.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.