Inside Nacos Dynamic Service Discovery: Architecture, Protocols, and Code
This article explains the fundamentals of Nacos dynamic service discovery, covering its purpose, communication protocols, registration, heartbeat, subscription, push mechanisms, and client querying, and includes detailed code examples and diagrams to illustrate the internal processes and performance improvements.
01 What Is Dynamic Service Discovery?
Service discovery uses a registry to record information about all services in a distributed system so that other services can quickly locate the registered services.
In monolithic applications, DNS+Nginx can satisfy service discovery by configuring IP lists in Nginx. In microservice architectures, services are finer‑grained and frequently go online/offline, requiring a registry that can dynamically detect these changes and push updated IP lists to consumers.
02 How Nacos Implements Dynamic Service Discovery
The core principle of Nacos dynamic service discovery is illustrated in the following diagram.
2.1 Communication Protocol
The registration and discovery process relies on a communication protocol. Nacos 1.x only supports HTTP, while Nacos 2.x introduced gRPC, a long‑connection protocol that reduces the overhead of creating and destroying HTTP connections, improving performance by more than nine times.
2.2 Nacos Service Registration
Service registration means the client reports its IP, port, and other metadata to the Nacos server.
Establish a long‑connection: the Nacos SDK resolves the server domain to an IP list, selects one IP, creates a gRPC connection, and monitors its status, reconnecting to another IP if the connection drops.
Health‑check request: before registration, the SDK sends an empty request; if no response is received, the server is considered unhealthy and the SDK retries a limited number of times.
Initiate registration: the SDK inserts a temporary record into a cache before sending data to the server. After a successful registration, the cache entry is marked as successful; if it fails, a background task retries the registration.
Nacos SDK’s automatic compensation mechanism for registration failures is shown in the following sequence diagram.
Relevant source code:
<code>@Override
public void registerService(String serviceName, String groupName, Instance instance) throws NacosException {
NAMING_LOGGER.info("[REGISTER-SERVICE] {} registering service {} with instance {}", namespaceId, serviceName, instance);
// add redo log
redoService.cacheInstanceForRedo(serviceName, groupName, instance);
doRegisterService(serviceName, groupName, instance);
}
public void doRegisterService(String serviceName, String groupName, Instance instance) throws NacosException {
// send registration to server
InstanceRequest request = new InstanceRequest(namespaceId, serviceName, groupName,
NamingRemoteConstants.REGISTER_INSTANCE, instance);
requestToServer(request, Response.class);
// mark registration success
redoService.instanceRegistered(serviceName, groupName);
}
</code>Compensation task execution:
<code>@Override
public void run() {
if (!redoService.isConnected()) {
LogUtils.NAMING_LOGGER.warn("Grpc Connection is disconnect, skip current redo task");
return;
}
try {
redoForInstances();
redoForSubscribes();
} catch (Exception e) {
LogUtils.NAMING_LOGGER.warn("Redo task run with unexpected exception: ", e);
}
}
private void redoForInstances() {
for (InstanceRedoData each : redoService.findInstanceRedoData()) {
try {
redoForInstance(each);
} catch (NacosException e) {
LogUtils.NAMING_LOGGER.error("Redo instance operation {} for {}@@{} failed. ", each.getRedoType(),
each.getGroupName(), each.getServiceName(), e);
}
}
}
</code>2.3 Nacos Heartbeat Mechanism
Most registries (Consul, Eureka, Zookeeper, etc.) use heartbeats to detect service offline events; Nacos does the same.
Nacos 1.x SDK sends heartbeats via HTTP, while Nacos 2.x SDK relies on gRPC’s built‑in heartbeat. If the server does not receive a heartbeat, it assumes the instance is offline.
<code>public class ConnectionBasedClientManager extends ClientConnectionEventListener implements ClientManager {
// connection lost, send disconnect event
public boolean clientDisconnected(String clientId) {
Loggers.SRV_LOG.info("Client connection {} disconnect, remove instances and subscribers", clientId);
ConnectionBasedClient client = clients.remove(clientId);
if (client == null) {
return true;
}
client.release();
NotifyCenter.publishEvent(new ClientEvent.ClientDisconnectEvent(client));
return true;
}
}
</code>2.4 Nacos Service Subscription
When a service goes online or offline, Nacos records the client as a subscriber and returns the latest instance list. Clients can also cancel subscriptions, after which the server removes them from the subscriber list.
Subscribed clients receive asynchronous gRPC pushes of updated instance lists, keeping local caches up‑to‑date.
Server‑side subscription handling code:
<code>@Override
public void subscribeService(Service service, Subscriber subscriber, String clientId) {
Service singleton = ServiceManager.getInstance().getSingletonIfExist(service).orElse(service);
Client client = clientManager.getClient(clientId);
// verify long connection
if (!clientIsLegal(client, clientId)) {
return;
}
// save subscription data
client.addServiceSubscriber(singleton, subscriber);
client.setLastUpdatedTime();
// publish subscription event
NotifyCenter.publishEvent(new ClientOperationEvent.ClientSubscribeServiceEvent(singleton, clientId));
}
</code>2.5 Nacos Push
Push Methods
Earlier Nacos versions used UDP to push updates, which suffered packet loss. Newer versions prefer gRPC; the server selects the protocol based on the client SDK version.
Push Retry
If a push fails (e.g., client restart or unstable connection), Nacos retries by placing the push task into a queue and re‑executing it, logging tasks that exceed one second.
Push Source Code
Adding a push task to the execution queue:
<code>private static class PushDelayTaskProcessor implements NacosTaskProcessor {
private final PushDelayTaskExecuteEngine executeEngine;
public PushDelayTaskProcessor(PushDelayTaskExecuteEngine executeEngine) {
this.executeEngine = executeEngine;
}
@Override
public boolean process(NacosTask task) {
PushDelayTask pushDelayTask = (PushDelayTask) task;
Service service = pushDelayTask.getService();
NamingExecuteTaskDispatcher.getInstance()
.dispatchAndExecuteTask(service, new PushExecuteTask(service, executeEngine, pushDelayTask));
return true;
}
}
</code>Push execution task:
<code>@Override
public void run() {
try {
// package data to push
PushDataWrapper wrapper = generatePushData();
ClientManager clientManager = delayTaskEngine.getClientManager();
for (String each : getTargetClientIds()) {
Client client = clientManager.getClient(each);
if (client == null) {
// client disconnected
continue;
}
Subscriber subscriber = clientManager.getClient(each).getSubscriber(service);
// push to subscriber
delayTaskEngine.getPushExecutor().doPushWithCallback(each, subscriber, wrapper,
new NamingPushCallback(each, subscriber, wrapper.getOriginalData(), delayTask.isPushToAll()));
}
} catch (Exception e) {
Loggers.PUSH.error("Push task for service" + service.getGroupedServiceName() + " execute failed ", e);
// re‑queue on failure
delayTaskEngine.addTask(service, new PushDelayTask(service, 1000L));
}
}
</code>2.6 Nacos SDK Query Service Instances
Consumers call the Nacos SDK to obtain the latest instance list, then select an instance (e.g., via weighted round‑robin) for invocation.
The SDK first checks its local memory cache, which is updated by pushes; if data is missing, it falls back to a subscription request or a direct query to the server. A failover mode can be enabled to read cached data from disk when the server is unavailable.
Querying service instances code snippet:
<code>private final ConcurrentMap<String, ServiceInfo> serviceInfoMap;
@Override
public List<Instance> getAllInstances(String serviceName, String groupName, List<String> clusters, boolean subscribe) throws NacosException {
ServiceInfo serviceInfo;
String clusterString = StringUtils.join(clusters, ",");
if (subscribe) {
// get from local memory, fallback to disk if needed
serviceInfo = serviceInfoHolder.getServiceInfo(serviceName, groupName, clusterString);
if (serviceInfo == null || !clientProxy.isSubscribed(serviceName, groupName, clusterString)) {
// subscribe if not present
serviceInfo = clientProxy.subscribe(serviceName, groupName, clusterString);
}
} else {
// direct query without subscription
serviceInfo = clientProxy.queryInstancesOfService(serviceName, groupName, clusterString, 0, false);
}
if (serviceInfo == null || CollectionUtils.isEmpty(serviceInfo.getHosts())) {
return new ArrayList<>();
}
return serviceInfo.getHosts();
}
</code>3. Conclusion
This article introduced the basic concepts and core capabilities of Nacos service discovery, providing a deeper understanding of its registration, discovery, heartbeat, subscription, push, and query mechanisms.
Sanyou's Java Diary
Passionate about technology, though not great at solving problems; eager to share, never tire of learning!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.