Backend Development 17 min read

Inside Nacos Dynamic Service Discovery: Architecture, Protocols, and Code

This article explains the fundamentals of Nacos dynamic service discovery, covering its purpose, communication protocols, registration, heartbeat, subscription, push mechanisms, and client querying, and includes detailed code examples and diagrams to illustrate the internal processes and performance improvements.

Sanyou's Java Diary
Sanyou's Java Diary
Sanyou's Java Diary
Inside Nacos Dynamic Service Discovery: Architecture, Protocols, and Code

01 What Is Dynamic Service Discovery?

Service discovery uses a registry to record information about all services in a distributed system so that other services can quickly locate the registered services.

In monolithic applications, DNS+Nginx can satisfy service discovery by configuring IP lists in Nginx. In microservice architectures, services are finer‑grained and frequently go online/offline, requiring a registry that can dynamically detect these changes and push updated IP lists to consumers.

02 How Nacos Implements Dynamic Service Discovery

The core principle of Nacos dynamic service discovery is illustrated in the following diagram.

2.1 Communication Protocol

The registration and discovery process relies on a communication protocol. Nacos 1.x only supports HTTP, while Nacos 2.x introduced gRPC, a long‑connection protocol that reduces the overhead of creating and destroying HTTP connections, improving performance by more than nine times.

2.2 Nacos Service Registration

Service registration means the client reports its IP, port, and other metadata to the Nacos server.

Establish a long‑connection: the Nacos SDK resolves the server domain to an IP list, selects one IP, creates a gRPC connection, and monitors its status, reconnecting to another IP if the connection drops.

Health‑check request: before registration, the SDK sends an empty request; if no response is received, the server is considered unhealthy and the SDK retries a limited number of times.

Initiate registration: the SDK inserts a temporary record into a cache before sending data to the server. After a successful registration, the cache entry is marked as successful; if it fails, a background task retries the registration.

Nacos SDK’s automatic compensation mechanism for registration failures is shown in the following sequence diagram.

Relevant source code:

<code>@Override
public void registerService(String serviceName, String groupName, Instance instance) throws NacosException {
    NAMING_LOGGER.info("[REGISTER-SERVICE] {} registering service {} with instance {}", namespaceId, serviceName, instance);
    // add redo log
    redoService.cacheInstanceForRedo(serviceName, groupName, instance);
    doRegisterService(serviceName, groupName, instance);
}

public void doRegisterService(String serviceName, String groupName, Instance instance) throws NacosException {
    // send registration to server
    InstanceRequest request = new InstanceRequest(namespaceId, serviceName, groupName,
        NamingRemoteConstants.REGISTER_INSTANCE, instance);
    requestToServer(request, Response.class);
    // mark registration success
    redoService.instanceRegistered(serviceName, groupName);
}
</code>

Compensation task execution:

<code>@Override
public void run() {
    if (!redoService.isConnected()) {
        LogUtils.NAMING_LOGGER.warn("Grpc Connection is disconnect, skip current redo task");
        return;
    }
    try {
        redoForInstances();
        redoForSubscribes();
    } catch (Exception e) {
        LogUtils.NAMING_LOGGER.warn("Redo task run with unexpected exception: ", e);
    }
}

private void redoForInstances() {
    for (InstanceRedoData each : redoService.findInstanceRedoData()) {
        try {
            redoForInstance(each);
        } catch (NacosException e) {
            LogUtils.NAMING_LOGGER.error("Redo instance operation {} for {}@@{} failed. ", each.getRedoType(),
                each.getGroupName(), each.getServiceName(), e);
        }
    }
}
</code>

2.3 Nacos Heartbeat Mechanism

Most registries (Consul, Eureka, Zookeeper, etc.) use heartbeats to detect service offline events; Nacos does the same.

Nacos 1.x SDK sends heartbeats via HTTP, while Nacos 2.x SDK relies on gRPC’s built‑in heartbeat. If the server does not receive a heartbeat, it assumes the instance is offline.

<code>public class ConnectionBasedClientManager extends ClientConnectionEventListener implements ClientManager {
    // connection lost, send disconnect event
    public boolean clientDisconnected(String clientId) {
        Loggers.SRV_LOG.info("Client connection {} disconnect, remove instances and subscribers", clientId);
        ConnectionBasedClient client = clients.remove(clientId);
        if (client == null) {
            return true;
        }
        client.release();
        NotifyCenter.publishEvent(new ClientEvent.ClientDisconnectEvent(client));
        return true;
    }
}
</code>

2.4 Nacos Service Subscription

When a service goes online or offline, Nacos records the client as a subscriber and returns the latest instance list. Clients can also cancel subscriptions, after which the server removes them from the subscriber list.

Subscribed clients receive asynchronous gRPC pushes of updated instance lists, keeping local caches up‑to‑date.

Server‑side subscription handling code:

<code>@Override
public void subscribeService(Service service, Subscriber subscriber, String clientId) {
    Service singleton = ServiceManager.getInstance().getSingletonIfExist(service).orElse(service);
    Client client = clientManager.getClient(clientId);
    // verify long connection
    if (!clientIsLegal(client, clientId)) {
        return;
    }
    // save subscription data
    client.addServiceSubscriber(singleton, subscriber);
    client.setLastUpdatedTime();
    // publish subscription event
    NotifyCenter.publishEvent(new ClientOperationEvent.ClientSubscribeServiceEvent(singleton, clientId));
}
</code>

2.5 Nacos Push

Push Methods

Earlier Nacos versions used UDP to push updates, which suffered packet loss. Newer versions prefer gRPC; the server selects the protocol based on the client SDK version.

Push Retry

If a push fails (e.g., client restart or unstable connection), Nacos retries by placing the push task into a queue and re‑executing it, logging tasks that exceed one second.

Push Source Code

Adding a push task to the execution queue:

<code>private static class PushDelayTaskProcessor implements NacosTaskProcessor {
    private final PushDelayTaskExecuteEngine executeEngine;
    public PushDelayTaskProcessor(PushDelayTaskExecuteEngine executeEngine) {
        this.executeEngine = executeEngine;
    }
    @Override
    public boolean process(NacosTask task) {
        PushDelayTask pushDelayTask = (PushDelayTask) task;
        Service service = pushDelayTask.getService();
        NamingExecuteTaskDispatcher.getInstance()
            .dispatchAndExecuteTask(service, new PushExecuteTask(service, executeEngine, pushDelayTask));
        return true;
    }
}
</code>

Push execution task:

<code>@Override
public void run() {
    try {
        // package data to push
        PushDataWrapper wrapper = generatePushData();
        ClientManager clientManager = delayTaskEngine.getClientManager();
        for (String each : getTargetClientIds()) {
            Client client = clientManager.getClient(each);
            if (client == null) {
                // client disconnected
                continue;
            }
            Subscriber subscriber = clientManager.getClient(each).getSubscriber(service);
            // push to subscriber
            delayTaskEngine.getPushExecutor().doPushWithCallback(each, subscriber, wrapper,
                new NamingPushCallback(each, subscriber, wrapper.getOriginalData(), delayTask.isPushToAll()));
        }
    } catch (Exception e) {
        Loggers.PUSH.error("Push task for service" + service.getGroupedServiceName() + " execute failed ", e);
        // re‑queue on failure
        delayTaskEngine.addTask(service, new PushDelayTask(service, 1000L));
    }
}
</code>

2.6 Nacos SDK Query Service Instances

Consumers call the Nacos SDK to obtain the latest instance list, then select an instance (e.g., via weighted round‑robin) for invocation.

The SDK first checks its local memory cache, which is updated by pushes; if data is missing, it falls back to a subscription request or a direct query to the server. A failover mode can be enabled to read cached data from disk when the server is unavailable.

Querying service instances code snippet:

<code>private final ConcurrentMap<String, ServiceInfo> serviceInfoMap;
@Override
public List<Instance> getAllInstances(String serviceName, String groupName, List<String> clusters, boolean subscribe) throws NacosException {
    ServiceInfo serviceInfo;
    String clusterString = StringUtils.join(clusters, ",");
    if (subscribe) {
        // get from local memory, fallback to disk if needed
        serviceInfo = serviceInfoHolder.getServiceInfo(serviceName, groupName, clusterString);
        if (serviceInfo == null || !clientProxy.isSubscribed(serviceName, groupName, clusterString)) {
            // subscribe if not present
            serviceInfo = clientProxy.subscribe(serviceName, groupName, clusterString);
        }
    } else {
        // direct query without subscription
        serviceInfo = clientProxy.queryInstancesOfService(serviceName, groupName, clusterString, 0, false);
    }
    if (serviceInfo == null || CollectionUtils.isEmpty(serviceInfo.getHosts())) {
        return new ArrayList<>();
    }
    return serviceInfo.getHosts();
}
</code>

3. Conclusion

This article introduced the basic concepts and core capabilities of Nacos service discovery, providing a deeper understanding of its registration, discovery, heartbeat, subscription, push, and query mechanisms.

MicroservicesService DiscoverygRPCNacosDynamic Registration
Sanyou's Java Diary
Written by

Sanyou's Java Diary

Passionate about technology, though not great at solving problems; eager to share, never tire of learning!

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.