Boosting a Dubbo Registry from 40 to 1100 QPS: Practical Performance Hacks

Through systematic measurement, bottleneck identification, and targeted optimizations—including lock redesign, Redis caching, and URL parsing improvements—the author transformed a Go‑based Dubbo registry’s registration throughput from 40 QPS to over 1100 QPS, demonstrating practical performance‑tuning techniques for high‑scale backend services.

Xiao Lou's Tech Notes
Xiao Lou's Tech Notes
Xiao Lou's Tech Notes
Boosting a Dubbo Registry from 40 to 1100 QPS: Practical Performance Hacks

Hello everyone, I'm Xiao Lou. This article shares a macro‑level performance optimization of a self‑built Dubbo registry written in Go.

Background

The project is a custom Dubbo registration center. Consumers and providers send discovery requests to an Agent, which proxies them. The Agent maintains a gRPC long‑connection with the Registry to push updates and periodically pulls subscription lists. The Agent runs on the same machine as business services, following a Service‑Mesh‑like approach.

The Registry acts like a Zookeeper or a simple web service offering register, deregister, and subscribe APIs.

Do We Really Need Performance Optimization?

Optimization aims to reduce resource consumption (lower cost) and improve system stability. Without measurable benefits, optimization can be wasteful.

In this registry, registration failures block Dubbo application startup. As more applications were onboarded and the company moved from VMs to containers, rapid scaling required the registry to handle high QPS during peak expansions. A corporate drill set a performance target of 1000 QPS, prompting this effort.

Metric Collection

Core interfaces were instrumented with metrics for request count, latency (p99/p95/p90), and error count. Load testing revealed an initial registration throughput of about 40 QPS, with a p99 latency under 1 second as the success criterion.

Where Was the Bottleneck?

The registration flow includes locking, creating apps/clusters, inserting endpoints, and using MySQL‑based pessimistic locks. The lock granularity is at the app level, allowing only one request per app at a time, which severely limits throughput.

By logging timestamps at each critical step, the lock was confirmed as the slowest part.

Lock Optimization

Two strategies were applied:

Add unique indexes to tables where possible.

Replace MySQL pessimistic locks with Redis optimistic locks.

For tables that cannot have unique indexes (e.g., App and Cluster), a double‑checked locking pattern similar to the Java singleton was introduced:

public class Singleton {
    private static volatile Singleton instance = null;
    private Singleton() {}
    private static Singleton getInstance() {
        if (instance == null) {
            synchronized (Singleton.class) {
                if (instance == null) {
                    instance = new Singleton();
                }
            }
        }
        return instance;
    }
}

Applied to the registry, the pseudo‑code becomes:

app = DB.get("app_name")
if app == null {
    redis.lock()
    app = DB.get("app_name")
    if app == null {
        app = DB.insert("app_name")
    }
    redis.unlock()
}

This reduces lock contention because the lock is only needed during the first creation of an app or cluster.

Result: registration throughput rose from 40 QPS to 430 QPS (≈10×).

Read‑Through Cache

Since basic app/cluster information rarely changes, a Redis cache was added for these reads, increasing QPS from 430 to 440.

CPU Optimization

Profiling with Go’s pprof showed ParseUrl consuming excessive CPU. The URL parsing was performed multiple times per request.

Two possible fixes:

Refactor to parse URLs once and pass the parsed object.

Introduce a per‑session cache for parsed URLs.

The second approach was chosen to minimize code changes. Inspired by Dubbo’s own object caching in org.apache.dubbo.common.utils.PojoUtils#generalize, a cache wrapper was added:

func parseUrl(url, cache) {
    if cache.get(url) != null {
        return cache.get(url)
    }
    u = parseUrl0(url)
    cache.put(url, u)
    return u
}

After this change, QPS jumped to 1100, meeting the target.

Final Thoughts

The optimizations were incremental and low‑cost, suitable when a full rewrite is impossible. The key lessons are: define clear metrics, locate bottlenecks, prioritize high‑frequency paths, and validate each change with data.

Follow the WeChat public account "捉虫大师" for more backend technical sharing.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendGoprofiling
Xiao Lou's Tech Notes
Written by

Xiao Lou's Tech Notes

Backend technology sharing, architecture design, performance optimization, source code reading, troubleshooting, and pitfall practices

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.