Backend Development 41 min read

Design and Performance Optimization of a Custom API Gateway

To overcome Spring Cloud Gateway’s memory leaks, slow O(N) routing and complex reactive code, the team built a custom, thread‑per‑core Netty gateway with in‑memory O(1) route lookup, DAG‑based filter chains and an asynchronous client, delivering four‑times higher throughput (≈45 k QPS), ~19 ms ART, no leaks, and far‑reduced latency.

DeWu Technology
DeWu Technology
DeWu Technology
Design and Performance Optimization of a Custom API Gateway

Business Background

The original gateway was built with Spring Cloud Gateway (SCG) on top of WebFlux, which offers low entry cost and rapid development. As traffic grew, the gateway faced security, performance, and stability issues due to generic routing, massive route registration, and complex filter logic.

Technical Pain Points

SCG suffers from memory leaks, complex reactive programming, multi‑layer abstraction overhead, inefficient O(N) route matching, long cold‑start latency, and tight coupling of routing and business logic.

Ideal Gateway Requirements

Support massive interface registration with dynamic routing and high‑performance matching.

Simplify the programming model.

Match or exceed SCG performance (low RT99, low ART).

Stable without memory leaks and seamless upgrades.

Strong extensibility (timeouts, multi‑protocol support, ecosystem).

Clear architecture separating data flow from control flow with UI control plane.

Solution Research

After evaluating open‑source gateways (e.g., Zuul2) and finding them insufficient, a custom solution was pursued.

Core Architecture

The new gateway adopts a thread‑per‑core model similar to Nginx, keeping the entire request lifecycle on a single Netty EventLoop thread to avoid context switches and reduce latency.

Request Context Wrapper

public class DefaultServerWebExchange implements ServerWebExchange {
    private final Channel client2ProxyChannel;
    private final Channel proxy2ClientChannel;
    private final EventLoop executor;
    private ServerHttpRequest request;
    private ServerHttpResponse response;
    private final Map
attributes;
}

Client2ProxyHttpHandler

@Override
protected void channelRead0(ChannelHandlerContext ctx, FullHttpRequest fullHttpRequest) {
    try {
        Channel client2ProxyChannel = ctx.channel();
        DefaultServerHttpRequest serverHttpRequest = new DefaultServerHttpRequest(fullHttpRequest, client2ProxyChannel);
        ServerWebExchange serverWebExchange = new DefaultServerWebExchange(client2ProxyChannel, (EventLoop) ctx.executor(), serverHttpRequest, null);
        // request filter chain
        this.requestFilterChain.filter(serverWebExchange);
    } catch (Throwable t) {
        log.error("Exception caused before filters!\n {}", ExceptionUtils.getStackTrace(t));
        ByteBufHelper.safeRelease(fullHttpRequest);
        throw t;
    }
}

FilterChain Design

public void filter(ServerWebExchange exchange) {
    if (this.index < filters.size()) {
        GatewayFilter filter = filters.get(this.index);
        DefaultGatewayFilterChain chain = new DefaultGatewayFilterChain(this, this.index + 1);
        try {
            filter.filter(exchange, chain);
        } catch (Throwable e) {
            log.error("Filter chain unhandle backward exception! Request path {}, FilterClass: {}, exception: {}",
                exchange.getRequest().getPath(), filter.getClass(), ExceptionUtils.getFullStackTrace(e));
            ResponseDecorator.failResponse(exchange, 500, "网关内部错误!filter chain exception!");
        }
    }
}

Filters are split into RequestFilter and ResponseFilter with strictly increasing, non‑duplicate order values to form a DAG.

Routing and Matching

Routes are stored in memory (hash map) for O(1) lookup. The Route class holds id, skipCount, and target URI.

public class Route implements Ordered {
    private final String id;
    private final int skipCount;
    private final URI uri;
}

Lookup logic:

private Route lookupRoute(ServerWebExchange exchange) {
    String path = exchange.getRequest().getPath();
    Route exactRoute = pathRouteMap.getOrDefault(path, null);
    if (exactRoute != null) {
        exchange.getAttributes().put(DAGApplicationConfig.GATEWAY_ROUTE_CACHE, exactRoute);
        return exactRoute;
    }
    // additional matching omitted for brevity
}

Async Client Framework

Provides protocol‑agnostic non‑blocking RPC with timeout management and callback handling.

public class AsyncContext
implements Cloneable {
    STATE state = STATE.INIT;
    final Channel usedChannel;
    final ChannelPool usedChannelPool;
    final EventExecutor executor;
    final AsyncClient
agent;
    Req request;
    Resp response;
    ResponseCallback
responseCallback;
    ExceptionCallback exceptionCallback;
    int timeout;
    long deadline;
    long sendTimestamp;
    Promise
responsePromise;
}

AsyncClient submission flow:

public void submitSend(AsyncContext
asyncContext) {
    asyncContext.state = AsyncContext.STATE.SENDING;
    asyncContext.deadline = asyncContext.timeout + System.currentTimeMillis();
    ReferenceCountUtil.retain(asyncContext.request);
    Future
responseFuture = trySend(asyncContext);
    responseFuture.addListener((GenericFutureListener
>) future -> {
        if (future.isSuccess()) {
            ReferenceCountUtil.release(asyncContext.request);
            Resp response = future.getNow();
            asyncContext.responseCallback.callback(response);
        }
    });
}

Channel write and timeout scheduling are performed inside trySend and sendNow methods (code omitted for brevity).

Sentinel Integration

Cluster flow control is performed asynchronously using the async client to fetch tokens from Redis, avoiding blocking calls in the Netty worker thread.

public void filter(ServerWebExchange exchange, GatewayFilterChain chain) {
    String resource = exchange.getRequest().getPath();
    ClusterFlowRule rule = ClusterFlowManager.getClusterFlowRule(resource);
    if (rule != null) {
        tokenService.asyncRequestToken(rule, exchange.getExecutor())
            .addListener(future -> {
                TokenResult tokenResult = future.isSuccess() ? (TokenResult) future.getNow() : RedisTokenService.FAIL;
                ClusterFlowManager.setTokenResult(rule.getRuleId(), tokenResult);
                doSentinelFlowControl(exchange, chain, resource);
            });
    } else {
        doSentinelFlowControl(exchange, chain, resource);
    }
}

Performance Evaluation

Load tests show the custom gateway (DAG) achieves up to 45k QPS at 80% CPU with ART ~19 ms, while SCG caps at ~11k QPS at 95% CPU with ART ~54 ms. Memory leak issues are eliminated, cold‑start latency spikes are reduced by >99%, and thread count is cut from hundreds to under 100 after optimizations.

Conclusion

The self‑developed API gateway delivers 4× higher throughput, lower latency, and better stability compared to the widely used SCG, validating the architectural choices of thread‑per‑core processing, in‑memory routing, and asynchronous client design.

Javaperformance optimizationMicroservicesAPI GatewayNettyReactive Programming
DeWu Technology
Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.