Backend Development 19 min read

Design and Architecture of Ctrip’s High‑Performance API Gateway Handling 20 Billion Daily Requests

This article details Ctrip’s API gateway architecture, describing its evolution to handle 20 billion daily requests through fully asynchronous processing, streaming forwarding, single‑threaded event‑loop design, and various performance and governance optimizations, while also covering multi‑protocol compatibility, routing, and module orchestration.

Code Ape Tech Column

Sep 8, 2023

Design and Architecture of Ctrip’s High‑Performance API Gateway Handling 20 Billion Daily Requests

Introduction: Ctrip’s API gateway, introduced with micro‑service architecture in 2014, has grown to serve over 3000 services and process about 200 billion requests per day as of July 2021.

Early design was based on Netflix OSS Zuul 1.0, using Tomcat NIO + AsyncServlet on the server side, a dedicated thread‑pool with a responsibility‑chain pattern for business flow, Apache HttpClient for the client side, and core components such as Archaius, Hystrix and Groovy for dynamic configuration, circuit‑breaking and hot‑update support.

With increasing traffic and overseas expansion, the original synchronous‑centric approach showed limitations, prompting a full‑asynchronous refactor.

2.1 Asynchronous Flow Design

The gateway now adopts a fully asynchronous model (server + business + client) built on Netty’s NIO/Epoll event‑loop. Business processes are transformed into asynchronous stages, handling IO events such as request validation, authentication, and remote calls, as well as raw packet forwarding.

Key challenges include flow and state management, exception handling, context propagation, thread scheduling, and flow control. To hide the complexity from developers, a wrapper framework based on RxJava’s Maybe type is used, providing unified synchronous/asynchronous interfaces, built‑in timeout, and error handling.

public interface Processor<T> {
    ProcessorType getType();
    int getOrder();
    boolean shouldProcess(RequestContext context);
    // Unified external wrapper as Maybe
    Maybe<T> process(RequestContext context) throws Exception;
}

public abstract class AbstractProcessor implements Processor {
    protected void processSync(RequestContext context) throws Exception {}
    protected T processSyncAndGetReponse(RequestContext context) throws Exception { process(context); return null; }
    protected Maybe<T> processAsync(RequestContext context) throws Exception { /* ... */ }
    @Override
    public Maybe<T> process(RequestContext context) throws Exception {
        Maybe<T> maybe = processAsync(context);
        if (maybe instanceof ScalarCallable) {
            // synchronous method, no extra wrapping
            return maybe;
        } else {
            // add timeout, ignore errors by default
            return maybe.timeout(getAsyncTimeout(context), TimeUnit.MILLISECONDS,
                Schedulers.from(context.getEventloop()), timeoutFallback(context));
        }
    }
    protected long getAsyncTimeout(RequestContext context) { return 2000; }
    protected Maybe<T> timeoutFallback(RequestContext context) { return Maybe.empty(); }
}

The processing pipeline follows a responsibility‑chain divided into inbound, outbound, error and log stages, each composed of one or more filters that execute sequentially and can short‑circuit on response or exception.

public class RxUtil {
    public static <T> Maybe<T> concat(Iterable<? extends Callable<Maybe<T>>> iterable) {
        Iterator<? extends Callable<Maybe<T>>> sources = iterable.iterator();
        while (sources.hasNext()) {
            Maybe<T> maybe;
            try {
                maybe = sources.next().call();
            } catch (Exception e) {
                return Maybe.error(e);
            }
            if (maybe != null) {
                if (maybe instanceof ScalarCallable) {
                    // synchronous method
                    T response = ((ScalarCallable<T>) maybe).call();
                    if (response != null) {
                        // has response, break
                        return maybe;
                    }
                } else {
                    if (sources.hasNext()) {
                        // pass remaining sources to callback for further filters
                        return new ConcattedMaybe(maybe, sources);
                    } else {
                        return maybe;
                    }
                }
            }
        }
        return Maybe.empty();
    }
}

public class ProcessEngine {
    // each stage adds default timeout and error handling
    private void process(RequestContext context) {
        List<Callable<Maybe<Response>>> inboundTask = get(ProcessorType.INBOUND, context);
        List<Callable<Maybe<Void>>> outboundTask = get(ProcessorType.OUTBOUND, context);
        List<Callable<Maybe<Response>>> errorTask = get(ProcessorType.ERROR, context);
        List<Callable<Maybe<Void>>> logTask = get(ProcessorType.LOG, context);

        RxUtil.concat(inboundTask) // inbound stage
            .toSingle() // get response
            .flatMapMaybe(response -> {
                context.setOriginResponse(response);
                return RxUtil.concat(outboundTask);
            }) // enter outbound
            .onErrorResumeNext(e -> {
                context.setThrowable(e);
                return RxUtil.concat(errorTask).flatMap(response -> {
                    context.resetResponse(response);
                    return RxUtil.concat(outboundTask);
                });
            }) // error handling then back to outbound
            .flatMap(response -> RxUtil.concat(logTask)) // logging stage
            .timeout(asyncTimeout.get(), TimeUnit.MILLISECONDS, Schedulers.from(context.getEventloop()),
                Maybe.error(new ServerException(500, "Async-Timeout-Processing"))) // global timeout
            .subscribe(
                unused -> {
                    logger.error("this should not happen, " + context);
                    context.release();
                },
                e -> {
                    logger.error("this should not happen, " + context, e);
                    context.release();
                },
                () -> context.release()
            );
    }
}

2.2 Streaming Forwarding & Single‑Threaded Execution

By parsing only the HTTP header and forwarding the body stream directly to the upstream service, the gateway reduces latency and memory footprint. The entire request lifecycle runs on a single Netty event‑loop, eliminating thread‑safety issues and simplifying multi‑stage coordination, while still allowing isolated thread‑pools for blocking IO filters.

Challenges such as thread safety, multi‑stage linkage, and edge‑case handling (e.g., upstream 404/413 responses) are addressed through context‑bound event‑loops and careful resource isolation.

2.3 Additional Optimizations

Lazy loading of internal variables and deferred parsing of cookies/query strings.

Off‑heap memory and zero‑copy techniques combined with streaming forwarding.

Adoption of JDK 11 with ZGC to improve GC pause times.

Custom HTTP codec to mitigate legacy “bad practices” and improve request success rates.

Traffic governance for oversized requests, long URIs, and non‑ASCII characters, allowing routing of malformed traffic for better observability.

Request filtering to handle security issues such as request smuggling.

3 Gateway Business Forms

The gateway acts as a unified ingress point, decoupling internal and external networks, providing common cross‑cutting concerns (security, authentication, rate‑limiting, monitoring), and enabling efficient traffic control for private protocols, link optimization, and multi‑region active‑active deployments.

4 Gateway Governance

Multi‑protocol compatibility is achieved by abstracting protocol‑specific encoding/decoding and exposing a common intermediate model. The routing module defines matching rules, tags, properties, and target endpoints with weight‑based gray releases.

{
    "type": "uri",
    "value": "/hotel/order",
    "matcherType": "prefix",
    "tags": ["owner_admin","org_framework","appId_123456"],
    "properties": {"core":"true"},
    "routes": [{
        "condition": "true",
        "zone": "PRO",
        "targets": [{"url": "http://test.ctrip.com/hotel", "weight": 100}]
    }]
}

Module orchestration schedules various filters (e.g., circuit‑breaker, rate‑limiter, logging) across predefined stages, allowing per‑gateway customization while maintaining decoupling.

{
    "name": "addResponseHeader",
    "stage": "PRE_RESPONSE",
    "ruleOrder": 0,
    "grayRatio": 100,
    "condition": "true",
    "actionParam": {
        "connection": "keep-alive",
        "x-service-call": "${request.func.remoteCost}",
        "Access-Control-Expose-Headers": "x-service-call",
        "x-gate-root-id": "${func.catRootMessageId}"
    },
    "exceptionHandle": "return"
}

5 Summary

The article compares Ctrip’s self‑built gateway with alternatives such as Zuul 1.0, Nginx, Spring Cloud Gateway, and Istio, concluding that the choice depends on business context. Ctrip continues to evolve the platform, exploring public vs. private gateways, HTTP/3, and Service Mesh integration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

java performance optimization Microservices api-gateway Netty Async Design

Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.