Taobao’s Secret to Billions of API Calls: High‑Performance Gateway & Reliable Messaging
Taobao’s Open Platform sustains hundreds of billions of daily API calls and messages by employing a pipeline‑based high‑performance API gateway with multi‑level caching, asynchronous processing, granular traffic control, a highly reliable push‑pull messaging system, and a zero‑loss data‑sync service that dynamically balances resources during massive traffic spikes.
1. High‑Performance API Gateway
Taobao’s Open Platform uses a pipeline design for its API gateway, handling business, security, routing, and invocation logic. To meet Double‑11 peak traffic of nearly one million QPS, it employs multi‑level rich‑client caching with asynchronous refresh, supporting tens of millions of QPS while controlling network congestion.
The gateway adopts full asynchronous processing: after initial validation, it releases the servlet thread and forwards requests via HSF or HTTP NIO clients. Responses are handled by an event‑driven TOP worker pool, enabling complete request‑level asynchrony.
Metadata reads use a multi‑tier cache: a distributed cache, an LRU local cache, and a Bloom filter to prevent cache breakdown. Dynamic cache rules are pushed, and expired data may be served temporarily while asynchronous updates refresh the cache.
Batch API calls combine multiple requests into a single TOP SDK call, split into asynchronous remote calls, and merge results, reducing round‑trip latency and network usage.
Multi‑dimensional traffic control provides per‑API QPS limits, daily quotas, and grouping strategies to prioritize critical traffic during spikes, with both cluster‑wide and single‑node control mechanisms.
2. High‑Reliability Messaging Service
The messaging system separates routing, storage, and push subsystems, ensuring at‑least‑once delivery and decoupling producers from consumers.
Routing filters, authenticates, and logs events, forwarding them to a storage layer based on BitCask with memory‑mapped files for high‑throughput writes.
Push uses a Disruptor‑based event model with Netty and WebSocket long‑connections, achieving average latency around 100 ms and maximum 200 ms.
Both push and pull modes are supported; push is default for real‑time delivery, while pull can be used when needed.
Message confirmation employs a transaction per message, with in‑memory handling for the majority and tiered storage (HeapMemory → DirectMemory → FileSystem) for overflow, ensuring >95% of transactions complete without disk I/O.
3. Zero‑Loss Data Synchronization
Data sync combines message‑driven real‑time updates with periodic reconciliation tasks to guarantee consistency and low latency for order data.
Messages carry order IDs; the sync client merges rapid updates and writes to the target DB. Reconciliation tasks run every 30 seconds, comparing source and target orders and correcting mismatches.
Dynamic resource isolation groups machines and users into logical clusters, isolating hot‑spot sellers and ensuring stable DB connection usage. Clusters span multiple data centers for fault tolerance, with automatic re‑allocation on failures.
A flexible storage model stores frequently queried fields separately while keeping the full order JSON in a large column. Hashcodes and modification timestamps enable fast change detection, reducing DB reads by 90%.
Write‑heavy periods use direct update‑plus‑timestamp checks instead of select‑then‑update, and logical deletions are batched during low‑traffic windows.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
