How Alibaba’s Notify and MetaQ Power Massive E‑Commerce Messaging

This article explains the design principles, architecture, and performance optimizations of Alibaba's Notify and MetaQ message middleware, illustrating how they achieve reliable asynchronous communication, high scalability, and low latency for billions of messages during peak e‑commerce events like Double 11.

21CTO
21CTO
21CTO
How Alibaba’s Notify and MetaQ Power Massive E‑Commerce Messaging

Message Middleware – The Distributed Messaging Broadcaster

Overview

Message middleware is a typical middleware technology composed of message transmission mechanisms or queue patterns. It enables reliable asynchronous communication between applications or components, reducing coupling and improving system scalability and availability.

3.1 Notify

Notify is a self‑developed messaging engine at Taobao, a core system for Double 11. Its three main functions are decoupling, asynchrony, and parallelism. For example, a user registration may involve writing to a user database, sending a welcome red‑packet, provisioning an Alipay account, and notifying other services. Executing these steps serially causes high latency; parallel execution reduces delay but introduces the "final consistency" problem, where the overall process must wait for the slowest step.

To achieve final consistency, a message queue (MQ) system is introduced.

Core Principles

Designed for message accumulation rather than pure point‑to‑point transmission.

No single point of failure; freely scalable architecture.

The architecture includes:

Message accumulation design: Persistent disk storage is used; messages are written to disk before asynchronous delivery, avoiding aggressive in‑memory approaches.

No single point of failure: Senders are stateless and can be added or removed dynamically.

Config server cluster: Detects online/offline status of message servers and broadcasts changes.

Notify server cluster: Handles message sending and receiving; any server can fail without affecting the system.

Storage cluster: Supports both multi‑copy disk storage for high reliability and memory‑based storage for high throughput, all stateless.

Message receiver cluster: Dynamically scales with config server notifications.

3.2 Notify Double 11 Preparation and Optimization

During Double 11, Notify handled massive load, assuming backend failures and massive message accumulation. In stress tests, it sustained 600 k writes per second, accumulating 450 million messages, and performed reliably during the actual event.

3.3 MetaQ

MetaQ is a pure queue‑model middleware written in Java, open‑sourced in 2012. It evolved through three versions:

2011‑01: MetaQ 1.0, derived from Apache Kafka, used internally for log transport.

2012‑09: MetaQ 2.0, solved partition limits and expanded binlog synchronization.

2013‑07: MetaQ 3.0, widely applied to order processing, cache sync, stream computing, real‑time IM, binlog sync, etc.

MetaQ provides a persistent disk queue with high reliability, leveraging OS cache for performance. Features include distributed producers/consumers, strict ordering, rich pull modes, horizontal subscriber scaling, real‑time subscription, and billion‑level message accumulation.

Storage Structure

All messages are written sequentially to a single dedicated queue.

Message location metadata is written to a separate file in serial order, ensuring reliability while supporting many queues.

MetaQ vs. Notify

Notify focuses on transaction messages and distributed transaction scenarios.

MetaQ focuses on ordered messages such as binlog sync and pull‑based scenarios like stream computing.

3.4 MetaQ Double 11 Preparation and Optimization

Migration from Notify transaction messages to MetaQ reduces pressure on Notify servers and improves throughput via binlog batch processing.

Low‑latency optimizations for transaction clusters include MySQL instance scaling, separating binlog storage from data storage, and configuring MySQL to generate binlog without locking.

Cluster parameter tuning (batch size, flush mode, data TTL, I/O scheduling, virtual memory) enhances accumulation efficiency.

Monitoring and real‑time alerting were implemented; during Double 11, total message writes reached 112 billion, deliveries 220 billion, with peak live‑stream write rate of 13.1 k and delivery rate of 27.8 k per second.

Conclusion

Alibaba’s distributed message middleware now serves over 500 business systems across the group, processing more than 350 billion messages daily with high reliability, performance, and distributed transaction support, making it one of the most established middleware products in the company.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsBackend Architecturehigh availabilitymiddlewareMessage Queue
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.