How Xiaomi Built Talos: A Scalable, Stateless Message Queue for Billions of Events
This article details Xiaomi's journey from Kafka 0.8 to the home‑grown Talos system, covering business motivations, storage‑compute separation architecture, key challenges such as tail‑read and consistency, and extensive performance, resource, and platform optimizations that enable a high‑throughput, multi‑tenant messaging service.
Business Background
Before 2015 Xiaomi used Kafka 0.8, which couples storage and compute, leading to uneven data distribution, painful cluster expansion, complex failover, and a consumer rebalance algorithm that degraded consumption experience.
Desired Capabilities
Xiaomi required a fast‑scaling, stateless, fault‑tolerant queue with minimal consumer loss, strong multi‑tenant support, and cross‑datacenter replication.
Talos Overview
Talos is a self‑developed distributed message queue that serves internal Xiaomi departments and external ecosystem partners, positioned against AWS Kinesis and Apache Kafka.
Architecture & Key Issues
Talos adopts a storage‑compute separation design: messages are persisted in HDFS, while stateless Talos Servers perform partition scheduling and load balancing using consistent hashing. Meta information and control flow (e.g., server registration, topic DDL broadcast) are stored in ZooKeeper.
DFS client tail‑read: HDFS blocks are invisible while being written. Xiaomi modified the HDFS client to support tail‑reading the latest block, enabling true “write‑and‑read” semantics.
Consistency model: To avoid split‑brain, Talos uses HDFS RecoverLease and a custom fencing mechanism so that only one Talos Server writes to a partition at any time.
Partition delayed allocation: Consistent hashing causes frequent partition migrations during rolling upgrades. Talos introduces a delayed‑allocation strategy that postpones migration until a node is stable, reducing communication overhead and smoothing consumption.
Performance & Resource Optimizations
Thread‑model redesign: Replaced a simple thread pool with a memory‑aware min‑heap thread pool. Requests for the same Topic‑Partition are routed to the same thread; otherwise the least‑loaded thread is selected, preventing a slow Topic from blocking others.
Write optimization: Merged multiple small I/O operations into larger batches. Instead of flushing HDFS on every write, Talos aggregates writes per partition and performs a single flush, raising single‑node QPS from ~1 K to >10 K and reducing P99 latency from 70 ms to 5 ms at 5 K QPS.
GC tuning: Switched from CMS to G1, tuned heap size, Young‑GC pause time, and Mixed‑GC intervals. Most GC pauses dropped from >100 ms to <70 ms.
Bandwidth reduction: Implemented client‑side addressing to eliminate unnecessary request forwarding, saving ~40 % of bandwidth and cutting P95 latency by 50 %.
Load‑balancing improvement: Enhanced consistent hashing by adjusting the number of virtual nodes per physical node based on historical daily traffic, achieving >50 % reduction in traffic variance across nodes.
Platform Monitoring
Talos provides a unified monitoring framework: agents on each service collect metrics, high‑availability streaming jobs ingest them into Druid, and dashboards/Falcon visualize multi‑dimensional views (service, cluster, machine). This framework also powers real‑time metering and billing.
Automated Resource Management
Partition quota requests are auto‑approved when projected traffic stays below a calculated threshold (≈70 % of cases). Exaggerated requests are automatically rejected with explanatory messages, eliminating manual approval bottlenecks.
Future Vision
Talos aims to add transaction support and cross‑region replication to meet financial‑grade reliability, explore compute capabilities within the messaging layer, and investigate ServiceMesh, Serverless, and a next‑generation Message Mesh for decoupled, intelligent data transport.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
