How iQIYI Cut Stream Data Costs by 70%: From Private‑Cloud Kafka to AutoMQ
This article details iQIYI's evolution from a tightly coupled private‑cloud Kafka setup to a cloud‑native AutoMQ architecture, describing the challenges of scaling, the development of the Stream platform and Stream‑SDK, the migration to hybrid and public‑cloud Kafka, and the resulting cost and elasticity improvements.
Background and Motivation
iQIYI processes massive real‑time stream data for recommendation, search, advertising and reporting. The original architecture used Kafka as the storage bus and Flink for computation. As business grew, the private‑cloud Kafka service exhibited three critical shortcomings:
Strong business‑cluster coupling : application code hard‑coded broker addresses, making migrations painful and preventing unified monitoring.
Missing unified schema management : no central metadata for schemas or data ownership, hindering discovery and governance.
Absent master‑slave management : backup links were configured per business without a platform‑level view, complicating consistency and failover.
Stream Platform and Stream‑SDK Architecture
The solution was refactored into three layers: the Stream platform, the Stream‑SDK client, and the underlying storage component.
Stream Platform Core Modules
Logical queues : replaces the traditional "cluster+Topic" model with a "project+queue" naming scheme. The queue is bound to one or two clusters (primary/backup), eliminating direct cluster dependencies for applications.
Schema management : queues can attach Avro/Protobuf schemas that are automatically synchronized to the metadata center, enabling SQL‑based stream processing and data‑lineage tracking.
Data map : provides multi‑dimensional search, authorization and usage statistics for queues, simplifying cross‑team data reuse.
Data lineage : Stream‑SDK reports read/write endpoints, allowing the platform to construct application‑level lineage graphs.
Stream‑SDK Unified Client
The SDK abstracts Kafka, RocketMQ and later AutoMQ protocols. Clients supply a project, queue and token; the SDK fetches cluster address, topic name and authentication parameters from the platform, then uses the native client for read/write.
Two runtime mechanisms are critical:
Configuration acquisition & reporting : the SDK calls the platform’s configuration API, obtains the necessary connection info, and reports client IP, consumer group and application name for lineage.
Heartbeat‑driven cluster change detection : every minute the SDK sends a heartbeat. If the platform indicates a cluster change, the SDK automatically switches traffic to the new cluster without service interruption.
Multi‑Cloud Kafka Construction
Private‑cloud Kafka clusters suffered from poor elasticity and low resource utilization. Starting in 2023, public‑cloud Kafka services from multiple cloud providers were introduced, forming a hybrid architecture. Benefits included on‑demand scaling, higher utilization and >20% cost reduction compared with the pure private‑cloud deployment.
Transition from Kafka to AutoMQ
Although public‑cloud Kafka improved resource elasticity, scaling Kafka clusters remained cumbersome because each broker stores data locally. AutoMQ, a storage‑compute‑separated solution, was adopted to achieve second‑level elasticity.
Key Architectural Features of AutoMQ
Shared storage : all stream data is written to object storage. A Write‑Ahead Log (WAL) on block storage mitigates object‑storage latency and IOPS limits; data is first persisted to WAL then flushed to object storage.
Single‑replica topics : the underlying cloud storage already provides multi‑replica durability, allowing AutoMQ topics to use a single logical replica, eliminating intra‑cluster replication traffic and reducing cost.
Kafka protocol compatibility : AutoMQ retains the open‑source Kafka compute layer and fully supports the Kafka wire protocol, so existing producers/consumers and the Stream‑SDK work unchanged.
Rapid elasticity : because brokers no longer hold data, they can be started or terminated in minutes, enabling pay‑as‑you‑go storage costs and matching capacity to traffic spikes.
After performance and stability validation, AutoMQ was deployed in public‑cloud environments and integrated into the Stream platform. Migrating private‑cloud Kafka → public‑cloud Kafka → AutoMQ reduced operational costs by >70%.
Summary of Outcomes and Future Plans
Stream data has become a critical low‑latency conduit at iQIYI. The shift from a “cluster‑centric” to a “data‑centric” architecture—realized through logical queues, unified schema management, data maps, lineage, and the Stream‑SDK—enabled seamless migration across clouds and storage backends. Currently about 40% of stream traffic runs on public‑cloud Kafka or AutoMQ, with roughly half of that traffic already on AutoMQ. The roadmap focuses on expanding AutoMQ adoption and exploring its self‑adaptive elasticity mechanisms to achieve further cost savings.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
