How Tencent Cloud Implements Tiered Storage for Kafka: Architecture, Challenges, and Evolution
This article examines the challenges of Kafka's traditional architecture, explains why local‑state heavy deployments cause operational difficulty and resource waste, and details Tencent Cloud's elastic, storage‑compute‑separated designs—including tiered storage, segment state machines, offset constraints, and performance optimizations—while sharing practical implementation insights and future directions.
Introduction
Lu Shilin, the core maintainer of Tencent Cloud Message Queue Kafka, presented the practice and evolution of tiered storage in Tencent Cloud, covering four aspects: problems and challenges of Kafka architecture, elastic architecture comparison, tiered storage architecture and principles, and Tencent Cloud's concrete implementation.
Problems and Challenges of Kafka Architecture
Typical Kafka deployments use Zookeeper or KRaft for metadata, physical machines or VMs for compute, and local disks for storage. This model suffers from three main issues:
Heavy local state: data resides locally, making any operation require data migration and increasing operational complexity.
Resource waste: scaling is performed at the broker level, but resources (CPU, bandwidth, disk) are allocated per node, leading to inefficient utilization.
Historical data processing: large‑scale data back‑fill pollutes the page cache and degrades overall read/write SLA.
Operational Difficulty
Data migration is required in scenarios such as uneven data distribution among nodes, node‑level resource bottlenecks (CPU, disk, bandwidth), and large volumes of hot data that slow down reads/writes during migration.
Resource Waste
Resource bottlenecks stem from CPU compression overhead (e.g., Gzip, Snappy, Zstd), message format conversion (V0‑V2), disk space consumption for massive cold data, disk I/O limits during tail‑read of historical data, and HDD throughput constraints. These factors cause different resource pressures depending on workload.
Elastic Architecture Comparison
Storage‑Compute Separation Architecture
This design, similar to Pulsar’s architecture, uses a storage layer (e.g., HDFS, COS, S3) decoupled from compute nodes. A Proxy handles unified access, service discovery, and rate limiting; Brokers hold partitions and balance load across multiple storage back‑ends; the storage layer can be multi‑modal.
Advantages:
Node expansion does not require data migration.
Compute and storage can be scaled independently.
Drawbacks:
File‑switching (e.g., HDFS or BookKeeper) introduces lease recovery glitches.
Strong dependency on the external storage system can cause service outage if the storage fails.
Elastic Local Storage Architecture
Combines cloud disks with cloud hosts, using automated operations to monitor disk usage (e.g., LVM expansion) and hot‑migration of compute resources (CVM or containers) when CPU or memory become bottlenecks. It supports vertical scaling but cannot address Kafka’s need for horizontal scaling.
Elastic Remote Storage Architecture
Local storage handles hot data with low latency, while remote storage (e.g., COS) stores cold data. Benefits include consistent write latency, fallback to local storage during remote failures, and cost reduction through cheaper remote storage.
Tiered Storage Architecture
Read/Write Flow
Producers write to cloud disks; data is asynchronously synced to remote storage. Consumers first read from local storage; if data is absent, they fetch from remote storage based on offset and read strategy.
Data Lifecycle
Only inactive segments are uploaded to remote storage. A LocalRetention parameter controls how long uploaded segments are kept locally, while the standard Retention parameter governs remote data expiration.
Offset Constraints
Key offset definitions:
Lz – Local log end offset (latest local offset).
Ly – Last stable offset (consumer‑visible offset).
Ry – Remote log end offset.
Lx – Local log start offset.
Rx – Remote log start offset.
Constraints: Lz ≥ Ly ≥ Lx and Ly ≥ Ry ≥ Rx.
Segment State Machine
Segments transition through three dimensions:
CopySegment: CopySegmentStarted → CopySegmentFinished DeleteSegment: DeleteSegmentStarted → DeleteSegmentFinished DeletePartition:
DeletePartitionMarked → DeletePartitionStarted → DeletePartitionFinishedState transitions must follow defined order; for example, CopySegmentStarted → DeleteSegmentStarted is illegal without completing the copy phase first.
Implementation in Tencent Cloud
Segment Metadata Management
Metadata is synchronized via an internal inner‑topic (WAL). Brokers consume the WAL to build in‑memory state machines, periodically persist snapshots and offsets locally, and recover state using snapshots plus incremental WAL.
Consumption Performance
While write performance matches native Kafka, reading from remote COS can become a bottleneck. Tencent Cloud improves read throughput by pre‑loading messages into memory pools and proactively downloading hot data during idle periods.
Isolation and Resource Controls
To ensure stability when third‑party storage is involved, the system isolates resources:
Dedicated I/O disks to reduce interference.
CPU core pinning and thread isolation.
Bandwidth throttling and parallelism control for upload/download tasks.
Off‑heap memory usage and ByteBuffer reuse.
Rollback Mechanisms
Features include pausing tiered uploads, pausing remote downloads on demand, automatic cloud‑disk expansion, and topic/cluster‑level rollback support.
Future Outlook
Plans involve richer schema support (Protobuf, JSON), enhanced ingestion layer with stateless horizontal scaling, compute engine for format conversion (e.g., Parquet, Hudi, Delta Lake), and multi‑modal storage exploration.
Conclusion
Tencent Cloud’s tiered storage solution decouples compute and storage, introduces flexible data lifecycle management, and provides elastic scaling while addressing the operational and resource challenges of traditional Kafka deployments.
Tencent Cloud Middleware
Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
