Technical Overview of Tencent Cloud CBS Data Scheduling System
The Tencent Cloud CBS data scheduling system has evolved from a simple snapshot service into a highly concurrent, low‑latency platform that uses COW/ROW mechanisms, multi‑version snapshots, rapid rollback, hot‑data caching, horizontal scaling, fault‑tolerant task switching, cross‑region replication, and seamless disk migration to ensure reliable, fast storage for backups, image creation, and cloud‑disk migration, with future AI‑driven scheduling and ultra‑low‑latency features.
This document summarizes the technical sharing of Yang Guangchao, a Tencent Cloud storage expert, on the CBS (Cloud Block Storage) data scheduling system.
1. Evolution of CBS Data Scheduling System – Initially a simple snapshot service in 2015, CBS evolved to support data protection, cloud‑server image production, and online cloud‑disk migration as business scale grew and latency requirements became stricter.
2. Typical Business Scenarios and Challenges
Data protection – daily backups, manual and periodic snapshots.
Image production for cloud servers – creating images from snapshots and batch deploying servers.
Cloud‑disk migration – moving disks between storage warehouses without affecting user workloads.
Each scenario faces challenges such as latency sensitivity, high concurrency, and fault tolerance.
3. Key Technologies
COW and ROW – Two snapshot mechanisms. COW (Copy‑On‑Write) copies original blocks before writing, leading to write amplification; ROW (Redirect‑On‑Write) updates pointers, reducing write amplification and favoring write‑intensive workloads.
Multi‑Version ROW – Assigns version numbers to snapshots, enabling incremental backup and precise data reconstruction.
Snapshot Overview – Combines full and incremental backup; snapshot creation is designed to be completed in seconds.
Rollback Process – Uses bitmap metadata to identify which blocks need restoration, merging data from the target snapshot and its dependencies.
Image Production via Rollback – Leverages snapshot rollback to create images quickly without full download, achieving second‑level server boot.
Hot Data Access Strategy – Caches frequently accessed image blocks in the transmission layer to reduce latency for batch server provisioning.
Horizontal Scaling – Splits large regional deployments into smaller zones, uses static (heartbeat) and dynamic (load‑aware) balancing for both control and transmission layers, and replicates data to avoid hotspots.
Task Smooth Switching – Detects node failures via heartbeat, failure rate, and monitoring; switches tasks to healthy nodes to maintain I/O continuity.
Cross‑Region Image Replication – Uses separate control planes in each region to transfer image metadata, ensuring data safety by separating transfer and verification.
Seamless Disk Migration – Introduces an I/O access layer that isolates user I/O from background migration I/O, supports three block states (unmigrated, migrating, migrated) and writes to both source and destination during migration to guarantee data integrity.
Data Reliability – MD5 checksums, cross‑region backup, version‑based write protection, and careful reclamation of old snapshot data ensure data correctness.
Future Directions
Support for block, file, and database storage scenarios.
Ultra‑low‑latency online migration for high‑performance disks.
Second‑level RPO for finer‑grained data protection.
AI‑driven intelligent scheduling for resource risk detection and balanced storage pool utilization.
Q&A Highlights
Empty blocks are not migrated during a full snapshot.
Deployment has shifted from large‑region to small‑region models.
Migration speed is dynamically controlled based on available bandwidth.
Cross‑region image transfer has no special network requirements beyond internal bandwidth limits.
The control plane has active‑standby nodes; the scheduling and transmission clusters are stateless.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.