Ceph Performance Optimization: Lock-Related Issues and Solutions
The article details how Didi’s large‑scale Ceph deployment suffered from high tail latency due to long‑held and coarse‑grained locks, and describes a series of fixes—including asynchronous read threads, fine‑grained object caches, per‑thread lock‑free logging, and lock‑free filestore apply—that cut latency by up to 90 % and more than doubled read throughput.
This article discusses Ceph performance optimization, focusing on lock-related issues discovered during large-scale deployment at Didi. Ceph is a widely-used open-source distributed storage system in cloud computing, file, and object storage domains. The article assumes readers are familiar with Ceph's basic read/write code flow and uses the Luminous version.
The authors identified that Ceph's tail latency is poor, especially under high concurrent loads, causing latency spikes that can lead to timeouts or crashes in latency-sensitive applications. They conducted detailed analysis and optimization of Ceph's tail latency issues, with lock usage being a major contributing factor.
2. Long Lock Holding Time
The article describes how Ceph's OSD (Object Storage Daemon) processes client requests using a thread pool (osd_op_tp). When handling object read requests, threads lock the corresponding PG (Placement Group) lock, then perform synchronous read operations until data is sent back to the client. This synchronous approach causes problems when data isn't in page cache and needs to be read from disk - a time-consuming operation that blocks other operations on the same PG, increasing latency and reducing throughput.
The authors implemented asynchronous read optimization by creating dedicated read threads. The OSD threads only need to submit read requests to the read thread's queue and can immediately unlock, significantly reducing PG lock holding time. The read thread performs disk reads and then passes results to a finisher thread, which re-acquires the PG lock for subsequent processing. This moves time-consuming disk access outside the locked section and enables unified traffic control.
Testing with fio showed that after asynchronous read optimization, random write average latency decreased by 53%. For a specific filestore cluster, read throughput increased by 120% after the optimization was deployed.
3. Coarse Lock Granularity
The article discusses Ceph's client-side object cache, which uses a single large mutex lock for all cache operations. This causes contention when multiple operations need to access the cache simultaneously. The authors implemented fine-grained object-level locks, allowing concurrent operations on different objects while only requiring the global lock for shared data structure access. This increased throughput by over 20% under high concurrency.
4. Unnecessary Lock Contention
The article describes several optimizations to reduce lock contention:
4.1 Reducing PG Lock Competition: The authors modified the request processing pipeline to ensure only one thread processes requests from each PG slot queue at a time, preventing multiple threads from blocking on the same PG lock.
4.2 Log Lock Optimization: Instead of using a global log queue with a single lock, each logging thread now has its own thread-local log queue implemented as a lock-free single-producer single-consumer queue. This reduced log submission latency by nearly 90% under high concurrency.
4.3 Filestore Apply Lock Optimization: For the filestore storage engine, the authors eliminated the apply lock by using atomic operations and ensuring each OpSequencer (osr) is only added to the apply queue once. This optimization reduced total apply time by 89.6% in testing.
The article concludes with information about the Didi Cloud Storage team, their responsibilities, and technical expertise in distributed storage, internet service architecture, and Linux storage stack.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.