How DisCoGC Cuts Storage Costs by 20%: A Deep Dive into ByteStore’s New GC Paradigm
This article analyzes the DisCoGC algorithm introduced by ByteDance, explaining how its discard‑centric garbage collection eliminates the write‑amplification vs. space‑amplification trade‑off in log‑structured storage, details the engineering challenges of multi‑layer deployment, and presents production results showing up to 20% TCO reduction without impacting latency.
Background and Motivation
FAST 2026 accepted a paper titled “Discard‑Based Garbage Collection for Distributed Log‑Structured Storage Systems in ByteDance,” describing a new GC approach for ByteStore, ByteDance’s self‑developed append‑only distributed storage that serves petabyte‑scale workloads such as TikTok video storage, recommendation index, large‑model data, and Feishu collaborative documents.
The Classic Compaction Dilemma
Log‑structured storage offers high write throughput and simple consistency, but its append‑only nature creates massive invalid data after updates. Traditional compaction scans old logs, rewrites valid data to new logs, and deletes the old files. This incurs two opposing costs:
Frequent compaction reduces space amplification but dramatically increases write amplification, consuming I/O, CPU, and accelerating SSD wear.
Infrequent compaction lowers write amplification but leaves large amounts of invalid data, inflating space usage and total‑cost‑of‑ownership (TCO).
In ByteStore, this trade‑off generated millions of dollars of extra monthly TCO.
Data‑Driven Insight from Production Traces
ByteStore, ByteDrive, and Tsinghua University collected full‑scale I/O traces from three core scenarios—online services, search‑ad‑recommendation (SAR), and offline distributed computation—covering petabytes of writes and over a billion I/O requests across several days. The analysis revealed:
Online services exhibit highly fragmented 4 KiB writes; invalid data appears as scattered fragments.
SAR and offline jobs produce mostly large, sequential writes (>256 KiB) with frequent full‑range overwrites, generating long contiguous invalid regions.
These contiguous invalid regions account for >70% of total write traffic and are the primary source of GC cost.
Core Idea: Discard‑Centric Garbage Collection (DisCoGC)
Because >90% of invalid data forms contiguous ranges larger than 128 KiB (most >1 MiB), the team proposed abandoning pure compaction and introducing a discard‑first GC paradigm:
Discard Phase: Directly mark and reclaim contiguous invalid ranges without moving any valid data, eliminating write amplification.
Compaction Phase: Run low‑frequency compaction only to clean up fragmented small invalid pieces left by discard.
This complementary approach removes the write‑amplification penalty while still handling fragmentation.
Engineering Challenges and Solutions
Cross‑layer Alignment: ByteStore spans ByteDrive (block storage), the ByteStore engine (EC stripe units), and the UFS user‑space file system (4 KiB clusters). Misaligned allocation units caused up to 50% boundary loss when discarding. The team introduced:
Boundary‑extension: Expand discard ranges to absorb adjacent leftover garbage.
EC‑stripe redesign: Align stripe size with UFS clusters to eliminate boundary loss.
Impact on Foreground Workloads: Discard metadata updates could compete with latency‑sensitive services. A three‑layer control system was built:
Batch‑merge discard requests per file to reduce I/O.
Concurrency‑aware scheduling that limits parallel discards and prioritizes the largest reclaimable segments.
IOPS throttling to guarantee foreground latency.
Fragmentation & Metadata Overhead: Frequent discards sparsify log files, increasing metadata size. The solution pairs high‑frequency discard with occasional compaction that merges fragments and trims metadata growth.
SSD Trim Adaptation: Different SSD models expose vastly different Trim IOPS. The team added a Trim Filter to drop tiny Trim ranges and a Trim Merger to coalesce adjacent ranges, allowing the system to adapt Trim thresholds per hardware.
Production Results
After full deployment across ByteDance’s production clusters, DisCoGC achieved:
Space amplification reduced from 1.37× to 1.23× and write amplification cut by 32%, yielding ~20% overall storage TCO reduction.
Scenario‑specific TCO savings: >25% for high‑sequential write workloads, 2‑5% for fragmented random‑write services.
No measurable impact on latency, P99/P999 tail latency, or bandwidth.
CPU usage for GC dropped to 82.9% of the pure‑compaction baseline; metadata memory overhead increased only 2.9%.
These results confirm that the discard‑centric design fully resolves the long‑standing write‑vs‑space amplification dilemma.
Conclusion
DisCoGC demonstrates that a data‑driven, discard‑first GC strategy can dramatically lower storage costs and improve efficiency in petabyte‑scale log‑structured systems without sacrificing performance. The success stems from extensive production trace analysis, careful cross‑layer engineering, and adaptive hardware‑aware mechanisms.
ByteDance SE Lab
Official account of ByteDance SE Lab, sharing research and practical experience in software engineering. Our lab unites researchers and engineers from various domains to accelerate the fusion of software engineering and AI, driving technological progress in every phase of software development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
