Design and Implementation of a Scalable, High‑Availability Object Storage Service (OSS) Based on SeaweedFS
This article describes the design goals, technology selection, architecture, high‑availability mechanisms, performance testing, cost optimization, and seamless migration strategy of a new object storage service built on SeaweedFS to support billions of files with low latency and high reliability.
Amazon S3 (Simple Storage Service) is a widely used object storage platform provided by AWS since 2006. Our company requires massive storage for images, videos, files, QR codes, and ML training data, handling over 1 billion objects, making storage reliability critical.
Design Goals : The service must be scalable (support >1 billion objects with horizontal expansion), highly available (multi‑tenant isolation, rate‑limiting, disaster‑recovery), high performance (comparable to Ceph RADOS), low cost (cheaper than Ceph), easy to integrate (S3‑compatible), and enable seamless upgrades without service disruption.
Technology Selection : After evaluating open‑source options, we focused on minio and SeaweedFS. MinIO offers a friendly UI and simple deployment but cannot handle the required >1 billion objects and has limitations with small‑file workloads. SeaweedFS, inspired by Facebook’s Haystack design, provides strong performance, flexible architecture, full S3 compatibility, and supports both EC and replication.
System Architecture : The solution consists of a stateless Proxy layer for API adaptation, a Storage layer (SeaweedFS, Ceph S3, public‑cloud S3), and a Management platform for UI, permissions, quotas, and data listing. Metadata is stored in a distributed DB (DCDB) co‑built with BaiKalDB, ensuring sub‑millisecond read/write latency.
High Availability : We achieve isolation via per‑bucket proxy routing, tenant‑level rate limiting, and dual‑storage disaster recovery (replication and EC). The proxy automatically switches to backup storage on failure, and a compensation service fills missing data from the backup.
Performance : Benchmarks on 6 × 256 GB SATA servers show write throughput of 3000‑4000 TPS for 1 MB objects, and mixed read/write/delete workloads reach over 3500 QPS, meeting our latency targets (write < 3 ms, read < 3 ms).
Cost Optimization : By tiering storage (NVMe SSD → SATA SSD → HDD) and using EC for cold data, we significantly reduce storage expenses while maintaining required performance.
Seamless Migration : The proxy layer abstracts all S3‑compatible APIs, allowing transparent migration from Ceph S3 or public‑cloud S3 to the new OSS. Data is gradually backfilled via MQ‑driven compensation, ensuring no service interruption.
Deployment Benefits : With only two engineers, the project was delivered in three months and now serves ~20 million objects (≈60 TB). Compared to the previous public‑cloud S3 setup, response latency dropped from 150 ms to 3 ms, stability improved, and costs decreased.
Operational Tips : Adjust volumeGrowthCount to keep all volume servers writable, configure bucket‑to‑collection mapping carefully, and use filer.sync for dual‑cluster backup when the proxy is absent.
Future Outlook : Plans include regular incremental backups, continued open‑source contributions, S3‑based storage‑compute separation for analytics workloads, and building a distributed file system on top of the OSS for high‑IO scenarios.
Tongcheng Travel Technology Center
Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
