How a Concurrent Append‑Only Architecture Doubles Storage Performance
This article examines the design and implementation of a proprietary concurrent Append‑Only distributed object storage system, detailing its unified persistence layer, heavyweight client, hardware optimizations, metadata simplification, flexible redundancy, high availability, and real‑world performance gains across big‑data, AI, and log‑archiving workloads.
Introduction
In the wave of enterprise digital transformation, commercial object storage systems struggle to meet dynamic workload demands, exposing performance and architectural limitations in scenarios such as log archiving, AI training data lakes, and big‑data compute‑storage separation.
The "Concurrent Append‑Only" architecture developed by the team overcomes closed‑source constraints, enabling distributed concurrent read/write I/O with dual‑mode multi‑replica and erasure‑coding (EC) redundancy.
With horizontal scalability for exabyte‑scale unstructured data, the system achieves a 2× write throughput increase, high availability, and 50% cost reduction, fully supporting log archiving with a 100% performance boost.
Architecture Analysis
1. Comparison of Mainstream Open‑Source Object Storage
2. Architecture Details
2.1 Design Philosophy
Standardized Interface for Seamless Migration – Supports AWS S3 API with a tenant/bucket/object model.
Unified Append‑Only Persistence Layer – Provides a common data persistence layer for object, file, and block protocols, using multi‑replica and EC redundancy while allowing only append‑only writes to reduce write amplification and simplify consistency.
Heavyweight Client – Handles multi‑replica and EC redundancy logic, performs concurrent fan‑out writes, caches hot metadata, and reduces I/O latency.
Hardware Performance Maximization – Utilizes SPDK in user space to manage raw NVMe disks, eliminating kernel‑space transitions.
2.2 Key Features
LogStream – A generic append‑only log stream exposing write permissions to clients; supports multi‑replica and EC modes, with Extents as storage allocation units distributed across DataNodes.
Concurrent Consistency Access Model – Write permissions are assigned directly to clients, avoiding write conflicts and eliminating frequent distributed lock coordination.
Concurrent Append‑Write and Random Read I/O – Multiple LogStreams can be opened per client; writable Extents are rotated to new ones upon seal, with MetaNode assigning new Extents based on weighted random selection.
Extremely Slim Metadata – Only maps LogStream/Extent to DataNode devices, allowing on‑node caching and avoiding namespace bottlenecks.
Client‑Side Fan‑Out Write – Writes directly to multiple replicas, bypassing master‑slave replication to lower latency.
Flexible Redundancy with Strong Consistency – Both multi‑replica and EC modes enforce strong consistency; writes succeed only when all replicas acknowledge.
High Availability and Elastic Expansion – Stateless object gateway (OGW) and Raft‑based MetaNode cluster enable auto‑failover and scaling.
Minimized Data Migration – Balanced data redistribution during node expansion and optimized EC repair reduce migration bandwidth.
2.3 Design Summary
2.3.1 Hardware Design
Compatible with generic x86/ARM servers, using standard NVMe SSDs, SATA HDDs, and Ethernet/InfiniBand NICs.
2.3.2 Software Design
The system comprises an Object Gateway (OGW), an object metadata database (TiDB), a distributed storage cluster (Cluster), and a garbage collection service (GC). The Cluster includes MetaNode, DataNode, and Client components.
Object Gateway (OGW) – Provides S3‑compatible APIs for bucket/object lifecycle management and access control.
Object Metadata Database (TiDB) – Stores metadata for tenants, buckets, and objects.
Cluster –
MetaNode – Raft‑based metadata service handling LogStream lifecycle, node health monitoring, metadata synchronization (RocksDB), and automatic failover.
DataNode – Uses SPDK BlobStore for raw block allocation; supports Extent Create/Append/Read/Seal/Delete and replication.
Client – Exposes LogStream Create/Open/Append/Close APIs, implements fan‑out append‑write, caches metadata, and reports heartbeats.
GC – Reclaims space from deleted or failed uploads.
2.4 Data Layout
Objects are split into slices stored as append‑only logs across multiple writable Extents on different DataNodes, ensuring load balancing and capacity distribution.
Applicable Scenarios
Big‑Data Compute‑Storage Separation – Supports StarRocks, Hive, Flink with high‑throughput storage.
Large‑Model Storage – Stores AI training datasets and models, leveraging flat namespace and unlimited scalability.
Low‑Cost Log Archiving – Provides flexible EC policies and S3 compatibility, saving hundreds of PB‑year licensing costs.
Massive Unstructured Data – Serves financial files, images, audio/video with high concurrency and no extra gateway overhead.
Implementation Results
1. StarRocks Resource Utilization
The proprietary storage decouples compute and storage, allowing independent scaling, doubling CPU utilization, cutting compute cost by 50%, and reducing storage cost by over 60%.
2. Log Archiving Cost Reduction
Flexible EC and S3 compatibility enable seamless log ingestion and archiving, saving commercial licensing fees for hundreds of PB of logs annually.
Future Plans
Unified Multi‑Protocol Storage Platform
Protocol Layer Expansion – Add POSIX, NFS, and block protocols on top of existing S3/SWIFT.
Unified Access Gateway – Provide a single gateway supporting NFS/POSIX, HDFS, S3/SWIFT for transparent multi‑protocol data access.
Global Namespace & Intelligent Tiering – Create a cross‑cluster logical namespace with smart caching and prefetching to eliminate data silos.
Extreme Performance & Cost Optimization
Introduce RDMA, DPDK for zero‑copy, ultra‑low latency I/O.
Implement automated hot‑cold tiering, cross‑cluster replication, compression, and deduplication to further lower storage costs.
Overall, the self‑developed distributed storage system serves as a cost‑effective, high‑performance foundation that drives AI and data‑intelligence initiatives while supporting the company’s digital transformation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
