Consistency Models in Distributed Storage: Cosmos DB, Cassandra, OceanBase
This article explains the fundamentals of consistency in distributed storage systems, contrasts it with database transaction consistency, and details the various consistency levels offered by Azure Cosmos DB, Apache Cassandra, and OceanBase, highlighting their guarantees, configurations, and the performance‑availability trade‑offs involved.
Distributed Storage System Model
In distributed storage systems (including distributed databases like OceanBase), the term consistency is often used, but its meaning varies across systems and people, leading to confusion.
Consider a simple storage system with a single client (single process) and a single server (single‑process service). The client issues read/write operations sequentially, and the server processes each request in order, so later operations always see the results of earlier ones.
When the system becomes more complex with a single service process (single replica) but multiple concurrent clients, operations can interfere: a client may read data written by another client. This is similar to multi‑threaded programs sharing memory.
Adding another service process on a different machine to store a copy of the data creates multiple replicas . A client may send each operation to either replica. If replica synchronization is delayed, a read may miss a recent write, or a later read may miss an earlier write.
In real systems, many clients concurrently read and write multiple replicas. If their operations target the same data item (e.g., the same file range or the same database row), they affect each other. For example, should client B see the latest write from client A immediately?
Further, each service process may manage only a subset of the total data (partitioned tables). A write that touches multiple partitions may be applied to different replicas, leading to complex read/write anomalies.
From the above, a generic distributed storage system has the following characteristics:
Data is split into multiple shards stored on many service nodes.
Each shard has multiple replicas on different nodes.
Many clients concurrently perform read/write operations, each taking variable time.
Unless otherwise noted, operations are atomic.
Difference from Database Transaction Consistency
The consistency in the ACID definition of a database transaction is different. In ACID, consistency means that a transaction preserves global constraints (e.g., uniqueness, foreign keys). It has nothing to do with whether the data is replicated.
In distributed storage, consistency refers to the agreement among multiple replicas. Transaction isolation, which limits the visibility of concurrent operations, is also unrelated to replica consistency.
Client‑View Consistency Model
To keep replicas consistent, a system must ensure:
All replicas eventually receive every write operation (no matter how long it takes).
All replicas execute writes in a deterministic order.
From the client’s perspective, four guarantees are defined:
Monotonic reads : once a client has read version n, any later read must return version ≥ n.
Read your writes : after a client writes version n, subsequent reads must see at least version n.
Monotonic writes : a client’s writes are observed by all replicas in the same order they were issued.
Read‑after‑write : after reading version n, any subsequent write must be performed on a replica that is at least at version n.
Different consistency levels offered by a system provide some subset of these guarantees, trading off latency, availability, and scalability.
Cosmos DB Consistency Levels
Azure Cosmos DB, a globally distributed NoSQL service, offers five configurable consistency levels (from strongest to weakest):
Strong consistency : reads always return the latest version (linearizable). Writes must be committed to a majority of replicas before succeeding; reads require majority acknowledgments.
Bounded staleness : reads are at most K versions behind the latest write, using a sliding window to guarantee global order.
Session consistency : within a session, guarantees monotonic reads, monotonic writes, and read‑your‑writes; no guarantees across sessions.
Prefix consistency : if no new writes occur, all replicas eventually converge; reads never see out‑of‑order writes (e.g., they may see A, A‑B, or A‑B‑C but never A‑C).
Eventual consistency : the weakest guarantee; replicas eventually converge, but reads may return older data.
Approximately 73% of Cosmos DB users choose session consistency, while 20% choose bounded staleness.
Cassandra Consistency Levels
Cassandra uses a quorum‑based protocol. By configuring the number of replicas that must acknowledge reads and writes, different consistency levels are achieved. Writes are performed via a PUT (upsert) operation, which is idempotent and has overwrite semantics.
Write configurations
ALL: write must be persisted to all replicas.
EACH_QUORUM: write must reach a quorum in each data‑center.
QUORUM: write must reach a majority of replicas.
LOCAL_QUORUM: write must reach a majority within the local data‑center.
ONE: write must reach at least one replica.
TWO: write must reach at least two replicas.
THREE: write must reach at least three replicas.
LOCAL_ONE: write must reach at least one replica in the local data‑center.
Read configurations
ALL: read returns only after all replicas respond.
QUORUM: read returns after a majority respond.
LOCAL_QUORUM: read returns after a majority in the local data‑center respond.
ONE: read returns after the nearest replica responds (may be stale).
TWO, THREE, LOCAL_ONE: similar to the write variants, requiring two, three, or one local replica responses.
When the sum of write and read quorum sizes exceeds the total replica count, Cassandra provides strong consistency; otherwise it provides eventual consistency.
OceanBase Consistency Levels
OceanBase uses a Multi‑Paxos consensus algorithm. Transactions are committed only after a majority of replicas acknowledge, and a leader replica executes all reads and writes to provide strong consistency. If the leader fails, another majority replica becomes the new leader, ensuring no data loss.
OceanBase offers four consistency levels:
Strong consistency : reads and writes go to the leader; a majority of replicas must confirm writes.
Bounded staleness : reads may be served by any replica as long as the data is no older than a configured time window (e.g., 30 seconds).
Prefix consistency : within a client session, reads are monotonic, guaranteeing that a client never sees out‑of‑order writes.
Eventual consistency : reads may be served by any replica without waiting for log replay; data may be stale but availability and latency are maximized.
Weak consistency levels relax read semantics while still requiring writes to go through the leader, thus preserving monotonic writes and read‑after‑write guarantees, but not read‑your‑writes.
References
Consistency model – Wikipedia: https://en.wikipedia.org/wiki/Consistency_model
Tunable data consistency levels in Azure Cosmos DB: https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels
Configuring data consistency in Apache Cassandra: https://docs.datastax.com/en/archived/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
Cassandra lightweight transactions documentation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
