Databases 18 min read

Inside Alibaba's Doris KV Store: Architecture, Routing & Failover Secrets

This article examines Alibaba's internal Doris KV storage system, detailing why large companies build proprietary data products, the project's kickoff criteria, the two‑layer architecture, virtual‑node routing, failover mechanisms, and cluster scaling strategies for massive KV workloads.

JavaEdge

Feb 23, 2024

Inside Alibaba's Doris KV Store: Architecture, Routing & Failover Secrets

1. How Large Companies View Internal Technical Products

Internet giants like Alibaba rely on massive e‑commerce platforms that generate huge user traffic and data volumes. While they can adopt open‑source solutions, they often develop proprietary infrastructure such as distributed caches, message queues, service frameworks, and databases to tailor to specific business scenarios, improve engineering capabilities, and create competitive barriers.

However, resources are limited and must be balanced between core business development and foundational technology projects. Engineers seeking technical growth prefer challenging foundational projects, creating a negotiation between engineers and management: engineers must demonstrate business value, innovation, and low risk to obtain support.

2. Project Kickoff – Key Technical Metrics

Current Situation

Numerous business services require massive KV storage and access.

Existing solutions (e.g., UDAS, multiple KV implementations) suffer from scaling difficulty, low write performance, and high operational cost.

Complex KV engines (e.g., Aspara) are hard to use and have poor performance.

These problems lead to development and operational difficulties, prompting the need for a unified, maintainable, and scalable KV system.

Product Positioning

The goal is a transparent, distributed KV storage engine that replaces fragmented solutions, supports Alibaba’s international and domestic sites, and delivers high availability, linear scalability, and low latency.

3. Overall Architecture

Two‑layer design: Client, DataServer + Store.

Four core components: Client, RataServer, Store, Administration.

Doris focuses on the distributed routing and cluster management layers, delegating actual data persistence to Berkeley DB.

4. Main Access Model

KV Client contacts Administration to obtain cluster topology and routing algorithm.

Client hashes the key to compute target servers.

Client sends data via a custom protocol to the chosen DataServer, which writes to the local Berkeley DB.

5. Partition Routing Algorithm

Doris uses a virtual‑node based routing algorithm that improves balance and reduces data movement during scaling.

Balance: even data distribution.

Fluctuation: X/(M+X) – better than traditional consistent hashing X/M.

5.1 Example

Two physical nodes vs. three physical nodes illustrate how virtual nodes map to servers.

The virtual‑node index is calculated as:

virtual_node_index = hash(md5(key)) mod virtual_node_count

During cluster expansion, the virtual‑node count remains constant; only the mapping between virtual nodes and physical servers is updated, minimizing data migration.

6. Failover Strategy

Doris targets 99.997% availability through data redundancy and failover mechanisms.

6.1 Redundant Storage

Dual‑write (W=2, R=1) across two nodes per group.

Copy‑on‑write and update logs ensure consistency.

Data is written to both groups concurrently, providing multiple copies.

6.2 Failure Scenarios

Transient failure – short‑term unavailability.

Temporary failure – e.g., server upgrade, network glitch, with recovery within hours.

Permanent failure – server goes offline.

Different strategies are applied based on the failure type, ranging from retry and temporary logging to adding new servers for permanent loss.

6.3 Health Check & Configuration Pull

ConfigServer heartbeats DataServer.

Clients report failures.

Clients periodically pull configuration.

If a server becomes unreachable, the control center marks it as temporarily failed and notifies all clients.

7. Cluster Scaling Design

Elastic scaling is essential for handling increased load. Adding a new node requires migrating only the data belonging to the newly assigned virtual nodes.

7.1 Data Migration During Expansion

Old routing: Route1(key) = {pn1, pn2}.

New routing after adding node X: Route2(key) = {pn1, pnx}.

Only the data on pn2 needs to move to pnx.

7.2 Expansion Procedure

Add a physical server to a group and start the Doris process.

Temporarily mark the entire group as failed.

Recompute virtual‑node distribution and copy the affected files to the new server.

Restore the group, allowing normal reads/writes to resume.

The project spent about six months developing Doris, which has since powered multiple Alibaba services for years with minimal operational incidents, scaling to hundreds of nodes.

8. Patent Considerations

Large enterprises often encourage engineers to file patents to strengthen their IP portfolio and protect against “patent trolls.” Companies typically provide legal support and modest financial incentives for successful filings.

9. Conclusion

Designing a distributed KV storage system like Doris involves addressing data consistency, high availability, and seamless scaling—challenges that surpass those of typical distributed systems. Engaging in such technically demanding projects not only yields valuable products for the company but also accelerates personal skill growth and career prospects.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Database Architecture KV store failover Routing Algorithm

Written by

JavaEdge

First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.