Cloud Native 15 min read

Chaos Engineering Practices for Bilibili Distributed KV Storage

Peng Liangyou describes how Bilibili’s large‑scale distributed KV storage adopts Netflix‑style chaos engineering—defining steady‑state hypotheses, replicating production environments, injecting CPU, memory, network and replica faults via automated “monkey” experiments, monitoring latency and durability with Prometheus/Grafana, and over 1.5 years preventing critical incidents while cutting testing costs and enabling incremental, standards‑based reliability improvements.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Chaos Engineering Practices for Bilibili Distributed KV Storage

This article, authored by Peng Liangyou, a senior test development engineer at Bilibili, introduces the reliability assurance and chaos engineering implementation for Bilibili's large‑scale distributed KV storage system.

Background

The author references a previous article on Bilibili's distributed KV storage and explains that the system must meet high reliability, availability, performance, and scalability requirements. Traditional reliability testing relies on exhaustive test cases, but it often misses rare or multi‑factor fault scenarios.

Challenges of Large Distributed Storage

Designing, developing, and testing large distributed systems is difficult due to unexpected node failures, asynchronous interactions, network faults, multi‑core concurrency, and diverse client logic. These factors can cause severe production incidents that are hard to detect and reproduce.

Existing open‑source reliability frameworks such as P# and Jepsen are costly to adopt for non‑commercial teams, requiring specialized programming languages and steep learning curves.

Significance of Chaos Engineering

Chaos engineering, derived from chaos theory, provides a systematic way to inject faults into a distributed system to evaluate its overall resilience. It is especially suitable for large distributed environments where component dependencies are complex and unpredictable.

The Netflix chaos engineering principles (four steps) are outlined:

Define a stable state using measurable system behavior.

Assume the control and experiment groups remain in that state.

Inject realistic fault events into the experiment group.

Compare the control and experiment groups to assess impact.

Five advanced principles are also discussed, emphasizing business‑level stability metrics, realistic fault scenarios, production‑like environments, continuous automated experiments, and minimizing blast radius.

Chaos Engineering Practice

4.1 Establishing a Steady‑State Hypothesis

The KV storage system evolves continuously; therefore, the definition of a steady state must be regularly updated based on business metrics such as request latency, success rate, error rate, and data durability.

4.2 Real‑User Scenarios

The test environment replicates production hardware, deployment topology, multi‑region clusters, raft groups, and diverse storage engines to mimic real user workloads.

4.3 Designing and Running Continuous Experiments

Using a data‑migration scenario, the author combines functional test cases with randomly selected fault “monkeys” to create chaos experiments that run continuously in a realistic test environment.

Key code snippets used in the experiments:

//封装业务请求PUT/GET/DEL并持续检查数据状态
go func() {
    common.PutGetDelLoop(t, true, b.Client, 1000000, 300)
    close(done1)
}()

//封装批量请求PUT/GET/DEL并持续检查数据状态
go func() {
    common.PutGetDelBatch(t, true, b.Client)
    close(done2)
}()

//封装用户场景:数据迁移并检查任务状态
resp := common.RebalanceTable(base.RemoteServer, common.REBALANCE_TABLE, Table, "0:50%,9:50")
assert.Contains(t, resp, "OK")
log.Info("Rabalance plan: %s", "0:50%,9:50")

For fault injection, a C++ class Monkeys defines target hosts and tables, and provides methods to inject CPU, memory, and replica faults:

class Monkeys {
public:
    Monkeys() {
        // 定义目标节点
        m_hosts.push_back("172.22.12.25:8000");
        m_hosts.push_back("172.22.12.31:8000");
        m_hosts.push_back("172.22.12.37:8000");
        srand(time(0));
        // 定义各种目标实验对象
        m_tables.push_back("test_granite");
        m_tables.push_back("test_quartz");
        m_tables.push_back("test_pebble");
        m_tables.push_back("test_marble_k16");
    }
    // 封装 CPU monkey 注入CPU类型异常
    void cpu_monkey() {
        std::string host = m_hosts[rand() % m_hosts.size()];
        cpu_load(host);
        LOG_INFO << "CPU MONKEY:" << host;
    }
    // 封装 mem monkey 注入内存类型异常
    void mem_monkey() {
        std::string host = m_hosts[rand() % m_hosts.size()];
        mem_load(host);
        LOG_INFO << "MEM MONKEY:" << host;
    }
    // 封装 replica monkey 注入副本丢失异常
    void replica_monkey() {
        std::string table = m_tables[rand() % m_tables.size()];
        drop_replica(table);
        LOG_INFO << "Replica MONKEY:" << table;
    }
    // ... other monkey implementations
};

The ChaosTest class binds these monkey functions into a vector and runs them repeatedly:

class ChaosTest : public ::testing::Test {
protected:
    ChaosTest() {
        m_monkeyVec.push_back(std::bind(&Monkeys::cpu_monkey, &m_monkeys));
        m_monkeyVec.push_back(std::bind(&Monkeys::mem_monkey, &m_monkeys));
        m_monkeyVec.push_back(std::bind(&Monkeys::disk_monkey, &m_monkeys));
        m_monkeyVec.push_back(std::bind(&Monkeys::network_monkey, &m_monkeys));
        m_monkeyVec.push_back(std::bind(&Monkeys::kill_node_monkey, &m_monkeys));
        m_monkeyVec.push_back(std::bind(&Monkeys::stop_node_monkey, &m_monkeys));
        m_monkeyVec.push_back(std::bind(&Monkeys::stop_meta_monkey, &m_monkeys));
        m_monkeyVec.push_back(std::bind(&Monkeys::replica_monkey, &m_monkeys));
        // ...
    }
    // ... test execution logic
};

4.4 Recording and Analyzing Results

Metrics are collected before, during, and after experiments using Prometheus and visualized with Grafana. Continuous, unattended experiments help uncover probabilistic issues and drive stability improvements.

5 Results and Benefits

Over 1.5 years of continuous operation, the chaos engineering practice intercepted multiple severe incidents that traditional testing missed, reduced maintenance costs compared to conventional storage testing frameworks, and allowed incremental development while adhering to the open‑closed principle.

6 Standardization and Serviceization

The 2021 CAICT Chaos Engineering Practice Guidelines are referenced for standardizing chaos experiments. The article also notes the need to integrate various open‑source fault‑injection tools (e.g., Chaos Blade, Chaos Mesh) into a company‑specific service platform.

Chaos Engineeringdistributed-storageFault InjectionKV StoreBilibilireliability testing
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.