Tag

Failure Recovery

0 views collected around this technical thread.

政采云技术
政采云技术
Aug 2, 2022 · Fundamentals

Understanding the Chandy‑Lamport Distributed Snapshot Algorithm

This article explains the Chandy‑Lamport algorithm for capturing consistent global snapshots in distributed systems, describes its assumptions and message‑marker rules, walks through a detailed example with three processes and channels, and relates it to Apache Flink's asynchronous checkpoint mechanism.

Apache FlinkChandy-LamportDistributed Systems
0 likes · 14 min read
Understanding the Chandy‑Lamport Distributed Snapshot Algorithm
Bilibili Tech
Bilibili Tech
Mar 11, 2022 · Databases

Design and Architecture of Bilibili's High‑Performance Distributed KV Store

Bilibili’s high‑performance distributed KV store combines hash and range partitioning, Raft‑based multi‑replica consistency, and a Metaserver‑managed topology of pools, zones, nodes, tables, shards and replicas, offering features such as partition splitting, binlog streaming, multi‑active replication, bulk loading, KV‑storage separation, and automated load, leader and health balancing for reliable, scalable data services.

Bulk LoadDistributed StorageFailure Recovery
0 likes · 22 min read
Design and Architecture of Bilibili's High‑Performance Distributed KV Store
NetEase Game Operations Platform
NetEase Game Operations Platform
Jan 4, 2020 · Operations

Ceph Storage Failure Recovery: Analysis and Step‑by‑Step Procedures

This article describes a real‑world Ceph storage incident caused by disk bad sectors, analyzes its impact, and presents two practical recovery methods—full disk copy with dd+nc and skipping the faulty sector during service start—along with detailed commands and post‑recovery steps.

CephFailure RecoveryLinux
0 likes · 11 min read
Ceph Storage Failure Recovery: Analysis and Step‑by‑Step Procedures