High‑Availability Architecture and Reliability Practices from a Former Google SRE

The article shares a former Google SRE’s insights on building high‑availability systems, explaining key factors such as MTBF and MTTR, redundancy strategies like N+2, change‑management practices, and practical tips for reliability engineering and operations.

ReliabilitySRESystem Design

0 likes · 16 min read

High‑Availability Architecture and Reliability Practices from a Former Google SRE

Efficient Ops

Jul 27, 2015 · Operations

What Google SREs Do: Inside the Role that Powers Reliable Services

This article explains the responsibilities, requirements, and daily work of Google Site Reliability Engineers, contrasts them with Software Engineers, outlines key internal infrastructure components, and discusses the future direction of operations engineering in the cloud era.

GoogleInfrastructureOperations

0 likes · 11 min read

What Google SREs Do: Inside the Role that Powers Reliable Services

MaGe Linux Operations

Apr 28, 2015 · Operations

How Yelp Achieved Zero‑Downtime HAProxy Reloads Using Linux qdisc

Yelp’s infrastructure team tackled HAProxy’s reload‑induced packet loss by leveraging Linux’s plug qdisc and iptables to delay SYN packets during reloads, enabling zero‑downtime service updates and improving reliability despite the kernel’s brief binding window.

HAProxyLinux qdiscNetwork Traffic Control

0 likes · 7 min read

How Yelp Achieved Zero‑Downtime HAProxy Reloads Using Linux qdisc