Tagged articles

403 articles

Page 4 of 5

Mar 1, 2021 · Operations

Google’s Project Health Metrics and Practices for Pre‑Release Code Quality

The article explains how Google measures and maintains software quality before release by dividing responsibilities between product teams and SRE, using monorepo, trunk‑based development, daily release candidates, automated testing, performance monitoring, and a Project Health (pH) metric system that tracks productivity, release velocity, reliability, and quality.

GoogleMetricsProject Health

0 likes · 12 min read

Google’s Project Health Metrics and Practices for Pre‑Release Code Quality

DevOps Coach

Feb 9, 2021 · Operations

Master Elastic Observability: Build a Full‑Stack Monitoring Platform in Half a Day

This workshop guides participants from installing a single‑node Elastic Stack to deploying a cloud‑native observability platform for a multi‑tier pet‑store application, covering health checks, metrics, logs, APM tracing, SLO/SLI setup, and custom dashboards across local, AWS, and Tencent Cloud environments.

Cloud NativeElastic StackObservability

0 likes · 7 min read

Master Elastic Observability: Build a Full‑Stack Monitoring Platform in Half a Day

21CTO

Feb 3, 2021 · Operations

Bridging Product Development and SRE: How to Ensure Stability Across the Software Lifecycle

This article explains the role of Site Reliability Engineering (SRE) in bridging product and foundational technology development, outlines the software lifecycle, describes how SRE ensures system stability through controllability, observability, and protection, and provides practical best‑practice checklists and maturity levels for evaluating and improving reliability.

ObservabilityOperationsSRE

0 likes · 13 min read

Bridging Product Development and SRE: How to Ensure Stability Across the Software Lifecycle

Liangxu Linux

Jan 24, 2021 · Operations

Master iTerm2 on macOS: Custom Shortcuts, Minimalist Prompt, and Secure Dotfile Management

This guide shows how to configure iTerm2 on macOS with custom Ctrl‑arrow shortcuts, set up a minimalist Bash prompt that displays Kubernetes context via kubectx, and manage secure dotfiles using Makefiles, Ansible Vault, and a Docker‑based decryption container for reproducible environments.

Ansible VaultSREdotfiles

0 likes · 5 min read

Master iTerm2 on macOS: Custom Shortcuts, Minimalist Prompt, and Secure Dotfile Management

Efficient Ops

Jan 19, 2021 · Operations

How SRE Bridges Development and Operations to Boost System Reliability

This article explores the role of Site Reliability Engineering (SRE) as a bridge between product development and operations, detailing its responsibilities, core principles, lifecycle perspective, stability value, and practical frameworks for controllability, observability, and best‑practice implementation to enhance system reliability.

ObservabilitySREreliability engineering

0 likes · 13 min read

How SRE Bridges Development and Operations to Boost System Reliability

Continuous Delivery 2.0

Jan 19, 2021 · Operations

Understanding MTTR, MTBF, and MTTF: Key Reliability Metrics for SRE

This article explains the definitions, calculations, and practical importance of MTTR, MTBF, and MTTF for reliability engineering, showing how accurate data and proper metric use enable SRE teams to improve system availability, plan maintenance, and reduce downtime.

MTBFMTTFMTTR

0 likes · 13 min read

Understanding MTTR, MTBF, and MTTF: Key Reliability Metrics for SRE

Efficient Ops

Jan 5, 2021 · Operations

Master Site Reliability Engineering: Inside the SRE Foundation Course

The SRE Foundation course introduces site reliability engineering principles, practices, and tools, explaining why perfect reliability is impractical, outlining SRE responsibilities, detailing the curriculum across eight modules, and identifying the diverse professionals—from engineers to managers—who can benefit from mastering reliability, scalability, and automation.

CourseReliabilitySRE

0 likes · 7 min read

Master Site Reliability Engineering: Inside the SRE Foundation Course

21CTO

Jan 2, 2021 · Operations

Designing & Operating Highly Available Scalable Systems: Google’s SRE Secrets

This article presents a comprehensive overview of Site Reliability Engineering (SRE) as shared by Google SRE expert Ramón Medrano Llamas, covering SRE fundamentals, a typical day’s workflow, design principles for massive scale, fault‑tolerant architecture, monitoring, SLI/SLO metrics, redundancy strategies, disaster recovery, and operational best practices.

OperationsSREScalable Systems

0 likes · 13 min read

Designing & Operating Highly Available Scalable Systems: Google’s SRE Secrets

dbaplus Community

Dec 28, 2020 · Operations

From Zero to Scalable Monitoring: Lessons Learned Building a Prometheus‑Based SRE Platform

This article recounts a two‑year journey of building a company‑wide monitoring system, detailing early challenges with openfalcon, the shift to a Prometheus‑Grafana stack, architectural decisions across service, business, and product dimensions, and practical solutions to alert fatigue, threshold setting, and fault isolation.

GrafanaSRE

0 likes · 8 min read

From Zero to Scalable Monitoring: Lessons Learned Building a Prometheus‑Based SRE Platform

Continuous Delivery 2.0

Dec 18, 2020 · Operations

Applying the VALET Model for SRE Transformation at Home Depot (THD)

The article explains how Home Depot (THD) adopted the VALET model—a five‑dimensional SLO language covering Volume, Availability, Latency, Error, and Ticket—to unify communication, automate data collection, and improve reliability across its massive retail and e‑commerce infrastructure.

OperationsReliabilitySLO

0 likes · 9 min read

Applying the VALET Model for SRE Transformation at Home Depot (THD)

dbaplus Community

Nov 23, 2020 · Operations

Mastering Fault Management: Building a Robust SRE Stability Framework

This article outlines a comprehensive SRE fault‑management framework, covering core responsibilities, stability metrics such as MTBF and MTTR, detailed pre‑, during‑, and post‑incident processes, monitoring, capacity planning, disaster‑recovery, error budgeting, organizational support, and future trends like AIOps and chaos engineering.

Error BudgetMTBFMTTR

0 likes · 30 min read

Mastering Fault Management: Building a Robust SRE Stability Framework

Efficient Ops

Nov 4, 2020 · Operations

Unlocking SRE: Foundations, Principles, and Career Paths Explained

This article clarifies common misconceptions about Site Reliability Engineering, outlines the role’s responsibilities, presents the SRE Foundation course syllabus and target audience, and highlights the GOPS 2020 Global Operations Conference where the training is offered.

DevOpsReliabilitySRE

0 likes · 7 min read

Unlocking SRE: Foundations, Principles, and Career Paths Explained

ByteFE

Oct 28, 2020 · Frontend Development

Engineering Practices and Platform Evolution for Frontend Development at ByteDance

This article details ByteDance's journey in front‑end engineering, describing the evolution from manual deployments to a fully automated CI/CD pipeline, the creation of a dedicated front‑end deployment platform, and the ongoing development of a comprehensive front‑end R&D platform that integrates DevOps and SRE principles.

AutomationDeploymentDevOps

0 likes · 15 min read

Engineering Practices and Platform Evolution for Frontend Development at ByteDance

Alibaba Cloud Developer

Oct 27, 2020 · Operations

Mastering SRE: Mindset, Monitoring, and Incident Response Strategies

This article shares practical SRE insights from years of experience at Alibaba, covering the right mindset, team responsibilities, systematic monitoring, alert management, fault‑handling processes, and resource control to build resilient, high‑availability systems.

Resource ManagementSREreliability engineering

0 likes · 34 min read

Mastering SRE: Mindset, Monitoring, and Incident Response Strategies

Efficient Ops

Oct 19, 2020 · Operations

Designing an Effective DevOps Operations System: Principles and Practices

This article outlines a comprehensive DevOps operations framework, tracing its evolution from traditional ops to modern automation, detailing business standards, work policies, system integration, and best‑practice norms to achieve high SLA, low cost, and a one‑stop operational platform.

AutomationDevOpsInfrastructure

0 likes · 13 min read

Designing an Effective DevOps Operations System: Principles and Practices

ITPUB

Oct 15, 2020 · Operations

How a Huawei Maintenance Engineer Turned Painful On‑Call Duty into Efficient Knowledge Management

A Huawei maintenance engineer shares a decade‑long journey of turning 24/7 on‑call pain into systematic knowledge management, building comprehensive fault‑handling documentation, automating tools, and guiding the team’s evolution toward SRE practices that dramatically reduce manual effort and improve reliability.

AutomationDocumentationHuawei

0 likes · 14 min read

How a Huawei Maintenance Engineer Turned Painful On‑Call Duty into Efficient Knowledge Management

Efficient Ops

Sep 14, 2020 · Operations

Top 10 Must‑Read Books for Mastering SRE, DevOps, and Cloud Operations

Discover a curated list of ten essential books covering Site Reliability Engineering, performance tuning, AI‑ops, security, DevOps practices, Jenkins pipelines, and the evolution of modern operations, each offering practical insights and real‑world examples to elevate your technical expertise.

Book RecommendationsDevOpsSRE

0 likes · 9 min read

Top 10 Must‑Read Books for Mastering SRE, DevOps, and Cloud Operations

HaoDF Tech Team

Sep 7, 2020 · Operations

Analyzing Latency and Slow Interface Detection in a Full‑Chain Monitoring System

This article explains how latency is used as a key indicator for application risk identification, defines slow interfaces, describes why percentile‑based thresholds are preferred over averages, and outlines the architecture, task workflow, and practical optimization strategies for a full‑chain monitoring system in a microservice environment.

LatencyMicroservicesSRE

0 likes · 14 min read

Analyzing Latency and Slow Interface Detection in a Full‑Chain Monitoring System

Efficient Ops

Aug 25, 2020 · Operations

How to Build an Enterprise‑Grade Observability System and Master Incident Response

This article explains how enterprises adopting SRE can design a comprehensive observability platform—covering metrics, logs, and tracing—while also detailing effective incident response, post‑mortem practices, testing, capacity planning, automation tool development, and user‑experience focus to improve overall operational reliability.

ObservabilityOperationsSRE

0 likes · 17 min read

How to Build an Enterprise‑Grade Observability System and Master Incident Response

Efficient Ops

Aug 23, 2020 · Operations

Unlock Reliable Services: SRE Foundation Course Highlights at GOPS 2020

The SRE Foundation course presented at the GOPS 2020 Global Operations Conference in Shenzhen introduces core Site Reliability Engineering principles, practical tools, and certification preparation through eight detailed modules, targeting a wide range of IT professionals and business stakeholders.

DevOpsSRESite Reliability Engineering

0 likes · 6 min read

Unlock Reliable Services: SRE Foundation Course Highlights at GOPS 2020

Efficient Ops

Jul 28, 2020 · Operations

How Zhejiang Mobile Transformed SRE for Telecom: A Practical Operations Blueprint

This article details Zhejiang Mobile's adaptation of Google‑originated Site Reliability Engineering to a telecom environment, outlining a three‑layer capability framework, standardized processes, integrated platforms, and measurable outcomes that demonstrate how agile SRE practices can boost reliability and scalability in traditional industries.

InfrastructureSRESite Reliability Engineering

0 likes · 11 min read

How Zhejiang Mobile Transformed SRE for Telecom: A Practical Operations Blueprint

StarRing Big Data Open Lab

Jul 28, 2020 · Operations

How DevOps and SRE Transform Modern Software Delivery and Operations

This article explains the evolution from traditional C/S to B/S architectures, compares DevOps and SRE principles, discusses their roles in the container and cloud eras, and showcases StarRing's TDC platform that integrates automated pipelines, monitoring, and deployment for efficient software delivery.

DevOpsOperationsSRE

0 likes · 14 min read

How DevOps and SRE Transform Modern Software Delivery and Operations

dbaplus Community

Jul 13, 2020 · Operations

14 Expert Q&A on Building an Effective SRE System for Fault Management

In this detailed Q&A, a Meitu SRE leader explains the relationship between DevOps and SRE, shares practical advice on team composition, monitoring, alerting, fault‑prevention design, and provides step‑by‑step guidance using Grafana, draw.io, and other tools to help organizations build reliable services.

DevOpsGrafanaSRE

0 likes · 10 min read

14 Expert Q&A on Building an Effective SRE System for Fault Management

AntTech

Jul 8, 2020 · Cloud Native

From Double Eleven to Cloud‑Native Capacity: Zheng Yangfei’s Journey and Ant Group’s Autoscaling Innovation

The article chronicles Zheng Yangfei’s rise from a double‑eleven intern to leader of Ant Group’s cloud‑native capacity team, detailing the evolution of large‑scale load‑testing, the challenges of autoscaling in financial‑grade systems, and the team’s shift toward platform‑driven, risk‑aware engineering.

SRE

0 likes · 11 min read

From Double Eleven to Cloud‑Native Capacity: Zheng Yangfei’s Journey and Ant Group’s Autoscaling Innovation

Tencent Cloud Developer

May 14, 2020 · Operations

Tencent Classroom Monitoring Practices: Challenges, Strategies, and Future Directions

During the pandemic’s “停课不停学” surge, Tencent Classroom tackled a 120‑fold traffic jump by rapidly deploying Grafana dashboards, Kibana logs, internal Moniter and cloud monitoring tools, establishing a three‑layer feedback‑alert‑on‑call model, and now plans automation, unified visualizations, and chaos‑engineering to further boost observability and service reliability.

DevOpsSRETencent Classroom

0 likes · 14 min read

Tencent Classroom Monitoring Practices: Challenges, Strategies, and Future Directions

Tencent Cloud Developer

Apr 22, 2020 · Cloud Native

Designing High‑Quality Service Architecture Under Traffic Peaks: Load Balancing, Rate Limiting, Retries, Timeouts, and Failure Mitigation

Drawing on Google SRE principles, Bilibili’s technical director outlines a systematic, cloud‑native framework for high‑quality service architecture during traffic peaks, covering frontend and internal load balancing, distributed rate limiting, controlled retries, fail‑fast timeouts, and comprehensive failure‑mitigation strategies.

SREcloud-nativeload balancing

0 likes · 13 min read

Designing High‑Quality Service Architecture Under Traffic Peaks: Load Balancing, Rate Limiting, Retries, Timeouts, and Failure Mitigation

Efficient Ops

Mar 26, 2020 · Operations

Why SRE Exists and How It Solves Reliability Challenges

This article explains why Site Reliability Engineering (SRE) emerged, outlines its core responsibilities, required skill set, and how it addresses reliability challenges through decoupling, SLO‑driven monitoring, and scenario‑based drills, while highlighting key observations and focus areas for modern operations teams.

SLOSREmonitoring

0 likes · 13 min read

Why SRE Exists and How It Solves Reliability Challenges

NetEase Game Operations Platform

Mar 21, 2020 · Operations

Understanding Linux Kernel Packet Reception Path and NAPI

This article explains how Linux kernels receive network packets—from NIC hardware interrupts through NAPI polling, kernel TCP/IP processing, and finally to user‑space sockets—while also covering interrupt handling, buffer tuning, and performance‑optimizing techniques for SREs.

KernelLinuxNAPI

0 likes · 12 min read

Understanding Linux Kernel Packet Reception Path and NAPI

Efficient Ops

Mar 20, 2020 · Operations

How Zhejiang Mobile Revamped IT Operations with AIOpsDev and SRE

Zhejiang Mobile’s IT Operations team announced a strategic shift from reactive ticket‑driven maintenance to a proactive, AI‑powered AIOpsDev model, establishing new departments, adopting SRE practices, and leveraging cloud‑native technologies to dramatically improve efficiency, reliability, and digital transformation.

DevOpsITILOperations

0 likes · 7 min read

How Zhejiang Mobile Revamped IT Operations with AIOpsDev and SRE

Dada Group Technology

Mar 5, 2020 · Operations

Building a High-Throughput Kubernetes-Based Log Processing System at Dada

The article describes how Dada rebuilt its log processing pipeline using Kubernetes mixed deployment, Filebeat for automated collection, Storm for efficient parsing, and Elasticsearch cold/hot nodes to handle over 130 billion daily log entries and 300TB storage.

ElasticsearchFilebeatKubernetes

0 likes · 9 min read

Building a High-Throughput Kubernetes-Based Log Processing System at Dada

NetEase Game Operations Platform

Feb 15, 2020 · Databases

Using Flyway for Database Version Management: Principles, Configuration, and Best Practices

This article introduces Flyway as a database migration tool, explains its working principle, directory and naming conventions, supported databases, and provides detailed step‑by‑step instructions, best‑practice guidelines, and troubleshooting tips for safely managing MySQL schema changes in production environments.

DevOpsFlywaySQL

0 likes · 13 min read

Using Flyway for Database Version Management: Principles, Configuration, and Best Practices

Efficient Ops

Feb 5, 2020 · Operations

Balancing Stability and Speed: Google SRE Lessons for Modern Ops Teams

This article examines the inherent tension between operations and development, explains Google’s error‑budget and SLO approach, and shares practical DevOps, on‑call, automation, and talent strategies that help ops teams improve efficiency while maintaining product reliability.

AutomationError BudgetOn-Call

0 likes · 9 min read

Balancing Stability and Speed: Google SRE Lessons for Modern Ops Teams

MaGe Linux Operations

Jan 31, 2020 · Operations

Balancing Stability and Speed: Lessons from Google SRE for Modern Ops

This article examines the tension between operations and development teams, explains Google's SRE error‑budget model, and shares practical reflections on engineering ops, on‑call rotation, automation, and talent development to achieve a sustainable balance between product stability and rapid innovation.

AutomationDevOpsError Budget

0 likes · 8 min read

Balancing Stability and Speed: Lessons from Google SRE for Modern Ops

dbaplus Community

Dec 30, 2019 · Operations

How Alibaba’s ECS Team Built a Scalable SRE System for Massive Cloud Services

This article explains the origins of Site Reliability Engineering (SRE), outlines the responsibilities of SRE teams, and details Alibaba Cloud’s ECS SRE practices—including capacity planning, performance optimization, full‑stack stability governance, automated release pipelines, on‑call processes, and the core principles and mindset that guide modern SRE work.

AutomationOperationsSRE

0 likes · 28 min read

How Alibaba’s ECS Team Built a Scalable SRE System for Massive Cloud Services

Efficient Ops

Dec 3, 2019 · Operations

How E‑commerce SRE Teams Tackle Scale, Cost, and Speed Challenges

The talk outlines the unique operational challenges of a fast‑growing e‑commerce platform—including massive scale, frequent changes, cost pressures, and the trade‑off between speed and stability—and describes how the SRE team uses automation, capacity planning, and process engineering to deliver reliable, efficient services.

SREe‑commerce

0 likes · 29 min read

How E‑commerce SRE Teams Tackle Scale, Cost, and Speed Challenges

Alibaba Cloud Developer

Nov 29, 2019 · Operations

How Alibaba’s ECS SRE Team Built a Rock‑Solid Cloud Infrastructure for 100% Cloud Migration

This article explains how Alibaba's Elastic Compute Service (ECS) SRE team tackled massive traffic, database bottlenecks, alert overload, and resource inconsistencies by establishing a full‑stack reliability organization, upgrading core components, automating pipelines, and instituting rigorous monitoring, incident response, and change‑management processes.

OperationsSRESite Reliability Engineering

0 likes · 27 min read

How Alibaba’s ECS SRE Team Built a Rock‑Solid Cloud Infrastructure for 100% Cloud Migration

21CTO

Nov 15, 2019 · Operations

How SRE Designs Highly Available Software Systems at Scale

This article presents Google SRE expert Ramón Medrano Llamas’s comprehensive guide on designing, operating, and maintaining large‑scale, highly available software systems, covering SRE fundamentals, daily workflows, scalability strategies, fault‑tolerant architecture, monitoring, and operational best practices.

SREScalable Systemsfault tolerance

0 likes · 13 min read

How SRE Designs Highly Available Software Systems at Scale

Efficient Ops

Oct 29, 2019 · Operations

How Xiami’s SRE Team Revamped Monitoring to Cut Alert Noise by 90%

Xiami’s SRE team overhauled its monitoring system by categorizing alerts, introducing fault, generic, and basic monitoring, optimizing alert paths with stream processing, and leveraging Alibaba’s traffic scheduling platform, dramatically reducing daily noise from thousands of alerts to a manageable few hundred critical notifications.

AlibabaSRETraffic Scheduling

0 likes · 9 min read

How Xiami’s SRE Team Revamped Monitoring to Cut Alert Noise by 90%

Sohu Tech Products

Oct 23, 2019 · Operations

Google SRE Weekly Alert Limits and Practical Strategies for Reducing Alert Fatigue

This article examines how Google SRE limits weekly alerts to ten, compares it with typical Chinese internet operations teams, and provides practical strategies—including on‑call scheduling, alert escalation, automation, dashboard optimization, and team management—to dramatically reduce alert volume and improve incident response.

Alert ManagementOn-CallOperations

0 likes · 15 min read

Google SRE Weekly Alert Limits and Practical Strategies for Reducing Alert Fatigue

Efficient Ops

Oct 23, 2019 · Operations

Building Scalable Operations: From SRE to AIOps and DevOps

This article explores how to construct a scalable operations framework by integrating concepts such as SRE, DevOps, AIOps, and continuous improvement, addressing organizational challenges, process standardization, tool automation, and the shift from reactive firefighting to proactive, value‑driven management.

IT ManagementSREScalability

0 likes · 30 min read

Building Scalable Operations: From SRE to AIOps and DevOps

dbaplus Community

Oct 16, 2019 · Operations

How to Cut Alert Noise: Practical SRE Strategies for Ops Teams

This article shares concrete SRE‑inspired techniques—duty‑roster scheduling, tiered alert handling, automation safeguards, dashboard focus on top‑3 alerts, time‑based filtering, and systematic code review—to dramatically reduce daily alarm volume while keeping on‑call teams motivated and effective.

On-CallSREalert optimization

0 likes · 15 min read

How to Cut Alert Noise: Practical SRE Strategies for Ops Teams

dbaplus Community

Aug 26, 2019 · Operations

Boost Network Transparency: Automated Monitoring and Ops Tools for SREs

Network engineers often go unnoticed until outages, so this guide explains how to make network status transparent through device availability checks, log and traffic monitoring, SNMP error tracking, and automation scripts—leveraging Python, syslog servers, and northbound APIs—to reduce troubleshooting time and prevent incidents.

Network MonitoringPythonSNMP

0 likes · 11 min read

Boost Network Transparency: Automated Monitoring and Ops Tools for SREs

DevOps

Aug 24, 2019 · Operations

DevOps Engineers: The Highest‑Paid IT Role, Their Value, and How to Build a Career

The article explains why DevOps and SRE engineers top the 2019 StackOverflow IT job popularity list, outlines their responsibilities, career prospects, required skills, and provides practical advice for aspiring professionals.

AutomationDevOpsIT careers

0 likes · 8 min read

DevOps

Jul 29, 2019 · Operations

Google’s Continuous Delivery Practices and SRE Culture: A DevOps Case Study

This article examines Google’s corporate values, development history, culture, and detailed DevOps and Site Reliability Engineering practices—including continuous delivery, SRE responsibilities, and Google Cloud Platform CI/CD tools—to illustrate how the company achieves 24/7 reliable service deployment at massive scale.

Continuous DeliveryDevOpsGoogle

0 likes · 15 min read

Google’s Continuous Delivery Practices and SRE Culture: A DevOps Case Study

AntTech

Jun 12, 2019 · Operations

Alipay’s Technical Risk System: Building SRE, TRaaS, and AIOps for High Availability

The article details how Alipay’s technical risk team, led by researcher Chen Liang, evolved from early scalability work to a full‑stack SRE organization, created the TRaaS risk‑defense platform and integrated AIOps to achieve near‑five‑nine availability and automated self‑healing for its financial services.

SRETRaaSaiops

0 likes · 12 min read

Alipay’s Technical Risk System: Building SRE, TRaaS, and AIOps for High Availability

Efficient Ops

Feb 26, 2019 · Operations

How NetEase Solved 10 Years of Ops Challenges: From Scripts to Cloud-native Platforms

In this talk, senior NetEase operations engineer Gu Xianjie shares a decade‑long journey tackling technical debt, rapid product growth, and on‑call pain points, describing the evolution from manual scripts to automated platforms, service‑oriented tools, DevOps/SRE practices, and cloud‑native strategies that boosted efficiency and reliability.

SREplatform engineering

0 likes · 17 min read

How NetEase Solved 10 Years of Ops Challenges: From Scripts to Cloud-native Platforms

ITPUB

Feb 11, 2019 · Operations

How to Make Enterprise Networks Transparent and Efficient with Simple Monitoring Tools

This article explains how network engineers can use lightweight monitoring solutions, log analysis, traffic and error tracking, and custom automation scripts to gain visibility, reduce troubleshooting time, and safely automate routine network tasks in enterprise environments.

AutomationNetwork MonitoringOperations

0 likes · 10 min read

How to Make Enterprise Networks Transparent and Efficient with Simple Monitoring Tools

ITPUB

Jan 31, 2019 · Operations

Master Monitoring: Collect Metrics for New Systems Using White‑Box Techniques & the Four Golden SRE Indicators

This article explains how to approach monitoring for a newly introduced system by focusing on white‑box metric collection, distinguishing basic and business metrics, outlining common collection methods, and detailing Google SRE's four golden indicators—error, latency, traffic, and saturation—to guide effective observability.

MetricsObservabilityOperations

0 likes · 10 min read

Master Monitoring: Collect Metrics for New Systems Using White‑Box Techniques & the Four Golden SRE Indicators

JD Tech

Jan 31, 2019 · Operations

Understanding White‑Box and Black‑Box Monitoring: Data Collection Methods and the Four Golden Metrics

This article explains the differences between white‑box and black‑box monitoring, outlines common data‑collection techniques for both basic and business metrics, and details Google SRE’s four golden indicators—error, latency, traffic, and saturation—to help engineers design effective monitoring solutions.

MetricsOperationsSRE

0 likes · 9 min read

Understanding White‑Box and Black‑Box Monitoring: Data Collection Methods and the Four Golden Metrics

Didi Tech

Jan 7, 2019 · Operations

Data‑Driven Risk Quantification Platform for SRE at Didi

Didi’s data‑driven Risk Quantification Platform assigns numeric Change Credit and Monitoring Health scores to deployments, alerts and core services, turning operational best‑practice adoption into a competitive game that has raised scores, cut incident rates despite higher change volume, and paves the way for broader risk‑management across the organization.

Risk QuantificationSREdata-driven operations

0 likes · 9 min read

Data‑Driven Risk Quantification Platform for SRE at Didi

JD Tech

Jan 3, 2019 · Operations

Comprehensive Monitoring Strategies for E‑commerce Platforms: Black‑Box and White‑Box Approaches

This article systematically explains how to enhance e‑commerce platform availability by implementing both black‑box monitoring to detect functional failures and white‑box monitoring to pinpoint root causes, detailing core order‑process metrics, common issues, mitigation strategies, and illustrative Grafana dashboards.

GrafanaOperationsSRE

0 likes · 9 min read

Comprehensive Monitoring Strategies for E‑commerce Platforms: Black‑Box and White‑Box Approaches

AntTech

Dec 19, 2018 · Information Security

Red‑Blue Technical Attack‑Defense Exercises and SRE Practices at Ant Financial

Ant Financial’s internal red‑blue technical attack‑defense program, driven by a dedicated blue team and SRE‑based red team, continuously probes system weaknesses, refines fault‑injection tools like Awatch, and evolves high‑availability and self‑healing mechanisms to strengthen risk control and operational reliability.

Fault InjectionOperationsSRE

0 likes · 10 min read

Red‑Blue Technical Attack‑Defense Exercises and SRE Practices at Ant Financial

Continuous Delivery 2.0

Nov 19, 2018 · Operations

Google's Software Testing Transformation: Crisis, Leadership, and Organizational Mechanisms

The article analyzes how Google responded to a testing crisis by empowering a visionary leader, establishing supportive structures, encouraging innovation, and persisting over years to embed a quality‑centric culture that eventually led to decentralized testing, SRE adoption, and a shift toward test‑design engineers.

GoogleSRESoftware Testing

0 likes · 7 min read

Google's Software Testing Transformation: Crisis, Leadership, and Organizational Mechanisms

JD Tech

Nov 5, 2018 · Operations

Practical Guide to Elasticsearch Monitoring and Operations

This article provides a comprehensive, operations‑focused overview of Elasticsearch monitoring, covering tool selection, key metrics for black‑box and white‑box monitoring, common issues discovered through alerts, and practical optimization recommendations to ensure high availability of ES clusters.

ElasticsearchSREtools

0 likes · 8 min read

Practical Guide to Elasticsearch Monitoring and Operations

21CTO

Aug 30, 2018 · Operations

Inside Google’s Production: How Requests Travel Through Its Massive Infrastructure

Google’s production environment spans a global edge network, massive data centers, sophisticated job scheduling with Borg, distributed storage systems like Bigtable and Spanner, and comprehensive monitoring, illustrating how user requests traverse multiple layers—from ISP to edge, GFE, load balancers, and finally to services.

DeploymentGoogleInfrastructure

0 likes · 9 min read

Inside Google’s Production: How Requests Travel Through Its Massive Infrastructure

Architecture Digest

Aug 29, 2018 · Operations

Google Production Environment: Network, Data Center, Cluster Management, Storage, Monitoring, and Deployment Workflow

The article explains Google’s end‑to‑end production infrastructure—including the edge network, data‑center hierarchy, Borg‑based cluster management, storage systems like Colossus and Spanner, monitoring with Borgmon, inter‑task RPC via Stubby, and the code‑to‑production pipeline using Piper, Blaze, Rapid, and Sisyphus—illustrating how requests travel from users to services in milliseconds.

Data centerDeploymentGoogle

0 likes · 10 min read

Google Production Environment: Network, Data Center, Cluster Management, Storage, Monitoring, and Deployment Workflow

Efficient Ops

Jul 3, 2018 · Operations

What Did China’s DevOps International Summit Reveal About the Future of Operations?

The DOIS DevOps International Summit in Beijing gathered over a hundred experts from finance, internet and telecom sectors to share DevOps, AIOps, SRE and automation practices, launch a new DevOps maturity model, and showcase how banks like Minsheng are applying these methods to modernize their operations.

DevOpsSREaiops

0 likes · 18 min read

What Did China’s DevOps International Summit Reveal About the Future of Operations?

dbaplus Community

Jun 7, 2018 · Operations

Why Ceph’s Unlimited Scalability Isn’t As Simple As It Looks

The article examines Ceph’s claimed infinite scalability, cost advantages, and operational stability from an SRE perspective, comparing it with centralized systems like HDFS, and reveals practical challenges such as expansion granularity, crushmap rebalancing, utilization limits, and maintenance overhead.

CephHDFSOperations

0 likes · 15 min read

Why Ceph’s Unlimited Scalability Isn’t As Simple As It Looks

vivo Internet Technology

Jun 5, 2018 · Operations

DevOps International Summit 2024: Latest Practices and Technologies

The DevOps International Summit 2024 in Beijing, the sole China‑based global DevOps conference, brings together over 80 leading experts to showcase end‑to‑end practices—from Lean‑Agile, Continuous Delivery, SRE, and microservices to DevSecOps, AI‑driven tooling, and the new Research and Operations Integration Capability Maturity Model—through industry‑focused tracks, hands‑on training, and real‑world case studies across finance, telecom, retail and more.

Continuous DeliveryDevOpsDevOps Summit

0 likes · 3 min read

DevOps International Summit 2024: Latest Practices and Technologies

dbaplus Community

May 8, 2018 · Operations

How to Build Reliable Operations: From BCM to Google SRE Practices

This article examines the growing challenges of system availability in modern operations, explains the concept of availability and the N‑nine metric, introduces Business Continuity Management and Google SRE approaches, and provides concrete technical and managerial methods—including architecture standardization, scaling strategies, tooling, emergency drills, and incident‑centralized management—to improve operational reliability.

AvailabilityBCMOperations

0 likes · 30 min read

How to Build Reliable Operations: From BCM to Google SRE Practices

360 Tech Engineering

May 2, 2018 · Operations

Applying Mesos and Docker Containerization in 360 Commercial Advertising System

This article details how 360's commercial advertising platform leverages Mesos and Docker containerization to solve data‑center migration, fault recovery, OS inconsistencies, and resource‑utilization challenges, describing the architecture, standardization, networking, storage, service discovery, and future plans.

Cloud NativeDockerMesos

0 likes · 22 min read

Applying Mesos and Docker Containerization in 360 Commercial Advertising System

Snowball Engineer Team

Jan 12, 2018 · Operations

RDR: An Open-Source Tool for Visualizing and Analyzing Redis Memory Usage

This article introduces RDR, an open-source visualization platform developed by Xueqiu's SRE team to safely and efficiently analyze Redis memory consumption by parsing RDB files, estimating key-level memory usage based on internal data structures, and generating intuitive statistical reports for operational optimization.

Memory analysisOperationsRDB Parsing

0 likes · 9 min read

RDR: An Open-Source Tool for Visualizing and Analyzing Redis Memory Usage

Efficient Ops

Jan 11, 2018 · Operations

Mastering Incident Troubleshooting: Proven SRE Strategies for Operations

This article shares practical SRE‑based principles for diagnosing and resolving online incidents, emphasizing systematic investigation, gathering clues, and prioritizing service restoration over immediate root‑cause identification to make troubleshooting less mystical and more effective.

OperationsSREincident management

0 likes · 7 min read

Mastering Incident Troubleshooting: Proven SRE Strategies for Operations

Meituan Technology Team

Dec 1, 2017 · Operations

Cloud SRE Development and Practice

The Meituan‑Dianping Technology Salon Online offers a recurring live‑streamed course where SRE experts, led by Zuo Pucun, discuss the challenges of high growth and concurrency, the evolution from firefighting to proactive stability, service availability, user‑experience optimization, and future automation in cloud SRE practice.

Meituan-DianpingSREStability Assurance

0 likes · 3 min read

MaGe Linux Operations

Nov 26, 2017 · Operations

What Google’s SRE Reveals About Modern Operations and SLO Design

This article shares key insights from the book “SRE Google Operations Unveiled,” explaining Google’s infrastructure, the role of SRE, and how Service Level Objectives (SLOs) help balance reliability, cost, and innovation in modern operations.

GoogleSLOSRE

0 likes · 9 min read

What Google’s SRE Reveals About Modern Operations and SLO Design

MaGe Linux Operations

Nov 19, 2017 · Operations

Which DevOps Team Topology Fits Your Organization? A Practical Guide

This article examines common DevOps team structures and anti‑patterns, explains how product portfolio, leadership, and organizational readiness influence the choice of topology, and presents nine practical models—from collaborative teams to SRE and container‑driven approaches—to help you select the most effective structure for your business.

CollaborationSRETeam Topology

0 likes · 19 min read

Which DevOps Team Topology Fits Your Organization? A Practical Guide

Qunar Tech Salon

Oct 26, 2017 · Operations

Evolution of Pinterest's Monitoring System: From Time-Series Metrics to Distributed Tracing

Over seven years, Pinterest’s monitoring team built and refined a three‑pronged observability platform—time‑series metrics, log search, and distributed tracing—scaling from a single‑machine system to handling millions of data points per second across tens of thousands of AWS VMs, while addressing reliability, cost, and usability challenges.

Distributed TracingObservabilitySRE

0 likes · 19 min read

Evolution of Pinterest's Monitoring System: From Time-Series Metrics to Distributed Tracing

Efficient Ops

Oct 24, 2017 · Operations

How Pinterest Scaled Its Monitoring, Logging, and Tracing Over Seven Years

This article chronicles Pinterest's seven‑year evolution from a single‑machine time‑series monitor to a multi‑component system that integrates metrics, log search, and distributed tracing, sharing architectural choices, scaling challenges, and lessons learned for building reliable, high‑performance operations platforms.

Distributed TracingOperationsSRE

0 likes · 24 min read

How Pinterest Scaled Its Monitoring, Logging, and Tracing Over Seven Years

Architects Research Society

Oct 20, 2017 · Operations

Understanding Site Reliability Engineering (SRE): Definitions, Tools, Roles, and Evolution

The article explains Site Reliability Engineering (SRE) as a discipline that blends software engineering with operations, detailing its origins, key responsibilities, required skill sets, tools, impact on reliability and downtime costs, and how the role has evolved with modern cloud and DevOps practices.

DevOpsReliabilitySRE

0 likes · 9 min read

Understanding Site Reliability Engineering (SRE): Definitions, Tools, Roles, and Evolution

Qunar Tech Salon

Sep 28, 2017 · Operations

The Evolution of DevOps: From Foundations to AIOps, Containers, SRE, and ChatOps

This article examines the lifecycle of DevOps, its core principles, the rise of AIOps, the pivotal role of container technology, the emergence of SRE as a best‑practice, and how ChatOps represents the ultimate goal of simplifying complex operations through conversational interfaces.

ChatOpsDevOpsSRE

0 likes · 12 min read

The Evolution of DevOps: From Foundations to AIOps, Containers, SRE, and ChatOps

Efficient Ops

Sep 27, 2017 · Operations

From Ops to SRE: What Google’s Site Reliability Model Means for Your Team

The article reflects on the shift from traditional operations to Site Reliability Engineering (SRE), comparing Google’s SRE practices with those of a Chinese cloud provider, and explores infrastructure, tooling, team structure, and cultural challenges while drawing practical lessons for engineers.

DevOpsGoogleSRE

0 likes · 19 min read

From Ops to SRE: What Google’s Site Reliability Model Means for Your Team

MaGe Linux Operations

Aug 30, 2017 · Operations

How Traditional Banks Are Overcoming IT Ops Challenges in the Digital Age

The article examines how traditional banks are reshaping their IT architecture and operations to meet soaring online transaction volumes, tighter regulatory demands, and the need for seamless DevOps and SRE practices, highlighting automation, self‑developed tools, and future data‑driven priorities.

AutomationBankingDevOps

0 likes · 7 min read

How Traditional Banks Are Overcoming IT Ops Challenges in the Digital Age

Efficient Ops

Aug 29, 2017 · Operations

From ITIL to SRE: How Vipshop Transformed Its Operations

This article recounts Vipshop’s journey from a traditional ITIL‑based operations model to an SRE‑inspired, automated workflow, detailing the construction of ITIL processes, the challenges faced, the shift toward automation, and personal insights on managing people, quality, and change.

DevOpsITILSRE

0 likes · 20 min read

From ITIL to SRE: How Vipshop Transformed Its Operations

Efficient Ops

Aug 28, 2017 · Operations

Can Ops Teams Become Agile? A Practical Kanban Journey

This article explores how operations teams can adopt agile principles—especially Kanban—to address common challenges such as delayed feedback, task overload, and hidden risks, demonstrating a step‑by‑step transformation within the DevOps lifecycle.

DevOpsKanbanLean

0 likes · 28 min read

Can Ops Teams Become Agile? A Practical Kanban Journey

360 Zhihui Cloud Developer

Jul 27, 2017 · Operations

Mastering Incident Troubleshooting: Proven SRE Strategies for Ops Teams

This article shares practical SRE‑based principles and step‑by‑step methods for diagnosing and resolving online incidents, emphasizing mindset, systematic information gathering, and structured analysis to turn mysterious outages into solvable problems.

OperationsRoot Cause AnalysisSRE

0 likes · 7 min read

Mastering Incident Troubleshooting: Proven SRE Strategies for Ops Teams

Efficient Ops

Jul 25, 2017 · Operations

Why Google’s SRE Model Matters: Lessons for Modern Ops Teams

This article explains the origins, responsibilities, and team structures of Google Site Reliability Engineering (SRE), compares it with traditional operations roles in companies like Yahoo, Alibaba, and Facebook, and offers practical guidance for building effective SRE or application‑operations teams today.

DevOpsSRESite Reliability Engineering

0 likes · 25 min read

Why Google’s SRE Model Matters: Lessons for Modern Ops Teams

360 Zhihui Cloud Developer

Jul 11, 2017 · Operations

Mastering On-Call: Practical Lessons from Google SRE for Effective Ops

This article shares practical insights from Google SRE on on‑call duty, covering why on‑call is needed, common challenges, effective scheduling, evaluation methods, and actionable tips to improve team resilience and reduce stress for operations engineers.

On-CallOperationsSRE

0 likes · 9 min read

Mastering On-Call: Practical Lessons from Google SRE for Effective Ops

Efficient Ops

Jun 10, 2017 · Operations

What Google’s SRE Book Reveals About Modern Operations

This article introduces the Chinese translation of Google’s SRE book, shares behind‑the‑scenes stories of its creation, and distills key concepts such as the AAA model, Borg architecture, SLOs, toil reduction, and the cultural shift required for reliable large‑scale services.

DevOpsGoogleInfrastructure

0 likes · 20 min read

What Google’s SRE Book Reveals About Modern Operations

ITPUB

Jun 9, 2017 · Operations

Mastering Effective Monitoring: From Basics to the USE Method

This article explains the fundamentals of monitoring, distinguishes traditional OPS from SRE perspectives, defines monitoring objects and metrics, introduces quantitative thinking with SLI/SLO, and presents the USE method with a MySQL example to help engineers detect and prevent failures efficiently.

MetricsOperationsSLI

0 likes · 10 min read

Mastering Effective Monitoring: From Basics to the USE Method

Efficient Ops

May 25, 2017 · Operations

How a Bank Transformed IT Ops with Automated DevOps and SRE Practices

This article outlines how China Merchants Bank’s data‑center application management team identified traditional financial IT operational pain points, introduced DevOps and SRE concepts, built non‑functional management frameworks, and implemented automated tooling, monitoring, and capacity‑scaling to achieve fully automated operations.

DevOpsIT OperationsPerformance Scaling

0 likes · 24 min read

How a Bank Transformed IT Ops with Automated DevOps and SRE Practices

Efficient Ops

May 20, 2017 · Operations

How SREcon Asia Selected Its Top 35 Talks: Inside the Rigorous Review Process

The article explains how the SREcon Asia conference in Singapore collected, evaluated, and selected 35 out of 108 submitted topics through a blind scoring system, highlighting the timeline, review criteria, and the final three‑day agenda focused on automation and monitoring.

DevOpsSRESREcon

0 likes · 5 min read

How SREcon Asia Selected Its Top 35 Talks: Inside the Rigorous Review Process

ITPUB

May 15, 2017 · Operations

Mastering Online Incident Management: From Detection to Prevention

This article outlines a comprehensive methodology for handling large‑scale online service incidents, covering goals, the "jump‑fill‑avoid" framework, step‑by‑step processes for detection, diagnosis, remediation, and post‑mortem analysis, as well as essential monitoring, logging, and escalation infrastructure.

OperationsSREincident management

0 likes · 18 min read

Mastering Online Incident Management: From Detection to Prevention

DevOps

Apr 18, 2017 · Operations

Understanding Site Reliability Engineering (SRE): Roles, Responsibilities, Skills, and Differences from DevOps

This article explains the concept of Site Reliability Engineering (SRE), its origins at Google, core responsibilities such as IT operations and availability improvement, required skill sets, how it differs from DevOps, and guidance on adopting SRE practices within organizations.

DevOpsIT Service ManagementOperations

0 likes · 12 min read

Understanding Site Reliability Engineering (SRE): Roles, Responsibilities, Skills, and Differences from DevOps

360 Zhihui Cloud Developer

Apr 6, 2017 · Operations

How SRE’s Dialectical Thinking Redefines Modern Operations

An insightful reflection on Google’s SRE philosophy shows how dialectical thinking—questioning absolute stability, embracing limited toil, prioritizing simple monitoring, recognizing automation’s hidden risks, and practicing real‑world failure drills—can reshape operations, encouraging smarter, more resilient system design.

AutomationReliabilitySRE

0 likes · 7 min read

How SRE’s Dialectical Thinking Redefines Modern Operations

360 Zhihui Cloud Developer

Mar 30, 2017 · Operations

What Google’s SRE Secrets Reveal About Modern Operations and SLOs

The article shares personal insights from reading Google’s SRE book, explaining core SRE concepts, Google’s robust infrastructure, the role of SLOs, and how they help balance cost, reliability, and innovation in modern operations.

GoogleReliabilitySLO

0 likes · 8 min read

What Google’s SRE Secrets Reveal About Modern Operations and SLOs

Efficient Ops

Mar 26, 2017 · Operations

How Google Scales App Engine: Lessons in Cloud Scalability and SRE

The article shares Google SRE veteran Minghua Ye’s insights on App Engine’s evolution, emphasizing the critical role of automatic scalability, distributed locks, service discovery, load balancing, and open‑source tools like gRPC, Protobuf, gflags, glog, and Googletest in building reliable, high‑traffic cloud services.

Distributed SystemsGoogle App EngineProtobuf

0 likes · 12 min read

How Google Scales App Engine: Lessons in Cloud Scalability and SRE

Efficient Ops

Mar 21, 2017 · Operations

Rethinking Operations: The “Third Kind” of SRE at Lianjia

The article shares the author’s experience transitioning from private to public and hybrid clouds at Lianjia, introduces a “third kind” of operations that blends traditional and internet‑based practices, and discusses containers, DNS‑based naming, and automation tools to build adaptable, cost‑effective infrastructure.

InfrastructureNaming ServiceSRE

0 likes · 21 min read

Rethinking Operations: The “Third Kind” of SRE at Lianjia

High Availability Architecture

Mar 15, 2017 · Operations

Highlights from SRECon17 Americas 2023 in San Francisco

The article reports on the SRECon17 Americas conference in San Francisco, summarizing keynote talks, panel sessions, and practical insights from industry leaders such as Stripe, Netflix, Google, and IBM on topics ranging from traffic control and container management to on‑call practices and cost considerations for Site Reliability Engineering.

DevOpsGoogleNetflix

0 likes · 6 min read

Highlights from SRECon17 Americas 2023 in San Francisco

Ctrip Technology

Dec 9, 2016 · Operations

Design and Implementation of Ctrip Call Center's Active‑Active Architecture and Unified Login

The article details Ctrip's call‑center architecture evolution, describing the multi‑layer active‑active design, public access, application and client layers, unified login mechanisms, operational challenges, disaster‑recovery drills, and future plans for software‑only and mobile agents, illustrating practical SRE principles in a large‑scale telephony system.

Active-ActiveIP phoneSRE

0 likes · 22 min read

Design and Implementation of Ctrip Call Center's Active‑Active Architecture and Unified Login

Efficient Ops

Dec 4, 2016 · Operations

How Ctrip Built a Seamless Multi‑Region Dual‑Active Call Center

This article details Ctrip's evolution from a single‑site call‑center to a fully dual‑active, multi‑region architecture, covering the overall system design, public network, application, and client layers, unified login mechanisms, heartbeat monitoring, and future software‑only and mobile‑first directions.

Dual-ActiveOperationsSRE

0 likes · 27 min read

How Ctrip Built a Seamless Multi‑Region Dual‑Active Call Center

Efficient Ops

Nov 7, 2016 · Operations

How to Train New SREs Effectively: Proven Practices and Playbooks

This article outlines a systematic approach to onboarding and training new Site Reliability Engineers, covering trust building, readiness assessment, diverse learning methods, structured curricula, on‑call milestones, project‑focused work, reverse‑engineering skills, statistical thinking, and improvisation techniques to develop high‑performing SRE teams.

On-CallOperationsSRE

0 likes · 17 min read

How to Train New SREs Effectively: Proven Practices and Playbooks

Efficient Ops

Oct 30, 2016 · Operations

How Google Music Recovered 1.5 PB of Lost Data After a Massive Deletion Bug

In March 2012, a privacy‑driven deletion pipeline mistakenly erased hundreds of thousands of Google Music files, prompting SREs to launch a massive data‑recovery effort that involved MapReduce impact analysis, tape‑based backups, and a complete redesign of the deletion system.

Data RecoveryGoogle MusicLarge-Scale Deletion

0 likes · 14 min read

How Google Music Recovered 1.5 PB of Lost Data After a Massive Deletion Bug

Efficient Ops

Oct 26, 2016 · Operations

From Sysadmin to Google SRE: How Modern Ops Teams Can Thrive

This article compares traditional system administration with Google’s Site Reliability Engineering, explaining why enterprises are shifting from cost‑center SLA focus to data‑driven, user‑experience‑oriented operations, and offers practical steps for teams to adopt automation, cloud platforms, and risk‑aware practices.

SRE

0 likes · 14 min read

From Sysadmin to Google SRE: How Modern Ops Teams Can Thrive

Efficient Ops

Oct 23, 2016 · Operations

How Google’s SRE Postmortems Drive System Reliability

This article explains Google’s SRE postmortem philosophy, the criteria for writing postmortems, best practices for a blame‑free culture, and how collaborative knowledge‑sharing and incentives improve incident handling and overall system reliability.

OperationsSREincident management

0 likes · 14 min read

How Google’s SRE Postmortems Drive System Reliability

Efficient Ops

Oct 16, 2016 · Operations

Balancing Reliability and Innovation: Google’s SRE Risk Management Explained

This article explores how Google Site Reliability Engineers manage service reliability by balancing risk, cost, and business goals, using metrics like unplanned downtime, availability formulas, and risk tolerance to set realistic SLOs for both consumer and infrastructure services.

AvailabilityGoogleOperations

0 likes · 21 min read

Balancing Reliability and Innovation: Google’s SRE Risk Management Explained

Efficient Ops

Oct 5, 2016 · Operations

5 Must‑Read Ops ‘Prescriptions’ to Boost Your Infrastructure Skills

This article curates five technical reads—covering network operations, Google’s production environment, massive cost‑saving strategies, IDC automation, and Docker‑based RDS—each presented as a “medicine” with a brief description and a link for deeper insight.

Cost OptimizationDockerOperations

0 likes · 5 min read

5 Must‑Read Ops ‘Prescriptions’ to Boost Your Infrastructure Skills

Efficient Ops

Sep 18, 2016 · Operations

Who Was the World’s First SRE? Uncovering Margaret Hamilton’s Legacy

This article explores the origins of Site Reliability Engineering, highlights Margaret Hamilton as the likely first SRE through her work on NASA’s Apollo program, and draws lessons on reliability, disaster prevention, and the evolution of modern SRE practices.

Apollo programMargaret HamiltonSRE

0 likes · 10 min read

Who Was the World’s First SRE? Uncovering Margaret Hamilton’s Legacy

Efficient Ops

Sep 13, 2016 · Operations

How Google SRE Principles Compare Across Industries

This article, excerpted from the upcoming Chinese edition of “SRE: Google Site Reliability Engineering”, examines how Google’s SRE guiding philosophies—disaster planning, post‑mortem culture, automation, and data‑driven decision‑making—are adopted, adapted, or contrasted in sectors such as manufacturing, aerospace, nuclear, telecommunications, healthcare, and finance, highlighting key similarities, differences, and lessons for Google and the broader tech industry.

AutomationOperationsSRE

0 likes · 21 min read

How Google SRE Principles Compare Across Industries

MaGe Linux Operations

May 27, 2016 · Operations

Why Google Relies on Software Engineers to Run Its Services: Inside SRE

The article explains Google’s Site Reliability Engineering (SRE) philosophy, how it empowers software engineers to automate operations, the balance between development and reliability, the concept of error budgets, and the cultural shift that turned DevOps into a core practice for large‑scale services.

DevOpsError BudgetOperations Automation

0 likes · 10 min read

Why Google Relies on Software Engineers to Run Its Services: Inside SRE

21CTO

Apr 21, 2016 · Operations

Why Google Lets Software Engineers Run Its Services: Inside Site Reliability Engineering

Google’s near‑perfect uptime is achieved by Site Reliability Engineering, a philosophy that empowers software engineers to automate operations, balance development with reliability, and treat system availability as a core product feature.

DevOpsGoogleSRE

0 likes · 10 min read

Why Google Lets Software Engineers Run Its Services: Inside Site Reliability Engineering