Tagged articles
403 articles
Page 4 of 5
Continuous Delivery 2.0
Continuous Delivery 2.0
Mar 1, 2021 · Operations

Google’s Project Health Metrics and Practices for Pre‑Release Code Quality

The article explains how Google measures and maintains software quality before release by dividing responsibilities between product teams and SRE, using monorepo, trunk‑based development, daily release candidates, automated testing, performance monitoring, and a Project Health (pH) metric system that tracks productivity, release velocity, reliability, and quality.

GoogleMetricsProject Health
0 likes · 12 min read
Google’s Project Health Metrics and Practices for Pre‑Release Code Quality
DevOps Coach
DevOps Coach
Feb 9, 2021 · Operations

Master Elastic Observability: Build a Full‑Stack Monitoring Platform in Half a Day

This workshop guides participants from installing a single‑node Elastic Stack to deploying a cloud‑native observability platform for a multi‑tier pet‑store application, covering health checks, metrics, logs, APM tracing, SLO/SLI setup, and custom dashboards across local, AWS, and Tencent Cloud environments.

Cloud NativeElastic StackObservability
0 likes · 7 min read
Master Elastic Observability: Build a Full‑Stack Monitoring Platform in Half a Day
21CTO
21CTO
Feb 3, 2021 · Operations

Bridging Product Development and SRE: How to Ensure Stability Across the Software Lifecycle

This article explains the role of Site Reliability Engineering (SRE) in bridging product and foundational technology development, outlines the software lifecycle, describes how SRE ensures system stability through controllability, observability, and protection, and provides practical best‑practice checklists and maturity levels for evaluating and improving reliability.

ObservabilityOperationsSRE
0 likes · 13 min read
Bridging Product Development and SRE: How to Ensure Stability Across the Software Lifecycle
Efficient Ops
Efficient Ops
Jan 19, 2021 · Operations

How SRE Bridges Development and Operations to Boost System Reliability

This article explores the role of Site Reliability Engineering (SRE) as a bridge between product development and operations, detailing its responsibilities, core principles, lifecycle perspective, stability value, and practical frameworks for controllability, observability, and best‑practice implementation to enhance system reliability.

ObservabilitySREreliability engineering
0 likes · 13 min read
How SRE Bridges Development and Operations to Boost System Reliability
Efficient Ops
Efficient Ops
Jan 5, 2021 · Operations

Master Site Reliability Engineering: Inside the SRE Foundation Course

The SRE Foundation course introduces site reliability engineering principles, practices, and tools, explaining why perfect reliability is impractical, outlining SRE responsibilities, detailing the curriculum across eight modules, and identifying the diverse professionals—from engineers to managers—who can benefit from mastering reliability, scalability, and automation.

CourseReliabilitySRE
0 likes · 7 min read
Master Site Reliability Engineering: Inside the SRE Foundation Course
21CTO
21CTO
Jan 2, 2021 · Operations

Designing & Operating Highly Available Scalable Systems: Google’s SRE Secrets

This article presents a comprehensive overview of Site Reliability Engineering (SRE) as shared by Google SRE expert Ramón Medrano Llamas, covering SRE fundamentals, a typical day’s workflow, design principles for massive scale, fault‑tolerant architecture, monitoring, SLI/SLO metrics, redundancy strategies, disaster recovery, and operational best practices.

OperationsSREScalable Systems
0 likes · 13 min read
Designing & Operating Highly Available Scalable Systems: Google’s SRE Secrets
Continuous Delivery 2.0
Continuous Delivery 2.0
Dec 18, 2020 · Operations

Applying the VALET Model for SRE Transformation at Home Depot (THD)

The article explains how Home Depot (THD) adopted the VALET model—a five‑dimensional SLO language covering Volume, Availability, Latency, Error, and Ticket—to unify communication, automate data collection, and improve reliability across its massive retail and e‑commerce infrastructure.

OperationsReliabilitySLO
0 likes · 9 min read
Applying the VALET Model for SRE Transformation at Home Depot (THD)
dbaplus Community
dbaplus Community
Nov 23, 2020 · Operations

Mastering Fault Management: Building a Robust SRE Stability Framework

This article outlines a comprehensive SRE fault‑management framework, covering core responsibilities, stability metrics such as MTBF and MTTR, detailed pre‑, during‑, and post‑incident processes, monitoring, capacity planning, disaster‑recovery, error budgeting, organizational support, and future trends like AIOps and chaos engineering.

Error BudgetMTBFMTTR
0 likes · 30 min read
Mastering Fault Management: Building a Robust SRE Stability Framework
Efficient Ops
Efficient Ops
Nov 4, 2020 · Operations

Unlocking SRE: Foundations, Principles, and Career Paths Explained

This article clarifies common misconceptions about Site Reliability Engineering, outlines the role’s responsibilities, presents the SRE Foundation course syllabus and target audience, and highlights the GOPS 2020 Global Operations Conference where the training is offered.

DevOpsReliabilitySRE
0 likes · 7 min read
Unlocking SRE: Foundations, Principles, and Career Paths Explained
ByteFE
ByteFE
Oct 28, 2020 · Frontend Development

Engineering Practices and Platform Evolution for Frontend Development at ByteDance

This article details ByteDance's journey in front‑end engineering, describing the evolution from manual deployments to a fully automated CI/CD pipeline, the creation of a dedicated front‑end deployment platform, and the ongoing development of a comprehensive front‑end R&D platform that integrates DevOps and SRE principles.

AutomationDeploymentDevOps
0 likes · 15 min read
Engineering Practices and Platform Evolution for Frontend Development at ByteDance
Efficient Ops
Efficient Ops
Oct 19, 2020 · Operations

Designing an Effective DevOps Operations System: Principles and Practices

This article outlines a comprehensive DevOps operations framework, tracing its evolution from traditional ops to modern automation, detailing business standards, work policies, system integration, and best‑practice norms to achieve high SLA, low cost, and a one‑stop operational platform.

AutomationDevOpsInfrastructure
0 likes · 13 min read
Designing an Effective DevOps Operations System: Principles and Practices
ITPUB
ITPUB
Oct 15, 2020 · Operations

How a Huawei Maintenance Engineer Turned Painful On‑Call Duty into Efficient Knowledge Management

A Huawei maintenance engineer shares a decade‑long journey of turning 24/7 on‑call pain into systematic knowledge management, building comprehensive fault‑handling documentation, automating tools, and guiding the team’s evolution toward SRE practices that dramatically reduce manual effort and improve reliability.

AutomationDocumentationHuawei
0 likes · 14 min read
How a Huawei Maintenance Engineer Turned Painful On‑Call Duty into Efficient Knowledge Management
Efficient Ops
Efficient Ops
Sep 14, 2020 · Operations

Top 10 Must‑Read Books for Mastering SRE, DevOps, and Cloud Operations

Discover a curated list of ten essential books covering Site Reliability Engineering, performance tuning, AI‑ops, security, DevOps practices, Jenkins pipelines, and the evolution of modern operations, each offering practical insights and real‑world examples to elevate your technical expertise.

Book RecommendationsDevOpsSRE
0 likes · 9 min read
Top 10 Must‑Read Books for Mastering SRE, DevOps, and Cloud Operations
HaoDF Tech Team
HaoDF Tech Team
Sep 7, 2020 · Operations

Analyzing Latency and Slow Interface Detection in a Full‑Chain Monitoring System

This article explains how latency is used as a key indicator for application risk identification, defines slow interfaces, describes why percentile‑based thresholds are preferred over averages, and outlines the architecture, task workflow, and practical optimization strategies for a full‑chain monitoring system in a microservice environment.

LatencyMicroservicesSRE
0 likes · 14 min read
Analyzing Latency and Slow Interface Detection in a Full‑Chain Monitoring System
Efficient Ops
Efficient Ops
Aug 25, 2020 · Operations

How to Build an Enterprise‑Grade Observability System and Master Incident Response

This article explains how enterprises adopting SRE can design a comprehensive observability platform—covering metrics, logs, and tracing—while also detailing effective incident response, post‑mortem practices, testing, capacity planning, automation tool development, and user‑experience focus to improve overall operational reliability.

ObservabilityOperationsSRE
0 likes · 17 min read
How to Build an Enterprise‑Grade Observability System and Master Incident Response
Efficient Ops
Efficient Ops
Aug 23, 2020 · Operations

Unlock Reliable Services: SRE Foundation Course Highlights at GOPS 2020

The SRE Foundation course presented at the GOPS 2020 Global Operations Conference in Shenzhen introduces core Site Reliability Engineering principles, practical tools, and certification preparation through eight detailed modules, targeting a wide range of IT professionals and business stakeholders.

DevOpsSRESite Reliability Engineering
0 likes · 6 min read
Unlock Reliable Services: SRE Foundation Course Highlights at GOPS 2020
Efficient Ops
Efficient Ops
Jul 28, 2020 · Operations

How Zhejiang Mobile Transformed SRE for Telecom: A Practical Operations Blueprint

This article details Zhejiang Mobile's adaptation of Google‑originated Site Reliability Engineering to a telecom environment, outlining a three‑layer capability framework, standardized processes, integrated platforms, and measurable outcomes that demonstrate how agile SRE practices can boost reliability and scalability in traditional industries.

InfrastructureSRESite Reliability Engineering
0 likes · 11 min read
How Zhejiang Mobile Transformed SRE for Telecom: A Practical Operations Blueprint
dbaplus Community
dbaplus Community
Jul 13, 2020 · Operations

14 Expert Q&A on Building an Effective SRE System for Fault Management

In this detailed Q&A, a Meitu SRE leader explains the relationship between DevOps and SRE, shares practical advice on team composition, monitoring, alerting, fault‑prevention design, and provides step‑by‑step guidance using Grafana, draw.io, and other tools to help organizations build reliable services.

DevOpsGrafanaSRE
0 likes · 10 min read
14 Expert Q&A on Building an Effective SRE System for Fault Management
Tencent Cloud Developer
Tencent Cloud Developer
May 14, 2020 · Operations

Tencent Classroom Monitoring Practices: Challenges, Strategies, and Future Directions

During the pandemic’s “停课不停学” surge, Tencent Classroom tackled a 120‑fold traffic jump by rapidly deploying Grafana dashboards, Kibana logs, internal Moniter and cloud monitoring tools, establishing a three‑layer feedback‑alert‑on‑call model, and now plans automation, unified visualizations, and chaos‑engineering to further boost observability and service reliability.

DevOpsSRETencent Classroom
0 likes · 14 min read
Tencent Classroom Monitoring Practices: Challenges, Strategies, and Future Directions
Tencent Cloud Developer
Tencent Cloud Developer
Apr 22, 2020 · Cloud Native

Designing High‑Quality Service Architecture Under Traffic Peaks: Load Balancing, Rate Limiting, Retries, Timeouts, and Failure Mitigation

Drawing on Google SRE principles, Bilibili’s technical director outlines a systematic, cloud‑native framework for high‑quality service architecture during traffic peaks, covering frontend and internal load balancing, distributed rate limiting, controlled retries, fail‑fast timeouts, and comprehensive failure‑mitigation strategies.

SREcloud-nativeload balancing
0 likes · 13 min read
Designing High‑Quality Service Architecture Under Traffic Peaks: Load Balancing, Rate Limiting, Retries, Timeouts, and Failure Mitigation
Efficient Ops
Efficient Ops
Mar 26, 2020 · Operations

Why SRE Exists and How It Solves Reliability Challenges

This article explains why Site Reliability Engineering (SRE) emerged, outlines its core responsibilities, required skill set, and how it addresses reliability challenges through decoupling, SLO‑driven monitoring, and scenario‑based drills, while highlighting key observations and focus areas for modern operations teams.

SLOSREmonitoring
0 likes · 13 min read
Why SRE Exists and How It Solves Reliability Challenges
Efficient Ops
Efficient Ops
Mar 20, 2020 · Operations

How Zhejiang Mobile Revamped IT Operations with AIOpsDev and SRE

Zhejiang Mobile’s IT Operations team announced a strategic shift from reactive ticket‑driven maintenance to a proactive, AI‑powered AIOpsDev model, establishing new departments, adopting SRE practices, and leveraging cloud‑native technologies to dramatically improve efficiency, reliability, and digital transformation.

DevOpsITILOperations
0 likes · 7 min read
How Zhejiang Mobile Revamped IT Operations with AIOpsDev and SRE
NetEase Game Operations Platform
NetEase Game Operations Platform
Feb 15, 2020 · Databases

Using Flyway for Database Version Management: Principles, Configuration, and Best Practices

This article introduces Flyway as a database migration tool, explains its working principle, directory and naming conventions, supported databases, and provides detailed step‑by‑step instructions, best‑practice guidelines, and troubleshooting tips for safely managing MySQL schema changes in production environments.

DevOpsFlywaySQL
0 likes · 13 min read
Using Flyway for Database Version Management: Principles, Configuration, and Best Practices
Efficient Ops
Efficient Ops
Feb 5, 2020 · Operations

Balancing Stability and Speed: Google SRE Lessons for Modern Ops Teams

This article examines the inherent tension between operations and development, explains Google’s error‑budget and SLO approach, and shares practical DevOps, on‑call, automation, and talent strategies that help ops teams improve efficiency while maintaining product reliability.

AutomationError BudgetOn-Call
0 likes · 9 min read
Balancing Stability and Speed: Google SRE Lessons for Modern Ops Teams
MaGe Linux Operations
MaGe Linux Operations
Jan 31, 2020 · Operations

Balancing Stability and Speed: Lessons from Google SRE for Modern Ops

This article examines the tension between operations and development teams, explains Google's SRE error‑budget model, and shares practical reflections on engineering ops, on‑call rotation, automation, and talent development to achieve a sustainable balance between product stability and rapid innovation.

AutomationDevOpsError Budget
0 likes · 8 min read
Balancing Stability and Speed: Lessons from Google SRE for Modern Ops
dbaplus Community
dbaplus Community
Dec 30, 2019 · Operations

How Alibaba’s ECS Team Built a Scalable SRE System for Massive Cloud Services

This article explains the origins of Site Reliability Engineering (SRE), outlines the responsibilities of SRE teams, and details Alibaba Cloud’s ECS SRE practices—including capacity planning, performance optimization, full‑stack stability governance, automated release pipelines, on‑call processes, and the core principles and mindset that guide modern SRE work.

AutomationOperationsSRE
0 likes · 28 min read
How Alibaba’s ECS Team Built a Scalable SRE System for Massive Cloud Services
Efficient Ops
Efficient Ops
Dec 3, 2019 · Operations

How E‑commerce SRE Teams Tackle Scale, Cost, and Speed Challenges

The talk outlines the unique operational challenges of a fast‑growing e‑commerce platform—including massive scale, frequent changes, cost pressures, and the trade‑off between speed and stability—and describes how the SRE team uses automation, capacity planning, and process engineering to deliver reliable, efficient services.

SREe‑commerce
0 likes · 29 min read
How E‑commerce SRE Teams Tackle Scale, Cost, and Speed Challenges
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 29, 2019 · Operations

How Alibaba’s ECS SRE Team Built a Rock‑Solid Cloud Infrastructure for 100% Cloud Migration

This article explains how Alibaba's Elastic Compute Service (ECS) SRE team tackled massive traffic, database bottlenecks, alert overload, and resource inconsistencies by establishing a full‑stack reliability organization, upgrading core components, automating pipelines, and instituting rigorous monitoring, incident response, and change‑management processes.

OperationsSRESite Reliability Engineering
0 likes · 27 min read
How Alibaba’s ECS SRE Team Built a Rock‑Solid Cloud Infrastructure for 100% Cloud Migration
21CTO
21CTO
Nov 15, 2019 · Operations

How SRE Designs Highly Available Software Systems at Scale

This article presents Google SRE expert Ramón Medrano Llamas’s comprehensive guide on designing, operating, and maintaining large‑scale, highly available software systems, covering SRE fundamentals, daily workflows, scalability strategies, fault‑tolerant architecture, monitoring, and operational best practices.

SREScalable Systemsfault tolerance
0 likes · 13 min read
How SRE Designs Highly Available Software Systems at Scale
Efficient Ops
Efficient Ops
Oct 29, 2019 · Operations

How Xiami’s SRE Team Revamped Monitoring to Cut Alert Noise by 90%

Xiami’s SRE team overhauled its monitoring system by categorizing alerts, introducing fault, generic, and basic monitoring, optimizing alert paths with stream processing, and leveraging Alibaba’s traffic scheduling platform, dramatically reducing daily noise from thousands of alerts to a manageable few hundred critical notifications.

AlibabaSRETraffic Scheduling
0 likes · 9 min read
How Xiami’s SRE Team Revamped Monitoring to Cut Alert Noise by 90%
Sohu Tech Products
Sohu Tech Products
Oct 23, 2019 · Operations

Google SRE Weekly Alert Limits and Practical Strategies for Reducing Alert Fatigue

This article examines how Google SRE limits weekly alerts to ten, compares it with typical Chinese internet operations teams, and provides practical strategies—including on‑call scheduling, alert escalation, automation, dashboard optimization, and team management—to dramatically reduce alert volume and improve incident response.

Alert ManagementOn-CallOperations
0 likes · 15 min read
Google SRE Weekly Alert Limits and Practical Strategies for Reducing Alert Fatigue
Efficient Ops
Efficient Ops
Oct 23, 2019 · Operations

Building Scalable Operations: From SRE to AIOps and DevOps

This article explores how to construct a scalable operations framework by integrating concepts such as SRE, DevOps, AIOps, and continuous improvement, addressing organizational challenges, process standardization, tool automation, and the shift from reactive firefighting to proactive, value‑driven management.

IT ManagementSREScalability
0 likes · 30 min read
Building Scalable Operations: From SRE to AIOps and DevOps
dbaplus Community
dbaplus Community
Oct 16, 2019 · Operations

How to Cut Alert Noise: Practical SRE Strategies for Ops Teams

This article shares concrete SRE‑inspired techniques—duty‑roster scheduling, tiered alert handling, automation safeguards, dashboard focus on top‑3 alerts, time‑based filtering, and systematic code review—to dramatically reduce daily alarm volume while keeping on‑call teams motivated and effective.

On-CallSREalert optimization
0 likes · 15 min read
How to Cut Alert Noise: Practical SRE Strategies for Ops Teams
dbaplus Community
dbaplus Community
Aug 26, 2019 · Operations

Boost Network Transparency: Automated Monitoring and Ops Tools for SREs

Network engineers often go unnoticed until outages, so this guide explains how to make network status transparent through device availability checks, log and traffic monitoring, SNMP error tracking, and automation scripts—leveraging Python, syslog servers, and northbound APIs—to reduce troubleshooting time and prevent incidents.

Network MonitoringPythonSNMP
0 likes · 11 min read
Boost Network Transparency: Automated Monitoring and Ops Tools for SREs
DevOps
DevOps
Jul 29, 2019 · Operations

Google’s Continuous Delivery Practices and SRE Culture: A DevOps Case Study

This article examines Google’s corporate values, development history, culture, and detailed DevOps and Site Reliability Engineering practices—including continuous delivery, SRE responsibilities, and Google Cloud Platform CI/CD tools—to illustrate how the company achieves 24/7 reliable service deployment at massive scale.

Continuous DeliveryDevOpsGoogle
0 likes · 15 min read
Google’s Continuous Delivery Practices and SRE Culture: A DevOps Case Study
AntTech
AntTech
Jun 12, 2019 · Operations

Alipay’s Technical Risk System: Building SRE, TRaaS, and AIOps for High Availability

The article details how Alipay’s technical risk team, led by researcher Chen Liang, evolved from early scalability work to a full‑stack SRE organization, created the TRaaS risk‑defense platform and integrated AIOps to achieve near‑five‑nine availability and automated self‑healing for its financial services.

SRETRaaSaiops
0 likes · 12 min read
Alipay’s Technical Risk System: Building SRE, TRaaS, and AIOps for High Availability
Efficient Ops
Efficient Ops
Feb 26, 2019 · Operations

How NetEase Solved 10 Years of Ops Challenges: From Scripts to Cloud-native Platforms

In this talk, senior NetEase operations engineer Gu Xianjie shares a decade‑long journey tackling technical debt, rapid product growth, and on‑call pain points, describing the evolution from manual scripts to automated platforms, service‑oriented tools, DevOps/SRE practices, and cloud‑native strategies that boosted efficiency and reliability.

SREplatform engineering
0 likes · 17 min read
How NetEase Solved 10 Years of Ops Challenges: From Scripts to Cloud-native Platforms
ITPUB
ITPUB
Jan 31, 2019 · Operations

Master Monitoring: Collect Metrics for New Systems Using White‑Box Techniques & the Four Golden SRE Indicators

This article explains how to approach monitoring for a newly introduced system by focusing on white‑box metric collection, distinguishing basic and business metrics, outlining common collection methods, and detailing Google SRE's four golden indicators—error, latency, traffic, and saturation—to guide effective observability.

MetricsObservabilityOperations
0 likes · 10 min read
Master Monitoring: Collect Metrics for New Systems Using White‑Box Techniques & the Four Golden SRE Indicators
Didi Tech
Didi Tech
Jan 7, 2019 · Operations

Data‑Driven Risk Quantification Platform for SRE at Didi

Didi’s data‑driven Risk Quantification Platform assigns numeric Change Credit and Monitoring Health scores to deployments, alerts and core services, turning operational best‑practice adoption into a competitive game that has raised scores, cut incident rates despite higher change volume, and paves the way for broader risk‑management across the organization.

Risk QuantificationSREdata-driven operations
0 likes · 9 min read
Data‑Driven Risk Quantification Platform for SRE at Didi
JD Tech
JD Tech
Jan 3, 2019 · Operations

Comprehensive Monitoring Strategies for E‑commerce Platforms: Black‑Box and White‑Box Approaches

This article systematically explains how to enhance e‑commerce platform availability by implementing both black‑box monitoring to detect functional failures and white‑box monitoring to pinpoint root causes, detailing core order‑process metrics, common issues, mitigation strategies, and illustrative Grafana dashboards.

GrafanaOperationsSRE
0 likes · 9 min read
Comprehensive Monitoring Strategies for E‑commerce Platforms: Black‑Box and White‑Box Approaches
AntTech
AntTech
Dec 19, 2018 · Information Security

Red‑Blue Technical Attack‑Defense Exercises and SRE Practices at Ant Financial

Ant Financial’s internal red‑blue technical attack‑defense program, driven by a dedicated blue team and SRE‑based red team, continuously probes system weaknesses, refines fault‑injection tools like Awatch, and evolves high‑availability and self‑healing mechanisms to strengthen risk control and operational reliability.

Fault InjectionOperationsSRE
0 likes · 10 min read
Red‑Blue Technical Attack‑Defense Exercises and SRE Practices at Ant Financial
Continuous Delivery 2.0
Continuous Delivery 2.0
Nov 19, 2018 · Operations

Google's Software Testing Transformation: Crisis, Leadership, and Organizational Mechanisms

The article analyzes how Google responded to a testing crisis by empowering a visionary leader, establishing supportive structures, encouraging innovation, and persisting over years to embed a quality‑centric culture that eventually led to decentralized testing, SRE adoption, and a shift toward test‑design engineers.

GoogleSRESoftware Testing
0 likes · 7 min read
Google's Software Testing Transformation: Crisis, Leadership, and Organizational Mechanisms
JD Tech
JD Tech
Nov 5, 2018 · Operations

Practical Guide to Elasticsearch Monitoring and Operations

This article provides a comprehensive, operations‑focused overview of Elasticsearch monitoring, covering tool selection, key metrics for black‑box and white‑box monitoring, common issues discovered through alerts, and practical optimization recommendations to ensure high availability of ES clusters.

ElasticsearchSREtools
0 likes · 8 min read
Practical Guide to Elasticsearch Monitoring and Operations
21CTO
21CTO
Aug 30, 2018 · Operations

Inside Google’s Production: How Requests Travel Through Its Massive Infrastructure

Google’s production environment spans a global edge network, massive data centers, sophisticated job scheduling with Borg, distributed storage systems like Bigtable and Spanner, and comprehensive monitoring, illustrating how user requests traverse multiple layers—from ISP to edge, GFE, load balancers, and finally to services.

DeploymentGoogleInfrastructure
0 likes · 9 min read
Inside Google’s Production: How Requests Travel Through Its Massive Infrastructure
Architecture Digest
Architecture Digest
Aug 29, 2018 · Operations

Google Production Environment: Network, Data Center, Cluster Management, Storage, Monitoring, and Deployment Workflow

The article explains Google’s end‑to‑end production infrastructure—including the edge network, data‑center hierarchy, Borg‑based cluster management, storage systems like Colossus and Spanner, monitoring with Borgmon, inter‑task RPC via Stubby, and the code‑to‑production pipeline using Piper, Blaze, Rapid, and Sisyphus—illustrating how requests travel from users to services in milliseconds.

Data centerDeploymentGoogle
0 likes · 10 min read
Google Production Environment: Network, Data Center, Cluster Management, Storage, Monitoring, and Deployment Workflow
dbaplus Community
dbaplus Community
Jun 7, 2018 · Operations

Why Ceph’s Unlimited Scalability Isn’t As Simple As It Looks

The article examines Ceph’s claimed infinite scalability, cost advantages, and operational stability from an SRE perspective, comparing it with centralized systems like HDFS, and reveals practical challenges such as expansion granularity, crushmap rebalancing, utilization limits, and maintenance overhead.

CephHDFSOperations
0 likes · 15 min read
Why Ceph’s Unlimited Scalability Isn’t As Simple As It Looks
vivo Internet Technology
vivo Internet Technology
Jun 5, 2018 · Operations

DevOps International Summit 2024: Latest Practices and Technologies

The DevOps International Summit 2024 in Beijing, the sole China‑based global DevOps conference, brings together over 80 leading experts to showcase end‑to‑end practices—from Lean‑Agile, Continuous Delivery, SRE, and microservices to DevSecOps, AI‑driven tooling, and the new Research and Operations Integration Capability Maturity Model—through industry‑focused tracks, hands‑on training, and real‑world case studies across finance, telecom, retail and more.

Continuous DeliveryDevOpsDevOps Summit
0 likes · 3 min read
DevOps International Summit 2024: Latest Practices and Technologies
dbaplus Community
dbaplus Community
May 8, 2018 · Operations

How to Build Reliable Operations: From BCM to Google SRE Practices

This article examines the growing challenges of system availability in modern operations, explains the concept of availability and the N‑nine metric, introduces Business Continuity Management and Google SRE approaches, and provides concrete technical and managerial methods—including architecture standardization, scaling strategies, tooling, emergency drills, and incident‑centralized management—to improve operational reliability.

AvailabilityBCMOperations
0 likes · 30 min read
How to Build Reliable Operations: From BCM to Google SRE Practices
Snowball Engineer Team
Snowball Engineer Team
Jan 12, 2018 · Operations

RDR: An Open-Source Tool for Visualizing and Analyzing Redis Memory Usage

This article introduces RDR, an open-source visualization platform developed by Xueqiu's SRE team to safely and efficiently analyze Redis memory consumption by parsing RDB files, estimating key-level memory usage based on internal data structures, and generating intuitive statistical reports for operational optimization.

Memory analysisOperationsRDB Parsing
0 likes · 9 min read
RDR: An Open-Source Tool for Visualizing and Analyzing Redis Memory Usage
Efficient Ops
Efficient Ops
Jan 11, 2018 · Operations

Mastering Incident Troubleshooting: Proven SRE Strategies for Operations

This article shares practical SRE‑based principles for diagnosing and resolving online incidents, emphasizing systematic investigation, gathering clues, and prioritizing service restoration over immediate root‑cause identification to make troubleshooting less mystical and more effective.

OperationsSREincident management
0 likes · 7 min read
Mastering Incident Troubleshooting: Proven SRE Strategies for Operations
Meituan Technology Team
Meituan Technology Team
Dec 1, 2017 · Operations

Cloud SRE Development and Practice

The Meituan‑Dianping Technology Salon Online offers a recurring live‑streamed course where SRE experts, led by Zuo Pucun, discuss the challenges of high growth and concurrency, the evolution from firefighting to proactive stability, service availability, user‑experience optimization, and future automation in cloud SRE practice.

Meituan-DianpingSREStability Assurance
0 likes · 3 min read
Cloud SRE Development and Practice
MaGe Linux Operations
MaGe Linux Operations
Nov 19, 2017 · Operations

Which DevOps Team Topology Fits Your Organization? A Practical Guide

This article examines common DevOps team structures and anti‑patterns, explains how product portfolio, leadership, and organizational readiness influence the choice of topology, and presents nine practical models—from collaborative teams to SRE and container‑driven approaches—to help you select the most effective structure for your business.

CollaborationSRETeam Topology
0 likes · 19 min read
Which DevOps Team Topology Fits Your Organization? A Practical Guide
Qunar Tech Salon
Qunar Tech Salon
Oct 26, 2017 · Operations

Evolution of Pinterest's Monitoring System: From Time-Series Metrics to Distributed Tracing

Over seven years, Pinterest’s monitoring team built and refined a three‑pronged observability platform—time‑series metrics, log search, and distributed tracing—scaling from a single‑machine system to handling millions of data points per second across tens of thousands of AWS VMs, while addressing reliability, cost, and usability challenges.

Distributed TracingObservabilitySRE
0 likes · 19 min read
Evolution of Pinterest's Monitoring System: From Time-Series Metrics to Distributed Tracing
Efficient Ops
Efficient Ops
Oct 24, 2017 · Operations

How Pinterest Scaled Its Monitoring, Logging, and Tracing Over Seven Years

This article chronicles Pinterest's seven‑year evolution from a single‑machine time‑series monitor to a multi‑component system that integrates metrics, log search, and distributed tracing, sharing architectural choices, scaling challenges, and lessons learned for building reliable, high‑performance operations platforms.

Distributed TracingOperationsSRE
0 likes · 24 min read
How Pinterest Scaled Its Monitoring, Logging, and Tracing Over Seven Years
Efficient Ops
Efficient Ops
Sep 27, 2017 · Operations

From Ops to SRE: What Google’s Site Reliability Model Means for Your Team

The article reflects on the shift from traditional operations to Site Reliability Engineering (SRE), comparing Google’s SRE practices with those of a Chinese cloud provider, and explores infrastructure, tooling, team structure, and cultural challenges while drawing practical lessons for engineers.

DevOpsGoogleSRE
0 likes · 19 min read
From Ops to SRE: What Google’s Site Reliability Model Means for Your Team
Efficient Ops
Efficient Ops
Aug 29, 2017 · Operations

From ITIL to SRE: How Vipshop Transformed Its Operations

This article recounts Vipshop’s journey from a traditional ITIL‑based operations model to an SRE‑inspired, automated workflow, detailing the construction of ITIL processes, the challenges faced, the shift toward automation, and personal insights on managing people, quality, and change.

DevOpsITILSRE
0 likes · 20 min read
From ITIL to SRE: How Vipshop Transformed Its Operations
Efficient Ops
Efficient Ops
Aug 28, 2017 · Operations

Can Ops Teams Become Agile? A Practical Kanban Journey

This article explores how operations teams can adopt agile principles—especially Kanban—to address common challenges such as delayed feedback, task overload, and hidden risks, demonstrating a step‑by‑step transformation within the DevOps lifecycle.

DevOpsKanbanLean
0 likes · 28 min read
Can Ops Teams Become Agile? A Practical Kanban Journey
Efficient Ops
Efficient Ops
Jul 25, 2017 · Operations

Why Google’s SRE Model Matters: Lessons for Modern Ops Teams

This article explains the origins, responsibilities, and team structures of Google Site Reliability Engineering (SRE), compares it with traditional operations roles in companies like Yahoo, Alibaba, and Facebook, and offers practical guidance for building effective SRE or application‑operations teams today.

DevOpsSRESite Reliability Engineering
0 likes · 25 min read
Why Google’s SRE Model Matters: Lessons for Modern Ops Teams
Efficient Ops
Efficient Ops
Jun 10, 2017 · Operations

What Google’s SRE Book Reveals About Modern Operations

This article introduces the Chinese translation of Google’s SRE book, shares behind‑the‑scenes stories of its creation, and distills key concepts such as the AAA model, Borg architecture, SLOs, toil reduction, and the cultural shift required for reliable large‑scale services.

DevOpsGoogleInfrastructure
0 likes · 20 min read
What Google’s SRE Book Reveals About Modern Operations
ITPUB
ITPUB
Jun 9, 2017 · Operations

Mastering Effective Monitoring: From Basics to the USE Method

This article explains the fundamentals of monitoring, distinguishes traditional OPS from SRE perspectives, defines monitoring objects and metrics, introduces quantitative thinking with SLI/SLO, and presents the USE method with a MySQL example to help engineers detect and prevent failures efficiently.

MetricsOperationsSLI
0 likes · 10 min read
Mastering Effective Monitoring: From Basics to the USE Method
Efficient Ops
Efficient Ops
May 25, 2017 · Operations

How a Bank Transformed IT Ops with Automated DevOps and SRE Practices

This article outlines how China Merchants Bank’s data‑center application management team identified traditional financial IT operational pain points, introduced DevOps and SRE concepts, built non‑functional management frameworks, and implemented automated tooling, monitoring, and capacity‑scaling to achieve fully automated operations.

DevOpsIT OperationsPerformance Scaling
0 likes · 24 min read
How a Bank Transformed IT Ops with Automated DevOps and SRE Practices
ITPUB
ITPUB
May 15, 2017 · Operations

Mastering Online Incident Management: From Detection to Prevention

This article outlines a comprehensive methodology for handling large‑scale online service incidents, covering goals, the "jump‑fill‑avoid" framework, step‑by‑step processes for detection, diagnosis, remediation, and post‑mortem analysis, as well as essential monitoring, logging, and escalation infrastructure.

OperationsSREincident management
0 likes · 18 min read
Mastering Online Incident Management: From Detection to Prevention
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Apr 6, 2017 · Operations

How SRE’s Dialectical Thinking Redefines Modern Operations

An insightful reflection on Google’s SRE philosophy shows how dialectical thinking—questioning absolute stability, embracing limited toil, prioritizing simple monitoring, recognizing automation’s hidden risks, and practicing real‑world failure drills—can reshape operations, encouraging smarter, more resilient system design.

AutomationReliabilitySRE
0 likes · 7 min read
How SRE’s Dialectical Thinking Redefines Modern Operations
Efficient Ops
Efficient Ops
Mar 26, 2017 · Operations

How Google Scales App Engine: Lessons in Cloud Scalability and SRE

The article shares Google SRE veteran Minghua Ye’s insights on App Engine’s evolution, emphasizing the critical role of automatic scalability, distributed locks, service discovery, load balancing, and open‑source tools like gRPC, Protobuf, gflags, glog, and Googletest in building reliable, high‑traffic cloud services.

Distributed SystemsGoogle App EngineProtobuf
0 likes · 12 min read
How Google Scales App Engine: Lessons in Cloud Scalability and SRE
Efficient Ops
Efficient Ops
Mar 21, 2017 · Operations

Rethinking Operations: The “Third Kind” of SRE at Lianjia

The article shares the author’s experience transitioning from private to public and hybrid clouds at Lianjia, introduces a “third kind” of operations that blends traditional and internet‑based practices, and discusses containers, DNS‑based naming, and automation tools to build adaptable, cost‑effective infrastructure.

InfrastructureNaming ServiceSRE
0 likes · 21 min read
Rethinking Operations: The “Third Kind” of SRE at Lianjia
High Availability Architecture
High Availability Architecture
Mar 15, 2017 · Operations

Highlights from SRECon17 Americas 2023 in San Francisco

The article reports on the SRECon17 Americas conference in San Francisco, summarizing keynote talks, panel sessions, and practical insights from industry leaders such as Stripe, Netflix, Google, and IBM on topics ranging from traffic control and container management to on‑call practices and cost considerations for Site Reliability Engineering.

DevOpsGoogleNetflix
0 likes · 6 min read
Highlights from SRECon17 Americas 2023 in San Francisco
Ctrip Technology
Ctrip Technology
Dec 9, 2016 · Operations

Design and Implementation of Ctrip Call Center's Active‑Active Architecture and Unified Login

The article details Ctrip's call‑center architecture evolution, describing the multi‑layer active‑active design, public access, application and client layers, unified login mechanisms, operational challenges, disaster‑recovery drills, and future plans for software‑only and mobile agents, illustrating practical SRE principles in a large‑scale telephony system.

Active-ActiveIP phoneSRE
0 likes · 22 min read
Design and Implementation of Ctrip Call Center's Active‑Active Architecture and Unified Login
Efficient Ops
Efficient Ops
Dec 4, 2016 · Operations

How Ctrip Built a Seamless Multi‑Region Dual‑Active Call Center

This article details Ctrip's evolution from a single‑site call‑center to a fully dual‑active, multi‑region architecture, covering the overall system design, public network, application, and client layers, unified login mechanisms, heartbeat monitoring, and future software‑only and mobile‑first directions.

Dual-ActiveOperationsSRE
0 likes · 27 min read
How Ctrip Built a Seamless Multi‑Region Dual‑Active Call Center
Efficient Ops
Efficient Ops
Nov 7, 2016 · Operations

How to Train New SREs Effectively: Proven Practices and Playbooks

This article outlines a systematic approach to onboarding and training new Site Reliability Engineers, covering trust building, readiness assessment, diverse learning methods, structured curricula, on‑call milestones, project‑focused work, reverse‑engineering skills, statistical thinking, and improvisation techniques to develop high‑performing SRE teams.

On-CallOperationsSRE
0 likes · 17 min read
How to Train New SREs Effectively: Proven Practices and Playbooks
Efficient Ops
Efficient Ops
Oct 30, 2016 · Operations

How Google Music Recovered 1.5 PB of Lost Data After a Massive Deletion Bug

In March 2012, a privacy‑driven deletion pipeline mistakenly erased hundreds of thousands of Google Music files, prompting SREs to launch a massive data‑recovery effort that involved MapReduce impact analysis, tape‑based backups, and a complete redesign of the deletion system.

Data RecoveryGoogle MusicLarge-Scale Deletion
0 likes · 14 min read
How Google Music Recovered 1.5 PB of Lost Data After a Massive Deletion Bug
Efficient Ops
Efficient Ops
Oct 26, 2016 · Operations

From Sysadmin to Google SRE: How Modern Ops Teams Can Thrive

This article compares traditional system administration with Google’s Site Reliability Engineering, explaining why enterprises are shifting from cost‑center SLA focus to data‑driven, user‑experience‑oriented operations, and offers practical steps for teams to adopt automation, cloud platforms, and risk‑aware practices.

SRE
0 likes · 14 min read
From Sysadmin to Google SRE: How Modern Ops Teams Can Thrive
Efficient Ops
Efficient Ops
Oct 23, 2016 · Operations

How Google’s SRE Postmortems Drive System Reliability

This article explains Google’s SRE postmortem philosophy, the criteria for writing postmortems, best practices for a blame‑free culture, and how collaborative knowledge‑sharing and incentives improve incident handling and overall system reliability.

OperationsSREincident management
0 likes · 14 min read
How Google’s SRE Postmortems Drive System Reliability
Efficient Ops
Efficient Ops
Oct 5, 2016 · Operations

5 Must‑Read Ops ‘Prescriptions’ to Boost Your Infrastructure Skills

This article curates five technical reads—covering network operations, Google’s production environment, massive cost‑saving strategies, IDC automation, and Docker‑based RDS—each presented as a “medicine” with a brief description and a link for deeper insight.

Cost OptimizationDockerOperations
0 likes · 5 min read
5 Must‑Read Ops ‘Prescriptions’ to Boost Your Infrastructure Skills
Efficient Ops
Efficient Ops
Sep 18, 2016 · Operations

Who Was the World’s First SRE? Uncovering Margaret Hamilton’s Legacy

This article explores the origins of Site Reliability Engineering, highlights Margaret Hamilton as the likely first SRE through her work on NASA’s Apollo program, and draws lessons on reliability, disaster prevention, and the evolution of modern SRE practices.

Apollo programMargaret HamiltonSRE
0 likes · 10 min read
Who Was the World’s First SRE? Uncovering Margaret Hamilton’s Legacy
Efficient Ops
Efficient Ops
Sep 13, 2016 · Operations

How Google SRE Principles Compare Across Industries

This article, excerpted from the upcoming Chinese edition of “SRE: Google Site Reliability Engineering”, examines how Google’s SRE guiding philosophies—disaster planning, post‑mortem culture, automation, and data‑driven decision‑making—are adopted, adapted, or contrasted in sectors such as manufacturing, aerospace, nuclear, telecommunications, healthcare, and finance, highlighting key similarities, differences, and lessons for Google and the broader tech industry.

AutomationOperationsSRE
0 likes · 21 min read
How Google SRE Principles Compare Across Industries
MaGe Linux Operations
MaGe Linux Operations
May 27, 2016 · Operations

Why Google Relies on Software Engineers to Run Its Services: Inside SRE

The article explains Google’s Site Reliability Engineering (SRE) philosophy, how it empowers software engineers to automate operations, the balance between development and reliability, the concept of error budgets, and the cultural shift that turned DevOps into a core practice for large‑scale services.

DevOpsError BudgetOperations Automation
0 likes · 10 min read
Why Google Relies on Software Engineers to Run Its Services: Inside SRE