Tagged articles
403 articles
Page 3 of 5
DevOps Cloud Academy
DevOps Cloud Academy
Dec 31, 2022 · Operations

Google Site Reliability Engineering (SRE) Principles and Engagement Model

The article explains Google’s Site Reliability Engineering (SRE) team, its mission to balance reliability and velocity through automation, the engagement model with development teams, funding principles, and a set of guiding principles that shape how SRE collaborates, scopes, and delivers value across services.

Engagement ModelGoogleReliability
0 likes · 29 min read
Google Site Reliability Engineering (SRE) Principles and Engagement Model
Bilibili Tech
Bilibili Tech
Dec 30, 2022 · Operations

Self-Developed HTTPDNS Service: Cost Estimation, Architecture, Optimization, and Lessons Learned

To cut the hundreds‑of‑thousands‑yuan monthly bill of a commercial HTTPDNS service, the team built a multi‑region, self‑hosted HTTPDNS platform, estimated to slash costs by up to 90%, then resolved unexpected TLS bandwidth waste by improving connection reuse, ultimately achieving over 80% savings and planning a hybrid‑cloud deployment.

BGPCost OptimizationDomain Hijacking
0 likes · 12 min read
Self-Developed HTTPDNS Service: Cost Estimation, Architecture, Optimization, and Lessons Learned
Efficient Ops
Efficient Ops
Dec 19, 2022 · Operations

How Tencent CDN Achieves Seamless Business Continuity with AI‑Powered SRE

This article details Tencent CDN's challenges and solutions for business continuity, covering bandwidth and device resource constraints, massive request handling, fault‑management lifecycle, automation bottlenecks, and the implementation of AIOps, intelligent alerts, capacity planning, and root‑cause analysis to ensure reliable service.

AutomationCDNOperations
0 likes · 21 min read
How Tencent CDN Achieves Seamless Business Continuity with AI‑Powered SRE
Efficient Ops
Efficient Ops
Dec 12, 2022 · Operations

How Bilibili Built a 5‑Year SRE Journey: High‑Availability, Multi‑Active, and Capacity Management

This article chronicles Bilibili's five‑year evolution of Site Reliability Engineering, detailing the introduction of SRE culture, the construction of high‑availability and multi‑active architectures, capacity management with Kubernetes, VPA/HPA, incident case studies, and the ongoing transformation of SRE practices across the organization.

KubernetesOperationsSRE
0 likes · 24 min read
How Bilibili Built a 5‑Year SRE Journey: High‑Availability, Multi‑Active, and Capacity Management
dbaplus Community
dbaplus Community
Nov 28, 2022 · Operations

How Bilibili Guaranteed Seamless Live Streaming for the League of Legends S12 Finals

Bilibili’s S12 technical guarantee team coordinated dozens of engineering groups, performed resource estimation, built a shared resource pool, applied chaos engineering, high‑availability architecture, and systematic performance testing to ensure the League of Legends World Championship livestream remained stable and responsive under peak traffic.

Performance TestingResource ManagementSRE
0 likes · 19 min read
How Bilibili Guaranteed Seamless Live Streaming for the League of Legends S12 Finals
Tencent Architect
Tencent Architect
Nov 28, 2022 · Operations

How Tencent CDN Achieves Business Continuity with Intelligent Operations

This article details Tencent CDN's extensive business continuity challenges—including bandwidth, device resources, and massive request volumes—and explains how a fault‑management lifecycle, AIOps components, intelligent alerting, and automated capacity planning together enable resilient, automated operations.

CDNIntelligent OperationsSRE
0 likes · 17 min read
How Tencent CDN Achieves Business Continuity with Intelligent Operations
Bilibili Tech
Bilibili Tech
Nov 19, 2022 · Operations

Technical Assurance for High‑Write Live‑Streaming Gift Scenarios

The technical‑assurance team secured Bilibili’s high‑write live‑stream gift system by expanding capacity, isolating hot keys, refactoring pipelines, adding asynchronous writes, employing horizontal scaling and full‑link load testing, converting uncertain dependencies into graceful fallbacks, and deploying dual‑active, chaos‑engineered disaster‑resilience architecture aligned with business usage patterns.

SREcapacity planningdatabase scaling
0 likes · 16 min read
Technical Assurance for High‑Write Live‑Streaming Gift Scenarios
21CTO
21CTO
Nov 15, 2022 · Operations

Mastering SRE: How to Define SLIs, SLOs, SLAs and Build Reliable Systems

This article explains how SRE teams should define Service Level Indicators, Objectives and Agreements, manage reliability, performance, saturation and observability, use proper metrics and tracing, handle error budgets, assess risks, and implement effective incident and project management to create robust, cloud‑native services.

Error BudgetObservabilityReliability
0 likes · 14 min read
Mastering SRE: How to Define SLIs, SLOs, SLAs and Build Reliable Systems
Bilibili Tech
Bilibili Tech
Nov 15, 2022 · Operations

Technical Assurance for Bilibili S12 Live Streaming Event: Architecture, Resource Management, and High Availability

To ensure “tea‑time” reliability for Bilibili’s 2022 S12 League of Legends championship, a cross‑functional technical‑assurance project introduced shared resource pools, CPUSET removal, multi‑instance HA architecture, adaptive throttling, chaos‑engineered fault injection, a new Golang gateway, extensive load testing, and coordinated on‑site duty, delivering uninterrupted live streaming without forced throttling.

SREchaos engineeringhigh availability
0 likes · 20 min read
Technical Assurance for Bilibili S12 Live Streaming Event: Architecture, Resource Management, and High Availability
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Nov 14, 2022 · Operations

Quantifying Internet Service Availability: Classic Metrics and the New User‑Uptime Indicator

The article reviews classic availability metrics such as Success‑Ratio, Incident‑Ratio, MTTR/MTTF, Error‑Budget, and SLA/SLO, then introduces User‑Uptime—a per‑user success time proportion that ignores long idle periods—and its windowed variant, showing how it complements existing indicators for more user‑centric reliability insight.

AvailabilityReliabilitySRE
0 likes · 27 min read
Quantifying Internet Service Availability: Classic Metrics and the New User‑Uptime Indicator
Ops Development Stories
Ops Development Stories
Oct 26, 2022 · Operations

Is SRE a Team Mindset? Unlocking Stable Services Beyond the Title

The article explains that SRE, introduced by Google, is not a single specialist but a collaborative mindset requiring product, development, testing, operations, and architecture skills, and argues that even small‑scale teams can achieve stability by embracing these principles despite common misconceptions.

Cloud NativeSREstability
0 likes · 4 min read
Is SRE a Team Mindset? Unlocking Stable Services Beyond the Title
Cloud Native Technology Community
Cloud Native Technology Community
Oct 19, 2022 · Industry Insights

What Sets Platform Engineering Apart from DevOps and SRE?

The article clarifies the distinctions between platform engineering, DevOps, and SRE, explaining their origins, common misconceptions, challenges such as shadow operations and developer cognitive load, and how platform engineering builds on these practices to deliver self‑service internal developer platforms that improve productivity and reliability.

DevOpsInternal Developer PlatformOperations
0 likes · 10 min read
What Sets Platform Engineering Apart from DevOps and SRE?
MaGe Linux Operations
MaGe Linux Operations
Oct 5, 2022 · Operations

Why Clear Responsibility Boundaries Are Crucial for DevOps and SRE Success

Clear responsibility boundaries are essential for effective DevOps and SRE workflows, preventing endless fire‑fighting, aligning software and organizational structures, and ensuring sustainable delivery across business stages. They also highlight how team size and expertise affect process clarity and when to choose DevOps‑first or SRE‑guided approaches.

SREresponsibility boundariesworkflow
0 likes · 6 min read
Why Clear Responsibility Boundaries Are Crucial for DevOps and SRE Success
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Sep 26, 2022 · Operations

How to Tame Alert Storms: Building a Systematic Monitoring and Alerting Framework for Microservices

This article analyzes the challenges of alert overload in large‑scale microservice environments and presents a systematic approach—including timeliness metrics, a maturity model, lifecycle tracking, feedback loops, downgrade mechanisms, and cross‑service aggregation—to improve alert effectiveness and reduce noise.

Alert ManagementMTTRMicroservices
0 likes · 16 min read
How to Tame Alert Storms: Building a Systematic Monitoring and Alerting Framework for Microservices
Architects Research Society
Architects Research Society
Sep 16, 2022 · Operations

Building a Reliability Culture: Practices, Benefits, and Implementation

This article explains what a reliability culture is, why it matters, how to cultivate it through mission statements, early‑stage reliability testing, chaos‑engineering practices like GameDays and FireDrills, and how organizations can continuously learn from incidents to improve system availability and customer trust.

CultureOperationsReliability
0 likes · 18 min read
Building a Reliability Culture: Practices, Benefits, and Implementation
Bilibili Tech
Bilibili Tech
Sep 9, 2022 · Operations

B站SRE's Stability Practices and Reflections

At the 2022 GOPS Global Operations Conference in Shenzhen, Bilibili’s infrastructure SRE lead Wu Anchuang unveiled the company’s comprehensive stability framework—detailing its SRE transformation, high‑availability architecture, active‑active disaster‑recovery, capacity planning, and event‑support strategies—marking the first public disclosure of these practices.

B站SREactivity assurance
0 likes · 1 min read
B站SRE's Stability Practices and Reflections
Zuoyebang Tech Team
Zuoyebang Tech Team
Aug 26, 2022 · Operations

How We Built a Three‑Layer Stability System for Massive Scale Operations

This article details the operational mindset, stability framework, and transformation journey of the Zuoyebang infrastructure team, covering service lifecycle management, standardization, cloud‑native architecture, multi‑active deployment, incident pre‑plan platforms, traffic scheduling, monitoring, capacity planning, and future directions for SRE service‑orientation.

AutomationInfrastructureOperations
0 likes · 20 min read
How We Built a Three‑Layer Stability System for Massive Scale Operations
Cloud Native Technology Community
Cloud Native Technology Community
Aug 18, 2022 · Operations

Understanding DevOps: Integrating Development and Operations Beyond the ‘Who Develops Who Operates’ Myth

The article clarifies common misconceptions about DevOps, explains that true development‑operations integration relies on dedicated ops teams, automation tools, standardized delivery artifacts, and unified permission management rather than developers performing ops tasks, and highlights Google SRE practices as a practical guide.

AutomationDevOpsInfrastructure
0 likes · 10 min read
Understanding DevOps: Integrating Development and Operations Beyond the ‘Who Develops Who Operates’ Myth
ITPUB
ITPUB
Aug 5, 2022 · Operations

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

On July 13, 2021, Bilibili’s OpenResty‑based SLB suffered a CPU‑100% outage caused by a Lua _gcd function bug triggered when a service’s weight was set to the string “0”, leading to a multi‑hour incident that was resolved by rebuilding SLB clusters and disabling JIT compilation.

Load BalancerOpenRestySRE
0 likes · 17 min read
How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned
Bilibili Tech
Bilibili Tech
Aug 2, 2022 · Operations

Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections

Bilibili adopted Google‑SRE SLO practices—selecting SLIs, defining availability and latency targets, grading services, and tracking error budgets—but encountered costly grading inconsistencies, hidden error detection, and inaccurate business‑level metrics, leading them to realize SLOs are chiefly valuable for early alerting rather than exhaustive reporting.

Cloud NativeError BudgetOperations
0 likes · 21 min read
Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections
DevOps
DevOps
Jul 25, 2022 · Operations

Understanding the Role and Responsibilities of Site Reliability Engineering (SRE)

This article provides a comprehensive overview of Site Reliability Engineering, explaining its origins, core responsibilities across infrastructure, platform, and business layers, daily tasks such as deployment, on‑call duties, SLI/SLO management, incident post‑mortems, capacity planning, and user support, as well as career advice for aspiring SREs.

InfrastructureOncallReliability
0 likes · 21 min read
Understanding the Role and Responsibilities of Site Reliability Engineering (SRE)
Big Data Technology Architecture
Big Data Technology Architecture
Jul 14, 2022 · Operations

Postmortem of Bilibili SLB Outage on July 13, 2021

This postmortem details the July 13, 2021 Bilibili outage caused by a Lua‑induced CPU 100% bug in the OpenResty‑based SLB, describing the incident timeline, root‑cause analysis, mitigation steps, and the subsequent technical and process improvements to enhance reliability and multi‑active deployment.

IncidentLoad BalancerLua
0 likes · 16 min read
Postmortem of Bilibili SLB Outage on July 13, 2021
dbaplus Community
dbaplus Community
Jul 12, 2022 · Operations

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

In July 2021 a sudden CPU‑100% spike in Bilibili's OpenResty‑based SLB caused widespread service outages, prompting an emergency response that rebuilt load‑balancer clusters, traced a Lua _gcd function bug triggered by a zero weight string, and led to extensive operational and architectural improvements.

Cloud NativeLuaOpenResty
0 likes · 17 min read
How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned
Bilibili Tech
Bilibili Tech
Jul 12, 2022 · Operations

Bilibili SLB Outage Postmortem (July 13, 2021): Timeline, Root Cause, and Improvements

On July 13 2021 Bilibili’s L7 SLB crashed when a recent Lua deployment set a balancer weight to the string “0”, producing a NaN value that triggered an infinite loop and 100 % CPU, prompting emergency restarts, a fresh cluster rollout, and long‑term safeguards such as automated provisioning, stricter Lua validation, and enhanced multi‑active disaster‑recovery processes.

Load BalancerRoot Cause AnalysisSLB
0 likes · 17 min read
Bilibili SLB Outage Postmortem (July 13, 2021): Timeline, Root Cause, and Improvements
DevOps
DevOps
Jul 8, 2022 · Operations

Nine Essential Skills Every Modern Site Reliability Engineer Should Master

The article outlines the nine core competencies—network expertise, Linux/Unix knowledge, cloud computing, CI/CD pipelines, QA automation, security engineering, DevOps, incident management, and post‑incident review—that enable SREs to ensure the availability, performance, and reliability of complex distributed systems.

DevOpsSRESite Reliability Engineering
0 likes · 6 min read
Nine Essential Skills Every Modern Site Reliability Engineer Should Master
dbaplus Community
dbaplus Community
Jul 4, 2022 · Operations

Why Most Monitoring Systems Fail: Lessons from a Veteran Ops Engineer

A seasoned operations professional shares personal experiences and hard‑earned insights on why traditional monitoring often becomes ineffective, how over‑automation and noisy dashboards hurt teams, and what a capability‑focused, user‑centric approach to observability should look like.

ObservabilityOperationsSRE
0 likes · 12 min read
Why Most Monitoring Systems Fail: Lessons from a Veteran Ops Engineer
Architecture Talk
Architecture Talk
Jun 27, 2022 · Operations

Why Build an SRE System? A Complete Guide to Site Reliability Engineering

This article explains the motivations behind Site Reliability Engineering (SRE), outlines its strategic goals, defines key concepts such as SLI, SLO, SLA and error budget, introduces the four golden metrics for monitoring distributed systems, and provides practical guidance on building, operating, and continuously improving an SRE practice.

Error BudgetSLISLO
0 likes · 14 min read
Why Build an SRE System? A Complete Guide to Site Reliability Engineering
HaoDF Tech Team
HaoDF Tech Team
Jun 21, 2022 · Operations

Evolution and High‑Availability Construction of the Haodafu Offline Message Push System

This article describes how the Haodafu offline push service grew from a simple PHP notification tool into a robust, highly‑available micro‑service platform by redesigning architecture, adopting vendor push channels, adding message‑queue reliability, implementing comprehensive monitoring, observability, and a fault‑diagnosis platform to ensure delivery rates and operational stability.

Mobile BackendObservabilitySRE
0 likes · 21 min read
Evolution and High‑Availability Construction of the Haodafu Offline Message Push System
Bilibili Tech
Bilibili Tech
Jun 21, 2022 · Cloud Native

Evolution of SRE in the Cloud‑Native Era – Insights from Industry Experts

Industry experts from Zhejiang Mobile, Bilibili, and Xiaomi discuss how SRE has evolved in the cloud‑native era, sharing concrete frameworks, observability practices, and cost‑focused platforms while emphasizing stability, metrics, on‑call processes, and the need to adapt Google’s model to real‑world product and operational contexts.

Cloud NativeDevOpsSRE
0 likes · 31 min read
Evolution of SRE in the Cloud‑Native Era – Insights from Industry Experts
Efficient Ops
Efficient Ops
Jun 19, 2022 · Operations

How Bilibili SRE Guarantees Million‑User Live Events: Strategies, Tools, and Lessons

This article details Bilibili's SRE approach to large‑scale live events, covering background, activity scenarios, resource planning, performance testing, chaos‑engineering drills, technical safeguards such as DCDN, SLB, WAF, PaaS, cache and DB, pre‑plan capabilities, post‑mortem analysis, and future outlook, illustrating how systematic capacity management and automated resilience practices enable stable operation for events with tens of millions of concurrent users.

Performance TestingSREcapacity planning
0 likes · 22 min read
How Bilibili SRE Guarantees Million‑User Live Events: Strategies, Tools, and Lessons
ITPUB
ITPUB
Jun 18, 2022 · Operations

How MDD and SRE Cut Mini‑Program Image‑Upload Failures from Days to Minutes

This article recounts a three‑day image‑upload outage in a mini‑program, analyzes the multi‑layer causes, and shows how combining Metrics‑Driven Development with SRE and a custom observability platform dramatically reduces diagnosis time and improves reliability.

Metrics-Driven DevelopmentMini ProgramObservability
0 likes · 20 min read
How MDD and SRE Cut Mini‑Program Image‑Upload Failures from Days to Minutes
Continuous Delivery 2.0
Continuous Delivery 2.0
Jun 17, 2022 · Operations

Addressing SRE Overload: Causes and Mitigation Strategies

The article examines why SRE teams experience overload due to high incident response demands, analyzes contributing factors such as production issues, alert volume, and manual processes, and proposes comprehensive mitigation steps including better testing, load management, and proactive error detection to reduce on‑call burden.

SREincident responseoverload
0 likes · 5 min read
Addressing SRE Overload: Causes and Mitigation Strategies
Bilibili Tech
Bilibili Tech
Jun 14, 2022 · Operations

SRE Practices for Large‑Scale Event Assurance at Bilibili

Bilibili’s SRE team ensures flawless large‑scale online events by meticulously gathering activity details, provisioning DNS, CDN, networking and compute resources, conducting multi‑stage performance tests and chaos‑engineering drills, applying layered traffic controls, maintaining historical checklists, executing predefined contingency responses, and iterating post‑mortems to drive continuous automation and reliability.

Cloud NativeEvent ReliabilityOperations
0 likes · 20 min read
SRE Practices for Large‑Scale Event Assurance at Bilibili
dbaplus Community
dbaplus Community
Jun 13, 2022 · Operations

How We Built a Mini‑Program Observability Platform to Slash Incident Resolution Time

After a three‑day, ten‑person investigation into a mini‑program image‑upload failure, we designed and implemented an end‑to‑end observability platform using MDD and SRE principles, defining SLI/SLO, instrumenting client, network, gateway and backend layers, and visualizing metrics with Grafana, ClickHouse and Prometheus.

GrafanaMDDMetrics
0 likes · 18 min read
How We Built a Mini‑Program Observability Platform to Slash Incident Resolution Time
DevOps
DevOps
May 18, 2022 · Operations

Understanding and Preventing Cascading Failures in Distributed Systems

The article explains how cascading failures arise from positive feedback loops in distributed systems, illustrates real‑world incidents such as the 2015 DynamoDB outage, outlines anti‑patterns like unlimited retries and unchecked load, and presents practical mitigation techniques including load‑shedding, circuit breakers, exponential back‑off, and controlled replication to improve system resilience.

Distributed SystemsResilienceSRE
0 likes · 19 min read
Understanding and Preventing Cascading Failures in Distributed Systems
Bilibili Tech
Bilibili Tech
Apr 26, 2022 · Operations

Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation

Bilibili’s SRE team combines stability theory, detailed fault‑stage and operational metrics, and a unified emergency‑response platform—including on‑call scheduling, fault‑command incident commanders, automated fault portraits, and rapid post‑mortems—to transform frequent incidents into data‑driven, collaborative recoveries and lay groundwork for AI‑assisted self‑healing.

Business StabilityMetricsOncall
0 likes · 23 min read
Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation
IT Architects Alliance
IT Architects Alliance
Apr 17, 2022 · Operations

Understanding the SRE Role: Responsibilities, Types, and Practices

This article explains what Site Reliability Engineering (SRE) is, why it was created, the challenges in hiring SREs, and breaks the role into three layers—Infrastructure, Platform, and Business—detailing their duties, deployment processes, on‑call practices, SLI/SLO management, incident post‑mortems, capacity planning, user support, and career advice.

InfrastructureOncallOperations
0 likes · 21 min read
Understanding the SRE Role: Responsibilities, Types, and Practices
Architect
Architect
Apr 16, 2022 · Operations

A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices

This article explains what SRE is, why it was created, how its responsibilities differ across companies, and breaks the work into Infrastructure, Platform, and Business SRE while covering deployment, on‑call, SLI/SLO, incident post‑mortems, capacity planning, user support, and career advice.

OncallOperationsSLI/SLO
0 likes · 22 min read
A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices
DevOps
DevOps
Apr 12, 2022 · Operations

Understanding Observability: Core Concepts, SRE Methodology, AIOps, and Business Architecture

The article explains the rising importance of observability in modern operations, defines its control‑theory roots, breaks it down into metrics, traces and logs, and argues that successful implementation requires three pillars—SRE practices, AIOps algorithms, and deep business‑architecture knowledge—together with well‑designed SLOs and critical‑path mapping.

ObservabilitySREaiops
0 likes · 10 min read
Understanding Observability: Core Concepts, SRE Methodology, AIOps, and Business Architecture
dbaplus Community
dbaplus Community
Apr 10, 2022 · Operations

How to Build a Practical SRE Operations Framework for Large‑Scale Systems

This article presents a hands‑on SRE framework covering the full product lifecycle—code development, resource planning, deployment, operational reliability, and decommissioning—derived from real‑world practices at Xiaomi and Sina to help teams manage massive internet services efficiently and cost‑effectively.

Resource ManagementSRESystem Lifecycle
0 likes · 16 min read
How to Build a Practical SRE Operations Framework for Large‑Scale Systems
HaoDF Tech Team
HaoDF Tech Team
Mar 29, 2022 · Operations

Building an Observability Platform for Mini‑Program Image Uploads Using SRE and Metrics‑Driven Development

The article describes how a three‑day, cross‑team investigation of a mini‑program image‑upload failure led to the design and implementation of an SRE‑driven, metrics‑driven observability platform that quantifies SLIs, automates tracing, and provides dashboards for real‑time and long‑term analysis, ultimately reducing MTTR.

BackendMetrics-Driven DevelopmentMini-Program
0 likes · 17 min read
Building an Observability Platform for Mini‑Program Image Uploads Using SRE and Metrics‑Driven Development
Selected Java Interview Questions
Selected Java Interview Questions
Feb 25, 2022 · Operations

How Every Line of Code Impacts Performance: Lessons from Optimizing Legacy Web Applications

After a sudden traffic surge exposed severe latency in a 15‑year‑old multi‑service web system, the team identified a DB connection leak in a health‑check probe, and through load testing, monitoring, logging, and dedicated SRE effort, they derived key operational lessons on performance optimization and maintenance of legacy applications.

Load TestingSREdatabase connections
0 likes · 9 min read
How Every Line of Code Impacts Performance: Lessons from Optimizing Legacy Web Applications
IT Architects Alliance
IT Architects Alliance
Feb 15, 2022 · Operations

What Real-World Performance Tuning Taught Us About Legacy Web Apps

After a traffic surge exposed severe latency in a 15-year-old multi-service web platform, we used monitoring to discover a DB-connection leak caused by a liveness probe, corrected it, and distilled four practical lessons on latency metrics, tooling, legacy maintenance, and code vigilance.

APMLoad TestingOperations
0 likes · 9 min read
What Real-World Performance Tuning Taught Us About Legacy Web Apps
Top Architect
Top Architect
Feb 13, 2022 · Operations

Performance Optimization Lessons from a Legacy Web Application: Monitoring, Load Testing, and Maintaining Old Systems

The article shares a real‑world case study of a legacy multi‑service web platform where traffic spikes exposed DB connection leaks, leading to a 90% response‑time bottleneck, and outlines four key takeaways about tail‑latency metrics, investing in tools and people, actively maintaining legacy systems, and treating every line of code as critical for performance.

BackendLoad TestingSRE
0 likes · 9 min read
Performance Optimization Lessons from a Legacy Web Application: Monitoring, Load Testing, and Maintaining Old Systems
IT Services Circle
IT Services Circle
Feb 2, 2022 · Operations

Huawei Cloud’s New Year Defense: How SRE Teams Counter Massive Attacks

Huawei Cloud’s internal “blue‑team” launched over twenty coordinated attacks around Chinese New Year, but the company’s SRE “red‑team” and a dedicated 24/7 “special forces” unit detected, isolated, and resolved incidents within minutes, keeping failure rates below 0.01% and demonstrating advanced cloud operations and security practices.

SREincident response
0 likes · 9 min read
Huawei Cloud’s New Year Defense: How SRE Teams Counter Massive Attacks
DevOps
DevOps
Jan 28, 2022 · Operations

Continuous Operations: Definition, Stages, and Practices

This article presents a comprehensive study of continuous operations, defining its meaning, outlining the three key stages of continuous deployment, operation, and feedback, reviewing ITIL and DevOps practices, and sharing real-world case studies from major tech companies to illustrate effective implementation.

Continuous OperationsDevOpsITIL
0 likes · 46 min read
Continuous Operations: Definition, Stages, and Practices
Meituan Technology Team
Meituan Technology Team
Jan 13, 2022 · Operations

Phoenix: Client‑Side CDN Disaster Recovery Solution at Meituan

Phoenix is Meituan’s client‑side CDN disaster‑recovery system that uses a Webpack‑based SDK, dynamic calculation service, and monitoring platform to automatically detect load failures, switch domains, isolate problems, and continuously hot‑standby resources, boosting resource success rates from 99.7 % to 99.9 % across hundreds of projects.

CDNMeituanSRE
0 likes · 16 min read
Phoenix: Client‑Side CDN Disaster Recovery Solution at Meituan
dbaplus Community
dbaplus Community
Jan 9, 2022 · Databases

How ICBC Tames MySQL: Real‑World Governance, Risk Mitigation, and SRE Practices

This article details Industrial and Commercial Bank of China's comprehensive MySQL governance framework, covering risk identification, prevention strategies, a four‑step methodology, automated quality gates, production‑level monitoring, SRE management, and future visions for rapid incident detection and self‑healing.

Database GovernancePerformance OptimizationProduction Monitoring
0 likes · 22 min read
How ICBC Tames MySQL: Real‑World Governance, Risk Mitigation, and SRE Practices
Continuous Delivery 2.0
Continuous Delivery 2.0
Dec 31, 2021 · Operations

Curated Reading List on DevOps, Software Delivery Performance, and Engineering Productivity

This article presents a concise collection of ten Chinese-language resources that summarize the 2021 DORA DevOps report, the importance of consistency in R&D, fundamental efficiency principles, Microsoft’s testing shift, Google’s release and productivity metrics, and SRE health measurements, offering valuable insights for modern software engineering teams.

Engineering ProductivityOperationsSRE
0 likes · 5 min read
Curated Reading List on DevOps, Software Delivery Performance, and Engineering Productivity
Alibaba Cloud Native
Alibaba Cloud Native
Dec 22, 2021 · Operations

How Alibaba’s ASI Powers Massive Serverless Kubernetes at Scale

This article details Alibaba's Serverless Infrastructure (ASI) built on ACK, explaining its large‑scale Kubernetes architecture, fully managed operations, change‑risk controls, gray‑release pipelines, web‑shell access, taskflow orchestration, node lifecycle management, elasticity, risk mitigation, probing, and self‑healing capabilities that enable reliable cloud‑native services.

Cloud NativeInfrastructureKubernetes
0 likes · 32 min read
How Alibaba’s ASI Powers Massive Serverless Kubernetes at Scale
IT Architects Alliance
IT Architects Alliance
Dec 1, 2021 · Operations

What Does an SRE Actually Do? A Deep Dive into Roles and Practices

This article explains the origins of Site Reliability Engineering, breaks down its three main layers—Infrastructure, Platform, and Business SRE—covers day‑one and day‑2 deployment, on‑call processes, SLI/SLO design, post‑mortems, capacity planning, user support, and offers practical advice for aspiring SREs.

InfrastructureOncallOperations
0 likes · 24 min read
What Does an SRE Actually Do? A Deep Dive into Roles and Practices
Programmer DD
Programmer DD
Nov 16, 2021 · Operations

What Does an SRE Do? A Practical Guide to Site Reliability Engineering

This article explains the role of Site Reliability Engineering (SRE), its origins at Google, the challenges of hiring, the three-layer model of infrastructure, platform, and business SRE, and provides detailed responsibilities, on‑call practices, SLI/SLO management, capacity planning, and career advice for aspiring SREs.

InfrastructureOncallSLI
0 likes · 23 min read
What Does an SRE Do? A Practical Guide to Site Reliability Engineering
dbaplus Community
dbaplus Community
Nov 14, 2021 · Operations

How to Boost Service Reliability: SRE Basics and Tackling Technical Debt

This article explains the fundamentals of Site Reliability Engineering, outlines a complete SRE workflow from prevention to post‑mortem, details key availability metrics and golden indicators, examines how technical debt arises and can be mitigated, and describes the tooling and practices needed to keep large‑scale services healthy.

SRETechnical Debtmonitoring
0 likes · 18 min read
How to Boost Service Reliability: SRE Basics and Tackling Technical Debt
HaoDF Tech Team
HaoDF Tech Team
Nov 8, 2021 · Operations

Service Risk Governance: Exploration, Mitigation, and Hands‑On Workshop

This talk recounts how the Good Doctor platform tackled severe online incidents by launching the DOA project, then a service risk governance initiative that identifies, quantifies, and mitigates latency‑related risks through metrics‑driven development, dependency analysis, middleware reliability, and a dedicated risk‑management platform.

MicroservicesSRElatency optimization
0 likes · 16 min read
Service Risk Governance: Exploration, Mitigation, and Hands‑On Workshop
HaoDF Tech Team
HaoDF Tech Team
Oct 8, 2021 · Operations

Understanding SRE: Foundations, Metrics, and Tackling Technical Debt

This article introduces the fundamentals of Site Reliability Engineering (SRE), explains how to measure service stability with metrics like MTTR, MTBF, and availability, outlines the SRE workflow from prevention to post‑mortem, and discusses how to identify and reduce technical debt to improve system health.

OperationsReliabilitySRE
0 likes · 18 min read
Understanding SRE: Foundations, Metrics, and Tackling Technical Debt
Continuous Delivery 2.0
Continuous Delivery 2.0
Sep 30, 2021 · Operations

Key Findings from the 2021 DORA DevOps Report: SRE Practices, Documentation, Security, and Culture

The 2021 DORA DevOps Report reveals that elite teams outperform low‑performing teams by adopting SRE principles, high‑quality documentation, integrated security, modern technical practices such as loose coupling, continuous testing, CI/CD, and a performance‑driven culture that fosters belonging and inclusion.

CultureOperationsSRE
0 likes · 19 min read
Key Findings from the 2021 DORA DevOps Report: SRE Practices, Documentation, Security, and Culture
Continuous Delivery 2.0
Continuous Delivery 2.0
Sep 29, 2021 · Operations

Key Findings and Recommendations from the 2021 DORA DevOps Report (Chapters 1‑3)

The 2021 DORA DevOps Report, based on a seven‑year study of over 32,000 professionals, reveals how elite software delivery and technical‑operations practices—such as reliability goals, secure supply‑chain integration, high‑quality documentation, cloud adoption, and positive team culture—drive superior organizational performance and provides data‑driven guidance for improvement.

SREorganizational cultureperformance metrics
0 likes · 18 min read
Key Findings and Recommendations from the 2021 DORA DevOps Report (Chapters 1‑3)
DevOps Cloud Academy
DevOps Cloud Academy
Sep 27, 2021 · Operations

Key Findings from Google DORA 2021 Accelerate State of DevOps Report

The 2021 DORA Accelerate State of DevOps report, based on responses from over 32,000 professionals, reveals new performance metrics, the impact of SRE and security supply‑chain practices, cultural factors affecting burnout, and how cloud adoption continues to drive higher software delivery and organizational performance.

DevOpsSRESecurity
0 likes · 8 min read
Key Findings from Google DORA 2021 Accelerate State of DevOps Report
DevOps Cloud Academy
DevOps Cloud Academy
Sep 26, 2021 · Operations

Key Findings from Google DORA 2021 Accelerate State of DevOps Report

Google’s 2021 DORA Accelerate State of DevOps report, based on over 32,000 professionals, reveals that elite teams dramatically outperform low‑performing teams across deployment frequency, lead time, recovery time and failure rates, while highlighting new reliability metrics, the importance of team culture, SRE, cloud adoption, secure software supply chains and documentation.

DevOpsSRESecurity
0 likes · 7 min read
Key Findings from Google DORA 2021 Accelerate State of DevOps Report
Continuous Delivery 2.0
Continuous Delivery 2.0
Sep 26, 2021 · Operations

Key Findings from Google DORA’s 2021 Accelerate State of DevOps Report

The 2021 Accelerate State of DevOps report by Google DORA, based on over 32,000 professionals, reveals that elite teams dramatically outperform low‑performing teams across four classic delivery metrics, introduces a new reliability metric, and highlights the impact of team culture, SRE practices, cloud adoption, secure software supply chains, and high‑quality documentation on software delivery and organizational performance.

ReliabilitySREcloud
0 likes · 7 min read
Key Findings from Google DORA’s 2021 Accelerate State of DevOps Report
TAL Education Technology
TAL Education Technology
Aug 19, 2021 · Operations

Comprehensive SRE Guide for Summer and Winter High‑Load Periods in an Online Education Platform

This document outlines a comprehensive SRE‑driven operational framework for ensuring stable, high‑availability online education services during peak summer and winter periods, detailing pre‑, during‑, and post‑maintenance phases, architectural principles, load testing, monitoring, capacity management, safety hardening, chaos engineering, incident response, and post‑mortem practices.

Load TestingSREcapacity planning
0 likes · 17 min read
Comprehensive SRE Guide for Summer and Winter High‑Load Periods in an Online Education Platform
ByteDance ADFE Team
ByteDance ADFE Team
Jul 9, 2021 · Operations

From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting

The article traces the evolution from a rudimentary deployment workflow in a small startup to a mature, Google‑inspired Site Reliability Engineering (SRE) approach, explaining SRE definitions, team duties, error‑budget concepts, key reliability metrics (SLI/SLO/SLA), monitoring implementation with OpenTSDB, and best‑practice alerting rules.

AlertingError BudgetSLI
0 likes · 7 min read
From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting
DataFunTalk
DataFunTalk
Jun 27, 2021 · Big Data

Practical Experience in Operating NetEase's Big Data Platform: Architecture, EasyOps, Monitoring, and Optimization

This presentation by NetEase senior SRE Jin Chuan details the current state of NetEase's big data platform, introduces the internally built EasyOps management system, explains a generic Ansible‑based operation framework, describes Prometheus/Grafana monitoring and alerting, and shares practical lessons on network, storage, and cloud migration for large‑scale Hadoop services.

AnsiblePrometheusSRE
0 likes · 10 min read
Practical Experience in Operating NetEase's Big Data Platform: Architecture, EasyOps, Monitoring, and Optimization
DevOps
DevOps
Jun 10, 2021 · Operations

Operations Is Not Simple: Challenges, Methodologies, and Paths to Sustainable Improvement

This article explores the complexity of IT operations, outlining common misconceptions, essential capabilities, organizational and individual pain points, and presents self‑help strategies such as SRE, DevOps, automation, and AIOps to achieve sustainable, scalable, and intelligent operations within enterprises.

AutomationDevOpsSRE
0 likes · 28 min read
Operations Is Not Simple: Challenges, Methodologies, and Paths to Sustainable Improvement
Efficient Ops
Efficient Ops
Jun 7, 2021 · Operations

How Alibaba’s ECS Team Built a Scalable SRE System: Lessons for Large R&D Teams

This article summarizes Alibaba Cloud Elastic Compute Service's four‑year SRE journey, covering why ECS created its own SRE organization, the five‑layer SRE framework, standards, automation platforms, empowerment practices, and team‑building insights that can guide large development teams toward reliable, high‑availability operations.

SREreliability engineering
0 likes · 24 min read
How Alibaba’s ECS Team Built a Scalable SRE System: Lessons for Large R&D Teams
Big Data Technology Architecture
Big Data Technology Architecture
Jun 2, 2021 · Big Data

Practical Operations of NetEase Big Data Platform: Architecture, EasyOps, Monitoring, and Experience Sharing

The presentation details NetEase's big data platform operations, covering current usage, the internally built EasyOps control system, a generic service‑operation framework based on Ansible, Prometheus‑Grafana monitoring, configuration management, network and storage optimizations, and lessons learned from cloud migration.

AnsibleBig DataEasyOps
0 likes · 9 min read
Practical Operations of NetEase Big Data Platform: Architecture, EasyOps, Monitoring, and Experience Sharing
TAL Education Technology
TAL Education Technology
May 27, 2021 · Big Data

Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading

This article outlines the challenges of operating petabyte‑scale big‑data clusters and presents a comprehensive monitoring framework—including basic and upgraded monitoring layers, metric collection, alerting pipelines, and strategies for alarm convergence and grading—to ensure reliable, proactive SRE operations.

AlertingGrafanaOperations
0 likes · 12 min read
Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading
Alibaba Cloud Developer
Alibaba Cloud Developer
May 18, 2021 · Operations

Mastering Incident Response: Structured Problem Solving and Key Roles

This guide outlines a structured approach to incident response, detailing problem definition, temporary fixes, root‑cause analysis, solution design, implementation, and standardization, while highlighting four critical roles—commander, communicator, rapid‑recovery lead, and diagnosis lead—to ensure swift, coordinated recovery of production services.

OperationsSRETeam Roles
0 likes · 10 min read
Mastering Incident Response: Structured Problem Solving and Key Roles
Programmer DD
Programmer DD
Apr 27, 2021 · Cloud Native

Top Open‑Source Tools Every SRE Should Master for Scalable, Reliable Systems

This article surveys the most popular open‑source projects for Site Reliability Engineering and DevOps, covering monitoring, deployment, chaos testing, and observability tools such as Cloudprober, Istio, Prometheus, Litmus, and more, highlighting their key features and how they help build scalable, high‑reliability cloud‑native systems.

DevOpsKubernetesSRE
0 likes · 11 min read
Top Open‑Source Tools Every SRE Should Master for Scalable, Reliable Systems
DevOps
DevOps
Apr 21, 2021 · Operations

Xiaomi's Practice of Chaos Engineering and Fault Injection Platform

This article details Xiaomi's implementation of chaos engineering, describing the principles, platform construction using ChaosBlade, a comprehensive fault‑injection workflow, case study results, operational insights, and future plans to enhance system reliability and observability.

SRE
0 likes · 10 min read
Xiaomi's Practice of Chaos Engineering and Fault Injection Platform
Dada Group Technology
Dada Group Technology
Apr 19, 2021 · Operations

Exploring Elastic Capacity and Automated Scaling Architecture at Dada Group

This article presents Dada Group's comprehensive approach to elastic capacity management and automated scaling, detailing the challenges faced during traffic spikes, the design of a cloud‑native auto‑scaler, multi‑metric observability, decision‑making logic, execution mechanisms, extreme scaling practices, and future optimization directions.

Auto ScalingCloud NativeSRE
0 likes · 15 min read
Exploring Elastic Capacity and Automated Scaling Architecture at Dada Group
Efficient Ops
Efficient Ops
Mar 31, 2021 · Operations

Top 7 SRE Interview Questions Every Candidate Should Master

This article outlines the seven most important Site Reliability Engineering interview questions, explains why they matter, and provides an overview of the upcoming SRE Foundation course that equips professionals with the principles, practices, and tools needed for reliable, scalable systems.

SRESRE FoundationSite Reliability Engineering
0 likes · 9 min read
Top 7 SRE Interview Questions Every Candidate Should Master
Top Architect
Top Architect
Mar 26, 2021 · Operations

Top Open‑Source Projects for SREs and DevOps

This article presents a curated list of popular open‑source tools for monitoring, deployment, chaos testing, and reliability engineering, explaining their main features and how they help SREs and DevOps engineers build scalable, highly available cloud‑native systems.

Cloud NativeDevOpsSRE
0 likes · 10 min read
Top Open‑Source Projects for SREs and DevOps
dbaplus Community
dbaplus Community
Mar 25, 2021 · Operations

Mastering High‑Quality Service Architecture: Load Balancing, Rate Limiting, Retries & Timeouts

This article distills Bilibili's technical director insights on building high‑service‑quality architectures, covering systematic load‑balancing strategies, sophisticated rate‑limiting mechanisms, robust retry policies, precise timeout controls, and comprehensive approaches to prevent cascading failures in large‑scale systems.

Backend ArchitectureSREload balancing
0 likes · 14 min read
Mastering High‑Quality Service Architecture: Load Balancing, Rate Limiting, Retries & Timeouts
DevOps
DevOps
Mar 18, 2021 · Operations

Understanding Site Reliability Engineering (SRE) and Its Role in Software Stability

Site Reliability Engineering (SRE) combines software engineering with operations to ensure scalable, highly reliable systems, outlining the collaboration between product development and SRE roles, the software lifecycle, stability value, and practical frameworks for observability, controllability, and best‑practice implementation.

SRESite Reliability Engineeringsoftware lifecycle
0 likes · 12 min read
Understanding Site Reliability Engineering (SRE) and Its Role in Software Stability
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 8, 2021 · Operations

How to Ensure System Stability During Massive Sales Events: A Complete SRE Playbook

This article presents a comprehensive, step‑by‑step framework for guaranteeing system reliability during high‑traffic promotional periods, covering SRE hierarchy, stability criteria, profiling, monitoring, capacity planning, incident response, and post‑event analysis to help teams build resilient services.

SREcapacity planningincident response
0 likes · 21 min read
How to Ensure System Stability During Massive Sales Events: A Complete SRE Playbook