Tagged articles

403 articles

Page 3 of 5

Dec 31, 2022 · Operations

Google Site Reliability Engineering (SRE) Principles and Engagement Model

The article explains Google’s Site Reliability Engineering (SRE) team, its mission to balance reliability and velocity through automation, the engagement model with development teams, funding principles, and a set of guiding principles that shape how SRE collaborates, scopes, and delivers value across services.

Engagement ModelGoogleReliability

0 likes · 29 min read

Google Site Reliability Engineering (SRE) Principles and Engagement Model

Bilibili Tech

Dec 30, 2022 · Operations

Self-Developed HTTPDNS Service: Cost Estimation, Architecture, Optimization, and Lessons Learned

To cut the hundreds‑of‑thousands‑yuan monthly bill of a commercial HTTPDNS service, the team built a multi‑region, self‑hosted HTTPDNS platform, estimated to slash costs by up to 90%, then resolved unexpected TLS bandwidth waste by improving connection reuse, ultimately achieving over 80% savings and planning a hybrid‑cloud deployment.

BGPCost OptimizationDomain Hijacking

0 likes · 12 min read

Self-Developed HTTPDNS Service: Cost Estimation, Architecture, Optimization, and Lessons Learned

MaGe Linux Operations

Dec 23, 2022 · Operations

How to Build an Enterprise‑Grade Observability System for Reliable SRE

This article explains how enterprises can design and implement a comprehensive observability platform—covering metrics, logs, tracing, fault response, post‑mortems, testing, capacity planning, and automation—to improve system reliability and user experience.

AutomationObservabilitySRE

0 likes · 16 min read

How to Build an Enterprise‑Grade Observability System for Reliable SRE

Efficient Ops

Dec 19, 2022 · Operations

How Tencent CDN Achieves Seamless Business Continuity with AI‑Powered SRE

This article details Tencent CDN's challenges and solutions for business continuity, covering bandwidth and device resource constraints, massive request handling, fault‑management lifecycle, automation bottlenecks, and the implementation of AIOps, intelligent alerts, capacity planning, and root‑cause analysis to ensure reliable service.

AutomationCDNOperations

0 likes · 21 min read

How Tencent CDN Achieves Seamless Business Continuity with AI‑Powered SRE

Efficient Ops

Dec 12, 2022 · Operations

How Bilibili Built a 5‑Year SRE Journey: High‑Availability, Multi‑Active, and Capacity Management

This article chronicles Bilibili's five‑year evolution of Site Reliability Engineering, detailing the introduction of SRE culture, the construction of high‑availability and multi‑active architectures, capacity management with Kubernetes, VPA/HPA, incident case studies, and the ongoing transformation of SRE practices across the organization.

KubernetesOperationsSRE

0 likes · 24 min read

How Bilibili Built a 5‑Year SRE Journey: High‑Availability, Multi‑Active, and Capacity Management

dbaplus Community

Nov 28, 2022 · Operations

How Bilibili Guaranteed Seamless Live Streaming for the League of Legends S12 Finals

Bilibili’s S12 technical guarantee team coordinated dozens of engineering groups, performed resource estimation, built a shared resource pool, applied chaos engineering, high‑availability architecture, and systematic performance testing to ensure the League of Legends World Championship livestream remained stable and responsive under peak traffic.

Performance TestingResource ManagementSRE

0 likes · 19 min read

How Bilibili Guaranteed Seamless Live Streaming for the League of Legends S12 Finals

Tencent Architect

Nov 28, 2022 · Operations

How Tencent CDN Achieves Business Continuity with Intelligent Operations

This article details Tencent CDN's extensive business continuity challenges—including bandwidth, device resources, and massive request volumes—and explains how a fault‑management lifecycle, AIOps components, intelligent alerting, and automated capacity planning together enable resilient, automated operations.

CDNIntelligent OperationsSRE

0 likes · 17 min read

How Tencent CDN Achieves Business Continuity with Intelligent Operations

Bilibili Tech

Nov 19, 2022 · Operations

Technical Assurance for High‑Write Live‑Streaming Gift Scenarios

The technical‑assurance team secured Bilibili’s high‑write live‑stream gift system by expanding capacity, isolating hot keys, refactoring pipelines, adding asynchronous writes, employing horizontal scaling and full‑link load testing, converting uncertain dependencies into graceful fallbacks, and deploying dual‑active, chaos‑engineered disaster‑resilience architecture aligned with business usage patterns.

SREcapacity planningdatabase scaling

0 likes · 16 min read

Technical Assurance for High‑Write Live‑Streaming Gift Scenarios

21CTO

Nov 15, 2022 · Operations

Mastering SRE: How to Define SLIs, SLOs, SLAs and Build Reliable Systems

This article explains how SRE teams should define Service Level Indicators, Objectives and Agreements, manage reliability, performance, saturation and observability, use proper metrics and tracing, handle error budgets, assess risks, and implement effective incident and project management to create robust, cloud‑native services.

Error BudgetObservabilityReliability

0 likes · 14 min read

Mastering SRE: How to Define SLIs, SLOs, SLAs and Build Reliable Systems

Xiaohe Frontend Team

Nov 15, 2022 · Operations

Mastering Incident Postmortems: Turn Failures into Learning Opportunities

This article explains why thorough, blameless incident postmortems are essential, outlines when to initiate them, describes the key components of an effective review, and offers practical steps to transform each outage into a continuous‑improvement opportunity for engineering teams.

Blameless CultureRoot Cause AnalysisSRE

0 likes · 6 min read

Mastering Incident Postmortems: Turn Failures into Learning Opportunities

Bilibili Tech

Nov 15, 2022 · Operations

Technical Assurance for Bilibili S12 Live Streaming Event: Architecture, Resource Management, and High Availability

To ensure “tea‑time” reliability for Bilibili’s 2022 S12 League of Legends championship, a cross‑functional technical‑assurance project introduced shared resource pools, CPUSET removal, multi‑instance HA architecture, adaptive throttling, chaos‑engineered fault injection, a new Golang gateway, extensive load testing, and coordinated on‑site duty, delivering uninterrupted live streaming without forced throttling.

SREchaos engineeringhigh availability

0 likes · 20 min read

Technical Assurance for Bilibili S12 Live Streaming Event: Architecture, Resource Management, and High Availability

NetEase Yanxuan Technology Product Team

Nov 14, 2022 · Operations

Quantifying Internet Service Availability: Classic Metrics and the New User‑Uptime Indicator

The article reviews classic availability metrics such as Success‑Ratio, Incident‑Ratio, MTTR/MTTF, Error‑Budget, and SLA/SLO, then introduces User‑Uptime—a per‑user success time proportion that ignores long idle periods—and its windowed variant, showing how it complements existing indicators for more user‑centric reliability insight.

AvailabilityReliabilitySRE

0 likes · 27 min read

Quantifying Internet Service Availability: Classic Metrics and the New User‑Uptime Indicator

Ops Development Stories

Oct 26, 2022 · Operations

Is SRE a Team Mindset? Unlocking Stable Services Beyond the Title

The article explains that SRE, introduced by Google, is not a single specialist but a collaborative mindset requiring product, development, testing, operations, and architecture skills, and argues that even small‑scale teams can achieve stability by embracing these principles despite common misconceptions.

Cloud NativeSREstability

0 likes · 4 min read

Is SRE a Team Mindset? Unlocking Stable Services Beyond the Title

Cloud Native Technology Community

Oct 19, 2022 · Industry Insights

What Sets Platform Engineering Apart from DevOps and SRE?

The article clarifies the distinctions between platform engineering, DevOps, and SRE, explaining their origins, common misconceptions, challenges such as shadow operations and developer cognitive load, and how platform engineering builds on these practices to deliver self‑service internal developer platforms that improve productivity and reliability.

DevOpsInternal Developer PlatformOperations

0 likes · 10 min read

What Sets Platform Engineering Apart from DevOps and SRE?

MaGe Linux Operations

Oct 15, 2022 · Operations

Why Developers Hate Ops: Is DevOps Dead and Is Platform Engineering the Future?

The article examines growing developer frustration with operational responsibilities, the perceived decline of DevOps, and how platform engineering and Site Reliability Engineering are emerging as new approaches to balance development speed with reliable operations in cloud‑native environments.

Cloud NativeOperationsSRE

0 likes · 10 min read

Why Developers Hate Ops: Is DevOps Dead and Is Platform Engineering the Future?

MaGe Linux Operations

Oct 5, 2022 · Operations

Why Clear Responsibility Boundaries Are Crucial for DevOps and SRE Success

Clear responsibility boundaries are essential for effective DevOps and SRE workflows, preventing endless fire‑fighting, aligning software and organizational structures, and ensuring sustainable delivery across business stages. They also highlight how team size and expertise affect process clarity and when to choose DevOps‑first or SRE‑guided approaches.

SREresponsibility boundariesworkflow

0 likes · 6 min read

Why Clear Responsibility Boundaries Are Crucial for DevOps and SRE Success

NetEase Yanxuan Technology Product Team

Sep 26, 2022 · Operations

How to Tame Alert Storms: Building a Systematic Monitoring and Alerting Framework for Microservices

This article analyzes the challenges of alert overload in large‑scale microservice environments and presents a systematic approach—including timeliness metrics, a maturity model, lifecycle tracking, feedback loops, downgrade mechanisms, and cross‑service aggregation—to improve alert effectiveness and reduce noise.

Alert ManagementMTTRMicroservices

0 likes · 16 min read

How to Tame Alert Storms: Building a Systematic Monitoring and Alerting Framework for Microservices

Architects Research Society

Sep 16, 2022 · Operations

Building a Reliability Culture: Practices, Benefits, and Implementation

This article explains what a reliability culture is, why it matters, how to cultivate it through mission statements, early‑stage reliability testing, chaos‑engineering practices like GameDays and FireDrills, and how organizations can continuously learn from incidents to improve system availability and customer trust.

CultureOperationsReliability

0 likes · 18 min read

Building a Reliability Culture: Practices, Benefits, and Implementation

Bilibili Tech

Sep 9, 2022 · Operations

B站SRE's Stability Practices and Reflections

At the 2022 GOPS Global Operations Conference in Shenzhen, Bilibili’s infrastructure SRE lead Wu Anchuang unveiled the company’s comprehensive stability framework—detailing its SRE transformation, high‑availability architecture, active‑active disaster‑recovery, capacity planning, and event‑support strategies—marking the first public disclosure of these practices.

B站SREactivity assurance

0 likes · 1 min read

B站SRE's Stability Practices and Reflections

Zuoyebang Tech Team

Aug 26, 2022 · Operations

How We Built a Three‑Layer Stability System for Massive Scale Operations

This article details the operational mindset, stability framework, and transformation journey of the Zuoyebang infrastructure team, covering service lifecycle management, standardization, cloud‑native architecture, multi‑active deployment, incident pre‑plan platforms, traffic scheduling, monitoring, capacity planning, and future directions for SRE service‑orientation.

AutomationInfrastructureOperations

0 likes · 20 min read

How We Built a Three‑Layer Stability System for Massive Scale Operations

Cloud Native Technology Community

Aug 18, 2022 · Operations

Understanding DevOps: Integrating Development and Operations Beyond the ‘Who Develops Who Operates’ Myth

The article clarifies common misconceptions about DevOps, explains that true development‑operations integration relies on dedicated ops teams, automation tools, standardized delivery artifacts, and unified permission management rather than developers performing ops tasks, and highlights Google SRE practices as a practical guide.

AutomationDevOpsInfrastructure

0 likes · 10 min read

Understanding DevOps: Integrating Development and Operations Beyond the ‘Who Develops Who Operates’ Myth

Top Architect

Aug 17, 2022 · Operations

Every Line of Code Matters: How We Boosted Application Performance by 3000% Through System Optimization

The article shares a senior architect's experience improving a legacy multi‑service web system's performance by 3000% by fixing a DB connection leak, adopting tail‑latency metrics, and investing in load testing, monitoring, logging, and dedicated SRE resources.

BackendLoad TestingSRE

0 likes · 9 min read

Every Line of Code Matters: How We Boosted Application Performance by 3000% Through System Optimization

Java Architect Essentials

Aug 14, 2022 · Operations

Every Line of Code Matters: Lessons from a 3000% Performance Improvement

This article shares a real‑world case study of how a hidden database‑connection leak in a pod’s health‑check caused severe latency, and outlines four key lessons on performance metrics, testing, legacy system maintenance, and the critical impact of every line of code.

Load TestingOperationsSRE

0 likes · 9 min read

Every Line of Code Matters: Lessons from a 3000% Performance Improvement

ITPUB

Aug 5, 2022 · Operations

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

On July 13, 2021, Bilibili’s OpenResty‑based SLB suffered a CPU‑100% outage caused by a Lua _gcd function bug triggered when a service’s weight was set to the string “0”, leading to a multi‑hour incident that was resolved by rebuilding SLB clusters and disabling JIT compilation.

Load BalancerOpenRestySRE

0 likes · 17 min read

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

Bilibili Tech

Aug 2, 2022 · Operations

Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections

Bilibili adopted Google‑SRE SLO practices—selecting SLIs, defining availability and latency targets, grading services, and tracking error budgets—but encountered costly grading inconsistencies, hidden error detection, and inaccurate business‑level metrics, leading them to realize SLOs are chiefly valuable for early alerting rather than exhaustive reporting.

Cloud NativeError BudgetOperations

0 likes · 21 min read

Lessons Learned from Implementing SLOs at Bilibili: Practices, Pitfalls, and Reflections

DevOps

Jul 25, 2022 · Operations

Understanding the Role and Responsibilities of Site Reliability Engineering (SRE)

This article provides a comprehensive overview of Site Reliability Engineering, explaining its origins, core responsibilities across infrastructure, platform, and business layers, daily tasks such as deployment, on‑call duties, SLI/SLO management, incident post‑mortems, capacity planning, and user support, as well as career advice for aspiring SREs.

InfrastructureOncallReliability

0 likes · 21 min read

Understanding the Role and Responsibilities of Site Reliability Engineering (SRE)

Big Data Technology Architecture

Jul 14, 2022 · Operations

Postmortem of Bilibili SLB Outage on July 13, 2021

This postmortem details the July 13, 2021 Bilibili outage caused by a Lua‑induced CPU 100% bug in the OpenResty‑based SLB, describing the incident timeline, root‑cause analysis, mitigation steps, and the subsequent technical and process improvements to enhance reliability and multi‑active deployment.

IncidentLoad BalancerLua

0 likes · 16 min read

Postmortem of Bilibili SLB Outage on July 13, 2021

Su San Talks Tech

Jul 13, 2022 · Operations

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

This post‑mortem details the July 2021 Bilibili outage caused by a Lua bug in the OpenResty‑based SLB, describing the timeline, root‑cause analysis, mitigation steps, and the technical and organizational improvements implemented to prevent similar incidents.

Load BalancerLuaSRE

0 likes · 18 min read

dbaplus Community

Jul 12, 2022 · Operations

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

In July 2021 a sudden CPU‑100% spike in Bilibili's OpenResty‑based SLB caused widespread service outages, prompting an emergency response that rebuilt load‑balancer clusters, traced a Lua _gcd function bug triggered by a zero weight string, and led to extensive operational and architectural improvements.

Cloud NativeLuaOpenResty

0 likes · 17 min read

Bilibili Tech

Jul 12, 2022 · Operations

Bilibili SLB Outage Postmortem (July 13, 2021): Timeline, Root Cause, and Improvements

On July 13 2021 Bilibili’s L7 SLB crashed when a recent Lua deployment set a balancer weight to the string “0”, producing a NaN value that triggered an infinite loop and 100 % CPU, prompting emergency restarts, a fresh cluster rollout, and long‑term safeguards such as automated provisioning, stricter Lua validation, and enhanced multi‑active disaster‑recovery processes.

Load BalancerRoot Cause AnalysisSLB

0 likes · 17 min read

Bilibili SLB Outage Postmortem (July 13, 2021): Timeline, Root Cause, and Improvements

DevOps

Jul 8, 2022 · Operations

Nine Essential Skills Every Modern Site Reliability Engineer Should Master

The article outlines the nine core competencies—network expertise, Linux/Unix knowledge, cloud computing, CI/CD pipelines, QA automation, security engineering, DevOps, incident management, and post‑incident review—that enable SREs to ensure the availability, performance, and reliability of complex distributed systems.

DevOpsSRESite Reliability Engineering

0 likes · 6 min read

Nine Essential Skills Every Modern Site Reliability Engineer Should Master

dbaplus Community

Jul 4, 2022 · Operations

Why Most Monitoring Systems Fail: Lessons from a Veteran Ops Engineer

A seasoned operations professional shares personal experiences and hard‑earned insights on why traditional monitoring often becomes ineffective, how over‑automation and noisy dashboards hurt teams, and what a capability‑focused, user‑centric approach to observability should look like.

ObservabilityOperationsSRE

0 likes · 12 min read

Why Most Monitoring Systems Fail: Lessons from a Veteran Ops Engineer

Efficient Ops

Jun 27, 2022 · Operations

How Huya Reaches 98% Containerization & 80% AI Elasticity for Ultra‑Reliable Live Streaming

This article details Huya's SRE-driven architecture that combines center‑edge deployment, high containerization, AI‑powered elasticity, fault avoidance, and fast recovery mechanisms to achieve deterministic, highly available live‑streaming services.

SREelastic computingstability

0 likes · 16 min read

How Huya Reaches 98% Containerization & 80% AI Elasticity for Ultra‑Reliable Live Streaming

Architecture Talk

Jun 27, 2022 · Operations

Why Build an SRE System? A Complete Guide to Site Reliability Engineering

This article explains the motivations behind Site Reliability Engineering (SRE), outlines its strategic goals, defines key concepts such as SLI, SLO, SLA and error budget, introduces the four golden metrics for monitoring distributed systems, and provides practical guidance on building, operating, and continuously improving an SRE practice.

Error BudgetSLISLO

0 likes · 14 min read

Why Build an SRE System? A Complete Guide to Site Reliability Engineering

HaoDF Tech Team

Jun 21, 2022 · Operations

Evolution and High‑Availability Construction of the Haodafu Offline Message Push System

This article describes how the Haodafu offline push service grew from a simple PHP notification tool into a robust, highly‑available micro‑service platform by redesigning architecture, adopting vendor push channels, adding message‑queue reliability, implementing comprehensive monitoring, observability, and a fault‑diagnosis platform to ensure delivery rates and operational stability.

Mobile BackendObservabilitySRE

0 likes · 21 min read

Evolution and High‑Availability Construction of the Haodafu Offline Message Push System

Bilibili Tech

Jun 21, 2022 · Cloud Native

Evolution of SRE in the Cloud‑Native Era – Insights from Industry Experts

Industry experts from Zhejiang Mobile, Bilibili, and Xiaomi discuss how SRE has evolved in the cloud‑native era, sharing concrete frameworks, observability practices, and cost‑focused platforms while emphasizing stability, metrics, on‑call processes, and the need to adapt Google’s model to real‑world product and operational contexts.

Cloud NativeDevOpsSRE

0 likes · 31 min read

Evolution of SRE in the Cloud‑Native Era – Insights from Industry Experts

Efficient Ops

Jun 19, 2022 · Operations

How Bilibili SRE Guarantees Million‑User Live Events: Strategies, Tools, and Lessons

This article details Bilibili's SRE approach to large‑scale live events, covering background, activity scenarios, resource planning, performance testing, chaos‑engineering drills, technical safeguards such as DCDN, SLB, WAF, PaaS, cache and DB, pre‑plan capabilities, post‑mortem analysis, and future outlook, illustrating how systematic capacity management and automated resilience practices enable stable operation for events with tens of millions of concurrent users.

Performance TestingSREcapacity planning

0 likes · 22 min read

How Bilibili SRE Guarantees Million‑User Live Events: Strategies, Tools, and Lessons

ITPUB

Jun 18, 2022 · Operations

How MDD and SRE Cut Mini‑Program Image‑Upload Failures from Days to Minutes

This article recounts a three‑day image‑upload outage in a mini‑program, analyzes the multi‑layer causes, and shows how combining Metrics‑Driven Development with SRE and a custom observability platform dramatically reduces diagnosis time and improves reliability.

Metrics-Driven DevelopmentMini ProgramObservability

0 likes · 20 min read

How MDD and SRE Cut Mini‑Program Image‑Upload Failures from Days to Minutes

Continuous Delivery 2.0

Jun 17, 2022 · Operations

Addressing SRE Overload: Causes and Mitigation Strategies

The article examines why SRE teams experience overload due to high incident response demands, analyzes contributing factors such as production issues, alert volume, and manual processes, and proposes comprehensive mitigation steps including better testing, load management, and proactive error detection to reduce on‑call burden.

SREincident responseoverload

0 likes · 5 min read

Addressing SRE Overload: Causes and Mitigation Strategies

Bilibili Tech

Jun 14, 2022 · Operations

SRE Practices for Large‑Scale Event Assurance at Bilibili

Bilibili’s SRE team ensures flawless large‑scale online events by meticulously gathering activity details, provisioning DNS, CDN, networking and compute resources, conducting multi‑stage performance tests and chaos‑engineering drills, applying layered traffic controls, maintaining historical checklists, executing predefined contingency responses, and iterating post‑mortems to drive continuous automation and reliability.

Cloud NativeEvent ReliabilityOperations

0 likes · 20 min read

SRE Practices for Large‑Scale Event Assurance at Bilibili

dbaplus Community

Jun 13, 2022 · Operations

How We Built a Mini‑Program Observability Platform to Slash Incident Resolution Time

After a three‑day, ten‑person investigation into a mini‑program image‑upload failure, we designed and implemented an end‑to‑end observability platform using MDD and SRE principles, defining SLI/SLO, instrumenting client, network, gateway and backend layers, and visualizing metrics with Grafana, ClickHouse and Prometheus.

GrafanaMDDMetrics

0 likes · 18 min read

How We Built a Mini‑Program Observability Platform to Slash Incident Resolution Time

dbaplus Community

Jun 9, 2022 · Operations

Building an Effective SRE System: Key Principles, Metrics, and Practices

This article explains Site Reliability Engineering (SRE), its core concepts such as SLI, SLO, SLA, error budgets, risk analysis, the four golden metrics, and practical steps for developing, piloting, and operating reliable services with monitoring, automation, and post‑mortem practices.

Error BudgetSLISLO

0 likes · 15 min read

Building an Effective SRE System: Key Principles, Metrics, and Practices

Bilibili Tech

May 20, 2022 · Operations

Bilibili SRE Practices: Stability Operations, Incident Management, and Platform Enablement

Bilibili’s SRE team, confronting rapid growth and complex systems, built a systematic stability operation that includes emergency response, incident handling, on‑call scheduling, and an Event Operations Center platform, using metrics like MTTR, MTTI and AI‑assisted automation to reduce downtime and improve reliability.

BilibiliMetricsOncall

0 likes · 27 min read

Bilibili SRE Practices: Stability Operations, Incident Management, and Platform Enablement

DevOps

May 18, 2022 · Operations

Understanding and Preventing Cascading Failures in Distributed Systems

The article explains how cascading failures arise from positive feedback loops in distributed systems, illustrates real‑world incidents such as the 2015 DynamoDB outage, outlines anti‑patterns like unlimited retries and unchecked load, and presents practical mitigation techniques including load‑shedding, circuit breakers, exponential back‑off, and controlled replication to improve system resilience.

Distributed SystemsResilienceSRE

0 likes · 19 min read

Understanding and Preventing Cascading Failures in Distributed Systems

Bilibili Tech

Apr 26, 2022 · Operations

Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation

Bilibili’s SRE team combines stability theory, detailed fault‑stage and operational metrics, and a unified emergency‑response platform—including on‑call scheduling, fault‑command incident commanders, automated fault portraits, and rapid post‑mortems—to transform frequent incidents into data‑driven, collaborative recoveries and lay groundwork for AI‑assisted self‑healing.

Business StabilityMetricsOncall

0 likes · 23 min read

Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation

IT Architects Alliance

Apr 17, 2022 · Operations

Understanding the SRE Role: Responsibilities, Types, and Practices

This article explains what Site Reliability Engineering (SRE) is, why it was created, the challenges in hiring SREs, and breaks the role into three layers—Infrastructure, Platform, and Business—detailing their duties, deployment processes, on‑call practices, SLI/SLO management, incident post‑mortems, capacity planning, user support, and career advice.

InfrastructureOncallOperations

0 likes · 21 min read

Understanding the SRE Role: Responsibilities, Types, and Practices

Architect

Apr 16, 2022 · Operations

A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices

This article explains what SRE is, why it was created, how its responsibilities differ across companies, and breaks the work into Infrastructure, Platform, and Business SRE while covering deployment, on‑call, SLI/SLO, incident post‑mortems, capacity planning, user support, and career advice.

OncallOperationsSLI/SLO

0 likes · 22 min read

A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices

IT Architects Alliance

Apr 12, 2022 · Operations

Understanding Site Reliability Engineering (SRE): Concepts, Metrics, and Practices

This article explains Site Reliability Engineering (SRE), covering its origins, core responsibilities, key concepts such as SLI/SLO/SLA and error budgets, the four golden monitoring metrics, risk analysis, and practical guidance on building reliable services using tools like Prometheus and Grafana.

Error BudgetOperationsSLI

0 likes · 15 min read

Understanding Site Reliability Engineering (SRE): Concepts, Metrics, and Practices

DevOps

Apr 12, 2022 · Operations

Understanding Observability: Core Concepts, SRE Methodology, AIOps, and Business Architecture

The article explains the rising importance of observability in modern operations, defines its control‑theory roots, breaks it down into metrics, traces and logs, and argues that successful implementation requires three pillars—SRE practices, AIOps algorithms, and deep business‑architecture knowledge—together with well‑designed SLOs and critical‑path mapping.

ObservabilitySREaiops

0 likes · 10 min read

Understanding Observability: Core Concepts, SRE Methodology, AIOps, and Business Architecture

dbaplus Community

Apr 10, 2022 · Operations

How to Build a Practical SRE Operations Framework for Large‑Scale Systems

This article presents a hands‑on SRE framework covering the full product lifecycle—code development, resource planning, deployment, operational reliability, and decommissioning—derived from real‑world practices at Xiaomi and Sina to help teams manage massive internet services efficiently and cost‑effectively.

Resource ManagementSRESystem Lifecycle

0 likes · 16 min read

How to Build a Practical SRE Operations Framework for Large‑Scale Systems

HaoDF Tech Team

Mar 29, 2022 · Operations

Building an Observability Platform for Mini‑Program Image Uploads Using SRE and Metrics‑Driven Development

The article describes how a three‑day, cross‑team investigation of a mini‑program image‑upload failure led to the design and implementation of an SRE‑driven, metrics‑driven observability platform that quantifies SLIs, automates tracing, and provides dashboards for real‑time and long‑term analysis, ultimately reducing MTTR.

BackendMetrics-Driven DevelopmentMini-Program

0 likes · 17 min read

Building an Observability Platform for Mini‑Program Image Uploads Using SRE and Metrics‑Driven Development

dbaplus Community

Mar 25, 2022 · Operations

Beyond Metrics, Traces, Logs: The SRE, AIOps, and Business Architecture Secrets of Observability

Observability is more than just combining metrics, traces, and logs; successful implementation requires the disciplined SRE methodology, AI‑driven AIOps capabilities, and a deep understanding of business architecture to define critical paths and layered SLOs for real‑world systems.

SREaiopsbusiness architecture

0 likes · 11 min read

Beyond Metrics, Traces, Logs: The SRE, AIOps, and Business Architecture Secrets of Observability

Efficient Ops

Mar 16, 2022 · Operations

Why Traditional Monitoring Fails and Observability Is the Future for Ops Teams

Drawing from years of ops experience, the author recounts the decline of traditional monitoring, the rise of automated dashboards, the challenges of AIOps and observability, and proposes a shift toward data‑driven, business‑focused capability building to make alerts truly useful.

ObservabilitySREaiops

0 likes · 13 min read

Why Traditional Monitoring Fails and Observability Is the Future for Ops Teams

Java High-Performance Architecture

Mar 16, 2022 · Operations

Why Every Line of Code Matters: Lessons from Boosting Web App Performance 3000×

This article shares practical lessons from optimizing the performance of legacy web applications, covering how a hidden DB‑connection leak caused severe latency, why average latency is misleading, and the essential tools, testing, and team practices needed to keep services fast and reliable.

Load TestingPerformance OptimizationSRE

0 likes · 9 min read

Why Every Line of Code Matters: Lessons from Boosting Web App Performance 3000×

Ops Development Stories

Mar 3, 2022 · Operations

What Exactly Does an SRE Do? Unpacking Roles, Skills, and Practices

This article explains the SRE role originated by Google, outlines its core responsibilities such as automation, observability, incident response, testing, capacity planning, and SLI/SLO/SLA management, and highlights the skills and cultural practices needed for reliable service operations.

ObservabilitySLASLI

0 likes · 29 min read

What Exactly Does an SRE Do? Unpacking Roles, Skills, and Practices

Selected Java Interview Questions

Feb 25, 2022 · Operations

How Every Line of Code Impacts Performance: Lessons from Optimizing Legacy Web Applications

After a sudden traffic surge exposed severe latency in a 15‑year‑old multi‑service web system, the team identified a DB connection leak in a health‑check probe, and through load testing, monitoring, logging, and dedicated SRE effort, they derived key operational lessons on performance optimization and maintenance of legacy applications.

Load TestingSREdatabase connections

0 likes · 9 min read

How Every Line of Code Impacts Performance: Lessons from Optimizing Legacy Web Applications

IT Architects Alliance

Feb 15, 2022 · Operations

What Real-World Performance Tuning Taught Us About Legacy Web Apps

After a traffic surge exposed severe latency in a 15-year-old multi-service web platform, we used monitoring to discover a DB-connection leak caused by a liveness probe, corrected it, and distilled four practical lessons on latency metrics, tooling, legacy maintenance, and code vigilance.

APMLoad TestingOperations

0 likes · 9 min read

What Real-World Performance Tuning Taught Us About Legacy Web Apps

Top Architect

Feb 13, 2022 · Operations

Performance Optimization Lessons from a Legacy Web Application: Monitoring, Load Testing, and Maintaining Old Systems

The article shares a real‑world case study of a legacy multi‑service web platform where traffic spikes exposed DB connection leaks, leading to a 90% response‑time bottleneck, and outlines four key takeaways about tail‑latency metrics, investing in tools and people, actively maintaining legacy systems, and treating every line of code as critical for performance.

BackendLoad TestingSRE

0 likes · 9 min read

Performance Optimization Lessons from a Legacy Web Application: Monitoring, Load Testing, and Maintaining Old Systems

IT Services Circle

Feb 2, 2022 · Operations

Huawei Cloud’s New Year Defense: How SRE Teams Counter Massive Attacks

Huawei Cloud’s internal “blue‑team” launched over twenty coordinated attacks around Chinese New Year, but the company’s SRE “red‑team” and a dedicated 24/7 “special forces” unit detected, isolated, and resolved incidents within minutes, keeping failure rates below 0.01% and demonstrating advanced cloud operations and security practices.

SREincident response

0 likes · 9 min read

Huawei Cloud’s New Year Defense: How SRE Teams Counter Massive Attacks

DevOps

Jan 28, 2022 · Operations

Continuous Operations: Definition, Stages, and Practices

This article presents a comprehensive study of continuous operations, defining its meaning, outlining the three key stages of continuous deployment, operation, and feedback, reviewing ITIL and DevOps practices, and sharing real-world case studies from major tech companies to illustrate effective implementation.

Continuous OperationsDevOpsITIL

0 likes · 46 min read

Continuous Operations: Definition, Stages, and Practices

Efficient Ops

Jan 25, 2022 · Operations

From Zero to Scalable Monitoring: Lessons from Building a 200‑Service Platform

Over two years, we built a monitoring system covering 200+ services and 700+ instances, evolving from ad‑hoc Nginx logs to a Prometheus‑based observability platform with unified dashboards, automated alerts, and lessons on metric selection, alert fatigue, and fault isolation.

AlertingSRE

0 likes · 9 min read

From Zero to Scalable Monitoring: Lessons from Building a 200‑Service Platform

Meituan Technology Team

Jan 13, 2022 · Operations

Phoenix: Client‑Side CDN Disaster Recovery Solution at Meituan

Phoenix is Meituan’s client‑side CDN disaster‑recovery system that uses a Webpack‑based SDK, dynamic calculation service, and monitoring platform to automatically detect load failures, switch domains, isolate problems, and continuously hot‑standby resources, boosting resource success rates from 99.7 % to 99.9 % across hundreds of projects.

CDNMeituanSRE

0 likes · 16 min read

Phoenix: Client‑Side CDN Disaster Recovery Solution at Meituan

dbaplus Community

Jan 9, 2022 · Databases

How ICBC Tames MySQL: Real‑World Governance, Risk Mitigation, and SRE Practices

This article details Industrial and Commercial Bank of China's comprehensive MySQL governance framework, covering risk identification, prevention strategies, a four‑step methodology, automated quality gates, production‑level monitoring, SRE management, and future visions for rapid incident detection and self‑healing.

Database GovernancePerformance OptimizationProduction Monitoring

0 likes · 22 min read

How ICBC Tames MySQL: Real‑World Governance, Risk Mitigation, and SRE Practices

Top Architect

Jan 8, 2022 · Operations

High‑Availability Architecture Practices from Bilibili: Load Balancing, Rate Limiting, Retries, and Timeout Strategies

This article presents Bilibili’s high‑availability design, covering load‑balancing decisions, subset selection, multi‑cluster deployment, adaptive rate limiting, retry policies, timeout propagation, and chain‑failure mitigation, all illustrated with diagrams and practical SRE insights.

BackendRetrySRE

0 likes · 15 min read

High‑Availability Architecture Practices from Bilibili: Load Balancing, Rate Limiting, Retries, and Timeout Strategies

Continuous Delivery 2.0

Dec 31, 2021 · Operations

Curated Reading List on DevOps, Software Delivery Performance, and Engineering Productivity

This article presents a concise collection of ten Chinese-language resources that summarize the 2021 DORA DevOps report, the importance of consistency in R&D, fundamental efficiency principles, Microsoft’s testing shift, Google’s release and productivity metrics, and SRE health measurements, offering valuable insights for modern software engineering teams.

Engineering ProductivityOperationsSRE

0 likes · 5 min read

Curated Reading List on DevOps, Software Delivery Performance, and Engineering Productivity

Alibaba Cloud Native

Dec 22, 2021 · Operations

How Alibaba’s ASI Powers Massive Serverless Kubernetes at Scale

This article details Alibaba's Serverless Infrastructure (ASI) built on ACK, explaining its large‑scale Kubernetes architecture, fully managed operations, change‑risk controls, gray‑release pipelines, web‑shell access, taskflow orchestration, node lifecycle management, elasticity, risk mitigation, probing, and self‑healing capabilities that enable reliable cloud‑native services.

Cloud NativeInfrastructureKubernetes

0 likes · 32 min read

How Alibaba’s ASI Powers Massive Serverless Kubernetes at Scale

IT Architects Alliance

Dec 1, 2021 · Operations

What Does an SRE Actually Do? A Deep Dive into Roles and Practices

This article explains the origins of Site Reliability Engineering, breaks down its three main layers—Infrastructure, Platform, and Business SRE—covers day‑one and day‑2 deployment, on‑call processes, SLI/SLO design, post‑mortems, capacity planning, user support, and offers practical advice for aspiring SREs.

InfrastructureOncallOperations

0 likes · 24 min read

What Does an SRE Actually Do? A Deep Dive into Roles and Practices

Programmer DD

Nov 16, 2021 · Operations

What Does an SRE Do? A Practical Guide to Site Reliability Engineering

This article explains the role of Site Reliability Engineering (SRE), its origins at Google, the challenges of hiring, the three-layer model of infrastructure, platform, and business SRE, and provides detailed responsibilities, on‑call practices, SLI/SLO management, capacity planning, and career advice for aspiring SREs.

InfrastructureOncallSLI

0 likes · 23 min read

What Does an SRE Do? A Practical Guide to Site Reliability Engineering

dbaplus Community

Nov 14, 2021 · Operations

How to Boost Service Reliability: SRE Basics and Tackling Technical Debt

This article explains the fundamentals of Site Reliability Engineering, outlines a complete SRE workflow from prevention to post‑mortem, details key availability metrics and golden indicators, examines how technical debt arises and can be mitigated, and describes the tooling and practices needed to keep large‑scale services healthy.

SRETechnical Debtmonitoring

0 likes · 18 min read

How to Boost Service Reliability: SRE Basics and Tackling Technical Debt

HaoDF Tech Team

Nov 8, 2021 · Operations

Service Risk Governance: Exploration, Mitigation, and Hands‑On Workshop

This talk recounts how the Good Doctor platform tackled severe online incidents by launching the DOA project, then a service risk governance initiative that identifies, quantifies, and mitigates latency‑related risks through metrics‑driven development, dependency analysis, middleware reliability, and a dedicated risk‑management platform.

MicroservicesSRElatency optimization

0 likes · 16 min read

Service Risk Governance: Exploration, Mitigation, and Hands‑On Workshop

HaoDF Tech Team

Oct 8, 2021 · Operations

Understanding SRE: Foundations, Metrics, and Tackling Technical Debt

This article introduces the fundamentals of Site Reliability Engineering (SRE), explains how to measure service stability with metrics like MTTR, MTBF, and availability, outlines the SRE workflow from prevention to post‑mortem, and discusses how to identify and reduce technical debt to improve system health.

OperationsReliabilitySRE

0 likes · 18 min read

Understanding SRE: Foundations, Metrics, and Tackling Technical Debt

Continuous Delivery 2.0

Sep 30, 2021 · Operations

Key Findings from the 2021 DORA DevOps Report: SRE Practices, Documentation, Security, and Culture

The 2021 DORA DevOps Report reveals that elite teams outperform low‑performing teams by adopting SRE principles, high‑quality documentation, integrated security, modern technical practices such as loose coupling, continuous testing, CI/CD, and a performance‑driven culture that fosters belonging and inclusion.

CultureOperationsSRE

0 likes · 19 min read

Key Findings from the 2021 DORA DevOps Report: SRE Practices, Documentation, Security, and Culture

Continuous Delivery 2.0

Sep 29, 2021 · Operations

Key Findings and Recommendations from the 2021 DORA DevOps Report (Chapters 1‑3)

The 2021 DORA DevOps Report, based on a seven‑year study of over 32,000 professionals, reveals how elite software delivery and technical‑operations practices—such as reliability goals, secure supply‑chain integration, high‑quality documentation, cloud adoption, and positive team culture—drive superior organizational performance and provides data‑driven guidance for improvement.

SREorganizational cultureperformance metrics

0 likes · 18 min read

Key Findings and Recommendations from the 2021 DORA DevOps Report (Chapters 1‑3)

DevOps Cloud Academy

Sep 27, 2021 · Operations

Key Findings from Google DORA 2021 Accelerate State of DevOps Report

The 2021 DORA Accelerate State of DevOps report, based on responses from over 32,000 professionals, reveals new performance metrics, the impact of SRE and security supply‑chain practices, cultural factors affecting burnout, and how cloud adoption continues to drive higher software delivery and organizational performance.

DevOpsSRESecurity

0 likes · 8 min read

Key Findings from Google DORA 2021 Accelerate State of DevOps Report

DevOps Cloud Academy

Sep 26, 2021 · Operations

Key Findings from Google DORA 2021 Accelerate State of DevOps Report

Google’s 2021 DORA Accelerate State of DevOps report, based on over 32,000 professionals, reveals that elite teams dramatically outperform low‑performing teams across deployment frequency, lead time, recovery time and failure rates, while highlighting new reliability metrics, the importance of team culture, SRE, cloud adoption, secure software supply chains and documentation.

DevOpsSRESecurity

0 likes · 7 min read

Continuous Delivery 2.0

Sep 26, 2021 · Operations

Key Findings from Google DORA’s 2021 Accelerate State of DevOps Report

The 2021 Accelerate State of DevOps report by Google DORA, based on over 32,000 professionals, reveals that elite teams dramatically outperform low‑performing teams across four classic delivery metrics, introduces a new reliability metric, and highlights the impact of team culture, SRE practices, cloud adoption, secure software supply chains, and high‑quality documentation on software delivery and organizational performance.

ReliabilitySREcloud

0 likes · 7 min read

Key Findings from Google DORA’s 2021 Accelerate State of DevOps Report

Alibaba Cloud Native

Aug 27, 2021 · Operations

How Chaos Engineering Strengthens System Resilience: Building a Fault‑Injection Platform

This article explains why modern agile and DevOps environments need chaos engineering, describes the design and goals of a fault‑injection platform, outlines tool selection, details a five‑step exercise workflow, and shares a real‑world case study that demonstrates improved stability and SRE capabilities.

ResilienceSREplatform

0 likes · 10 min read

How Chaos Engineering Strengthens System Resilience: Building a Fault‑Injection Platform

TAL Education Technology

Aug 19, 2021 · Operations

Comprehensive SRE Guide for Summer and Winter High‑Load Periods in an Online Education Platform

This document outlines a comprehensive SRE‑driven operational framework for ensuring stable, high‑availability online education services during peak summer and winter periods, detailing pre‑, during‑, and post‑maintenance phases, architectural principles, load testing, monitoring, capacity management, safety hardening, chaos engineering, incident response, and post‑mortem practices.

Load TestingSREcapacity planning

0 likes · 17 min read

Comprehensive SRE Guide for Summer and Winter High‑Load Periods in an Online Education Platform

Efficient Ops

Aug 10, 2021 · Operations

From Zero to Scalable Monitoring: Lessons from Building a 200‑Service Platform

AlertingSRE

0 likes · 9 min read

dbaplus Community

Jul 26, 2021 · Operations

Top Open‑Source Tools Every SRE Should Know for Monitoring, Chaos Engineering, and Reliability

This article introduces a curated list of popular open‑source projects for SRE and DevOps, covering monitoring, deployment, chaos engineering, and reliability tools such as Cloudprober, Istio, Checkov, Litmus, Locust, Prometheus, and more, highlighting their key features and practical use cases.

KubernetesSREmonitoring

0 likes · 10 min read

Top Open‑Source Tools Every SRE Should Know for Monitoring, Chaos Engineering, and Reliability

ByteDance ADFE Team

Jul 9, 2021 · Operations

From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting

The article traces the evolution from a rudimentary deployment workflow in a small startup to a mature, Google‑inspired Site Reliability Engineering (SRE) approach, explaining SRE definitions, team duties, error‑budget concepts, key reliability metrics (SLI/SLO/SLA), monitoring implementation with OpenTSDB, and best‑practice alerting rules.

AlertingError BudgetSLI

0 likes · 7 min read

From Ad‑hoc Deployment to Standardized SRE Practices: Definitions, Responsibilities, Metrics and Alerting

DataFunTalk

Jun 27, 2021 · Big Data

Practical Experience in Operating NetEase's Big Data Platform: Architecture, EasyOps, Monitoring, and Optimization

This presentation by NetEase senior SRE Jin Chuan details the current state of NetEase's big data platform, introduces the internally built EasyOps management system, explains a generic Ansible‑based operation framework, describes Prometheus/Grafana monitoring and alerting, and shares practical lessons on network, storage, and cloud migration for large‑scale Hadoop services.

AnsiblePrometheusSRE

0 likes · 10 min read

Practical Experience in Operating NetEase's Big Data Platform: Architecture, EasyOps, Monitoring, and Optimization

Alibaba Cloud Developer

Jun 24, 2021 · R&D Management

Master Your First Technical Talk: A Six‑Step Blueprint for New Speakers

This article shares a practical six‑step framework—from understanding nervousness and defining purpose to designing simple slides, crafting a script, deliberate rehearsal, and handling Q&A—to help technical newcomers deliver confident, audience‑focused presentations at conferences.

R&DSREcommunication

0 likes · 26 min read

Master Your First Technical Talk: A Six‑Step Blueprint for New Speakers

DevOps

Jun 10, 2021 · Operations

Operations Is Not Simple: Challenges, Methodologies, and Paths to Sustainable Improvement

This article explores the complexity of IT operations, outlining common misconceptions, essential capabilities, organizational and individual pain points, and presents self‑help strategies such as SRE, DevOps, automation, and AIOps to achieve sustainable, scalable, and intelligent operations within enterprises.

AutomationDevOpsSRE

0 likes · 28 min read

Operations Is Not Simple: Challenges, Methodologies, and Paths to Sustainable Improvement

Efficient Ops

Jun 7, 2021 · Operations

How Alibaba’s ECS Team Built a Scalable SRE System: Lessons for Large R&D Teams

This article summarizes Alibaba Cloud Elastic Compute Service's four‑year SRE journey, covering why ECS created its own SRE organization, the five‑layer SRE framework, standards, automation platforms, empowerment practices, and team‑building insights that can guide large development teams toward reliable, high‑availability operations.

SREreliability engineering

0 likes · 24 min read

How Alibaba’s ECS Team Built a Scalable SRE System: Lessons for Large R&D Teams

Big Data Technology Architecture

Jun 2, 2021 · Big Data

Practical Operations of NetEase Big Data Platform: Architecture, EasyOps, Monitoring, and Experience Sharing

The presentation details NetEase's big data platform operations, covering current usage, the internally built EasyOps control system, a generic service‑operation framework based on Ansible, Prometheus‑Grafana monitoring, configuration management, network and storage optimizations, and lessons learned from cloud migration.

AnsibleBig DataEasyOps

0 likes · 9 min read

Practical Operations of NetEase Big Data Platform: Architecture, EasyOps, Monitoring, and Experience Sharing

TAL Education Technology

May 27, 2021 · Big Data

Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading

This article outlines the challenges of operating petabyte‑scale big‑data clusters and presents a comprehensive monitoring framework—including basic and upgraded monitoring layers, metric collection, alerting pipelines, and strategies for alarm convergence and grading—to ensure reliable, proactive SRE operations.

AlertingGrafanaOperations

0 likes · 12 min read

Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading

ByteDance ADFE Team

May 25, 2021 · Operations

Service Governance and SRE: Ensuring 24/7 Service Reliability

The article explains service governance and SRE practices, detailing goals, components, overload handling, capacity planning, and strategies to maintain continuous 24‑hour service reliability while reducing manual toil.

OperationsReliabilitySRE

0 likes · 12 min read

Service Governance and SRE: Ensuring 24/7 Service Reliability

Alibaba Cloud Developer

May 20, 2021 · Operations

Mastering Production Incident Response: Structured Problem Solving and Key Roles

This guide explains how to design and practice a structured incident‑response process—defining problems, applying quick‑recovery steps, analyzing root causes, standardizing solutions, and assigning critical roles—to dramatically reduce production outage duration.

OperationsSRETeam Roles

0 likes · 11 min read

Mastering Production Incident Response: Structured Problem Solving and Key Roles

Alibaba Cloud Developer

May 18, 2021 · Operations

Mastering Incident Response: Structured Problem Solving and Key Roles

This guide outlines a structured approach to incident response, detailing problem definition, temporary fixes, root‑cause analysis, solution design, implementation, and standardization, while highlighting four critical roles—commander, communicator, rapid‑recovery lead, and diagnosis lead—to ensure swift, coordinated recovery of production services.

OperationsSRETeam Roles

0 likes · 10 min read

Mastering Incident Response: Structured Problem Solving and Key Roles

MaGe Linux Operations

May 16, 2021 · Operations

Top 10 Open‑Source Tools Every SRE Should Use for Reliable Cloud Operations

This article introduces ten popular open‑source projects for monitoring, deployment, and reliability engineering, detailing each tool's purpose, key features, and how they help Site Reliability Engineers build scalable, highly reliable cloud‑native systems.

DevOpsSREmonitoring

0 likes · 10 min read

Top 10 Open‑Source Tools Every SRE Should Use for Reliable Cloud Operations

Programmer DD

Apr 27, 2021 · Cloud Native

Top Open‑Source Tools Every SRE Should Master for Scalable, Reliable Systems

This article surveys the most popular open‑source projects for Site Reliability Engineering and DevOps, covering monitoring, deployment, chaos testing, and observability tools such as Cloudprober, Istio, Prometheus, Litmus, and more, highlighting their key features and how they help build scalable, high‑reliability cloud‑native systems.

DevOpsKubernetesSRE

0 likes · 11 min read

Top Open‑Source Tools Every SRE Should Master for Scalable, Reliable Systems

DevOps

Apr 21, 2021 · Operations

Xiaomi's Practice of Chaos Engineering and Fault Injection Platform

This article details Xiaomi's implementation of chaos engineering, describing the principles, platform construction using ChaosBlade, a comprehensive fault‑injection workflow, case study results, operational insights, and future plans to enhance system reliability and observability.

SRE

0 likes · 10 min read

Xiaomi's Practice of Chaos Engineering and Fault Injection Platform

Dada Group Technology

Apr 19, 2021 · Operations

Exploring Elastic Capacity and Automated Scaling Architecture at Dada Group

This article presents Dada Group's comprehensive approach to elastic capacity management and automated scaling, detailing the challenges faced during traffic spikes, the design of a cloud‑native auto‑scaler, multi‑metric observability, decision‑making logic, execution mechanisms, extreme scaling practices, and future optimization directions.

Auto ScalingCloud NativeSRE

0 likes · 15 min read

Exploring Elastic Capacity and Automated Scaling Architecture at Dada Group

Efficient Ops

Mar 31, 2021 · Operations

Understanding Site Reliability Engineering (SRE): Roles, Tools, and Practices

The article provides a comprehensive overview of Site Reliability Engineering (SRE), explaining its origins, definition by Google, required skill sets, typical responsibilities, tools used, and how the role has evolved within DevOps and modern cloud‑native environments.

DevOpsOperationsReliability

0 likes · 9 min read

Understanding Site Reliability Engineering (SRE): Roles, Tools, and Practices

Top Architect

Mar 26, 2021 · Operations

Top Open‑Source Projects for SREs and DevOps

This article presents a curated list of popular open‑source tools for monitoring, deployment, chaos testing, and reliability engineering, explaining their main features and how they help SREs and DevOps engineers build scalable, highly available cloud‑native systems.

Cloud NativeDevOpsSRE

0 likes · 10 min read

Top Open‑Source Projects for SREs and DevOps

dbaplus Community

Mar 25, 2021 · Operations

Mastering High‑Quality Service Architecture: Load Balancing, Rate Limiting, Retries & Timeouts

This article distills Bilibili's technical director insights on building high‑service‑quality architectures, covering systematic load‑balancing strategies, sophisticated rate‑limiting mechanisms, robust retry policies, precise timeout controls, and comprehensive approaches to prevent cascading failures in large‑scale systems.

Backend ArchitectureSREload balancing

0 likes · 14 min read

Mastering High‑Quality Service Architecture: Load Balancing, Rate Limiting, Retries & Timeouts

DevOps

Mar 18, 2021 · Operations

Understanding Site Reliability Engineering (SRE) and Its Role in Software Stability

Site Reliability Engineering (SRE) combines software engineering with operations to ensure scalable, highly reliable systems, outlining the collaboration between product development and SRE roles, the software lifecycle, stability value, and practical frameworks for observability, controllability, and best‑practice implementation.

SRESite Reliability Engineeringsoftware lifecycle

0 likes · 12 min read

Understanding Site Reliability Engineering (SRE) and Its Role in Software Stability

Alibaba Cloud Developer

Mar 8, 2021 · Operations

How to Ensure System Stability During Massive Sales Events: A Complete SRE Playbook

This article presents a comprehensive, step‑by‑step framework for guaranteeing system reliability during high‑traffic promotional periods, covering SRE hierarchy, stability criteria, profiling, monitoring, capacity planning, incident response, and post‑event analysis to help teams build resilient services.

SREcapacity planningincident response

0 likes · 21 min read

How to Ensure System Stability During Massive Sales Events: A Complete SRE Playbook