Tagged articles

service reliability

57 articles · Page 1 of 1

Apr 21, 2026 · Operations

Can AI Be Blamed for a 9‑Hour Travel App Outage? Lessons on Software Engineering Discipline

A nine‑hour outage of the popular travel app exposed how reliance on AI‑generated code can mask deeper failures in disaster‑recovery planning, incident response, and engineering rigor, reminding developers that high availability depends on disciplined practices rather than tools.

AI codeHigh Availabilityincident response

0 likes · 5 min read

Can AI Be Blamed for a 9‑Hour Travel App Outage? Lessons on Software Engineering Discipline

Airbnb Technology Team

Mar 24, 2026 · Cloud Native

How Airbnb Ensures Safe, Reliable Dynamic Configuration Changes

Airbnb’s Sitar platform demonstrates how a modern dynamic configuration system can provide safe, reliable, and flexible runtime changes through a Git‑centric workflow, multi‑tenant control and data planes, staged rollouts, rapid rollback, and local caching, balancing developer agility with operational stability.

Git WorkflowMicroservicesdynamic-configuration

0 likes · 13 min read

How Airbnb Ensures Safe, Reliable Dynamic Configuration Changes

Xiaolei Talks DB

Oct 22, 2025 · Databases

How to Evaluate a Database’s Long‑Term Service Capability

In a landscape crowded with OLTP, OLAP, HTAP, NewSQL and cloud‑native options, this article explains why enterprises must look beyond performance and assess a database’s five‑dimensional long‑term service capability to ensure sustainable growth and low migration risk.

DatabasesEnterprise Architecturelong-term evaluation

0 likes · 9 min read

How to Evaluate a Database’s Long‑Term Service Capability

dbaplus Community

Jun 23, 2025 · Operations

How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance

This article shares a year‑long, hands‑on experience of improving backend alert governance at Tencent Meeting, covering why alerts are hard, designing segmented error codes, building unified alert policies, driving team silence‑up, measuring progress, and the tools that make the process sustainable.

Alert ManagementMonitoringbackend operations

0 likes · 42 min read

How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance

Sohu Tech Products

Jun 18, 2025 · Backend Development

How LLMs Transform Traffic Replay Testing for Backend Services

This article walks through the challenges of traditional traffic replay, explains the design and benefits of a conventional replay system, and then details how integrating large language models can automate data preparation, script generation, and validation to make backend testing more accurate, scalable, and efficient.

Backend testingLLMservice reliability

0 likes · 18 min read

How LLMs Transform Traffic Replay Testing for Backend Services

Cognitive Technology Team

Jun 17, 2025 · Cloud Computing

What a Single NullPointerException Taught Us About Cloud Reliability

The June 2025 Google Cloud outage, caused by an untested code change that triggered a NullPointerException, crippled over 70 core services worldwide, prompting a rapid technical fix, public apology, and industry‑wide reflections on cloud stability, fault tolerance, and deployment practices.

Google CloudNullPointerExceptioncloud outage

0 likes · 7 min read

What a Single NullPointerException Taught Us About Cloud Reliability

Baidu Tech Salon

Feb 20, 2025 · Backend Development

Avalanche Prevention Architecture in Baidu Netdisk: Practices and Solutions

Baidu Netdisk engineers protect its billion‑user service from cascading failures by deploying dynamic circuit‑breaker overload control, priority‑based traffic isolation, request‑validity filtering, socket‑level disconnect detection, and unified timestamp handling, a combination that dramatically reduces avalanche incidents and boosts overall availability.

avalanche preventionbackend-architecturecircuit breaker

0 likes · 17 min read

Avalanche Prevention Architecture in Baidu Netdisk: Practices and Solutions

Baidu Geek Talk

Feb 17, 2025 · Operations

How Baidu Netdisk Prevents Service Avalanches: Dynamic Circuit Breaking & Queue Control

This article analyzes Baidu Netdisk's anti‑avalanche architecture, explaining how avalanche cascades occur in high‑concurrency services and detailing practical prevention, blocking, and mitigation techniques such as dynamic circuit breaking, traffic isolation, request‑validity checks, and socket‑level detection to maintain system reliability.

Dynamic ThrottlingOperationsavalanche mitigation

0 likes · 18 min read

How Baidu Netdisk Prevents Service Avalanches: Dynamic Circuit Breaking & Queue Control

JD Cloud Developers

Dec 19, 2024 · Backend Development

How Discard Policy and Error Threshold Rescue Java Services During Log Overload

This article analyzes a severe service‑availability drop caused by Log4j2 asynchronous logging bottlenecks, explains how configuring log4j2.asyncQueueFullPolicy=Discard and log4j2.discardThreshold=ERROR mitigates the issue, details the investigation steps, performance tests, and provides practical recommendations for robust backend logging.

Asynchronous LoggingJava backendlog4j2

0 likes · 15 min read

How Discard Policy and Error Threshold Rescue Java Services During Log Overload

JavaEdge

Dec 8, 2024 · Backend Development

Netflix’s Service‑Level Priority Load Shedding: Protecting User‑Initiated Requests

This article explains how Netflix extended its priority load‑shedding strategy from the API gateway to individual services, detailing the classification of user‑initiated versus pre‑fetch requests, the implementation of partitioned concurrency limiters, CPU‑ and I/O‑based shedding, test results, and real‑world impact on availability.

Netflixbackend-architectureconcurrency limits

0 likes · 18 min read

Netflix’s Service‑Level Priority Load Shedding: Protecting User‑Initiated Requests

JD Tech Talk

Nov 27, 2024 · Operations

Understanding SLA, SLO, and SLI: Concepts, Practices, and Alert Governance for High‑Traffic Events

This article explains the definitions and relationships of SLA, SLO, and SLI, shows how to set realistic targets, presents service‑level grading, alert‑noise reduction techniques, and practical examples to help teams prepare for large‑scale events such as the 11.11 promotion.

Alert ManagementSLASLI

0 likes · 20 min read

Understanding SLA, SLO, and SLI: Concepts, Practices, and Alert Governance for High‑Traffic Events

JD Cloud Developers

Nov 27, 2024 · Operations

Mastering SLA, SLO, and SLI: Practical Strategies for Reliable Services

This article explains the core concepts of SLA, SLO, and SLI, demonstrates how to set realistic service level objectives, manage alert noise, and apply practical examples—including API, MQ, and scheduled task monitoring—to improve system reliability and performance during high‑traffic events like 11.11 promotions.

MonitoringSLASLI

0 likes · 23 min read

Mastering SLA, SLO, and SLI: Practical Strategies for Reliable Services

DevOps

Aug 22, 2024 · Operations

Synthetic Monitoring and Fault Drills: Practices for Ensuring Service Stability

This article explains why service stability is critical, outlines the importance and key factors of synthetic monitoring, provides practical guidelines for implementing it, and then describes fault‑drill concepts, benefits, processes, and common cloud‑native tools to proactively discover and mitigate failures in micro‑service environments.

Fault InjectionOperationsSynthetic Monitoring

0 likes · 11 min read

Synthetic Monitoring and Fault Drills: Practices for Ensuring Service Stability

Rare Earth Juejin Tech Community

Jun 18, 2024 · Backend Development

Graceful Shutdown in Go: Designing Robust Service Termination with the GS Library

This article describes a real‑world incident where rapid pod scaling caused order‑submission failures in a serverless e‑commerce platform, analyzes the root causes, and presents a Go‑based graceful‑shutdown solution—including ASyncClose, SyncClose, and ForceSyncClose modes—implemented in the open‑source GS library to help developers reliably terminate services.

Backend DevelopmentGoGraceful Shutdown

0 likes · 21 min read

Graceful Shutdown in Go: Designing Robust Service Termination with the GS Library

Efficient Ops

Apr 23, 2024 · Cloud Computing

Why Do Cloud Outages Keep Happening? Governance Lessons and Strategies

The article examines the rapid growth of China's cloud market, the frequent "cloud collapse" incidents, their root causes in governance failures, and presents practical cloud governance measures along with an overview of the new industry standard for enterprise cloud governance capability maturity.

Industry standardscloud governanceservice reliability

0 likes · 8 min read

Why Do Cloud Outages Keep Happening? Governance Lessons and Strategies

Ops Development Stories

Apr 16, 2024 · Cloud Computing

What Tencent Cloud’s Outage Reveals About IaaS vs PaaS Reliability

The article analyzes a recent Tencent Cloud outage, detailing the specific API failures, contrasting the limited impact on IaaS services with widespread PaaS disruptions, and argues for multi‑cloud redundancy while critiquing sensationalist news and outdated status‑page expectations.

IaaSPaaSTencent Cloud

0 likes · 12 min read

What Tencent Cloud’s Outage Reveals About IaaS vs PaaS Reliability

Wukong Talks Architecture

Apr 15, 2024 · Operations

Post‑mortem of the April 8 Tencent Cloud API Outage and Improvement Measures

On April 8, a Tencent Cloud API outage caused console login failures for nearly 2,000 customers, affecting several dependent services for 87 minutes, and the detailed root‑cause analysis and subsequent improvement actions are presented to enhance system resilience and change management.

APICloudOperations

0 likes · 8 min read

Post‑mortem of the April 8 Tencent Cloud API Outage and Improvement Measures

NetEase LeiHuo Testing Center

Nov 3, 2023 · Operations

Best Practices for Third‑Party Interface Collaboration: Concepts, Rate Limiting, Monitoring, and Incident Handling

The article outlines how game QA and third‑party providers can improve cooperation by aligning basic performance concepts such as TPS, QPS and concurrency, selecting appropriate rate‑limiting strategies, establishing precise monitoring and alerting, and preparing clear incident‑response and delivery standards.

MonitoringOperationsperformance testing

0 likes · 15 min read

Best Practices for Third‑Party Interface Collaboration: Concepts, Rate Limiting, Monitoring, and Incident Handling

Didi Tech

Jul 25, 2023 · Backend Development

Separating Test Traffic Trigger and Result Verification for Didi Ride‑Hailing Backend

By separating test‑traffic triggering from result verification, Didi’s ride‑hailing backend uses live‑traffic inspection and replayed offline tests with bucketed validation rules to achieve near‑zero‑cost, full‑coverage QA, catching hundreds of bugs annually and dramatically improving service reliability for drivers and passengers.

Backend testingRide Hailingquality assurance

0 likes · 18 min read

Separating Test Traffic Trigger and Result Verification for Didi Ride‑Hailing Backend

Test Development Learning Exchange

May 25, 2023 · Operations

Online Incident Severity Level Definition Rules

This document defines the online incident severity grading system, outlining fault categories, influencing factors such as business metrics, capital loss, user impact, and public opinion, and presents detailed P0‑P3 grading rules with tables for capital‑based, C‑end, and B‑end user classifications.

Incident Managementfault classificationservice reliability

0 likes · 8 min read

Online Incident Severity Level Definition Rules

DaTaobao Tech

May 12, 2023 · Backend Development

Backend Development Journey and Lessons from Alibaba Taobao

Through a five‑year backend journey—from building a solo startup site and mastering Java, to handling high‑traffic services at Sina Weibo, and now developing B2B merchant tools at Alibaba Taobao—the author shares lessons on scalable architecture, automated deployment, aligning tech with business, proactive problem‑solving, code quality, teamwork, and career health.

career growthservice reliabilitytechnical leadership

0 likes · 9 min read

Backend Development Journey and Lessons from Alibaba Taobao

Java High-Performance Architecture

Jan 24, 2023 · Backend Development

How to Build Highly Available Backend APIs: 10 Essential Design Principles

This article explains why high availability is crucial for backend services and outlines ten practical design principles—including dependency control, avoiding single points, load balancing, isolation, rate limiting, circuit breaking, async processing, degradation, gray release, and chaos engineering—to help developers create resilient APIs.

API designHigh Availabilitybackend

0 likes · 10 min read

How to Build Highly Available Backend APIs: 10 Essential Design Principles

vivo Internet Technology

Jan 4, 2023 · Artificial Intelligence

Root Cause Localization Algorithm and Its Implementation for Service Fault Diagnosis

The article describes a root‑cause localization algorithm implemented in vivo’s monitoring platform that automatically analyzes latency spikes by splitting service timelines, computing variance, clustering results with K‑means, and recursively tracing downstream services, achieving over 85 % accuracy for dependency failures while still requiring human verification and outlining future AI‑driven enhancements.

AIOpsFault LocalizationK-Means

0 likes · 13 min read

Root Cause Localization Algorithm and Its Implementation for Service Fault Diagnosis

Zuoyebang Tech Team

Aug 26, 2022 · Operations

How We Built a Three‑Layer Stability System for Massive Scale Operations

This article details the operational mindset, stability framework, and transformation journey of the Zuoyebang infrastructure team, covering service lifecycle management, standardization, cloud‑native architecture, multi‑active deployment, incident pre‑plan platforms, traffic scheduling, monitoring, capacity planning, and future directions for SRE service‑orientation.

AutomationMonitoringOperations

0 likes · 20 min read

How We Built a Three‑Layer Stability System for Massive Scale Operations

ITPUB

Aug 18, 2022 · Operations

How WeChat Keeps Billions of Requests Stable: Overload Control Strategies for Massive Microservices

This article breaks down WeChat’s 2018 overload control system for massive microservices, explaining the problem of service overload, detection via average waiting time, and a multi‑level priority‑based mitigation strategy that dynamically adjusts admission thresholds to keep billions of daily requests stable.

MicroservicesPriority SchedulingWeChat

0 likes · 12 min read

How WeChat Keeps Billions of Requests Stable: Overload Control Strategies for Massive Microservices

ITPUB

Jul 31, 2022 · Operations

How Bilibili Scaled Live Gift Revenue with Full‑Link Automated Load Testing

This article details Bilibili's end‑to‑end full‑link load‑testing solution for its live‑stream gifting service, covering industry alternatives, the chosen architecture, a three‑stage automated testing framework, link analysis, configuration, validation, and practical case studies to ensure system stability under massive traffic spikes.

Bilibilifull‑linkload testing

0 likes · 16 min read

How Bilibili Scaled Live Gift Revenue with Full‑Link Automated Load Testing

dbaplus Community

Nov 14, 2021 · Operations

How to Boost Service Reliability: SRE Basics and Tackling Technical Debt

This article explains the fundamentals of Site Reliability Engineering, outlines a complete SRE workflow from prevention to post‑mortem, details key availability metrics and golden indicators, examines how technical debt arises and can be mitigated, and describes the tooling and practices needed to keep large‑scale services healthy.

MonitoringSREservice reliability

0 likes · 18 min read

How to Boost Service Reliability: SRE Basics and Tackling Technical Debt

HaoDF Tech Team

Nov 8, 2021 · Operations

Service Risk Governance: Exploration, Mitigation, and Hands‑On Workshop

This talk recounts how the Good Doctor platform tackled severe online incidents by launching the DOA project, then a service risk governance initiative that identifies, quantifies, and mitigates latency‑related risks through metrics‑driven development, dependency analysis, middleware reliability, and a dedicated risk‑management platform.

MicroservicesRisk GovernanceSRE

0 likes · 16 min read

Service Risk Governance: Exploration, Mitigation, and Hands‑On Workshop

Efficient Ops

Jul 11, 2021 · Operations

Mastering Incident Management: Principles and Methods for Effective Fault Handling

This guide outlines essential incident management principles—prioritizing business recovery and timely escalation—and presents practical methodologies such as restart, isolation, and downgrade, while detailing user impact handling, organizational roles, and post‑mortem best practices.

Incident ManagementOperationsescalation

0 likes · 10 min read

Mastering Incident Management: Principles and Methods for Effective Fault Handling

Youku Technology

Mar 5, 2021 · Industry Insights

How Youku Built a Service‑Side Quality Assurance System to Boost Release Quality

This article outlines Youku's end‑to‑end service‑side quality assurance framework, detailing the factors that affect quality across the development lifecycle, the automated testing practices integrated into the release pipeline, the platform capabilities built for data collection and replay, and the metrics used to measure improvements in reliability and development efficiency.

AutomationBackend testingContinuous Integration

0 likes · 12 min read

How Youku Built a Service‑Side Quality Assurance System to Boost Release Quality

NetEase Yanxuan Technology Product Team

Dec 11, 2020 · Operations

How to Build Effective Stability Governance for E‑commerce Logistics Services

This article analyzes the concept of stability governance, outlines its five fault‑management sub‑domains, examines the pain points of an electronic waybill service, and presents a comprehensive three‑phase strategy—prevention, perception, reach, mitigation, and post‑mortem—backed by concrete implementation steps in availability, monitoring, and online emergency handling.

MonitoringOperationsincident response

0 likes · 12 min read

How to Build Effective Stability Governance for E‑commerce Logistics Services

Open Source Linux

Sep 12, 2020 · Operations

Mastering Incident Response: Core Principles and Practical Methods

This guide outlines essential incident‑response principles—prioritizing business restoration and timely escalation—while detailing practical methods such as restart, isolation, and degradation, and explains how to organize response teams and conduct thorough post‑incident reviews.

Incident ManagementRestartdegradation

0 likes · 11 min read

Mastering Incident Response: Core Principles and Practical Methods

Efficient Ops

Sep 9, 2020 · Operations

Mastering Incident Management: Core Principles and Practical Methods

This guide outlines essential incident management principles—prioritizing business restoration and timely escalation—followed by detailed methodologies such as restart, isolation, and degradation, and explains role responsibilities, user impact handling, and post‑incident summarization for continuous improvement.

Incident ManagementOperationsfault handling

0 likes · 10 min read

Mastering Incident Management: Core Principles and Practical Methods

21CTO

Jul 13, 2020 · Operations

Why Did GitHub Crash? Inside the July 2020 Outage and Its Root Causes

The July 13, 2020 GitHub outage, triggered by load‑balancer misconfiguration, a database connection error during partitioning, and a network‑config mistake, sparked worldwide developer panic, highlighted reliability concerns, and revealed challenges in scaling cloud infrastructure amid the pandemic.

Cloud ComputingGitHubOutage

0 likes · 6 min read

Why Did GitHub Crash? Inside the July 2020 Outage and Its Root Causes

Didi Tech

Jun 3, 2020 · Backend Development

Stability Guidelines and Anti‑Patterns for Backend Services

Drawing on five years of incident reviews, the article defines a comprehensive stability framework for backend services—mandating timeout hierarchies, weak dependencies, service-discovery integration, staged gray releases, robust monitoring, capacity planning, and strict change management—while cataloguing common anti-patterns such as over-aggressive circuit breaking, static retries, improper timeouts, tight coupling, and insufficient isolation, and urging regular rehearsal of these practices.

Incident Managementbackend stabilitydeployment best practices

0 likes · 21 min read

Stability Guidelines and Anti‑Patterns for Backend Services

Programmer DD

Apr 29, 2020 · Operations

How to Keep Your Distributed System Running Even When Upstream Services Fail

The article explains why distributed systems must stay alive despite upstream or downstream failures, emphasizing rate limiting and circuit breaking as essential practices to prevent fault propagation and ensure service reliability, and it invites developers to assess their own safeguards.

circuit breakingdistributed systemsrate limiting

0 likes · 3 min read

How to Keep Your Distributed System Running Even When Upstream Services Fail

Qunar Tech Salon

Jan 7, 2020 · Operations

Comprehensive Dependency Governance for High‑Availability Backend Systems

This article outlines a systematic approach to dependency governance in high‑traffic backend services, covering service classification, rate limiting, Dubbo, HTTP, database, and message‑queue management to enhance availability, reduce failure impact, and improve overall system stability.

DubboOperationsdependency management

0 likes · 10 min read

Comprehensive Dependency Governance for High‑Availability Backend Systems

Didi Tech

Dec 2, 2019 · Operations

Capacity Estimation Methodology for Growing Services

The article presents a systematic capacity‑estimation methodology that links service traffic to order volume, uses CPU‑Idle as a primary metric, predicts traffic growth and upper‑bound limits, validates predictions with load‑testing, and provides scaling recommendations while noting limitations of the CPU‑Idle baseline.

Traffic Predictioncapacity planningresource utilization

0 likes · 9 min read

Capacity Estimation Methodology for Growing Services

JD Retail Technology

Oct 15, 2019 · Operations

Traffic Replication and Replay Platform for JD APP: Design, Features, and Operational Impact

The article describes JD's traffic replication and replay platform, explaining its background, the concepts of traffic copying and replay, detailed platform architecture and features, normalised load testing workflow, dynamic regression testing, operational results, current limitations, and future improvement directions.

AutomationJD platformload testing

0 likes · 11 min read

Traffic Replication and Replay Platform for JD APP: Design, Features, and Operational Impact

Ctrip Technology

Apr 18, 2019 · Operations

Application Monitoring Systems: Necessity, Components, Distributed Tracing, and Design for Developers, Testers, and Operations

The article explains why enterprise application monitoring systems are essential, outlines their core components such as Trace, Log, Metric, and Report, discusses distributed tracing techniques, and describes how these insights are designed to aid developers, testers, and operations engineers in performance tuning and fault diagnosis.

Distributed TracingObservabilityapplication monitoring

0 likes · 12 min read

Application Monitoring Systems: Necessity, Components, Distributed Tracing, and Design for Developers, Testers, and Operations

ITPUB

Mar 26, 2019 · Operations

How to Build a 99.99% High‑Availability Service: Practices and Architecture Evolution

This article explains the essential requirements for achieving 99.99% service availability—consistency, eliminating single points, placement groups, traffic isolation, same‑city active‑active, N+1 redundancy, and multi‑region active‑active—illustrated with a step‑by‑step Yum repository service case study and evolving architecture diagrams.

architecturecloud operationsdeployment

0 likes · 9 min read

How to Build a 99.99% High‑Availability Service: Practices and Architecture Evolution

Architect's Tech Stack

Dec 5, 2018 · Operations

Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform

The article shares a comprehensive, experience‑driven guide on building fault‑tolerant systems—covering retry mechanisms, dynamic node removal, timeout settings, service degradation, decoupling, and business‑level safeguards—to enable a platform that scales from millions to billions of daily requests without relying on manual fire‑fighting.

OperationsSystem Designfault tolerance

0 likes · 21 min read

Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform

Programmer DD

Oct 31, 2018 · Operations

Prevent Service Failures: Question Third Parties, Guard Users, Perfect Your Code

This article shares practical strategies for avoiding system failures by doubting third‑party services, protecting against misuse by consumers, and strengthening internal design through solid API practices, resource limits, and disciplined coding principles.

API designOperationsResource Management

0 likes · 16 min read

Prevent Service Failures: Question Third Parties, Guard Users, Perfect Your Code

Efficient Ops

Oct 29, 2018 · Operations

How Youzan Manages Online Incidents: A Step‑by‑Step Guide

This article outlines Youzan's end‑to‑end online incident management process—from fault detection and coordination through root‑cause analysis, recovery, review, and actionable JIRA tracking—highlighting practical workflows, data analysis, and continuous improvement practices for reliable service delivery.

Incident ManagementJIRA workflowOperations

0 likes · 10 min read

How Youzan Manages Online Incidents: A Step‑by‑Step Guide

ITPUB

Jun 5, 2018 · Operations

How Meituan Achieved Near‑Zero Downtime for Its Account Service

This article details Meituan's practical approaches to boosting account service reliability, covering MTBF/MTTR metrics, business‑level monitoring, flexible availability with circuit‑breaker patterns, cross‑region active‑active deployment, data synchronization techniques, and the measurable performance gains achieved.

Active-ActiveData synchronizationHigh Availability

0 likes · 13 min read

How Meituan Achieved Near‑Zero Downtime for Its Account Service

Efficient Ops

May 2, 2018 · Operations

How Tencent Scales 20,000+ Servers: Lessons from SNG Operations

This talk outlines the five major challenges faced by Tencent's SNG component operations—geographic distribution, HTTPS certificate management, massive device failures, long‑term maintenance, and large‑scale scaling—and describes the underlying architecture, operational principles, and practical techniques used to automate and reliably support millions of users during peak events.

AutomationOperationsTencent

0 likes · 20 min read

How Tencent Scales 20,000+ Servers: Lessons from SNG Operations

ITFLY8 Architecture Home

Mar 22, 2018 · Operations

How Simple Retry Can Crash Your System and Smarter Alternatives

This article examines the pitfalls of naive retry mechanisms, explores active‑standby service switching, dynamic removal of unhealthy nodes, proper timeout configuration, and anti‑reentrancy strategies to improve system reliability and prevent cascading failures in large‑scale backend operations.

fault toleranceretryservice reliability

0 likes · 14 min read

How Simple Retry Can Crash Your System and Smarter Alternatives

Efficient Ops

Mar 6, 2018 · Operations

How Tencent’s SNG Ops Team Overcame Five Massive Service Challenges

The SNG Operations team shares the five critical challenges of managing tens of thousands of domains, certificates, server failures, automation, and rapid scaling during peak events, and outlines the practical strategies they used to ensure reliable, near‑real‑time service delivery.

AutomationOperationscertificate-management

0 likes · 6 min read

How Tencent’s SNG Ops Team Overcame Five Massive Service Challenges

dbaplus Community

Oct 16, 2017 · Operations

How Ele.me Scaled Operations: Key Lessons from Incident Management and Capacity Planning

This article details Ele.me's rapid expansion challenges and shares a three‑stage technical operations journey—fine‑grained division, stability maintenance, and efficiency gains—highlighting real incidents, monitoring upgrades, capacity testing, and practical insights for reliable large‑scale delivery platforms.

Incident ManagementMonitoringOperations

0 likes · 14 min read

How Ele.me Scaled Operations: Key Lessons from Incident Management and Capacity Planning

Meituan Technology Team

Aug 10, 2017 · Frontend Development

Front-End Service Availability: Definition, Measurement, and Assurance Practices at Meituan-Dianping Checkout

The article outlines Meituan‑Dianping’s approach to front‑end service availability for its checkout system, defining availability across code, static resources, and network links, measuring failure duration, identifying typical bugs, and implementing a three‑stage assurance strategy using people processes, engineering tools, lightweight technology choices, and concrete practices such as TypeScript adoption, automated testing, health‑checks, DNS protection, and post‑incident monitoring.

MonitoringSSRavailability

0 likes · 15 min read

Front-End Service Availability: Definition, Measurement, and Assurance Practices at Meituan-Dianping Checkout

High Availability Architecture

Feb 22, 2017 · Operations

LinkedIn Redliner: Automated Capacity Planning and Performance Testing in Production

The article explains how LinkedIn’s Redliner system automatically measures service capacity and performs low‑impact, production‑traffic stress tests to identify bottlenecks, guide resource allocation, and support proactive capacity planning and performance regression testing.

LinkedIncapacity planningperformance testing

0 likes · 11 min read

LinkedIn Redliner: Automated Capacity Planning and Performance Testing in Production

Efficient Ops

Feb 6, 2017 · Operations

Building Billion‑Scale Web Systems That Auto‑Extinguish Failures

The article shares Tencent’s practical fault‑tolerance journey for a billion‑scale activity platform, covering retry strategies, automatic removal of faulty nodes, timeout tuning, business‑level safeguards, service degradation, and decoupling techniques that together reduce manual firefighting and improve system resilience.

Operationsfault tolerancelarge-scale systems

0 likes · 25 min read

Building Billion‑Scale Web Systems That Auto‑Extinguish Failures

Efficient Ops

Nov 9, 2016 · Operations

How to Design Effective SLOs and SLAs: A Technical Deep Dive

This article explains the definitions of service, SLI, SLO, and SLA, outlines how to choose and measure appropriate indicators, shares best practices for setting and improving SLOs, and shows how SLAs combine objectives with consequences to manage service reliability.

Cloud ComputingOperationsSLA

0 likes · 11 min read

How to Design Effective SLOs and SLAs: A Technical Deep Dive

Efficient Ops

Oct 16, 2016 · Operations

Balancing Reliability and Innovation: Google’s SRE Risk Management Explained

This article explores how Google Site Reliability Engineers manage service reliability by balancing risk, cost, and business goals, using metrics like unplanned downtime, availability formulas, and risk tolerance to set realistic SLOs for both consumer and infrastructure services.

GoogleOperationsRisk Management

0 likes · 21 min read

Balancing Reliability and Innovation: Google’s SRE Risk Management Explained

Java High-Performance Architecture

May 23, 2016 · Cloud Native

What Uber’s Microservices Reveal About the Pros and Cons of Distributed Architecture

Uber’s adoption of microservices showcases both the flexibility of using multiple languages and independent release cycles, while also exposing challenges such as duplicated effort across teams, type‑unsafe JSON interfaces, and the need for rigorous failure testing, offering valuable lessons for large‑scale system design.

MicroservicesUberarchitecture

0 likes · 5 min read

What Uber’s Microservices Reveal About the Pros and Cons of Distributed Architecture

Baidu Intelligent Testing

Feb 24, 2016 · Fundamentals

Rethinking Mobile Software Quality: From Bug Reduction to End‑to‑End Service Excellence

The article examines how rapid mobile internet growth reshapes software quality expectations, urging QA engineers to balance bug prevention, end‑to‑end service reliability, and user‑centric product excellence across evolving B2C scenarios.

bug reductionmobile QAservice reliability

0 likes · 6 min read

Rethinking Mobile Software Quality: From Bug Reduction to End‑to‑End Service Excellence

21CTO

Aug 29, 2015 · Backend Development

How to Prevent Service Failures: Trust Third‑Party, Guard Users, Master Your Own Code

An experienced backend engineer shares practical strategies to prevent service failures, covering third‑party distrust, user‑side safeguards, robust API design, traffic limiting, resource management, and architectural best practices such as single‑responsibility and avoiding single points of failure.

API designResource Managementfault tolerance

0 likes · 16 min read

How to Prevent Service Failures: Trust Third‑Party, Guard Users, Master Your Own Code