Tagged articles
56 articles
Page 1 of 1
Airbnb Technology Team
Airbnb Technology Team
Mar 24, 2026 · Cloud Native

How Airbnb Ensures Safe, Reliable Dynamic Configuration Changes

Airbnb’s Sitar platform demonstrates how a modern dynamic configuration system can provide safe, reliable, and flexible runtime changes through a Git‑centric workflow, multi‑tenant control and data planes, staged rollouts, rapid rollback, and local caching, balancing developer agility with operational stability.

Dynamic ConfigurationMicroservicesgit-workflow
0 likes · 13 min read
How Airbnb Ensures Safe, Reliable Dynamic Configuration Changes
Xiaolei Talks DB
Xiaolei Talks DB
Oct 22, 2025 · Databases

How to Evaluate a Database’s Long‑Term Service Capability

In a landscape crowded with OLTP, OLAP, HTAP, NewSQL and cloud‑native options, this article explains why enterprises must look beyond performance and assess a database’s five‑dimensional long‑term service capability to ensure sustainable growth and low migration risk.

Technology Selectiondatabasesenterprise architecture
0 likes · 9 min read
How to Evaluate a Database’s Long‑Term Service Capability
dbaplus Community
dbaplus Community
Jun 23, 2025 · Operations

How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance

This article shares a year‑long, hands‑on experience of improving backend alert governance at Tencent Meeting, covering why alerts are hard, designing segmented error codes, building unified alert policies, driving team silence‑up, measuring progress, and the tools that make the process sustainable.

Alert Managementbackend operationserror code design
0 likes · 42 min read
How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance
Sohu Tech Products
Sohu Tech Products
Jun 18, 2025 · Backend Development

How LLMs Transform Traffic Replay Testing for Backend Services

This article walks through the challenges of traditional traffic replay, explains the design and benefits of a conventional replay system, and then details how integrating large language models can automate data preparation, script generation, and validation to make backend testing more accurate, scalable, and efficient.

Backend testingLLMservice reliability
0 likes · 18 min read
How LLMs Transform Traffic Replay Testing for Backend Services
Cognitive Technology Team
Cognitive Technology Team
Jun 17, 2025 · Cloud Computing

What a Single NullPointerException Taught Us About Cloud Reliability

The June 2025 Google Cloud outage, caused by an untested code change that triggered a NullPointerException, crippled over 70 core services worldwide, prompting a rapid technical fix, public apology, and industry‑wide reflections on cloud stability, fault tolerance, and deployment practices.

Google Cloudcloud outageincident response
0 likes · 7 min read
What a Single NullPointerException Taught Us About Cloud Reliability
Baidu Tech Salon
Baidu Tech Salon
Feb 20, 2025 · Backend Development

Avalanche Prevention Architecture in Baidu Netdisk: Practices and Solutions

Baidu Netdisk engineers protect its billion‑user service from cascading failures by deploying dynamic circuit‑breaker overload control, priority‑based traffic isolation, request‑validity filtering, socket‑level disconnect detection, and unified timestamp handling, a combination that dramatically reduces avalanche incidents and boosts overall availability.

Backend Architectureavalanche preventioncircuit breaker
0 likes · 17 min read
Avalanche Prevention Architecture in Baidu Netdisk: Practices and Solutions
Baidu Geek Talk
Baidu Geek Talk
Feb 17, 2025 · Operations

How Baidu Netdisk Prevents Service Avalanches: Dynamic Circuit Breaking & Queue Control

This article analyzes Baidu Netdisk's anti‑avalanche architecture, explaining how avalanche cascades occur in high‑concurrency services and detailing practical prevention, blocking, and mitigation techniques such as dynamic circuit breaking, traffic isolation, request‑validity checks, and socket‑level detection to maintain system reliability.

Backend ArchitectureCircuit BreakingDynamic Throttling
0 likes · 18 min read
How Baidu Netdisk Prevents Service Avalanches: Dynamic Circuit Breaking & Queue Control
JD Cloud Developers
JD Cloud Developers
Dec 19, 2024 · Backend Development

How Discard Policy and Error Threshold Rescue Java Services During Log Overload

This article analyzes a severe service‑availability drop caused by Log4j2 asynchronous logging bottlenecks, explains how configuring log4j2.asyncQueueFullPolicy=Discard and log4j2.discardThreshold=ERROR mitigates the issue, details the investigation steps, performance tests, and provides practical recommendations for robust backend logging.

Java backendPerformance Testingasynchronous logging
0 likes · 15 min read
How Discard Policy and Error Threshold Rescue Java Services During Log Overload
JavaEdge
JavaEdge
Dec 8, 2024 · Backend Development

Netflix’s Service‑Level Priority Load Shedding: Protecting User‑Initiated Requests

This article explains how Netflix extended its priority load‑shedding strategy from the API gateway to individual services, detailing the classification of user‑initiated versus pre‑fetch requests, the implementation of partitioned concurrency limiters, CPU‑ and I/O‑based shedding, test results, and real‑world impact on availability.

Backend ArchitectureNetflixconcurrency limits
0 likes · 18 min read
Netflix’s Service‑Level Priority Load Shedding: Protecting User‑Initiated Requests
JD Cloud Developers
JD Cloud Developers
Nov 27, 2024 · Operations

Mastering SLA, SLO, and SLI: Practical Strategies for Reliable Services

This article explains the core concepts of SLA, SLO, and SLI, demonstrates how to set realistic service level objectives, manage alert noise, and apply practical examples—including API, MQ, and scheduled task monitoring—to improve system reliability and performance during high‑traffic events like 11.11 promotions.

SLASLISLO
0 likes · 23 min read
Mastering SLA, SLO, and SLI: Practical Strategies for Reliable Services
DevOps
DevOps
Aug 22, 2024 · Operations

Synthetic Monitoring and Fault Drills: Practices for Ensuring Service Stability

This article explains why service stability is critical, outlines the importance and key factors of synthetic monitoring, provides practical guidelines for implementing it, and then describes fault‑drill concepts, benefits, processes, and common cloud‑native tools to proactively discover and mitigate failures in micro‑service environments.

Fault InjectionOperationsSynthetic Monitoring
0 likes · 11 min read
Synthetic Monitoring and Fault Drills: Practices for Ensuring Service Stability
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jun 18, 2024 · Backend Development

Graceful Shutdown in Go: Designing Robust Service Termination with the GS Library

This article describes a real‑world incident where rapid pod scaling caused order‑submission failures in a serverless e‑commerce platform, analyzes the root causes, and presents a Go‑based graceful‑shutdown solution—including ASyncClose, SyncClose, and ForceSyncClose modes—implemented in the open‑source GS library to help developers reliably terminate services.

Backend DevelopmentGoGraceful Shutdown
0 likes · 21 min read
Graceful Shutdown in Go: Designing Robust Service Termination with the GS Library
Efficient Ops
Efficient Ops
Apr 23, 2024 · Cloud Computing

Why Do Cloud Outages Keep Happening? Governance Lessons and Strategies

The article examines the rapid growth of China's cloud market, the frequent "cloud collapse" incidents, their root causes in governance failures, and presents practical cloud governance measures along with an overview of the new industry standard for enterprise cloud governance capability maturity.

Industry standardscloud governanceservice reliability
0 likes · 8 min read
Why Do Cloud Outages Keep Happening? Governance Lessons and Strategies
NetEase LeiHuo Testing Center
NetEase LeiHuo Testing Center
Nov 3, 2023 · Operations

Best Practices for Third‑Party Interface Collaboration: Concepts, Rate Limiting, Monitoring, and Incident Handling

The article outlines how game QA and third‑party providers can improve cooperation by aligning basic performance concepts such as TPS, QPS and concurrency, selecting appropriate rate‑limiting strategies, establishing precise monitoring and alerting, and preparing clear incident‑response and delivery standards.

OperationsPerformance Testingmonitoring
0 likes · 15 min read
Best Practices for Third‑Party Interface Collaboration: Concepts, Rate Limiting, Monitoring, and Incident Handling
Didi Tech
Didi Tech
Jul 25, 2023 · Backend Development

Separating Test Traffic Trigger and Result Verification for Didi Ride‑Hailing Backend

By separating test‑traffic triggering from result verification, Didi’s ride‑hailing backend uses live‑traffic inspection and replayed offline tests with bucketed validation rules to achieve near‑zero‑cost, full‑coverage QA, catching hundreds of bugs annually and dramatically improving service reliability for drivers and passengers.

Backend testingRide Hailingquality assurance
0 likes · 18 min read
Separating Test Traffic Trigger and Result Verification for Didi Ride‑Hailing Backend
Test Development Learning Exchange
Test Development Learning Exchange
May 25, 2023 · Operations

Online Incident Severity Level Definition Rules

This document defines the online incident severity grading system, outlining fault categories, influencing factors such as business metrics, capital loss, user impact, and public opinion, and presents detailed P0‑P3 grading rules with tables for capital‑based, C‑end, and B‑end user classifications.

fault classificationincident managementservice reliability
0 likes · 8 min read
Online Incident Severity Level Definition Rules
DaTaobao Tech
DaTaobao Tech
May 12, 2023 · Backend Development

Backend Development Journey and Lessons from Alibaba Taobao

Through a five‑year backend journey—from building a solo startup site and mastering Java, to handling high‑traffic services at Sina Weibo, and now developing B2B merchant tools at Alibaba Taobao—the author shares lessons on scalable architecture, automated deployment, aligning tech with business, proactive problem‑solving, code quality, teamwork, and career health.

Career Growthservice reliabilitytechnical leadership
0 likes · 9 min read
Backend Development Journey and Lessons from Alibaba Taobao
Java High-Performance Architecture
Java High-Performance Architecture
Jan 24, 2023 · Backend Development

How to Build Highly Available Backend APIs: 10 Essential Design Principles

This article explains why high availability is crucial for backend services and outlines ten practical design principles—including dependency control, avoiding single points, load balancing, isolation, rate limiting, circuit breaking, async processing, degradation, gray release, and chaos engineering—to help developers create resilient APIs.

Backendapi-designfault tolerance
0 likes · 10 min read
How to Build Highly Available Backend APIs: 10 Essential Design Principles
vivo Internet Technology
vivo Internet Technology
Jan 4, 2023 · Artificial Intelligence

Root Cause Localization Algorithm and Its Implementation for Service Fault Diagnosis

The article describes a root‑cause localization algorithm implemented in vivo’s monitoring platform that automatically analyzes latency spikes by splitting service timelines, computing variance, clustering results with K‑means, and recursively tracing downstream services, achieving over 85 % accuracy for dependency failures while still requiring human verification and outlining future AI‑driven enhancements.

Fault LocalizationK-MeansRoot Cause Analysis
0 likes · 13 min read
Root Cause Localization Algorithm and Its Implementation for Service Fault Diagnosis
Zuoyebang Tech Team
Zuoyebang Tech Team
Aug 26, 2022 · Operations

How We Built a Three‑Layer Stability System for Massive Scale Operations

This article details the operational mindset, stability framework, and transformation journey of the Zuoyebang infrastructure team, covering service lifecycle management, standardization, cloud‑native architecture, multi‑active deployment, incident pre‑plan platforms, traffic scheduling, monitoring, capacity planning, and future directions for SRE service‑orientation.

AutomationInfrastructureOperations
0 likes · 20 min read
How We Built a Three‑Layer Stability System for Massive Scale Operations
ITPUB
ITPUB
Aug 18, 2022 · Operations

How WeChat Keeps Billions of Requests Stable: Overload Control Strategies for Massive Microservices

This article breaks down WeChat’s 2018 overload control system for massive microservices, explaining the problem of service overload, detection via average waiting time, and a multi‑level priority‑based mitigation strategy that dynamically adjusts admission thresholds to keep billions of daily requests stable.

MicroservicesPriority SchedulingWeChat
0 likes · 12 min read
How WeChat Keeps Billions of Requests Stable: Overload Control Strategies for Massive Microservices
ITPUB
ITPUB
Jul 31, 2022 · Operations

How Bilibili Scaled Live Gift Revenue with Full‑Link Automated Load Testing

This article details Bilibili's end‑to‑end full‑link load‑testing solution for its live‑stream gifting service, covering industry alternatives, the chosen architecture, a three‑stage automated testing framework, link analysis, configuration, validation, and practical case studies to ensure system stability under massive traffic spikes.

BilibiliLoad Testingfull‑link
0 likes · 16 min read
How Bilibili Scaled Live Gift Revenue with Full‑Link Automated Load Testing
dbaplus Community
dbaplus Community
Nov 14, 2021 · Operations

How to Boost Service Reliability: SRE Basics and Tackling Technical Debt

This article explains the fundamentals of Site Reliability Engineering, outlines a complete SRE workflow from prevention to post‑mortem, details key availability metrics and golden indicators, examines how technical debt arises and can be mitigated, and describes the tooling and practices needed to keep large‑scale services healthy.

SRETechnical Debtmonitoring
0 likes · 18 min read
How to Boost Service Reliability: SRE Basics and Tackling Technical Debt
HaoDF Tech Team
HaoDF Tech Team
Nov 8, 2021 · Operations

Service Risk Governance: Exploration, Mitigation, and Hands‑On Workshop

This talk recounts how the Good Doctor platform tackled severe online incidents by launching the DOA project, then a service risk governance initiative that identifies, quantifies, and mitigates latency‑related risks through metrics‑driven development, dependency analysis, middleware reliability, and a dedicated risk‑management platform.

MicroservicesSRElatency optimization
0 likes · 16 min read
Service Risk Governance: Exploration, Mitigation, and Hands‑On Workshop
Youku Technology
Youku Technology
Mar 5, 2021 · Industry Insights

How Youku Built a Service‑Side Quality Assurance System to Boost Release Quality

This article outlines Youku's end‑to‑end service‑side quality assurance framework, detailing the factors that affect quality across the development lifecycle, the automated testing practices integrated into the release pipeline, the platform capabilities built for data collection and replay, and the metrics used to measure improvements in reliability and development efficiency.

AutomationBackend testingOperations
0 likes · 12 min read
How Youku Built a Service‑Side Quality Assurance System to Boost Release Quality
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Dec 11, 2020 · Operations

How to Build Effective Stability Governance for E‑commerce Logistics Services

This article analyzes the concept of stability governance, outlines its five fault‑management sub‑domains, examines the pain points of an electronic waybill service, and presents a comprehensive three‑phase strategy—prevention, perception, reach, mitigation, and post‑mortem—backed by concrete implementation steps in availability, monitoring, and online emergency handling.

LogisticsOperationsincident response
0 likes · 12 min read
How to Build Effective Stability Governance for E‑commerce Logistics Services
Open Source Linux
Open Source Linux
Sep 12, 2020 · Operations

Mastering Incident Response: Core Principles and Practical Methods

This guide outlines essential incident‑response principles—prioritizing business restoration and timely escalation—while detailing practical methods such as restart, isolation, and degradation, and explains how to organize response teams and conduct thorough post‑incident reviews.

IsolationRestartdegradation
0 likes · 11 min read
Mastering Incident Response: Core Principles and Practical Methods
Efficient Ops
Efficient Ops
Sep 9, 2020 · Operations

Mastering Incident Management: Core Principles and Practical Methods

This guide outlines essential incident management principles—prioritizing business restoration and timely escalation—followed by detailed methodologies such as restart, isolation, and degradation, and explains role responsibilities, user impact handling, and post‑incident summarization for continuous improvement.

Operationsfault handlingincident management
0 likes · 10 min read
Mastering Incident Management: Core Principles and Practical Methods
21CTO
21CTO
Jul 13, 2020 · Operations

Why Did GitHub Crash? Inside the July 2020 Outage and Its Root Causes

The July 13, 2020 GitHub outage, triggered by load‑balancer misconfiguration, a database connection error during partitioning, and a network‑config mistake, sparked worldwide developer panic, highlighted reliability concerns, and revealed challenges in scaling cloud infrastructure amid the pandemic.

GitHubInfrastructureOutage
0 likes · 6 min read
Why Did GitHub Crash? Inside the July 2020 Outage and Its Root Causes
Didi Tech
Didi Tech
Jun 3, 2020 · Backend Development

Stability Guidelines and Anti‑Patterns for Backend Services

Drawing on five years of incident reviews, the article defines a comprehensive stability framework for backend services—mandating timeout hierarchies, weak dependencies, service-discovery integration, staged gray releases, robust monitoring, capacity planning, and strict change management—while cataloguing common anti-patterns such as over-aggressive circuit breaking, static retries, improper timeouts, tight coupling, and insufficient isolation, and urging regular rehearsal of these practices.

backend stabilitydeployment best practicesincident management
0 likes · 21 min read
Stability Guidelines and Anti‑Patterns for Backend Services
Programmer DD
Programmer DD
Apr 29, 2020 · Operations

How to Keep Your Distributed System Running Even When Upstream Services Fail

The article explains why distributed systems must stay alive despite upstream or downstream failures, emphasizing rate limiting and circuit breaking as essential practices to prevent fault propagation and ensure service reliability, and it invites developers to assess their own safeguards.

Circuit BreakingDistributed Systemsrate limiting
0 likes · 3 min read
How to Keep Your Distributed System Running Even When Upstream Services Fail
Qunar Tech Salon
Qunar Tech Salon
Jan 7, 2020 · Operations

Comprehensive Dependency Governance for High‑Availability Backend Systems

This article outlines a systematic approach to dependency governance in high‑traffic backend services, covering service classification, rate limiting, Dubbo, HTTP, database, and message‑queue management to enhance availability, reduce failure impact, and improve overall system stability.

DubboOperationsdependency management
0 likes · 10 min read
Comprehensive Dependency Governance for High‑Availability Backend Systems
Didi Tech
Didi Tech
Dec 2, 2019 · Operations

Capacity Estimation Methodology for Growing Services

The article presents a systematic capacity‑estimation methodology that links service traffic to order volume, uses CPU‑Idle as a primary metric, predicts traffic growth and upper‑bound limits, validates predictions with load‑testing, and provides scaling recommendations while noting limitations of the CPU‑Idle baseline.

Traffic Predictioncapacity planningresource utilization
0 likes · 9 min read
Capacity Estimation Methodology for Growing Services
JD Retail Technology
JD Retail Technology
Oct 15, 2019 · Operations

Traffic Replication and Replay Platform for JD APP: Design, Features, and Operational Impact

The article describes JD's traffic replication and replay platform, explaining its background, the concepts of traffic copying and replay, detailed platform architecture and features, normalised load testing workflow, dynamic regression testing, operational results, current limitations, and future improvement directions.

AutomationJD platformLoad Testing
0 likes · 11 min read
Traffic Replication and Replay Platform for JD APP: Design, Features, and Operational Impact
Ctrip Technology
Ctrip Technology
Apr 18, 2019 · Operations

Application Monitoring Systems: Necessity, Components, Distributed Tracing, and Design for Developers, Testers, and Operations

The article explains why enterprise application monitoring systems are essential, outlines their core components such as Trace, Log, Metric, and Report, discusses distributed tracing techniques, and describes how these insights are designed to aid developers, testers, and operations engineers in performance tuning and fault diagnosis.

Distributed TracingObservabilityapplication monitoring
0 likes · 12 min read
Application Monitoring Systems: Necessity, Components, Distributed Tracing, and Design for Developers, Testers, and Operations
ITPUB
ITPUB
Mar 26, 2019 · Operations

How to Build a 99.99% High‑Availability Service: Practices and Architecture Evolution

This article explains the essential requirements for achieving 99.99% service availability—consistency, eliminating single points, placement groups, traffic isolation, same‑city active‑active, N+1 redundancy, and multi‑region active‑active—illustrated with a step‑by‑step Yum repository service case study and evolving architecture diagrams.

Deploymentarchitecturecloud operations
0 likes · 9 min read
How to Build a 99.99% High‑Availability Service: Practices and Architecture Evolution
Architect's Tech Stack
Architect's Tech Stack
Dec 5, 2018 · Operations

Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform

The article shares a comprehensive, experience‑driven guide on building fault‑tolerant systems—covering retry mechanisms, dynamic node removal, timeout settings, service degradation, decoupling, and business‑level safeguards—to enable a platform that scales from millions to billions of daily requests without relying on manual fire‑fighting.

OperationsSystem Designfault tolerance
0 likes · 21 min read
Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform
Efficient Ops
Efficient Ops
Oct 29, 2018 · Operations

How Youzan Manages Online Incidents: A Step‑by‑Step Guide

This article outlines Youzan's end‑to‑end online incident management process—from fault detection and coordination through root‑cause analysis, recovery, review, and actionable JIRA tracking—highlighting practical workflows, data analysis, and continuous improvement practices for reliable service delivery.

JIRA workflowOperationsfault handling
0 likes · 10 min read
How Youzan Manages Online Incidents: A Step‑by‑Step Guide
ITPUB
ITPUB
Jun 5, 2018 · Operations

How Meituan Achieved Near‑Zero Downtime for Its Account Service

This article details Meituan's practical approaches to boosting account service reliability, covering MTBF/MTTR metrics, business‑level monitoring, flexible availability with circuit‑breaker patterns, cross‑region active‑active deployment, data synchronization techniques, and the measurable performance gains achieved.

Active-ActiveDistributed Systemscircuit breaker
0 likes · 13 min read
How Meituan Achieved Near‑Zero Downtime for Its Account Service
Efficient Ops
Efficient Ops
May 2, 2018 · Operations

How Tencent Scales 20,000+ Servers: Lessons from SNG Operations

This talk outlines the five major challenges faced by Tencent's SNG component operations—geographic distribution, HTTPS certificate management, massive device failures, long‑term maintenance, and large‑scale scaling—and describes the underlying architecture, operational principles, and practical techniques used to automate and reliably support millions of users during peak events.

AutomationOperationsTencent
0 likes · 20 min read
How Tencent Scales 20,000+ Servers: Lessons from SNG Operations
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Mar 22, 2018 · Operations

How Simple Retry Can Crash Your System and Smarter Alternatives

This article examines the pitfalls of naive retry mechanisms, explores active‑standby service switching, dynamic removal of unhealthy nodes, proper timeout configuration, and anti‑reentrancy strategies to improve system reliability and prevent cascading failures in large‑scale backend operations.

RetryTimeoutfault tolerance
0 likes · 14 min read
How Simple Retry Can Crash Your System and Smarter Alternatives
Efficient Ops
Efficient Ops
Mar 6, 2018 · Operations

How Tencent’s SNG Ops Team Overcame Five Massive Service Challenges

The SNG Operations team shares the five critical challenges of managing tens of thousands of domains, certificates, server failures, automation, and rapid scaling during peak events, and outlines the practical strategies they used to ensure reliable, near‑real‑time service delivery.

AutomationOperationscertificate-management
0 likes · 6 min read
How Tencent’s SNG Ops Team Overcame Five Massive Service Challenges
dbaplus Community
dbaplus Community
Oct 16, 2017 · Operations

How Ele.me Scaled Operations: Key Lessons from Incident Management and Capacity Planning

This article details Ele.me's rapid expansion challenges and shares a three‑stage technical operations journey—fine‑grained division, stability maintenance, and efficiency gains—highlighting real incidents, monitoring upgrades, capacity testing, and practical insights for reliable large‑scale delivery platforms.

Operationscapacity planningincident management
0 likes · 14 min read
How Ele.me Scaled Operations: Key Lessons from Incident Management and Capacity Planning
Meituan Technology Team
Meituan Technology Team
Aug 10, 2017 · Frontend Development

Front-End Service Availability: Definition, Measurement, and Assurance Practices at Meituan-Dianping Checkout

The article outlines Meituan‑Dianping’s approach to front‑end service availability for its checkout system, defining availability across code, static resources, and network links, measuring failure duration, identifying typical bugs, and implementing a three‑stage assurance strategy using people processes, engineering tools, lightweight technology choices, and concrete practices such as TypeScript adoption, automated testing, health‑checks, DNS protection, and post‑incident monitoring.

AvailabilitySSRfrontend
0 likes · 15 min read
Front-End Service Availability: Definition, Measurement, and Assurance Practices at Meituan-Dianping Checkout
Efficient Ops
Efficient Ops
Feb 6, 2017 · Operations

Building Billion‑Scale Web Systems That Auto‑Extinguish Failures

The article shares Tencent’s practical fault‑tolerance journey for a billion‑scale activity platform, covering retry strategies, automatic removal of faulty nodes, timeout tuning, business‑level safeguards, service degradation, and decoupling techniques that together reduce manual firefighting and improve system resilience.

Operationsfault tolerancelarge-scale systems
0 likes · 25 min read
Building Billion‑Scale Web Systems That Auto‑Extinguish Failures
Efficient Ops
Efficient Ops
Nov 9, 2016 · Operations

How to Design Effective SLOs and SLAs: A Technical Deep Dive

This article explains the definitions of service, SLI, SLO, and SLA, outlines how to choose and measure appropriate indicators, shares best practices for setting and improving SLOs, and shows how SLAs combine objectives with consequences to manage service reliability.

OperationsSLASLI
0 likes · 11 min read
How to Design Effective SLOs and SLAs: A Technical Deep Dive
Java High-Performance Architecture
Java High-Performance Architecture
May 23, 2016 · Cloud Native

What Uber’s Microservices Reveal About the Pros and Cons of Distributed Architecture

Uber’s adoption of microservices showcases both the flexibility of using multiple languages and independent release cycles, while also exposing challenges such as duplicated effort across teams, type‑unsafe JSON interfaces, and the need for rigorous failure testing, offering valuable lessons for large‑scale system design.

Distributed SystemsMicroservicesType Safety
0 likes · 5 min read
What Uber’s Microservices Reveal About the Pros and Cons of Distributed Architecture
21CTO
21CTO
Aug 29, 2015 · Backend Development

How to Prevent Service Failures: Trust Third‑Party, Guard Users, Master Your Own Code

An experienced backend engineer shares practical strategies to prevent service failures, covering third‑party distrust, user‑side safeguards, robust API design, traffic limiting, resource management, and architectural best practices such as single‑responsibility and avoiding single points of failure.

Resource Managementapi-designfault tolerance
0 likes · 16 min read
How to Prevent Service Failures: Trust Third‑Party, Guard Users, Master Your Own Code