Tagged articles

28 articles

Page 1 of 1

Jun 23, 2025 · Operations

How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance

This article shares a year‑long, hands‑on experience of improving backend alert governance at Tencent Meeting, covering why alerts are hard, designing segmented error codes, building unified alert policies, driving team silence‑up, measuring progress, and the tools that make the process sustainable.

Alert Managementbackend operationserror code design

0 likes · 42 min read

How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance

Architecture Breakthrough

May 17, 2024 · Operations

Why Service Orchestration Fails: Risks, Ordering Rules, and Rollback Strategies

The article analyzes a real‑world financing workflow where a failed electronic contract generation exposed legal, loss, and reputation risks, and it outlines critical ordering, dependency, and rollback cost considerations to improve service orchestration reliability.

Distributed TransactionsService Orchestrationbackend operations

0 likes · 6 min read

Why Service Orchestration Fails: Risks, Ordering Rules, and Rollback Strategies

Architecture Digest

Jan 5, 2024 · Operations

Nginx Rate Limiting: Request Rate, Connection Limits, and Bandwidth Control

This article explains how to configure Nginx for rate limiting, including request rate control, burst handling, whitelisting, connection limits, and upload/download bandwidth throttling, with detailed directives, examples, and code snippets to ensure proper service stability.

Bandwidth ControlConnection LimitNginx

0 likes · 14 min read

Nginx Rate Limiting: Request Rate, Connection Limits, and Bandwidth Control

Sohu Tech Products

Dec 27, 2023 · Operations

Why Does Elasticsearch Refresh Take 1‑5 Seconds? A Deep Dive into Index Settings and Soft Delete

This article records a systematic test of Elasticsearch refresh latency, revealing that update operations, a high proportion of deleted documents, and the soft‑delete setting significantly increase refresh time, while the large‑segment strategy and disabling soft delete can reduce latency without harming overall performance.

ElasticsearchIndex OptimizationPerformance Testing

0 likes · 7 min read

Why Does Elasticsearch Refresh Take 1‑5 Seconds? A Deep Dive into Index Settings and Soft Delete

Baidu Geek Talk

Nov 22, 2023 · Operations

Stability Assurance for Baidu Search Aladdin during Large-Scale Events

Baidu’s Aladdin search service safeguards stability during massive traffic spikes—such as Gaokao, the Tokyo and Beijing Olympics—by mapping dependencies, deploying multi‑dimensional monitoring, adding scaling layers like multi‑region Redis, and establishing rapid‑response on‑call teams, achieving over 99.99 % uptime and near‑real‑time data updates.

backend operationsfault handlinglarge-scale traffic

0 likes · 9 min read

Stability Assurance for Baidu Search Aladdin during Large-Scale Events

360 Quality & Efficiency

Nov 11, 2022 · Operations

Understanding TCPCopy: Architecture, Core Principles, and Performance

This article introduces the open‑source traffic‑replay tool TCPCopy, explains its 1.0 architecture—including the tcpcopy and intercept components—covers its packet‑capture and injection methods (raw socket vs pcap), TCP state handling, routing challenges, intercept role, and performance characteristics, providing practical insights for backend testing and operations.

PCAPbackend operationsnetwork testing

0 likes · 9 min read

Understanding TCPCopy: Architecture, Core Principles, and Performance

ITPUB

Aug 18, 2022 · Operations

How WeChat Keeps Billions of Requests Stable: Overload Control Strategies for Massive Microservices

This article breaks down WeChat’s 2018 overload control system for massive microservices, explaining the problem of service overload, detection via average waiting time, and a multi‑level priority‑based mitigation strategy that dynamically adjusts admission thresholds to keep billions of daily requests stable.

MicroservicesPriority SchedulingWeChat

0 likes · 12 min read

How WeChat Keeps Billions of Requests Stable: Overload Control Strategies for Massive Microservices

Sanyou's Java Diary

Aug 11, 2022 · Operations

Rapidly Diagnose Production Bugs with Linux Tools, Performance Tricks & Design Patterns

This article guides developers through classifying system‑level and business‑level bugs, using Linux utilities like perf, ps, and vmstat for quick root‑cause analysis, and outlines effective code‑design patterns and architectural strategies—caching, rate‑limiting, and high‑availability—to prevent and resolve production incidents.

Linux performancebackend operationsbug troubleshooting

0 likes · 13 min read

Rapidly Diagnose Production Bugs with Linux Tools, Performance Tricks & Design Patterns

dbaplus Community

Aug 7, 2022 · Operations

How to Slim Down Application Logs: Practical Techniques and Real‑World Case Study

Developers often flood systems with INFO logs, causing massive files that strain operations; this article outlines practical log‑slimming strategies—printing only essential logs, merging entries, using abbreviations, and context‑aware level switches—illustrated with a concrete case that reduced daily log volume from 5 GB to under 1 GB.

Code Refactoringbackend operationsjava logging

0 likes · 7 min read

How to Slim Down Application Logs: Practical Techniques and Real‑World Case Study

Programmer DD

Aug 2, 2022 · Operations

Master JVM Debugging with Arthas: Essential Commands and Real‑World Use Cases

Arthas, Alibaba’s open‑source Java diagnostic tool, enables dynamic code tracing, real‑time JVM monitoring, and on‑the‑fly debugging without stopping applications; this guide covers installation, common scenarios, and core commands such as stack, jad, sc, watch, trace, jobs, logger, dashboard, and redefine for effective troubleshooting.

ArthasJVM debuggingJava

0 likes · 16 min read

Master JVM Debugging with Arthas: Essential Commands and Real‑World Use Cases

Laravel Tech Community

Jun 6, 2022 · Operations

Nginx Unit 1.27.0 Release: HTTPS Redirection, Configurable Filenames, and Platform Updates

The Nginx Unit 1.27.0 release introduces HTTP‑to‑HTTPS redirection using $request_uri, configurable default filenames for pure‑path URIs, numerous bug fixes, expanded Linux distribution support, and updated Docker images with the latest language runtimes.

HTTPS redirectionNGINX UnitServer Configuration

0 likes · 4 min read

Nginx Unit 1.27.0 Release: HTTPS Redirection, Configurable Filenames, and Platform Updates

IT Architects Alliance

Jan 27, 2022 · Operations

How to Build a Highly Available Redis Service with Sentinel – Step‑by‑Step Guide

This article explains why Redis needs high availability, defines failure scenarios, compares common HA solutions, and walks through four deployment patterns—from a single instance to a three‑Sentinel architecture—highlighting their trade‑offs and practical implementation details.

Service Architecturebackend operationshigh availability

0 likes · 13 min read

How to Build a Highly Available Redis Service with Sentinel – Step‑by‑Step Guide

Architecture Digest

Oct 3, 2021 · Operations

Comparison of Distributed Scheduling Frameworks and Their Differences from Quartz

This article examines common business scenarios that require timed tasks, introduces single‑machine and distributed scheduling solutions such as Timer, ScheduledExecutorService, Spring, Quartz, TBSchedule, elastic‑job, Saturn, and XXL‑Job, and provides a detailed feature‑by‑feature comparison to help choose the most suitable framework.

Distributed SchedulingElastic-JobQuartz

0 likes · 11 min read

Comparison of Distributed Scheduling Frameworks and Their Differences from Quartz

macrozheng

Sep 6, 2021 · Operations

Choosing the Right Distributed Scheduler: Elastic‑Job vs X‑Job vs Quartz

This article examines common business scenarios requiring timed tasks, compares single‑machine and distributed scheduling frameworks such as Timer, Spring, Quartz, TBSchedule, Elastic‑Job, Saturn and XXL‑Job, and provides guidance on selecting the most suitable solution.

Distributed SchedulingElastic-JobQuartz

0 likes · 15 min read

Choosing the Right Distributed Scheduler: Elastic‑Job vs X‑Job vs Quartz

Tencent Cloud Middleware

Aug 19, 2021 · Backend Development

Fast Kafka Cluster Expansion: Practical Strategies to Reduce Data Migration

When a Kafka cluster reaches load limits or experiences sudden traffic spikes, urgent expansion is needed, but data migration can be time‑consuming and risky; this guide outlines several practical techniques—including adjusting retention, adding partitions, leader switching, and single‑replica operation—to quickly scale clusters while minimizing data movement and service disruption.

Data MigrationKafkaPartition Reassignment

0 likes · 21 min read

Fast Kafka Cluster Expansion: Practical Strategies to Reduce Data Migration

IT Architects Alliance

Jul 8, 2021 · Operations

Mastering High Availability: From Cold Backup to Multi‑Region Active‑Active

This article analyzes various high‑availability strategies for stateful backend services—covering cold backup, dual‑machine hot standby, same‑city active‑active, remote active‑active, and multi‑region active‑active architectures—detailing their benefits, limitations, and practical implementation considerations.

Active-ActiveSystem Designbackend operations

0 likes · 14 min read

Mastering High Availability: From Cold Backup to Multi‑Region Active‑Active

ITFLY8 Architecture Home

Apr 23, 2021 · Operations

How JD’s Open Platform Guarantees Reliable Message Delivery with Dynamic BMQ Design

This article explains how JD’s Open Platform’s Business Message Queue (BMQ) architecture, dynamic channels, retry and downgrade mechanisms, and real‑time monitoring ensure reliable, low‑risk message delivery across thousands of merchants while simplifying integration and scaling for future growth.

AlertingDynamic ConfigurationJD Open Platform

0 likes · 10 min read

How JD’s Open Platform Guarantees Reliable Message Delivery with Dynamic BMQ Design

Qunar Tech Salon

Feb 7, 2020 · Operations

Internal Resource Governance Practices for High‑Availability Systems

This article outlines comprehensive internal resource governance techniques—including degradation, circuit breaking, isolation, async conversion, thread‑pool management, JVM and hardware metric monitoring, and daily operational practices—to enhance system stability and high availability in large‑scale backend services.

backend operationscircuit breakerdegradation

0 likes · 10 min read

Internal Resource Governance Practices for High‑Availability Systems

MaGe Linux Operations

Jan 16, 2020 · Operations

How to Quickly Diagnose and Fix High CPU Usage in a Data Platform

This guide walks through a real‑world incident where a data platform’s CPU spiked to 98.94%, showing step‑by‑step how to identify the high‑load process, pinpoint the offending Java thread, analyze the root cause in the time‑utility code, and implement a performance‑focused solution that reduced load by thirtyfold.

CPU troubleshootingJava profilingLinux monitoring

0 likes · 7 min read

How to Quickly Diagnose and Fix High CPU Usage in a Data Platform

Architecture Digest

Sep 23, 2019 · Operations

Improving Application Availability: Practices, Monitoring, and Fault‑Tolerance in a Large‑Scale Payment System

The article describes how a high‑traffic payment platform achieves 99.999% availability by avoiding single points of failure, applying fail‑fast principles, implementing resource limits, building real‑time monitoring and alerting, and automating fault detection, routing, and recovery to ensure continuous 7×24 operation.

backend operationsfault tolerancehigh availability

0 likes · 23 min read

Improving Application Availability: Practices, Monitoring, and Fault‑Tolerance in a Large‑Scale Payment System

Full-Stack Internet Architecture

Aug 1, 2019 · Operations

Handling GC Alerts by Splitting and Sharding Scheduled Tasks in Production

The article recounts a production incident where a GC alert triggered due to excessive object creation in a scheduled ad‑transaction sync task, and explains how the problem was diagnosed, mitigated by task splitting, and finally resolved through data sharding across multiple machines.

backend operationsgcmonitoring

0 likes · 6 min read

Handling GC Alerts by Splitting and Sharding Scheduled Tasks in Production

Efficient Ops

May 13, 2018 · Operations

Diagnosing and Fixing TCP SYN Queue Overflows that Crash E‑commerce Sites

This article walks through a real‑world incident where an e‑commerce site suffered intermittent outages due to TCP SYN and accept queue overflows, explains the underlying handshake mechanics, shows how kernel and Nginx parameters can be tuned, and provides Python scripts for testing and SYN‑flood simulation.

SYN FloodTCPbackend operations

0 likes · 9 min read

Diagnosing and Fixing TCP SYN Queue Overflows that Crash E‑commerce Sites

MaGe Linux Operations

Jan 1, 2018 · Operations

How to Build a Fast, Accurate Log Analyzer with Python and MySQL

This article explains how to create a lightweight yet reliable Python‑based log analysis tool that parses nginx logs with regular expressions, stores detailed metrics in MySQL, and provides fine‑grained performance and anomaly reports for web services.

Pythonbackend operationslog analysis

0 likes · 8 min read

How to Build a Fast, Accurate Log Analyzer with Python and MySQL

21CTO

Nov 2, 2017 · Operations

How to Diagnose and Fix Online System Issues Efficiently

This article shares practical methods for frontline engineers to quickly understand, assess, and resolve online system problems by categorizing system layers, evaluating impact, using essential Linux monitoring tools, and applying systematic troubleshooting and design‑for‑failure strategies to minimize downtime.

Linux toolsOnline DebuggingPerformance Monitoring

0 likes · 11 min read

How to Diagnose and Fix Online System Issues Efficiently

ITFLY8 Architecture Home

Oct 11, 2016 · Operations

Mastering Service Degradation: Strategies to Keep High‑Concurrency Systems Running

This article explores comprehensive service degradation techniques—including automatic and manual switchovers, read/write and multi‑level fallback strategies, and practical examples like timeout, failure count, and traffic throttling—to ensure core functionality remains available during traffic spikes or component failures in high‑concurrency systems.

backend operationsfallback strategieshigh concurrency

0 likes · 14 min read

Mastering Service Degradation: Strategies to Keep High‑Concurrency Systems Running

21CTO

Jan 20, 2016 · Operations

From Single‑Server to Distributed CDN: Evolving Image Server Architecture

This article traces the evolution of image server architectures—from a simple single‑directory setup on Windows/.NET, through cluster‑based real‑time synchronization and shared‑storage solutions, to modern FastDFS‑backed distributed file systems with CDN acceleration—highlighting scalability, reliability, and migration challenges.

CDNFastDFSbackend operations

0 likes · 13 min read

From Single‑Server to Distributed CDN: Evolving Image Server Architecture

Java High-Performance Architecture

Oct 25, 2015 · Operations

How Load Balancing Powers Scalable Application Server Deployments

This article explains how stateless application servers combined with load balancers form scalable clusters, detailing request routing, scaling strategies, and common implementations such as Nginx for HTTP forwarding and LVS for IP‑level balancing.

Application ServerLVSNginx

0 likes · 3 min read

How Load Balancing Powers Scalable Application Server Deployments

Java High-Performance Architecture

Oct 25, 2015 · Operations

How to Build a Scalable Web Architecture: From Single Server to Clustered Services

This article explains how small‑team internet products evolve from a single‑server setup to a scalable, clustered web architecture by separating services, adding dedicated servers, and employing caching, CDNs, and other components to handle growing user traffic.

Server Clusteringbackend operationsscalable architecture

0 likes · 4 min read

How to Build a Scalable Web Architecture: From Single Server to Clustered Services