Tagged articles
28 articles
Page 1 of 1
dbaplus Community
dbaplus Community
Jun 23, 2025 · Operations

How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance

This article shares a year‑long, hands‑on experience of improving backend alert governance at Tencent Meeting, covering why alerts are hard, designing segmented error codes, building unified alert policies, driving team silence‑up, measuring progress, and the tools that make the process sustainable.

Alert Managementbackend operationserror code design
0 likes · 42 min read
How to Tame Alert Fatigue: Practical Strategies for Backend Alert Governance
Sohu Tech Products
Sohu Tech Products
Dec 27, 2023 · Operations

Why Does Elasticsearch Refresh Take 1‑5 Seconds? A Deep Dive into Index Settings and Soft Delete

This article records a systematic test of Elasticsearch refresh latency, revealing that update operations, a high proportion of deleted documents, and the soft‑delete setting significantly increase refresh time, while the large‑segment strategy and disabling soft delete can reduce latency without harming overall performance.

ElasticsearchIndex OptimizationPerformance Testing
0 likes · 7 min read
Why Does Elasticsearch Refresh Take 1‑5 Seconds? A Deep Dive into Index Settings and Soft Delete
Baidu Geek Talk
Baidu Geek Talk
Nov 22, 2023 · Operations

Stability Assurance for Baidu Search Aladdin during Large-Scale Events

Baidu’s Aladdin search service safeguards stability during massive traffic spikes—such as Gaokao, the Tokyo and Beijing Olympics—by mapping dependencies, deploying multi‑dimensional monitoring, adding scaling layers like multi‑region Redis, and establishing rapid‑response on‑call teams, achieving over 99.99 % uptime and near‑real‑time data updates.

backend operationsfault handlinglarge-scale traffic
0 likes · 9 min read
Stability Assurance for Baidu Search Aladdin during Large-Scale Events
360 Quality & Efficiency
360 Quality & Efficiency
Nov 11, 2022 · Operations

Understanding TCPCopy: Architecture, Core Principles, and Performance

This article introduces the open‑source traffic‑replay tool TCPCopy, explains its 1.0 architecture—including the tcpcopy and intercept components—covers its packet‑capture and injection methods (raw socket vs pcap), TCP state handling, routing challenges, intercept role, and performance characteristics, providing practical insights for backend testing and operations.

PCAPbackend operationsnetwork testing
0 likes · 9 min read
Understanding TCPCopy: Architecture, Core Principles, and Performance
ITPUB
ITPUB
Aug 18, 2022 · Operations

How WeChat Keeps Billions of Requests Stable: Overload Control Strategies for Massive Microservices

This article breaks down WeChat’s 2018 overload control system for massive microservices, explaining the problem of service overload, detection via average waiting time, and a multi‑level priority‑based mitigation strategy that dynamically adjusts admission thresholds to keep billions of daily requests stable.

MicroservicesPriority SchedulingWeChat
0 likes · 12 min read
How WeChat Keeps Billions of Requests Stable: Overload Control Strategies for Massive Microservices
Sanyou's Java Diary
Sanyou's Java Diary
Aug 11, 2022 · Operations

Rapidly Diagnose Production Bugs with Linux Tools, Performance Tricks & Design Patterns

This article guides developers through classifying system‑level and business‑level bugs, using Linux utilities like perf, ps, and vmstat for quick root‑cause analysis, and outlines effective code‑design patterns and architectural strategies—caching, rate‑limiting, and high‑availability—to prevent and resolve production incidents.

Linux performancebackend operationsbug troubleshooting
0 likes · 13 min read
Rapidly Diagnose Production Bugs with Linux Tools, Performance Tricks & Design Patterns
dbaplus Community
dbaplus Community
Aug 7, 2022 · Operations

How to Slim Down Application Logs: Practical Techniques and Real‑World Case Study

Developers often flood systems with INFO logs, causing massive files that strain operations; this article outlines practical log‑slimming strategies—printing only essential logs, merging entries, using abbreviations, and context‑aware level switches—illustrated with a concrete case that reduced daily log volume from 5 GB to under 1 GB.

Code Refactoringbackend operationsjava logging
0 likes · 7 min read
How to Slim Down Application Logs: Practical Techniques and Real‑World Case Study
Programmer DD
Programmer DD
Aug 2, 2022 · Operations

Master JVM Debugging with Arthas: Essential Commands and Real‑World Use Cases

Arthas, Alibaba’s open‑source Java diagnostic tool, enables dynamic code tracing, real‑time JVM monitoring, and on‑the‑fly debugging without stopping applications; this guide covers installation, common scenarios, and core commands such as stack, jad, sc, watch, trace, jobs, logger, dashboard, and redefine for effective troubleshooting.

ArthasJVM debuggingJava
0 likes · 16 min read
Master JVM Debugging with Arthas: Essential Commands and Real‑World Use Cases
Architecture Digest
Architecture Digest
Oct 3, 2021 · Operations

Comparison of Distributed Scheduling Frameworks and Their Differences from Quartz

This article examines common business scenarios that require timed tasks, introduces single‑machine and distributed scheduling solutions such as Timer, ScheduledExecutorService, Spring, Quartz, TBSchedule, elastic‑job, Saturn, and XXL‑Job, and provides a detailed feature‑by‑feature comparison to help choose the most suitable framework.

Distributed SchedulingElastic-JobQuartz
0 likes · 11 min read
Comparison of Distributed Scheduling Frameworks and Their Differences from Quartz
macrozheng
macrozheng
Sep 6, 2021 · Operations

Choosing the Right Distributed Scheduler: Elastic‑Job vs X‑Job vs Quartz

This article examines common business scenarios requiring timed tasks, compares single‑machine and distributed scheduling frameworks such as Timer, Spring, Quartz, TBSchedule, Elastic‑Job, Saturn and XXL‑Job, and provides guidance on selecting the most suitable solution.

Distributed SchedulingElastic-JobQuartz
0 likes · 15 min read
Choosing the Right Distributed Scheduler: Elastic‑Job vs X‑Job vs Quartz
Tencent Cloud Middleware
Tencent Cloud Middleware
Aug 19, 2021 · Backend Development

Fast Kafka Cluster Expansion: Practical Strategies to Reduce Data Migration

When a Kafka cluster reaches load limits or experiences sudden traffic spikes, urgent expansion is needed, but data migration can be time‑consuming and risky; this guide outlines several practical techniques—including adjusting retention, adding partitions, leader switching, and single‑replica operation—to quickly scale clusters while minimizing data movement and service disruption.

Data MigrationKafkaPartition Reassignment
0 likes · 21 min read
Fast Kafka Cluster Expansion: Practical Strategies to Reduce Data Migration
IT Architects Alliance
IT Architects Alliance
Jul 8, 2021 · Operations

Mastering High Availability: From Cold Backup to Multi‑Region Active‑Active

This article analyzes various high‑availability strategies for stateful backend services—covering cold backup, dual‑machine hot standby, same‑city active‑active, remote active‑active, and multi‑region active‑active architectures—detailing their benefits, limitations, and practical implementation considerations.

Active-ActiveSystem Designbackend operations
0 likes · 14 min read
Mastering High Availability: From Cold Backup to Multi‑Region Active‑Active
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Apr 23, 2021 · Operations

How JD’s Open Platform Guarantees Reliable Message Delivery with Dynamic BMQ Design

This article explains how JD’s Open Platform’s Business Message Queue (BMQ) architecture, dynamic channels, retry and downgrade mechanisms, and real‑time monitoring ensure reliable, low‑risk message delivery across thousands of merchants while simplifying integration and scaling for future growth.

AlertingDynamic ConfigurationJD Open Platform
0 likes · 10 min read
How JD’s Open Platform Guarantees Reliable Message Delivery with Dynamic BMQ Design
Qunar Tech Salon
Qunar Tech Salon
Feb 7, 2020 · Operations

Internal Resource Governance Practices for High‑Availability Systems

This article outlines comprehensive internal resource governance techniques—including degradation, circuit breaking, isolation, async conversion, thread‑pool management, JVM and hardware metric monitoring, and daily operational practices—to enhance system stability and high availability in large‑scale backend services.

backend operationscircuit breakerdegradation
0 likes · 10 min read
Internal Resource Governance Practices for High‑Availability Systems
MaGe Linux Operations
MaGe Linux Operations
Jan 16, 2020 · Operations

How to Quickly Diagnose and Fix High CPU Usage in a Data Platform

This guide walks through a real‑world incident where a data platform’s CPU spiked to 98.94%, showing step‑by‑step how to identify the high‑load process, pinpoint the offending Java thread, analyze the root cause in the time‑utility code, and implement a performance‑focused solution that reduced load by thirtyfold.

CPU troubleshootingJava profilingLinux monitoring
0 likes · 7 min read
How to Quickly Diagnose and Fix High CPU Usage in a Data Platform
Architecture Digest
Architecture Digest
Sep 23, 2019 · Operations

Improving Application Availability: Practices, Monitoring, and Fault‑Tolerance in a Large‑Scale Payment System

The article describes how a high‑traffic payment platform achieves 99.999% availability by avoiding single points of failure, applying fail‑fast principles, implementing resource limits, building real‑time monitoring and alerting, and automating fault detection, routing, and recovery to ensure continuous 7×24 operation.

backend operationsfault tolerancehigh availability
0 likes · 23 min read
Improving Application Availability: Practices, Monitoring, and Fault‑Tolerance in a Large‑Scale Payment System
Efficient Ops
Efficient Ops
May 13, 2018 · Operations

Diagnosing and Fixing TCP SYN Queue Overflows that Crash E‑commerce Sites

This article walks through a real‑world incident where an e‑commerce site suffered intermittent outages due to TCP SYN and accept queue overflows, explains the underlying handshake mechanics, shows how kernel and Nginx parameters can be tuned, and provides Python scripts for testing and SYN‑flood simulation.

SYN FloodTCPbackend operations
0 likes · 9 min read
Diagnosing and Fixing TCP SYN Queue Overflows that Crash E‑commerce Sites
21CTO
21CTO
Nov 2, 2017 · Operations

How to Diagnose and Fix Online System Issues Efficiently

This article shares practical methods for frontline engineers to quickly understand, assess, and resolve online system problems by categorizing system layers, evaluating impact, using essential Linux monitoring tools, and applying systematic troubleshooting and design‑for‑failure strategies to minimize downtime.

Linux toolsOnline DebuggingPerformance Monitoring
0 likes · 11 min read
How to Diagnose and Fix Online System Issues Efficiently
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Oct 11, 2016 · Operations

Mastering Service Degradation: Strategies to Keep High‑Concurrency Systems Running

This article explores comprehensive service degradation techniques—including automatic and manual switchovers, read/write and multi‑level fallback strategies, and practical examples like timeout, failure count, and traffic throttling—to ensure core functionality remains available during traffic spikes or component failures in high‑concurrency systems.

backend operationsfallback strategieshigh concurrency
0 likes · 14 min read
Mastering Service Degradation: Strategies to Keep High‑Concurrency Systems Running
21CTO
21CTO
Jan 20, 2016 · Operations

From Single‑Server to Distributed CDN: Evolving Image Server Architecture

This article traces the evolution of image server architectures—from a simple single‑directory setup on Windows/.NET, through cluster‑based real‑time synchronization and shared‑storage solutions, to modern FastDFS‑backed distributed file systems with CDN acceleration—highlighting scalability, reliability, and migration challenges.

CDNFastDFSbackend operations
0 likes · 13 min read
From Single‑Server to Distributed CDN: Evolving Image Server Architecture