Tagged articles
178 articles
Page 1 of 2
21CTO
21CTO
May 10, 2026 · Industry Insights

Why GitHub’s Reliability Issues Are Driving Users Away

GitHub’s uptime has fallen sharply, with hundreds of incidents—including dozens of major outages—largely fueled by AI‑driven code generation, prompting high‑profile users to migrate, leadership to prioritize availability, and a costly overhaul of capacity and architecture.

AI-driven developmentGitHubGitHub Actions
0 likes · 11 min read
Why GitHub’s Reliability Issues Are Driving Users Away
ITPUB
ITPUB
Apr 25, 2026 · Interview Experience

How to Design a Billion‑Scale URL Shortening System for an Interview

This article walks through the complete interview‑style design of a billion‑scale URL shortener, covering requirements, capacity estimation, API definitions, database schema, short‑code generation algorithms, sharding, caching, load balancing, rate limiting, and expiration handling, while illustrating each step with concrete examples and calculations.

Distributed SystemsSystem DesignURL shortener
0 likes · 24 min read
How to Design a Billion‑Scale URL Shortening System for an Interview
ITPUB
ITPUB
Jan 16, 2026 · Databases

How Meituan Built a Real‑Time Database Capacity Assessment System

Meituan's database team created a sandbox‑based capacity assessment platform that replays live traffic, uses accelerated replay to discover performance bottlenecks, and closes the loop with capacity monitoring and automated operations, dramatically improving stability and resource utilization.

Database CapacityPerformance Testingautomation
0 likes · 16 min read
How Meituan Built a Real‑Time Database Capacity Assessment System
Ray's Galactic Tech
Ray's Galactic Tech
Jan 9, 2026 · Operations

Why Does Nginx Return 502 Bad Gateway? A Complete Log‑to‑FastCGI Timeout Diagnosis

This guide walks through diagnosing intermittent 502 Bad Gateway errors in Nginx by analyzing error logs, checking upstream and FastCGI timeout settings, reviewing PHP‑FPM configuration, performing performance tuning, and outlining advanced troubleshooting, monitoring, and capacity‑planning strategies to ensure stable high‑traffic deployments.

502Nginxcapacity planning
0 likes · 9 min read
Why Does Nginx Return 502 Bad Gateway? A Complete Log‑to‑FastCGI Timeout Diagnosis
Woodpecker Software Testing
Woodpecker Software Testing
Jan 5, 2026 · Operations

Three Core Dimensions of Performance Testing: Time Behavior, Resource Utilization, and Capacity

This article breaks down performance testing into three essential dimensions—time behavior, resource utilization, and capacity—explains their key metrics, demonstrates a detailed e‑commerce flash‑sale case study, and shows how systematic testing and optimization can dramatically improve response times, throughput, and scalability.

JMeterLoad TestingMetrics
0 likes · 12 min read
Three Core Dimensions of Performance Testing: Time Behavior, Resource Utilization, and Capacity
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jan 5, 2026 · Operations

How to Define “Excellent” QPS Benchmarks for System Capacity Planning

This article provides a comprehensive framework for evaluating system support capability by defining QPS excellence thresholds across industry benchmarks, business types, response time, resource efficiency, performance metrics, architectural guidelines, optimization tactics, and real‑world case studies, culminating in a practical calculation formula.

Backend ArchitecturePerformance TestingQPS
0 likes · 5 min read
How to Define “Excellent” QPS Benchmarks for System Capacity Planning
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jan 3, 2026 · Operations

How to Accurately Estimate System QPS for Capacity Planning

This guide explains what QPS is, outlines three practical methods to estimate it—including business‑scenario modeling, historical data analysis, and industry benchmarking—covers key influencing factors, shows formulas linking QPS, concurrency and response time, and recommends tools and best‑practice tips for reliable capacity planning.

Load TestingQPScapacity planning
0 likes · 8 min read
How to Accurately Estimate System QPS for Capacity Planning
IT Services Circle
IT Services Circle
Dec 29, 2025 · Backend Development

Mastering Sharding: When, How, and How Much to Split Your Database

This guide walks senior backend engineers through the strategic reasoning, capacity estimation, and step‑by‑step migration techniques for sharding databases, covering when to split, choosing between partitioning, read/write splitting, or full sharding, and how to plan safe expansions.

Data MigrationPartitioningcapacity planning
0 likes · 19 min read
Mastering Sharding: When, How, and How Much to Split Your Database
DevOps Coach
DevOps Coach
Dec 25, 2025 · Cloud Native

Real-World Kubernetes Troubleshooting Skills You Won’t Learn in Interviews

The article reveals the hidden gap between textbook Kubernetes knowledge and real production failures, offering six practical skills—from interpreting pod symptoms and debugging without logs to capacity planning and treating events as first‑class signals—essential for engineers to survive on‑call crises that interview questions never cover.

Cloud NativeKubernetescapacity planning
0 likes · 7 min read
Real-World Kubernetes Troubleshooting Skills You Won’t Learn in Interviews
dbaplus Community
dbaplus Community
Nov 26, 2025 · Databases

How Meituan’s Database Capacity Assessment System Boosts Stability and Efficiency

Meituan’s Database Capacity Assessment System uses online traffic replay, accelerated load testing, and automated analysis to safely evaluate and optimize database read/write capacity, providing real‑time operation insights, flexible scaling, and reliable change‑risk mitigation for large‑scale production environments.

Meituanautomationcapacity planning
0 likes · 17 min read
How Meituan’s Database Capacity Assessment System Boosts Stability and Efficiency
Data Party THU
Data Party THU
Nov 2, 2025 · Operations

How to Maximize vLLM Throughput: Batch Size, Quantization, and Monitoring Tips

This guide explains how to unleash vLLM’s full potential by optimizing batch size, leveraging 4‑bit quantization, tuning concurrency parameters, planning capacity with token‑per‑second metrics, and implementing robust monitoring to balance latency, cost, and scalability in production deployments.

BatchingLLM servingcapacity planning
0 likes · 10 min read
How to Maximize vLLM Throughput: Batch Size, Quantization, and Monitoring Tips
Meituan Technology Team
Meituan Technology Team
Oct 16, 2025 · Databases

How Meituan’s Database Capacity Evaluation System Boosts Stability and Efficiency

Meituan’s database team introduced a capacity evaluation system that uses online traffic replay in sandbox environments, accelerated replay to uncover performance bottlenecks, and a capacity‑operation loop to monitor and manage cluster resources, thereby improving database stability, safety of changes, and resource utilization.

Performance Testingcapacity planningdatabases
0 likes · 19 min read
How Meituan’s Database Capacity Evaluation System Boosts Stability and Efficiency
MaGe Linux Operations
MaGe Linux Operations
Oct 16, 2025 · Operations

SRE Playbook: From Alert to Full Recovery of Service Avalanches

This comprehensive SRE guide walks through a real-world service avalanche incident, detailing alert triggering, root‑cause analysis, step‑by‑step recovery, capacity baseline creation, layered alert design, automated scripts, and post‑mortem best practices to help engineers prevent and resolve large‑scale outages.

AlertingSREService Avalanche
0 likes · 20 min read
SRE Playbook: From Alert to Full Recovery of Service Avalanches
Wukong Talks Architecture
Wukong Talks Architecture
Sep 22, 2025 · Databases

How AI‑Powered AIOps Transforms TiDB Database Operations

This article explores how integrating AI‑driven AIOps with the TiDB distributed database can automate monitoring, enable proactive anomaly detection, streamline root‑cause analysis, and optimize capacity planning, ultimately shifting database operations from manual firefighting to intelligent, data‑driven management.

Database operationsRoot Cause AnalysisTiDB
0 likes · 12 min read
How AI‑Powered AIOps Transforms TiDB Database Operations
MaGe Linux Operations
MaGe Linux Operations
Sep 12, 2025 · Operations

From Alert Storms to Intelligent Ops: A Practical AIOps Journey

This article explores how AIOps transforms traditional IT operations by using AI for anomaly detection, root‑cause analysis, capacity forecasting, and self‑healing, offering a step‑by‑step roadmap, real‑world code examples, toolchain recommendations, common pitfalls, and future trends for building intelligent, automated operations.

Root Cause Analysisaiopsanomaly detection
0 likes · 24 min read
From Alert Storms to Intelligent Ops: A Practical AIOps Journey
dbaplus Community
dbaplus Community
Aug 20, 2025 · Operations

How Qunar Automates Hotel Capacity Planning with Predictive Scaling

This article details Qunar's end‑to‑end solution for forecasting traffic spikes, estimating required CPU resources, and automatically scaling hotel services using a combined flow‑calendar, algorithmic prediction, and Ops‑driven auto‑scaling pipeline, improving stability and operational efficiency.

Algorithmic ForecastingAuto ScalingKubernetes
0 likes · 12 min read
How Qunar Automates Hotel Capacity Planning with Predictive Scaling
Tech Freedom Circle
Tech Freedom Circle
Aug 20, 2025 · Backend Development

P0 Eureka Service Discovery Collapse Cost a Top E‑commerce $120M During Double‑11

During the Double‑11 shopping festival, a leading e‑commerce platform suffered a P0 outage when its Eureka service‑discovery cluster overloaded, triggering a full‑chain failure that lasted 2 hours 42 minutes and caused losses exceeding 1.2 billion yuan; the article dissects the timeline, root causes, capacity mis‑planning, monitoring gaps, and remediation strategies.

Microservicescapacity planningeureka
0 likes · 34 min read
P0 Eureka Service Discovery Collapse Cost a Top E‑commerce $120M During Double‑11
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
Aug 18, 2025 · Operations

Why Fixing the Production Line Alone Won’t Solve Capacity Bottlenecks

Many managers blame the workshop when capacity falls short, but the real issue lies in misaligned production, planning, procurement, and logistics; this article explains how to diagnose bottlenecks, use data-driven methods, and implement coordinated supply‑chain strategies to boost throughput.

Production optimizationSupply Chaincapacity planning
0 likes · 10 min read
Why Fixing the Production Line Alone Won’t Solve Capacity Bottlenecks
Tech Freedom Circle
Tech Freedom Circle
Aug 15, 2025 · Backend Development

Calculating a 100k QPS Rate‑Limiting Threshold: Methods and Best Practices

This article explains how to determine a 100 000‑QPS rate‑limiting threshold by covering the purpose of throttling, the three core elements of limiting, common algorithms, target dimensions, capacity estimation for single‑service and full‑link scenarios, pressure‑testing techniques, monitoring data, and adaptive configuration strategies.

Performance TestingQPSadaptive throttling
0 likes · 18 min read
Calculating a 100k QPS Rate‑Limiting Threshold: Methods and Best Practices
Architecture Breakthrough
Architecture Breakthrough
Jul 28, 2025 · Operations

Turn Point Fixes into Systemic Solutions: A Practical Optimization Framework

Effective technical optimization requires moving from isolated, point‑style ideas to a comprehensive, measurable framework that quantifies goals, assesses gaps, designs capacity, monitors key services and links, and establishes clear compensation and incident‑handling procedures, ensuring a complete, closed‑loop solution.

Operationscapacity planningincident handling
0 likes · 8 min read
Turn Point Fixes into Systemic Solutions: A Practical Optimization Framework
Tech Freedom Circle
Tech Freedom Circle
Jul 14, 2025 · Databases

How to Estimate Sharding Capacity: Calculating Required Databases and Tables for an Alibaba Interview

The article walks through why sharding is needed, outlines IO and CPU bottlenecks, presents two design principles, shows how to estimate capacity from existing data and growth trends, compares range, modulo, consistent‑hash and Snowflake sharding schemes, and details migration strategies for expanding nodes without downtime.

Data Migrationcapacity planningconsistent hashing
0 likes · 22 min read
How to Estimate Sharding Capacity: Calculating Required Databases and Tables for an Alibaba Interview
Efficient Ops
Efficient Ops
Jul 13, 2025 · Operations

Mastering Modern System Operations: 6 Essential Strategies for Stability and Efficiency

This comprehensive guide outlines six critical areas of modern system operations—including real‑time monitoring, security safeguards, automation, fault diagnosis, collaborative teamwork, and process optimization—offering practical strategies and tools such as Zabbix, Prometheus, ELK, Redis, Ansible, and capacity planning to ensure stable, efficient enterprise services.

automationcapacity planningmonitoring
0 likes · 10 min read
Mastering Modern System Operations: 6 Essential Strategies for Stability and Efficiency
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
May 13, 2025 · Operations

Mastering Supply Chain Planning: Balancing Demand, Capacity, and Inventory with ERP

This article explains why many companies struggle with inaccurate plans, defines supply chain planning as the dynamic coordination of demand, capacity, and inventory, and provides a step‑by‑step ERP‑based framework—including demand forecasting, capacity analysis, inventory control, and execution—to achieve reliable, data‑driven operations.

ERPSupply Chaincapacity planning
0 likes · 10 min read
Mastering Supply Chain Planning: Balancing Demand, Capacity, and Inventory with ERP
Qunar Tech Salon
Qunar Tech Salon
Mar 27, 2025 · Operations

Automated Capacity Planning and Auto‑Scaling for Hotel Services During Traffic Peaks

This document describes a comprehensive capacity‑planning solution that predicts traffic‑peak impacts for hotel services, automatically estimates required CPU resources, creates timed scaling tasks, and evaluates performance using detailed metrics, thereby improving operational efficiency and reducing manual effort during events such as exam‑ticket printing and holiday travel surges.

Auto ScalingOperationsResource Management
0 likes · 12 min read
Automated Capacity Planning and Auto‑Scaling for Hotel Services During Traffic Peaks
Lin is Dream
Lin is Dream
Mar 16, 2025 · Fundamentals

Mastering TPS and QPS: Simple Calculations and Real-World Examples

This article explains the key performance metrics TPS (transactions per second) and QPS (queries per second), provides formulas for calculating them, and demonstrates practical calculations for multi-node deployments, illustrating how request latency, thread pools, and instance count affect overall system concurrency and throughput.

QPSTPSThroughput
0 likes · 3 min read
Mastering TPS and QPS: Simple Calculations and Real-World Examples
Architect
Architect
Jan 23, 2025 · Operations

Designing High‑Availability Systems: Architecture, Capacity Planning, and Fault‑Tolerance Guide

This article presents a comprehensive guide to building high‑availability systems, covering availability metrics, fault prevention, detection and recovery, capacity evaluation, layered architecture design, service tiering, resilience mechanisms, and operational best practices for reliable service delivery.

OperationsSystem Architecturecapacity planning
0 likes · 34 min read
Designing High‑Availability Systems: Architecture, Capacity Planning, and Fault‑Tolerance Guide
High Availability Architecture
High Availability Architecture
Jan 13, 2025 · Operations

Comprehensive Guide to High‑Availability System Architecture and Practices

This article provides a systematic overview of high‑availability system design, covering availability metrics, fault prevention, detection, recovery, capacity planning, service tiering, data layer resilience, monitoring, and the responsibilities of architects, SREs, and developers to ensure reliable, scalable services.

System Architecturecapacity planningfault tolerance
0 likes · 30 min read
Comprehensive Guide to High‑Availability System Architecture and Practices
Open Source Linux
Open Source Linux
Jan 13, 2025 · Operations

Key Lessons from 2024 Major Service Outages and How to Prevent Future Downtime

The article reviews major 2024 service outages—from Alibaba Cloud to OpenAI—highlights their root causes, and offers practical operations strategies such as disaster recovery, regular backups, load balancing, monitoring, performance tuning, and capacity planning to reduce future downtime.

Operationscapacity planningdisaster recovery
0 likes · 5 min read
Key Lessons from 2024 Major Service Outages and How to Prevent Future Downtime
Tencent Cloud Developer
Tencent Cloud Developer
Jan 7, 2025 · Operations

Designing High‑Availability Systems: Principles, Architecture, and Operations

This comprehensive guide explains how to design, build, and operate high‑availability systems by covering availability metrics, fault‑tolerance strategies, capacity planning, code and data layer architecture, automated testing, monitoring, and clear role responsibilities to ensure services stay reliable and resilient under load.

Cloud NativeSRESystem Design
0 likes · 32 min read
Designing High‑Availability Systems: Principles, Architecture, and Operations
Efficient Ops
Efficient Ops
Nov 19, 2024 · Operations

Mastering System Stability: Proven SRE Practices for Reliable, High‑Availability Services

This article explains how system stability depends on architecture and code details, defines SLA and the “nines” metric, outlines Google’s SRE hierarchy, and provides practical governance steps—including development and release processes, high‑availability design, capacity planning, monitoring, incident response, and team culture—to achieve reliable, high‑availability services.

SREcapacity planninghigh availability
0 likes · 34 min read
Mastering System Stability: Proven SRE Practices for Reliable, High‑Availability Services
Huolala Tech
Huolala Tech
Nov 14, 2024 · Operations

How Huolala Scaled Kafka: From Integrated Design to Cloud‑Native Elastic Architecture

This article chronicles the evolution of Huolala’s Kafka infrastructure—from an integrated compute‑storage design to a separated compute‑storage model with multi‑tenant deployment, and finally to a cloud‑native elastic architecture—detailing the challenges of capacity awareness, alarm configuration, and cost‑effective performance optimization.

KafkaOperationscapacity planning
0 likes · 9 min read
How Huolala Scaled Kafka: From Integrated Design to Cloud‑Native Elastic Architecture
Bilibili Tech
Bilibili Tech
Oct 25, 2024 · Operations

Bilibili Data Center Migration: Planning, Execution, and Lessons Learned

Bilibili’s 18‑month, multi‑regional data‑center migration moved tens of thousands of servers using a high‑frequency rolling strategy, combining meticulous planning, cross‑team coordination, automated rack placement and rigorous checklists to achieve significant cost savings, higher utilization, improved stability and greener operations.

Data Center MigrationProject Managementautomation
0 likes · 21 min read
Bilibili Data Center Migration: Planning, Execution, and Lessons Learned
Efficient Ops
Efficient Ops
Jul 31, 2024 · Operations

How HuoLala Achieved Zero‑Fault Peaks: A Blueprint for High‑Load System Reliability

This article details HuoLala's three‑year journey of systematic business‑peak assurance, covering goal definition, project‑management practices, technical risk mitigation, cloud‑provider coordination, and post‑event reviews that together delivered zero‑fault high‑traffic periods and continuously improving system stability.

capacity planningpeak load managementrisk management
0 likes · 20 min read
How HuoLala Achieved Zero‑Fault Peaks: A Blueprint for High‑Load System Reliability
Architecture and Beyond
Architecture and Beyond
Jul 21, 2024 · Operations

Mastering Backend Stability: 7 Essential Practices for High Availability

This comprehensive guide outlines the seven key pillars—operations, high‑availability architecture, capacity governance, change management, risk governance, fault management, and chaos engineering—that together form a systematic approach to building and maintaining a reliable, 24‑hour backend system.

Operationsbackend stabilitycapacity planning
0 likes · 40 min read
Mastering Backend Stability: 7 Essential Practices for High Availability
Efficient Ops
Efficient Ops
May 21, 2024 · Operations

What Is an SRE? Roles, Skills, and Best Practices Explained

This article demystifies Site Reliability Engineering (SRE) by explaining its origins, core responsibilities, essential skill sets, and key practices such as observability, incident response, testing, capacity planning, automation, user support, on‑call duties, and the definition of SLI/SLO/SLA, providing a comprehensive guide for modern operations teams.

SREcapacity planningincident response
0 likes · 29 min read
What Is an SRE? Roles, Skills, and Best Practices Explained
iQIYI Technical Product Team
iQIYI Technical Product Team
May 10, 2024 · Operations

Full‑Link Load Testing of iQIYI Playback Service: Process, Tools, and Outcomes

iQIYI implemented full‑link load testing of its playback service using LoadMaker for traffic generation and Rover for link control, mapping the topology, creating weighted user scenarios, and safely pressurizing production‑like environments, which validated multi‑times historical peak capacity, uncovered bottlenecks, and enabled several performance and disaster‑recovery improvements without impacting real users.

Load Testingcapacity planningiQIYI
0 likes · 10 min read
Full‑Link Load Testing of iQIYI Playback Service: Process, Tools, and Outcomes
Efficient Ops
Efficient Ops
Apr 14, 2024 · Operations

How to Ensure System Stability and High Availability: An SRE Playbook

This article explains the definitions of stability and high availability, distinguishes their relationship, outlines key performance indicators, and provides a comprehensive framework—including fault prevention, detection, and recovery, as well as design, coding, testing, monitoring, and emergency response practices—to help teams build reliable, highly available systems.

SREcapacity planninghigh availability
0 likes · 10 min read
How to Ensure System Stability and High Availability: An SRE Playbook
Efficient Ops
Efficient Ops
Jan 31, 2024 · Operations

How ICBC Boosted System Stability with Advanced Performance Capacity Testing

This article details ICBC Software Development Center's comprehensive approach to performance capacity testing, covering background challenges, a structured quality practice plan, enhanced test scope evaluation, result analysis, tool support, implementation outcomes, and future directions for ensuring system stability and scalability.

Performance Testingcapacity planningsoftware-engineering
0 likes · 9 min read
How ICBC Boosted System Stability with Advanced Performance Capacity Testing
DevOps
DevOps
Jan 18, 2024 · R&D Management

Understanding Story Points and Agile Team Capacity Planning

This article explains the concept of story points as a relative estimation unit, why agile teams use them, how they are applied across Scrum ceremonies, and answers common questions about their relationship to effort, value, and managerial decision‑making.

R&D managementStory PointsTeam Estimation
0 likes · 8 min read
Understanding Story Points and Agile Team Capacity Planning
Architect
Architect
Dec 13, 2023 · Industry Insights

How Bilibili Engineered a 1.2 B‑Viewer Live Stream for the LoL World Championship

This article details Bilibili's end‑to‑end technical planning, traffic‑estimation models, and concrete optimizations—including hotspot caching, traffic dispersion, long‑connection isolation, and automated fault‑injection—that enabled the S13 League of Legends finals to serve over 1.2 billion viewers with stable, low‑latency streaming.

Traffic Engineeringcapacity planninghigh concurrency
0 likes · 22 min read
How Bilibili Engineered a 1.2 B‑Viewer Live Stream for the LoL World Championship
JD Tech
JD Tech
Nov 16, 2023 · Operations

Preparing JD's CDP Platform for Double 11: Challenges, Capacity Planning, and Lessons Learned

This article recounts the author's experience preparing JD's Customer Data Platform (CDP) for the Double 11 shopping festival, detailing the platform's capabilities, business scenarios, capacity planning, stability and performance challenges, disaster‑recovery measures, and personal reflections on the intensive technical effort involved.

Big DataCDPOperations
0 likes · 12 min read
Preparing JD's CDP Platform for Double 11: Challenges, Capacity Planning, and Lessons Learned
AntTech
AntTech
Nov 8, 2023 · Artificial Intelligence

Kapacity V0.2 Release: AI‑Driven Traffic‑Based Replica Prediction for Cloud‑Native Autoscaling

Kapacity V0.2 introduces an AI‑powered, traffic‑driven replica prediction algorithm for cloud‑native autoscaling, featuring a Linear‑Residual model, a lightweight Swish Net time‑series forecaster, custom metric support, and open‑source tools, aiming to improve resource efficiency and reduce operational risk.

AIKubernetesPredictive Autoscaling
0 likes · 9 min read
Kapacity V0.2 Release: AI‑Driven Traffic‑Based Replica Prediction for Cloud‑Native Autoscaling
dbaplus Community
dbaplus Community
Oct 7, 2023 · Operations

How to Build a Truly High‑Availability System: 6 Essential Design Layers

This article breaks down high‑availability system design into six critical layers—architecture, development standards, application services, storage, product safeguards, and operations—offering concrete practices such as capacity planning, fault‑tolerant patterns, monitoring, and incident‑response strategies to achieve four‑nine (99.99%) uptime.

OperationsSystem Designcapacity planning
0 likes · 26 min read
How to Build a Truly High‑Availability System: 6 Essential Design Layers
Huolala Tech
Huolala Tech
Sep 7, 2023 · Big Data

How Huolala Ensures Doris Stability: Real-World Big Data Practices

This article details Huolala's big‑data architecture and the practical measures—ranging from background analysis and stability challenges to case studies, discovery mechanisms, capacity planning, high‑availability, and automation—that the company employs to guarantee Doris's reliability and performance across its rapidly growing logistics platform.

Big DataOLAPcapacity planning
0 likes · 15 min read
How Huolala Ensures Doris Stability: Real-World Big Data Practices
Liangxu Linux
Liangxu Linux
Sep 2, 2023 · Operations

How Many Files and TCP Connections Can a Linux Server Actually Handle?

This guide explains the Linux kernel parameters that limit the number of open files and TCP connections on a server, shows how to adjust those limits, calculates practical connection capacities for both servers and clients, and offers troubleshooting tips for the "too many open files" error.

TCP connectionscapacity planning
0 likes · 15 min read
How Many Files and TCP Connections Can a Linux Server Actually Handle?
FunTester
FunTester
Aug 4, 2023 · Operations

How Tencent Scales Its Services for Chinese New Year: Inside Cloud Load‑Testing Strategies

This article details Tencent's cloud load‑testing approach for handling massive traffic spikes during Chinese New Year, covering background challenges, model selection, script authoring options, data construction, report analysis, and real‑world case studies that demonstrate capacity planning and performance optimization.

Load TestingMicroservicesRPS
0 likes · 21 min read
How Tencent Scales Its Services for Chinese New Year: Inside Cloud Load‑Testing Strategies
Code Ape Tech Column
Code Ape Tech Column
Jul 26, 2023 · Operations

Service Governance: Monitoring, Fault Management, Release and Capacity Planning

This article explains how to achieve 24/7 service availability through comprehensive monitoring, fault handling, release management, and capacity planning, covering alarm types, batch processing, traffic and resource metrics, fault causes and mitigation, deployment strategies, scaling commands, and service degradation techniques.

capacity planningfault managementrelease-management
0 likes · 20 min read
Service Governance: Monitoring, Fault Management, Release and Capacity Planning
FunTester
FunTester
Jul 12, 2023 · Operations

Mastering Load Testing: Practical Wrk, GoReplay, and SRE Strategies

This article explains why automation testing often lags behind product changes, outlines essential load‑testing concepts such as bottleneck analysis and capacity planning, and provides hands‑on guidance for using Wrk and GoReplay tools within an SRE‑driven operations workflow.

GoReplayLoad TestingPerformance Testing
0 likes · 8 min read
Mastering Load Testing: Practical Wrk, GoReplay, and SRE Strategies
Tencent Cloud Developer
Tencent Cloud Developer
Mar 13, 2023 · Cloud Computing

Design Principles for High‑Availability System Architecture

The article outlines a comprehensive high‑availability architecture framework across six layers—development standards, application services, storage, product fallback, operations deployment, and emergency response—detailing design principles such as stateless services, elastic scaling, redundant storage, robust monitoring, gray releases, and chaos engineering to ensure resilient, continuously available systems.

DeploymentScalabilitySystem Architecture
0 likes · 25 min read
Design Principles for High‑Availability System Architecture
HelloTech
HelloTech
Jan 31, 2023 · Operations

Stability Assurance Practices for Large‑Scale Promotional Events

The article outlines a comprehensive stability‑assurance framework for large‑scale promotional events—detailing planning, capacity and pressure‑test rehearsals, strict change‑freeze, internal gray releases, coordinated on‑call response, thorough link and capacity analysis, monitoring, emergency procedures, cross‑team collaboration, external partner coordination, and post‑event review to ensure resilient system performance.

Large-Scale EventsPerformance Testingcapacity planning
0 likes · 17 min read
Stability Assurance Practices for Large‑Scale Promotional Events
ITPUB
ITPUB
Jan 12, 2023 · Operations

How to Build a Truly High‑Availability System: 6 Essential Design Layers

This article breaks down the essential design and operational considerations for achieving high availability across six layers—development standards, application services, storage, product strategy, operations deployment, and incident response—providing concrete practices, metrics, and safeguards to reach four‑nine (99.99%) uptime.

OperationsSystem Designcapacity planning
0 likes · 25 min read
How to Build a Truly High‑Availability System: 6 Essential Design Layers
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 10, 2023 · Operations

How to Master System Stability: A Step‑by‑Step Guide for Reliable Operations

This article explains what stability assurance is, outlines a systematic workflow—including anomaly identification, monitoring configuration, impact assessment, and solution planning—and provides practical methods such as capacity estimation, traffic limiting, load testing, scaling, and pre‑heating to ensure services remain stable during both daily operations and high‑traffic events.

Operationscapacity planningincident response
0 likes · 25 min read
How to Master System Stability: A Step‑by‑Step Guide for Reliable Operations
Architecture Digest
Architecture Digest
Dec 21, 2022 · Operations

Designing High‑Availability Systems: Principles and Practices Across Six Layers

This article systematically explores high‑availability system design from development standards, capacity planning, application services, storage, product strategies, operations deployment, to incident response, presenting key concepts, architectural patterns, and practical guidelines for building resilient services.

DeploymentOperationsSystem Design
0 likes · 27 min read
Designing High‑Availability Systems: Principles and Practices Across Six Layers
Bilibili Tech
Bilibili Tech
Nov 19, 2022 · Operations

Technical Assurance for High‑Write Live‑Streaming Gift Scenarios

The technical‑assurance team secured Bilibili’s high‑write live‑stream gift system by expanding capacity, isolating hot keys, refactoring pipelines, adding asynchronous writes, employing horizontal scaling and full‑link load testing, converting uncertain dependencies into graceful fallbacks, and deploying dual‑active, chaos‑engineered disaster‑resilience architecture aligned with business usage patterns.

SREcapacity planningdatabase scaling
0 likes · 16 min read
Technical Assurance for High‑Write Live‑Streaming Gift Scenarios
Architects' Tech Alliance
Architects' Tech Alliance
Nov 17, 2022 · Industry Insights

Mastering Compute Resource Planning and Cash Flow for IC Design Projects

This article analyzes how semiconductor design firms can balance compute capacity and cash flow by modeling monthly core‑hour demand across project phases, presenting beginner and experienced algorithms, realistic usage conversion, and strategic choices between local, cloud, and hybrid resources.

IC designcapacity planningcash flow
0 likes · 14 min read
Mastering Compute Resource Planning and Cash Flow for IC Design Projects
DeWu Technology
DeWu Technology
Oct 17, 2022 · Operations

High Availability: Principles and Practices for System Stability

High availability—measured in nines of uptime—requires partitioning systems, decoupling components, choosing robust technologies, deploying redundant instances with automatic failover, capacity planning, rapid scaling, traffic shaping, resource isolation, global protection, observability, and disciplined change management to achieve stable, resilient services.

capacity planningchange managementfault tolerance
0 likes · 10 min read
High Availability: Principles and Practices for System Stability
Efficient Ops
Efficient Ops
Jun 19, 2022 · Operations

How Bilibili SRE Guarantees Million‑User Live Events: Strategies, Tools, and Lessons

This article details Bilibili's SRE approach to large‑scale live events, covering background, activity scenarios, resource planning, performance testing, chaos‑engineering drills, technical safeguards such as DCDN, SLB, WAF, PaaS, cache and DB, pre‑plan capabilities, post‑mortem analysis, and future outlook, illustrating how systematic capacity management and automated resilience practices enable stable operation for events with tens of millions of concurrent users.

Performance TestingSREcapacity planning
0 likes · 22 min read
How Bilibili SRE Guarantees Million‑User Live Events: Strategies, Tools, and Lessons
Bilibili Tech
Bilibili Tech
Jun 14, 2022 · Operations

SRE Practices for Large‑Scale Event Assurance at Bilibili

Bilibili’s SRE team ensures flawless large‑scale online events by meticulously gathering activity details, provisioning DNS, CDN, networking and compute resources, conducting multi‑stage performance tests and chaos‑engineering drills, applying layered traffic controls, maintaining historical checklists, executing predefined contingency responses, and iterating post‑mortems to drive continuous automation and reliability.

Cloud NativeEvent ReliabilityOperations
0 likes · 20 min read
SRE Practices for Large‑Scale Event Assurance at Bilibili
Alibaba Cloud Native
Alibaba Cloud Native
Jun 8, 2022 · Cloud Native

Turning 618 Sales Uncertainty into Certainty: Cloud‑Native Best‑Practice Guide

This article outlines a comprehensive, cloud‑native methodology for preparing large‑scale sales events like the 618 promotion, covering uncertainty challenges, capacity assessment, performance testing, pre‑heating strategies, flow‑control, and MSE service‑governance techniques to ensure stable, cost‑effective operation.

Cloud NativeFlow ControlMSE
0 likes · 19 min read
Turning 618 Sales Uncertainty into Certainty: Cloud‑Native Best‑Practice Guide
dbaplus Community
dbaplus Community
May 17, 2022 · Backend Development

How to Size a Kafka Cluster for Over 1 Billion Daily Requests

This article walks through a scenario‑driven capacity assessment for a production‑grade Kafka cluster, covering QPS calculations, storage needs, physical machine count, disk choices, memory, CPU, network bandwidth, deployment steps, and a final resource summary.

Cluster SizingKafkabackend-development
0 likes · 13 min read
How to Size a Kafka Cluster for Over 1 Billion Daily Requests
Programmer DD
Programmer DD
May 17, 2022 · Operations

Why Full‑Link Load Testing in Production Is the Key to Business Continuity

This article explains the importance of conducting full‑link load testing in production environments, outlines the evolution and solution architecture, describes key technologies such as traffic coloring, data isolation and risk control, and shares practical implementation steps and customer case studies from Alibaba.

Performance Testingcapacity planningfull-link load testing
0 likes · 19 min read
Why Full‑Link Load Testing in Production Is the Key to Business Continuity
SQB Blog
SQB Blog
May 9, 2022 · Operations

How Havok Enables Realistic Full‑Link Load Testing for Scalable Services

This article explains how the Havok full‑link load testing platform was designed and built to replay real traffic safely, provide capacity‑assessment data, support multiple test types, and offer real‑time monitoring and circuit‑breaker protection for large‑scale online services.

Load Testingcapacity planningfull‑link testing
0 likes · 16 min read
How Havok Enables Realistic Full‑Link Load Testing for Scalable Services
Architect's Journey
Architect's Journey
Apr 27, 2022 · R&D Management

Essential Architecture Terms Every Architect Should Know

This article compiles core architectural concepts—including ROI, SOLID principles, system splitting, isolation, ACID, CAP/BASE, distributed transactions, and capacity estimation—explaining their definitions, practical examples, trade‑offs, and how they guide architects in making informed technical decisions.

CAP theoremDistributed TransactionsSOLID
0 likes · 19 min read
Essential Architecture Terms Every Architect Should Know
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Feb 11, 2022 · Operations

Proactive Identification of Double 11 Transaction‑Surge Traffic Risks Using an AI‑Driven Network Monitoring Solution

The article presents a case study of how Alibaba Cloud’s network operations team tackled the massive, unpredictable traffic spikes of the 2021 Double 11 shopping festival by identifying transaction‑promotion traffic risks early through AI‑powered analysis, overcoming the limitations of manual rule‑based detection, and achieving precise, automated capacity risk control.

AIDouble 11capacity planning
0 likes · 8 min read
Proactive Identification of Double 11 Transaction‑Surge Traffic Risks Using an AI‑Driven Network Monitoring Solution
Top Architect
Top Architect
Jan 9, 2022 · Information Security

Technical Analysis and Recent Updates of Xi'an “One Code Pass” System

The article reviews the Xi'an “One Code Pass” health‑code platform, covering its award recognition, recent service outages, capacity‑planning calculations, security‑platform procurement, Ministry engineer inspection, and the identified technical bottlenecks such as lack of CDN for static assets and insufficient outbound bandwidth.

Big DataInformation SecurityOne Code Pass
0 likes · 7 min read
Technical Analysis and Recent Updates of Xi'an “One Code Pass” System
Architecture Digest
Architecture Digest
Dec 27, 2021 · Fundamentals

System Capacity Design and Evaluation: Concepts, Metrics, and Practical Steps

This article explains how to design and evaluate system capacity by defining key metrics such as QPS, TPS and concurrency, describing when capacity assessment is needed, and outlining a step‑by‑step methodology—including traffic analysis, peak estimation, stress testing and redundancy planning—to ensure reliable performance under varying load conditions.

Performance TestingQPSSystem Design
0 likes · 11 min read
System Capacity Design and Evaluation: Concepts, Metrics, and Practical Steps
转转QA
转转QA
Nov 25, 2021 · Operations

Full‑Chain Production Environment Load Testing for Double 11 Promotion: Process, Findings, and Lessons

This article details the end‑to‑end preparation, execution, reporting, and retrospective of a large‑scale production‑environment load test for the Double 11 shopping festival, covering data preparation, QPS target calculation, multi‑scenario testing, issue analysis, and continuous improvement practices.

Double11Load TestingOperations
0 likes · 8 min read
Full‑Chain Production Environment Load Testing for Double 11 Promotion: Process, Findings, and Lessons
Efficient Ops
Efficient Ops
Nov 24, 2021 · Operations

Practical Prometheus in Kubernetes: Tips, Limits, and Scaling

This article shares practical experiences and best‑practice guidelines for deploying and operating Prometheus in Kubernetes, covering version selection, inherent limitations, exporter choices, metric design, multi‑cluster scraping, memory and storage planning, GPU monitoring, timezone handling, and alerting considerations.

ExportersGrafanaPrometheus
0 likes · 21 min read
Practical Prometheus in Kubernetes: Tips, Limits, and Scaling
IT Architects Alliance
IT Architects Alliance
Nov 9, 2021 · Operations

Why Scale and How: Hardware Expansion, AKF Splitting Principle, Distributed ID Generation, and Elastic Scaling

The article explains the reasons for scaling, outlines hardware and component expansion strategies, introduces the AKF splitting principle for distributed systems, discusses database clustering and distributed ID generation methods such as UUID and Snowflake, and describes elastic scaling challenges and solutions.

Distributed SystemsID generationcapacity planning
0 likes · 14 min read
Why Scale and How: Hardware Expansion, AKF Splitting Principle, Distributed ID Generation, and Elastic Scaling
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Nov 8, 2021 · Operations

How to Scale Your System: From Hardware Expansion to Distributed ID Strategies

This article explains why capacity expansion is necessary, outlines hardware and component scaling strategies, introduces the AKF splitting principle for Redis clusters, discusses challenges of distributed scaling such as data consistency and high concurrency, and reviews database clustering and distributed ID generation methods like UUID and Snowflake.

AKF principlecapacity planningdatabase clustering
0 likes · 14 min read
How to Scale Your System: From Hardware Expansion to Distributed ID Strategies
Java High-Performance Architecture
Java High-Performance Architecture
Nov 1, 2021 · Operations

Why Scaling Matters: Hardware Expansion, Distributed ID & Elastic Capacity Strategies

The article explains why performance optimization has limits and outlines practical scaling methods—including whole‑machine and component upgrades, AKF splitting, database clustering, distributed ID generation (UUID and Snowflake), and elastic scaling—while also discussing the challenges each approach introduces.

ID generationcapacity planningdatabase clustering
0 likes · 14 min read
Why Scaling Matters: Hardware Expansion, Distributed ID & Elastic Capacity Strategies
21CTO
21CTO
Oct 30, 2021 · Operations

Scaling Systems: Hardware Expansion, Distributed IDs, and Elastic Capacity

This article explains why capacity expansion is necessary, outlines hardware and component scaling strategies, introduces AKF splitting principles, discusses database clustering and distributed ID generation methods such as UUID and Snowflake, and highlights the benefits and challenges of elastic scaling.

capacity planningdistributed-idelastic scaling
0 likes · 13 min read
Scaling Systems: Hardware Expansion, Distributed IDs, and Elastic Capacity
ITPUB
ITPUB
Oct 20, 2021 · Databases

Why Is Database Capacity Planning So Hard? A Practical Guide Using ScyllaDB

This article explains why sizing a database cluster is challenging, outlines a systematic capacity‑planning process, examines workload characteristics, query‑operation mapping, consistency trade‑offs, and maintenance considerations, and demonstrates how the open‑source NoSQL database ScyllaDB can be used to model and simplify these decisions.

NoSQLPerformance ModelingScyllaDB
0 likes · 15 min read
Why Is Database Capacity Planning So Hard? A Practical Guide Using ScyllaDB
Baidu Geek Talk
Baidu Geek Talk
Oct 20, 2021 · Operations

Practical Strategies for Building High‑Availability Systems

This article presents a comprehensive, step‑by‑step guide on improving system reliability through early fault detection, scope reduction, frequency reduction, and rapid incident handling, using real‑world practices from Baidu's commercial hosting platform.

Log StandardizationOperationscapacity planning
0 likes · 20 min read
Practical Strategies for Building High‑Availability Systems
Architecture Digest
Architecture Digest
Sep 23, 2021 · Operations

High Availability Practices: From Taobao to Cloud

This talk shares practical high‑availability strategies learned from years of building Taobao’s massive e‑commerce platform and migrating to Alibaba Cloud, covering traditional IDC stability mechanisms, cache and disaster‑recovery designs, cloud‑native fault‑tolerance, capacity planning, rate‑limiting, graceful degradation, and multi‑region resilience.

Distributed Systemscachingcapacity planning
0 likes · 20 min read
High Availability Practices: From Taobao to Cloud
Shopee Tech Team
Shopee Tech Team
Sep 9, 2021 · Backend Development

Technical Architecture and High‑Concurrency Solutions for Shopee Shake During Major Promotions

Shopee Shake’s architecture separates admin and user sides into three layers—access, application, and resource—and uses horizontal scaling, bucketed Redis coin pools, multi‑level caching, asynchronous message queues, precise capacity formulas, and comprehensive monitoring and chaos‑engineered runbooks to reliably handle over 300,000 QPS during major promotional events.

Distributed SystemsShopee Shakeasynchronous processing
0 likes · 19 min read
Technical Architecture and High‑Concurrency Solutions for Shopee Shake During Major Promotions
HelloTech
HelloTech
Sep 2, 2021 · Operations

How Production Full‑Link Load Testing Guarantees High Availability at Scale

The article explains why large‑scale services must conduct production full‑link load testing, describes its evolution from ad‑hoc trials to standardized monthly practices, and details the technical and procedural steps—including traffic modeling, JMeter usage, middleware tagging, and responsibility mapping—that ensure reliable capacity planning and risk mitigation.

MicroservicesOperationscapacity planning
0 likes · 13 min read
How Production Full‑Link Load Testing Guarantees High Availability at Scale
TAL Education Technology
TAL Education Technology
Aug 19, 2021 · Operations

Comprehensive SRE Guide for Summer and Winter High‑Load Periods in an Online Education Platform

This document outlines a comprehensive SRE‑driven operational framework for ensuring stable, high‑availability online education services during peak summer and winter periods, detailing pre‑, during‑, and post‑maintenance phases, architectural principles, load testing, monitoring, capacity management, safety hardening, chaos engineering, incident response, and post‑mortem practices.

Load TestingSREcapacity planning
0 likes · 17 min read
Comprehensive SRE Guide for Summer and Winter High‑Load Periods in an Online Education Platform
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Aug 19, 2021 · Operations

How Alibaba Conquered Double 11: Scaling to 17.5k TPS with High‑Availability Architecture

Alibaba’s eight‑year Double 11 journey illustrates how the company tackled exponential business growth by inventing high‑availability middleware, precise capacity planning, unit‑based deployment, online stress testing, hybrid‑cloud elasticity, and intelligent runtime control to balance throughput, cost, and user experience during the midnight peak.

Distributed Systemscapacity planningcloud scaling
0 likes · 23 min read
How Alibaba Conquered Double 11: Scaling to 17.5k TPS with High‑Availability Architecture
Top Architect
Top Architect
Jul 25, 2021 · Operations

System Capacity Design and Evaluation: Concepts, Metrics, and Practical Steps

This article explains how to design and evaluate system capacity by defining key concepts such as design capacity, TPS, QPS, and concurrency, outlining when capacity assessment is needed, and providing a step‑by‑step methodology with real‑world examples and calculations for accurate performance planning.

Performance TestingQPSSystem Design
0 likes · 11 min read
System Capacity Design and Evaluation: Concepts, Metrics, and Practical Steps
IT Architects Alliance
IT Architects Alliance
Jun 28, 2021 · Industry Insights

WeChat Moments' Billion-Visit Architecture: Disaster Recovery & Flexible Scaling

The article analyzes WeChat Moments' massive image and video services, detailing its OC/IDC architecture, holiday traffic challenges, software and hardware safeguards, disaster‑recovery mechanisms, retry policies, and a series of flexible strategies—including compression format changes, bitrate reduction, buffer pools, and timeline throttling—to sustain billions of daily accesses.

Flexible ScalingVideo BitrateWeChat Moments
0 likes · 13 min read
WeChat Moments' Billion-Visit Architecture: Disaster Recovery & Flexible Scaling
Java Architect Essentials
Java Architect Essentials
Jun 24, 2021 · Operations

Scaling WeChat Moments: Architecture, Capacity Planning, and Flexible Strategies for High Traffic

This article analyzes the large‑scale architecture of WeChat Moments, detailing image and video traffic characteristics, hardware and software safeguards, disaster‑recovery mechanisms, capacity assessment, and a series of flexible strategies such as compression format changes, bitrate reduction, buffer pools, and timeline throttling to handle holiday spikes.

Backend ArchitectureFlexible StrategiesMoments
0 likes · 10 min read
Scaling WeChat Moments: Architecture, Capacity Planning, and Flexible Strategies for High Traffic
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jun 16, 2021 · Operations

Designing System Capacity: From Event Scenarios to Precise QPS Planning

This article explains how to assess and design system capacity by analyzing real‑world scenarios—such as a company sports event—calculating required concurrency, average and peak QPS using the 80/20 rule, performing load tests, and determining instance counts to ensure reliable performance under varying traffic spikes.

Load TestingQPSSystem Design
0 likes · 12 min read
Designing System Capacity: From Event Scenarios to Precise QPS Planning
Programmer DD
Programmer DD
Jun 14, 2021 · Operations

Mastering QPS, TPS, PV, UV, DAU, MAU & System Throughput Explained

This article clarifies key performance metrics such as QPS, TPS, PV, UV, DAU, MAU, concurrent users, and system throughput, explains their differences, relationships, and how they impact capacity planning, while also outlining essential performance testing concepts and evaluation methods for robust system design.

Performance TestingQPSSystem Throughput
0 likes · 8 min read
Mastering QPS, TPS, PV, UV, DAU, MAU & System Throughput Explained
Open Source Linux
Open Source Linux
Jun 3, 2021 · Operations

Master Kubernetes Capacity Planning: Detect & Optimize Unused Resources

This guide explains Kubernetes capacity planning, showing how to detect idle CPU and memory, identify wasteful namespaces, use open‑source tools like kube‑state‑metrics and cAdvisor, and apply PromQL queries to optimize resource requests and measure the impact of your improvements.

KubernetesPromQLResource Optimization
0 likes · 10 min read
Master Kubernetes Capacity Planning: Detect & Optimize Unused Resources