Tagged articles
178 articles
Page 2 of 2
Alibaba Cloud Developer
Alibaba Cloud Developer
May 19, 2021 · Cloud Computing

How to Optimize Cloud Resource Scheduling After Migration

After migrating to the cloud, enterprises must evaluate resource scale, cost pressure, and staffing before deciding whether to build their own scheduling system, and can choose among ECS, Dedicated Host, or private pool solutions, each with specific advantages, drawbacks, and suitable scenarios.

Auto Scalingcapacity planningdedicated host
0 likes · 15 min read
How to Optimize Cloud Resource Scheduling After Migration
Efficient Ops
Efficient Ops
Mar 14, 2021 · Operations

Practical Prometheus on Kubernetes: Exporters, Scaling & Tips

This article shares practical experiences and best‑practice guidelines for using Prometheus in Kubernetes environments, covering version selection, inherent limitations, common exporters, Grafana dashboards, metric selection principles, multi‑cluster scraping, GPU monitoring, timezone handling, memory and storage planning, and alerting considerations.

ExportersGrafanaKubernetes
0 likes · 24 min read
Practical Prometheus on Kubernetes: Exporters, Scaling & Tips
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 8, 2021 · Operations

How to Ensure System Stability During Massive Sales Events: A Complete SRE Playbook

This article presents a comprehensive, step‑by‑step framework for guaranteeing system reliability during high‑traffic promotional periods, covering SRE hierarchy, stability criteria, profiling, monitoring, capacity planning, incident response, and post‑event analysis to help teams build resilient services.

SREcapacity planningincident response
0 likes · 21 min read
How to Ensure System Stability During Massive Sales Events: A Complete SRE Playbook
Amap Tech
Amap Tech
Mar 5, 2021 · Databases

Scaling and Migrating a High‑Volume Order System with Sharding, Data Synchronization and Gray‑Rollout on Alibaba Cloud

To support Gaode Taxi’s soaring order volume, the team expanded from four to eight ECS instances, re‑sharded 256 tables into 4,096, built a custom binlog‑to‑Kafka sync middleware for full‑load and incremental migration, implemented rigorous validation and repair processes, and employed a gray‑rollout with ABC verification, completing the migration without code changes or incidents.

Alibaba Cloudcapacity planningdata synchronization
0 likes · 16 min read
Scaling and Migrating a High‑Volume Order System with Sharding, Data Synchronization and Gray‑Rollout on Alibaba Cloud
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 27, 2021 · Operations

How to Build Sustainable System Stability: Architecture, Ops, and Team Practices

This article shares practical insights from a technical leader on designing robust system architecture, implementing comprehensive capacity planning, establishing reliable operations processes, strengthening security, and cultivating team awareness to achieve long‑term stability for large‑scale internet services.

Operationsarchitecture designcapacity planning
0 likes · 24 min read
How to Build Sustainable System Stability: Architecture, Ops, and Team Practices
转转QA
转转QA
Jan 19, 2021 · Operations

Comprehensive Full-Link Performance Testing Process and Practices for E-commerce Platforms

This article details a systematic full‑link performance testing workflow—including background, timing, scenario design, data preparation, capacity planning, monitoring, issue analysis, and post‑test cleanup—aimed at reliably evaluating and scaling e‑commerce services during major promotional events.

OperationsPerformance Testingcapacity planning
0 likes · 18 min read
Comprehensive Full-Link Performance Testing Process and Practices for E-commerce Platforms
21CTO
21CTO
Jan 15, 2021 · Operations

How iQIYI Scaled Its Payment System with Full‑Link Load Testing

This article details iQIYI's end‑to‑end load‑testing methodology for its payment platform, covering problem identification, core‑link mapping, environment setup, realistic traffic modeling, execution safeguards, results from capacity verification and stress testing, and future plans for a unified testing solution.

Load TestingOperationscapacity planning
0 likes · 12 min read
How iQIYI Scaled Its Payment System with Full‑Link Load Testing
High Availability Architecture
High Availability Architecture
Sep 21, 2020 · Operations

Full‑Link Load Testing Practices for iQIYI Payment System

This article describes iQIYI's payment team approach to full‑link load testing, covering background challenges, systematic problem exploration, preparation of test environments, traffic modeling, execution safeguards, practical results, and future plans to improve capacity verification and system reliability.

Load TestingOperationscapacity planning
0 likes · 10 min read
Full‑Link Load Testing Practices for iQIYI Payment System
iQIYI Technical Product Team
iQIYI Technical Product Team
Sep 18, 2020 · Operations

Full-Chain Load Testing Practices for iQIYI Payment System

iQIYI’s payment team built a full‑chain load‑testing framework that isolates data, mocks dependencies, constructs realistic multi‑service traffic, and executes protected tests to expose bottlenecks, guide scaling and optimizations, and ultimately ensure reliable payment services during traffic spikes, while planning a unified automation platform.

Load Testingcapacity planningfull-chain testing
0 likes · 13 min read
Full-Chain Load Testing Practices for iQIYI Payment System
Efficient Ops
Efficient Ops
Aug 25, 2020 · Operations

How to Build an Enterprise‑Grade Observability System and Master Incident Response

This article explains how enterprises adopting SRE can design a comprehensive observability platform—covering metrics, logs, and tracing—while also detailing effective incident response, post‑mortem practices, testing, capacity planning, automation tool development, and user‑experience focus to improve overall operational reliability.

OperationsSREcapacity planning
0 likes · 17 min read
How to Build an Enterprise‑Grade Observability System and Master Incident Response
Node Underground
Node Underground
Aug 23, 2020 · Operations

How to Accurately Benchmark API QPS with Hey: A Step‑by‑Step Guide

This article introduces the Hey load‑testing tool, explains how to install and run it with specific QPS settings, analyzes the resulting metrics and charts, and offers practical tips for identifying bottlenecks such as network bandwidth or CPU usage during capacity planning.

API performanceGolangLoad Testing
0 likes · 5 min read
How to Accurately Benchmark API QPS with Hey: A Step‑by‑Step Guide
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Jul 22, 2020 · Operations

Building a Comprehensive High‑Availability System: Disaster Recovery, Capacity Planning, Online Protection, and Fault Drills

This article explains how to construct a truly high‑availability architecture for modern distributed, cloud‑native services by covering disaster‑recovery principles, capacity planning with realistic load testing, online traffic protection, and systematic fault‑drill practices.

Fault Injectioncapacity planningdisaster recovery
0 likes · 13 min read
Building a Comprehensive High‑Availability System: Disaster Recovery, Capacity Planning, Online Protection, and Fault Drills
JD Retail Technology
JD Retail Technology
Jun 10, 2020 · Operations

Logistics R&D Preparation for the 618 Promotion: System Readiness, Stress Testing, and Real‑Time Monitoring

The logistics R&D team spent 62 days preparing for the 618 promotion by analyzing core processes, applying stress tests, implementing fault‑tolerant architectures, planning capacity, and deploying real‑time monitoring tools to ensure system stability and performance under peak traffic.

OperationsPerformance TestingSystem Design
0 likes · 7 min read
Logistics R&D Preparation for the 618 Promotion: System Readiness, Stress Testing, and Real‑Time Monitoring
JD Retail Technology
JD Retail Technology
Jun 5, 2020 · Operations

How JD Cloud Engineered a Seamless 618 Shopping Surge: Ops Strategies & Disaster Drills

This article details JD Cloud's comprehensive operational preparation for the 618 shopping festival, covering early resource procurement, hardware fault management, network and CDN scaling, extensive capacity‑testing, disaster‑recovery drills, and cross‑departmental coordination that together ensured stable service during massive traffic spikes.

Infrastructurecapacity planningcloud operations
0 likes · 8 min read
How JD Cloud Engineered a Seamless 618 Shopping Surge: Ops Strategies & Disaster Drills
Top Architect
Top Architect
Apr 9, 2020 · Backend Development

Low‑Latency and High‑Availability Design of RocketMQ: Evolution, Optimizations, and Capacity Planning

This article reviews the evolution of Alibaba's Aliware message engine, analyzes the low‑latency and high‑availability challenges faced during Double 11, and details the architectural, JVM, memory, rate‑limiting, and multi‑replica solutions that enabled RocketMQ to achieve sub‑millisecond write latency and five‑nine availability.

Distributed SystemsLow latencyRocketMQ
0 likes · 29 min read
Low‑Latency and High‑Availability Design of RocketMQ: Evolution, Optimizations, and Capacity Planning
Didi Tech
Didi Tech
Feb 18, 2020 · Backend Development

Didi Ride‑Sharing Dispatch Engine: Architecture, Challenges, and Stability Measures for Carpool Day

During Didi’s 2019 Carpool Day promotion, a surge of up to 6.6‑times normal matching traffic forced a redesign of its dispatch engine, introducing near‑time assignment, filtered logic moves, configurable timeouts, extensive stress testing, monitoring, and rapid on‑call procedures that cut downstream pressure by over half.

Scalabilitycapacity planningcarpool
0 likes · 11 min read
Didi Ride‑Sharing Dispatch Engine: Architecture, Challenges, and Stability Measures for Carpool Day
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 18, 2020 · Cloud Native

Why Do Your Apps Crash? Alibaba’s High‑Availability Architecture Playbook

This article explains why online applications experience crashes during traffic spikes, outlines the complexity of modern cloud‑based service architectures, and shares Alibaba engineers’ practical notes on high‑availability design, capacity planning, full‑link stress testing, monitoring, traffic control, routine inspections, and chaos‑engineering drills using tools such as AHAS, PTS, Sentinel and Advisor.

Alibaba Cloudcapacity planningchaos engineering
0 likes · 12 min read
Why Do Your Apps Crash? Alibaba’s High‑Availability Architecture Playbook
Efficient Ops
Efficient Ops
Feb 17, 2020 · Operations

How Top IT Ops Teams Ensure Seamless Large-Scale Business Events

This article outlines how Ping An’s IT operations team systematically prepares for high‑traffic business events—detailing service assessment, architecture mapping, configuration audits, monitoring design, capacity planning, stress testing, and coordinated incident response—to guarantee reliability and performance under massive concurrent loads.

IT Operationscapacity planningincident response
0 likes · 15 min read
How Top IT Ops Teams Ensure Seamless Large-Scale Business Events
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 29, 2020 · Operations

Capacity Planning, Full‑Link Stress Testing, and Traffic Control for Alibaba's Double‑11 Mega‑Event

The article explains how Alibaba introduced systematic capacity planning, four‑stage capacity assessment, various single‑machine stress‑test techniques, and a full‑link stress‑testing platform to reliably handle the massive traffic spikes of the Double‑11 shopping festival, while also describing a flexible traffic‑control framework to prevent overload and avalanche effects.

Load TestingScalabilitybig-event
0 likes · 16 min read
Capacity Planning, Full‑Link Stress Testing, and Traffic Control for Alibaba's Double‑11 Mega‑Event
JD Retail Technology
JD Retail Technology
Jan 8, 2020 · Operations

Comprehensive Guide to E‑commerce Promotion Traffic Management and System Preparation

This article explains how e‑commerce promotions differ from offline sales by offering lower participation thresholds and flexible discount tactics, outlines methods for estimating and handling traffic spikes, and provides detailed strategies for system capacity planning, load testing, monitoring, and incident response to ensure stable large‑scale promotional events.

Load Testingcapacity planninge‑commerce
0 likes · 23 min read
Comprehensive Guide to E‑commerce Promotion Traffic Management and System Preparation
Didi Tech
Didi Tech
Dec 2, 2019 · Operations

Capacity Estimation Methodology for Growing Services

The article presents a systematic capacity‑estimation methodology that links service traffic to order volume, uses CPU‑Idle as a primary metric, predicts traffic growth and upper‑bound limits, validates predictions with load‑testing, and provides scaling recommendations while noting limitations of the CPU‑Idle baseline.

Traffic Predictioncapacity planningresource utilization
0 likes · 9 min read
Capacity Estimation Methodology for Growing Services
Youku Technology
Youku Technology
Nov 26, 2019 · Operations

Resource Assurance Strategies and Practices for Alibaba Youku Double‑11 Promotion

The article outlines Alibaba Youku’s end‑to‑end resource‑assurance platform for Double‑11 promotions, detailing automated demand collection, business‑to‑technical metric conversion, single‑machine capacity testing, rapid scaling and emergency borrowing, which together cut manual reviews by 80 % and boosted delivery efficiency tenfold.

OperationsResource Managementautomation
0 likes · 13 min read
Resource Assurance Strategies and Practices for Alibaba Youku Double‑11 Promotion
AntTech
AntTech
Nov 11, 2019 · Operations

How Alipay Scaled Its Payment System for Double 11: Architecture, Capacity Planning, and Elastic Design

The article details how Alipay engineers tackled the massive traffic spikes of Double 11 by addressing external payment bottlenecks, implementing recharge‑based balances, building capacity‑planning platforms, adopting logical data‑center (LDC) and CRG zone architectures, deploying elastic scaling, and evolving their OceanBase database and service‑mesh infrastructure to sustain millions of transactions per second.

AlipayDouble 11Elastic Architecture
0 likes · 16 min read
How Alipay Scaled Its Payment System for Double 11: Architecture, Capacity Planning, and Elastic Design
JD Retail Technology
JD Retail Technology
Nov 7, 2019 · Operations

7FRESH Technical Preparation for the 11.11 Shopping Festival: System Scaling, Degradation Strategies, Emergency Plans, Performance Testing, and Operational Monitoring

The article details how 7FRESH's R&D, testing, network operations, and product teams coordinated system capacity expansion, degradation mechanisms, emergency response procedures, extensive performance testing, and 24/7 monitoring to ensure stable and scalable service during the high‑traffic 11.11 shopping event.

OperationsPerformance Testingcapacity planning
0 likes · 10 min read
7FRESH Technical Preparation for the 11.11 Shopping Festival: System Scaling, Degradation Strategies, Emergency Plans, Performance Testing, and Operational Monitoring
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 22, 2019 · Operations

How Alibaba Masters Full‑Chain Performance Testing for Double 11

Alibaba’s seven‑year journey of full‑chain performance testing for its Double 11 shopping festival reveals a comprehensive, production‑environment‑based workflow—including environment transformation, data preparation, traffic safety, test execution, and intelligent analysis—designed to ensure system stability under massive traffic spikes and guide external customers.

AlibabaPerformance Testingcapacity planning
0 likes · 15 min read
How Alibaba Masters Full‑Chain Performance Testing for Double 11
Architects' Tech Alliance
Architects' Tech Alliance
Aug 23, 2019 · Operations

IO Performance Evaluation, Monitoring, and Optimization Guide

This article explains how to assess, monitor, and tune system I/O performance by defining I/O models, selecting appropriate evaluation tools, tracking key metrics for disk and network I/O, and applying practical optimization strategies for both storage and network bottlenecks.

Disk I/OIO performanceNetwork I/O
0 likes · 15 min read
IO Performance Evaluation, Monitoring, and Optimization Guide
Amap Tech
Amap Tech
Aug 20, 2019 · Operations

Full‑Link Load Testing and Stability Assurance at Gaode: Architecture, Practices, and Future Directions

To guarantee stability for over 100 million daily users, Gaode combines capacity planning, traffic control, disaster recovery, monitoring, and pre‑plan drills with a self‑built full‑link load‑testing platform (TestPG) that replays realistic traffic in production‑like environments, isolates test loads, provides rapid configuration, detailed debugging, automated error capture, and comprehensive reporting, while planning future enhancements such as integrated topology monitoring, advanced pressure models, and confidence evaluation.

Distributed SystemsLoad Testingcapacity planning
0 likes · 20 min read
Full‑Link Load Testing and Stability Assurance at Gaode: Architecture, Practices, and Future Directions
JD Retail Technology
JD Retail Technology
Nov 9, 2018 · Operations

JD Finance Technical Operations and System Optimization for the 11.11 Promotion

The JD Finance technical teams—including Wealth R&D, Consumer Finance, Payment, Middle‑Platform, and Crowdfunding—conducted comprehensive system reviews, performance stress tests, capacity expansions, monitoring enhancements, and emergency downgrade plans to ensure stable, high‑throughput service during the 11.11 shopping festival.

11.11 promotionPerformance TestingSystem optimization
0 likes · 8 min read
JD Finance Technical Operations and System Optimization for the 11.11 Promotion
JD Tech
JD Tech
Oct 29, 2018 · Operations

SGM Service Governance Monitoring Platform: Design, Features, and Use Cases

The article introduces SGM, a comprehensive service governance and monitoring solution that addresses scaling, dependency complexity, and operational challenges by providing automated topology, real‑time tracing, capacity planning, root‑cause analysis, and extensive monitoring features such as performance metrics, JVM stats, call‑chain visualization, business dashboards, and intelligent alerting.

AlertingOperationscall chain
0 likes · 13 min read
SGM Service Governance Monitoring Platform: Design, Features, and Use Cases
Ctrip Technology
Ctrip Technology
Oct 10, 2018 · Operations

Design and Implementation of Ctrip's Fourth-Generation Full-Link Performance Testing System

This article outlines the evolution of Ctrip’s performance testing approaches across three generations, analyzes their limitations, and presents the design, architecture, data construction, request tracing, monitoring, and operational considerations of the fourth-generation full‑link testing platform, including case studies and future outlook.

Load TestingSystem Designcapacity planning
0 likes · 14 min read
Design and Implementation of Ctrip's Fourth-Generation Full-Link Performance Testing System
Efficient Ops
Efficient Ops
Oct 9, 2018 · Operations

How Tencent Scales Automated Operations for Massive Services

Tencent’s architecture platform team explains how they monitor, automate, and secure billions of daily operations across storage, CDN, and live services, using multi‑dimensional metrics, real‑time and instant computation, AI‑driven anomaly detection, and a custom control platform for safe changes.

Operationsaiopsautomation
0 likes · 23 min read
How Tencent Scales Automated Operations for Massive Services
Architecture Digest
Architecture Digest
Sep 10, 2018 · Backend Development

Low‑Latency and High‑Availability Design of RocketMQ for Double‑11 Peak Traffic

This article reviews the evolution of Alibaba's Aliware message engine, analyzes the latency and availability challenges faced during Double‑11, and describes the low‑latency optimizations, capacity‑guarantee strategies, and multi‑replica high‑availability architecture implemented in RocketMQ to sustain trillion‑level message flows.

Distributed SystemsLow latencyMessage Queue
0 likes · 22 min read
Low‑Latency and High‑Availability Design of RocketMQ for Double‑11 Peak Traffic
Qunar Tech Salon
Qunar Tech Salon
Sep 5, 2018 · Operations

Tencent SNG Operations: Business Profiling for Capacity Planning, Activity Modeling, and Multi‑Region Deployment

The article explains how Tencent's SNG operations team uses business profiling—including capacity, activity, core‑link, and SET models—to address performance testing across device types, forecast activity‑driven resource needs, identify core versus peripheral services, and plan reliable multi‑region deployments.

Operationsbusiness profilingcapacity planning
0 likes · 9 min read
Tencent SNG Operations: Business Profiling for Capacity Planning, Activity Modeling, and Multi‑Region Deployment
Efficient Ops
Efficient Ops
Aug 21, 2018 · Operations

How Tencent SNG Uses Business Profiling to Optimize Capacity, Activity, and Multi‑Region Deployment

This article explains how Tencent's SNG operations team builds and applies business profiling models—including capacity, activity, core‑link, and SET planning—to predict performance, automate scaling, identify critical services, and efficiently distribute workloads across multiple regions.

Operationsactivity modelingcapacity planning
0 likes · 11 min read
How Tencent SNG Uses Business Profiling to Optimize Capacity, Activity, and Multi‑Region Deployment
Efficient Ops
Efficient Ops
Apr 22, 2018 · Fundamentals

Mastering Software Performance: From Axioms to Capacity Planning

This article explains fundamental performance concepts—defining response time and throughput, using axiomatic methods, analyzing bottlenecks with sequence diagrams and profiling, applying Amdahl’s Law, and guiding capacity planning to build reliable, high‑performance applications.

Response TimeThroughputcapacity planning
0 likes · 44 min read
Mastering Software Performance: From Axioms to Capacity Planning
Baidu Intelligent Testing
Baidu Intelligent Testing
Apr 16, 2018 · Operations

Online Load‑Testing Practices for Baidu Nuomi Marketing Activities

This article presents a comprehensive case study of Baidu Nuomi's online load‑testing methodology for high‑traffic marketing events, covering capacity estimation, test planning, execution, anti‑attack measures, platform architecture, and lessons learned to ensure system reliability and performance under peak loads.

Load Testingcapacity planningonline testing
0 likes · 16 min read
Online Load‑Testing Practices for Baidu Nuomi Marketing Activities
dbaplus Community
dbaplus Community
Jan 15, 2018 · Operations

How JD Finance Achieves Real-Time Capacity Assessment and Smart Alerting

This article explains JD Finance's operational challenges in a rapidly expanding micro‑service environment and presents a comprehensive approach that combines offline and online load testing, precise capacity calculations, and intelligent root‑cause alert analysis using both rule‑based and machine‑learning techniques.

Load TestingOperationsRoot Cause Analysis
0 likes · 15 min read
How JD Finance Achieves Real-Time Capacity Assessment and Smart Alerting
Efficient Ops
Efficient Ops
Jan 3, 2018 · Operations

How QQ Space Photo Album Handled a 4‑Fold Traffic Surge on New Year’s Day

On December 30, 2017, a sudden wave of users uploading and downloading their 18‑year‑old photos caused QQ Space's album service to experience a four‑times spike in download traffic and a twelve‑times surge in post activity, prompting the operations and development teams to employ capacity monitoring, elastic scaling, flexible architecture, and targeted optimizations to maintain service stability and user experience.

OperationsQQ Spacecapacity planning
0 likes · 10 min read
How QQ Space Photo Album Handled a 4‑Fold Traffic Surge on New Year’s Day
Dada Group Technology
Dada Group Technology
Dec 22, 2017 · Operations

Performance Testing Process, Plans, and Best Practices for High‑Traffic Events

This article explains the purpose of performance (stress) testing, compares four testing approaches, details the chosen proportional‑deployment strategy, and provides comprehensive preparation steps, script guidelines, metric analysis, and practical tips for ensuring system stability during large‑scale traffic spikes.

Load TestingOperationscapacity planning
0 likes · 10 min read
Performance Testing Process, Plans, and Best Practices for High‑Traffic Events
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 11, 2017 · Operations

How Alibaba’s Full‑Link Stress Test Powers Double 11’s Record‑Breaking Traffic

Alibaba’s full‑link stress testing, which simulates real‑world traffic across the entire e‑commerce platform, enabled the 2017 Double 11 event to handle peak loads of 325,000 transactions per second, demonstrating how production‑level, data‑isolated load testing ensures stability and capacity planning for massive online sales.

capacity planningstress testing
0 likes · 9 min read
How Alibaba’s Full‑Link Stress Test Powers Double 11’s Record‑Breaking Traffic
Suning Technology
Suning Technology
Nov 11, 2017 · Operations

Zero-Accident O2O Festival: Suning’s Frontend, App & Network Ops Secrets

Suning’s 2017 O2O shopping festival achieved a “zero‑incident” goal by integrating real‑time browser performance monitoring, WEEX‑based WAP acceleration, comprehensive app data collection with cloud‑based analytics, precise DNS and HTTP2 optimizations, and a multi‑layer network and service monitoring system that enabled rapid fault detection and capacity planning.

App OperationsNetwork ReliabilityO2O
0 likes · 15 min read
Zero-Accident O2O Festival: Suning’s Frontend, App & Network Ops Secrets
dbaplus Community
dbaplus Community
Oct 16, 2017 · Operations

How Ele.me Scaled Operations: Key Lessons from Incident Management and Capacity Planning

This article details Ele.me's rapid expansion challenges and shares a three‑stage technical operations journey—fine‑grained division, stability maintenance, and efficiency gains—highlighting real incidents, monitoring upgrades, capacity testing, and practical insights for reliable large‑scale delivery platforms.

Operationscapacity planningincident management
0 likes · 14 min read
How Ele.me Scaled Operations: Key Lessons from Incident Management and Capacity Planning
dbaplus Community
dbaplus Community
Oct 10, 2017 · Operations

How to Build Effective Service Monitoring: Principles, Practices, and Technical Implementation

This article explains why service monitoring is essential for large‑scale microservice environments, outlines design principles, core monitoring components, dependency mapping, call‑chain analysis, capacity planning, root‑cause analysis, and presents a practical technical architecture for implementing robust monitoring solutions.

Distributed TracingOperationscapacity planning
0 likes · 12 min read
How to Build Effective Service Monitoring: Principles, Practices, and Technical Implementation
Efficient Ops
Efficient Ops
Oct 9, 2017 · Operations

How Tencent Scales Operations for Holiday Traffic Surges

This article explains how Tencent's social platform operations team prepares for massive holiday traffic spikes by following a four‑stage process—business preparation, capacity evaluation, resource provisioning, and scaling with stress testing—while detailing team structures, operational standards, and the supporting tool ecosystem that enable reliable, high‑availability services.

OperationsToolingcapacity planning
0 likes · 13 min read
How Tencent Scales Operations for Holiday Traffic Surges
21CTO
21CTO
Aug 11, 2017 · Operations

Alibaba’s Double 11 Playbook: Scaling Architecture and Real‑Time Fault Tolerance

Alibaba’s eight‑year evolution of Double 11 showcases how limited cost can deliver maximal user experience and massive throughput by transitioning from a centralized 3.0 distributed architecture to multi‑active zones, employing capacity planning, full‑link stress testing, fine‑grained dependency governance, and dynamic traffic scheduling to ensure high availability.

capacity planningfault tolerancelarge-scale e-commerce
0 likes · 12 min read
Alibaba’s Double 11 Playbook: Scaling Architecture and Real‑Time Fault Tolerance
Efficient Ops
Efficient Ops
Jul 6, 2017 · Operations

36 Ops Strategies: Permissions, Documentation, and Capacity Management

The article shares practical operations lessons—from periodic permission audits and thorough documentation to capacity monitoring, log rotation, and automation—illustrating how systematic practices and tooling can standardize and streamline IT infrastructure management.

DocumentationIT ManagementOperations
0 likes · 8 min read
36 Ops Strategies: Permissions, Documentation, and Capacity Management
Efficient Ops
Efficient Ops
Jun 20, 2017 · Operations

Unlocking Ops Value: How Tencent’s Fine‑Grained Technical Operations Drive Massive Savings

This article explores how Tencent’s operations team redefines its value by applying fine‑grained technical management to mobile internet challenges, capacity planning, bandwidth optimization, and data‑driven product decisions, ultimately delivering huge cost savings and turning operations into a core competitive advantage.

OperationsResource Optimizationbandwidth management
0 likes · 22 min read
Unlocking Ops Value: How Tencent’s Fine‑Grained Technical Operations Drive Massive Savings
Efficient Ops
Efficient Ops
Jun 19, 2017 · Operations

How JD.com’s ForceBot Revolutionized 618 Sale Load Testing

This article examines JD.com’s 618 shopping festival performance, the deployment of unmanned delivery robots, and the design and architecture of the ForceBot full‑link load‑testing system that enabled precise capacity planning and bottleneck detection for massive e‑commerce traffic.

Load TestingSystem Architecturecapacity planning
0 likes · 8 min read
How JD.com’s ForceBot Revolutionized 618 Sale Load Testing
MaGe Linux Operations
MaGe Linux Operations
Jun 8, 2017 · Operations

From Taobao to the Cloud: Proven High‑Availability Strategies for Massive Traffic

This talk shares practical high‑availability designs learned from Alibaba's Taobao platform and Alibaba Cloud, covering traditional IDC stability mechanisms, modern cloud‑native fault‑tolerance, caching tricks, performance tuning, limit‑and‑degrade tactics, disaster‑recovery planning, and multi‑region deployment for handling billions of requests during peak events.

cachingcapacity planningcloud architecture
0 likes · 20 min read
From Taobao to the Cloud: Proven High‑Availability Strategies for Massive Traffic
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 1, 2017 · Operations

How Alibaba Engineers Capacity Planning and Full‑Link Load Testing for Massive Sales Events

This article explains Alibaba's four‑step capacity‑planning methodology, the various single‑machine load‑testing techniques, the design of a full‑link load‑testing platform for Double‑11, and the dynamic flow‑control framework that together ensure system stability during extreme traffic spikes.

AlibabaLoad TestingOperations
0 likes · 18 min read
How Alibaba Engineers Capacity Planning and Full‑Link Load Testing for Massive Sales Events
dbaplus Community
dbaplus Community
Apr 13, 2017 · Backend Development

Scalable Small .NET E‑Commerce Architecture: Monitoring, DB Master‑Slave & Capacity Planning

This guide walks through the evolution of a small .NET‑based e‑commerce system, covering its initial LAMP‑style setup, detailed backend architecture, logging and monitoring solutions, master‑slave database design, shared‑storage image server, mobile M‑site construction, capacity estimation methods, and caching strategies.

architecturecapacity planningdatabase
0 likes · 22 min read
Scalable Small .NET E‑Commerce Architecture: Monitoring, DB Master‑Slave & Capacity Planning
21CTO
21CTO
Apr 13, 2017 · Operations

Mastering Internet Performance Engineering and Capacity Planning

This article presents a comprehensive methodology for internet performance engineering, covering non‑functional quality goals, detailed metrics for application servers, databases, caches and message queues, a practical technical review outline, and a real‑world capacity‑planning case study with both maximal and minimal resource solutions.

Backend ArchitectureNon-functional RequirementsOperations
0 likes · 24 min read
Mastering Internet Performance Engineering and Capacity Planning
Architecture Digest
Architecture Digest
Apr 13, 2017 · Operations

Methodology for Internet Architecture Technical Review and Capacity/Performance Evaluation

This article presents a comprehensive methodology for reviewing internet‑scale system architectures, focusing on non‑functional quality attributes such as performance, availability, scalability, security, and maintainability, and provides detailed guidelines, metrics tables, and a classic case study for capacity and performance planning.

BackendNon-functional RequirementsOperations
0 likes · 27 min read
Methodology for Internet Architecture Technical Review and Capacity/Performance Evaluation
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 29, 2017 · Operations

How Alibaba Built the ‘Nuclear Weapon’ Full‑Link Stress Test for Double 11

This article chronicles Alibaba's evolution of the full‑link pressure testing platform—from its 2013 inception tackling massive Double 11 traffic, through data construction, isolation, traffic generation, and platform upgrades—to a mature, automated, cloud‑native solution that safeguards large‑scale e‑commerce stability.

AlibabaOperationsPerformance Testing
0 likes · 13 min read
How Alibaba Built the ‘Nuclear Weapon’ Full‑Link Stress Test for Double 11
Efficient Ops
Efficient Ops
Feb 28, 2017 · Operations

Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response

This article outlines a comprehensive PDCA‑based methodology for e‑commerce platforms to proactively prevent issues, quickly detect anomalies, and execute rapid decisions during large‑scale promotions, covering system goal definition, performance evaluation, capacity planning, SLA management, and team/process maturity.

Operationscapacity planninge‑commerce
0 likes · 18 min read
Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response
Architecture Digest
Architecture Digest
Dec 30, 2016 · Operations

Zero‑Point Battle: Evolution of Alibaba's Double 11 High‑Availability Architecture

The talk details how Alibaba tackled the massive technical challenges of Double 11 over eight years by evolving a highly available, scalable architecture through capacity planning, distributed middleware, hybrid‑cloud deployment, online stress testing, and fine‑grained traffic control to balance cost, performance, and user experience.

AlibabaDistributed SystemsDouble 11
0 likes · 22 min read
Zero‑Point Battle: Evolution of Alibaba's Double 11 High‑Availability Architecture
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 23, 2016 · Operations

How Alibaba Scaled Double 11: The Evolution of Capacity Planning and Real‑Time Stress Testing

This article recounts Alibaba's 7‑year journey of capacity planning for the massive Double 11 shopping festival, detailing early guesswork, the introduction of load‑testing, online and scenario‑based testing, traffic isolation, and full automation that enabled precise resource allocation across hundreds of services.

Load Testingcapacity planningperformance optimization
0 likes · 23 min read
How Alibaba Scaled Double 11: The Evolution of Capacity Planning and Real‑Time Stress Testing
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 12, 2016 · Cloud Native

How Alibaba Built a Decade-Long Microservice Architecture: Challenges and Lessons

This article chronicles Alibaba's ten‑year journey from monolithic Java EE deployments to a cloud‑native microservice ecosystem, detailing the technical challenges, the evolution of its EDAS RPC frameworks, comprehensive monitoring, capacity planning, and the strategies that enabled resilient large‑scale services during massive traffic events.

Cloud Nativecapacity planningmonitoring
0 likes · 11 min read
How Alibaba Built a Decade-Long Microservice Architecture: Challenges and Lessons
Efficient Ops
Efficient Ops
Oct 6, 2016 · Operations

How Ctrip Scales Application Operations: Practices, Automation, and Reliability

This talk details Ctrip's application operations framework, covering data‑center scale, multi‑application deployment on Windows, high availability goals, capacity‑prediction models, disaster‑recovery design, incident response, and the evolution from manual tooling to automated, intelligent operations.

Operationsautomationcapacity planning
0 likes · 21 min read
How Ctrip Scales Application Operations: Practices, Automation, and Reliability
Meituan Technology Team
Meituan Technology Team
Oct 1, 2016 · Operations

How Meituan Scaled Its Mobile Deal System for Mega‑Promotions: Traffic Modeling & Capacity Planning

This article details Meituan's technical approach to handling massive traffic spikes during large‑scale promotions, covering background of the O2O deal platform, traffic‑model construction, capacity‑budget calculations, micro‑service architecture evolution, pressure‑test strategies, and the PTP performance‑testing environment used to validate system limits.

Load TestingMicroservicesOperations
0 likes · 18 min read
How Meituan Scaled Its Mobile Deal System for Mega‑Promotions: Traffic Modeling & Capacity Planning
dbaplus Community
dbaplus Community
Aug 28, 2016 · Databases

Scaling Databases: From Baseline Metrics to Multi‑Layer Optimization

This guide walks DBAs through evaluating current database resources, establishing performance baselines, building business pressure models, conducting realistic stress tests, and applying a seven‑layer optimization roadmap—from statement tweaks to hardware upgrades and business‑level adjustments—to ensure the system can handle ten‑fold or hundred‑fold growth.

Database PerformanceHardwarecapacity planning
0 likes · 16 min read
Scaling Databases: From Baseline Metrics to Multi‑Layer Optimization
Tencent Music Tech Team
Tencent Music Tech Team
Jun 17, 2016 · Backend Development

Design Considerations for a High‑Scale Messaging System: Capacity Estimation, Consistency Guarantees, and Avalanche Prevention

Designing Quanmin K‑Song’s high‑scale messaging system requires careful capacity estimation of throughput, storage and network traffic, robust consistency via unique transaction IDs and operation logs, and avalanche prevention through selective retries, scaling and priority‑based throttling to maintain reliability under load.

ConsistencyDistributed Systemsavalanche prevention
0 likes · 7 min read
Design Considerations for a High‑Scale Messaging System: Capacity Estimation, Consistency Guarantees, and Avalanche Prevention
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jun 12, 2016 · Backend Development

Designing a Scalable E‑Commerce Architecture: From Simple Setup to Distributed Systems

This article walks through the functional and non‑functional requirements of a B2C e‑commerce platform, illustrates a progression from a three‑server starter architecture to a clustered high‑availability design, and details capacity‑planning calculations for supporting millions of users and peak traffic spikes.

BackendDistributed Systemsarchitecture
0 likes · 9 min read
Designing a Scalable E‑Commerce Architecture: From Simple Setup to Distributed Systems
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Apr 18, 2016 · Backend Development

Distributed Architecture and Scaling Techniques Behind Alibaba's Double 11 E‑commerce Platform

The talk details how Alibaba transformed its monolithic e‑commerce systems into a large‑scale distributed architecture using shared services, middleware, caching, database sharding, capacity planning, and cloud elasticity to support the massive traffic of the annual Double 11 shopping festival.

Alibabacapacity planningcloud computing
0 likes · 9 min read
Distributed Architecture and Scaling Techniques Behind Alibaba's Double 11 E‑commerce Platform
MaGe Linux Operations
MaGe Linux Operations
Jul 17, 2014 · Operations

Why Our First Flash‑Sale Crashed and the Operations Lessons We Learned

The first launch of Taijie Mall’s flash‑sale site crashed due to uncompressed images, a missing purchase button, and a refresh avalanche, but by isolating services and applying capacity‑planning formulas we identified key bottlenecks, implemented CDN and simplifications, and achieved a much smoother second launch.

Operationscapacity planninge‑commerce
0 likes · 7 min read
Why Our First Flash‑Sale Crashed and the Operations Lessons We Learned