Tagged articles
31 articles
Page 1 of 1
Efficient Ops
Efficient Ops
Oct 18, 2024 · Operations

Guotai Junan’s Level‑3 FinOps Success: Inside Their Capacity Management Journey

This article explores how Guotai Junan Securities leveraged FinOps and a new IT resource maturity model to achieve Level‑3 capacity management, detailing their cultural shift, automation tools, transparency gains, challenges overcome, and future plans for finer‑grained cost control in a rapidly digitizing industry.

Digital TransformationFinOpsIT Operations
0 likes · 12 min read
Guotai Junan’s Level‑3 FinOps Success: Inside Their Capacity Management Journey
Architect
Architect
Aug 10, 2023 · Operations

Capacity Management: Goals, Stages, Optimization Techniques, and Scaling Practices

The article explains how capacity management balances cost control and service quality through defined goals, three development stages, detailed resource optimization methods, stress‑testing metrics and standards, and automated scaling to achieve significant cost reductions while maintaining system stability.

OperationsPerformance TestingResource Optimization
0 likes · 10 min read
Capacity Management: Goals, Stages, Optimization Techniques, and Scaling Practices
AntTech
AntTech
Jul 14, 2023 · Cloud Native

KapacityStack: Open‑Source Cloud‑Native Intelligent Capacity Management and IHPA

KapacityStack is an open‑source, cloud‑native capacity platform from Ant Group that introduces the Intelligent Horizontal Pod Autoscaler (IHPA) to provide predictive, multi‑level, and stable autoscaling, reducing resource waste, carbon emissions, and operational costs while supporting extensible, modular integration with Kubernetes workloads.

autoscalingcapacity managementcloud-native
0 likes · 11 min read
KapacityStack: Open‑Source Cloud‑Native Intelligent Capacity Management and IHPA
dbaplus Community
dbaplus Community
Jun 24, 2023 · Operations

How Bilibili Scales Capacity: VPA, HPA, and Cost‑Saving Strategies

This article summarizes Zhang He’s Bilibili SRE talk on building a capacity‑management system that visualizes resource usage, reduces costs, improves stability, and leverages Kubernetes VPA, HPA, pooling, and quota management to support massive live‑stream events and rapid feature releases.

Cost OptimizationHPAKubernetes
0 likes · 21 min read
How Bilibili Scales Capacity: VPA, HPA, and Cost‑Saving Strategies
Efficient Ops
Efficient Ops
Apr 2, 2023 · Operations

Turning CMDB Data into Actionable Capacity Management for IT Operations

This article explores how CMDB data can be leveraged for proactive capacity assessment, outlining mechanisms, goals, metrics, evaluation types, baselines, and a tool design that integrates metric, policy, evaluation, and reporting functions to enhance IT asset efficiency and risk mitigation.

CMDBIT assetcapacity management
0 likes · 11 min read
Turning CMDB Data into Actionable Capacity Management for IT Operations
Bilibili Tech
Bilibili Tech
Mar 28, 2023 · Operations

Bilibili's Capacity Management Platform: Design, Implementation, and S12 Event Support

Bilibili's capacity management platform integrates foundational data, VPA/HPA scaling, quota control, and visual dashboards to streamline resource usage, cut costs, and boost stability, delivering event‑specific support such as for S12 that slashes release issues by 80% and online failures by 90%, while planning predictive scaling and risk control.

BilibiliResource OptimizationSRE
0 likes · 13 min read
Bilibili's Capacity Management Platform: Design, Implementation, and S12 Event Support
Baidu Tech Salon
Baidu Tech Salon
Mar 15, 2023 · Industry Insights

How Baidu Feed Scales Millions of Users with Serverless: A Multi‑Dimensional Elasticity Blueprint

This article details Baidu Feed's serverless transformation, describing how multi‑dimensional service profiling (elasticity, traffic, capacity) and three elastic strategies—predictive, load‑feedback, and timed—enable automatic scaling that reduces resource waste while maintaining 24/7 stability for billions of users.

Baidu FeedCloud NativeOperations
0 likes · 19 min read
How Baidu Feed Scales Millions of Users with Serverless: A Multi‑Dimensional Elasticity Blueprint
Baidu Geek Talk
Baidu Geek Talk
Mar 15, 2023 · Industry Insights

How Baidu Feed Scaled to Serverless with Multi‑Dimensional Service Profiles

This article explains how Baidu Feed’s backend services were transformed to a serverless model by building elastic, traffic, and capacity profiles for each service, enabling predictive, load‑feedback, and timed scaling strategies that automatically adjust resources with traffic fluctuations, reduce costs, and maintain stability.

Cloud NativeServerlessService Profiling
0 likes · 19 min read
How Baidu Feed Scaled to Serverless with Multi‑Dimensional Service Profiles
Zhuanzhuan Tech
Zhuanzhuan Tech
Feb 8, 2023 · Operations

Capacity Management: Goals, Practices, and Optimization at ZuanZuan

This article outlines ZuanZuan’s capacity management approach, covering its objectives, development stages, water‑level metrics, resource optimization techniques, cluster capacity assessment, stress‑test indicators and standards, as well as scaling strategies, demonstrating how systematic capacity management reduces costs while ensuring service stability.

Cost OptimizationPerformance MonitoringResource Optimization
0 likes · 12 min read
Capacity Management: Goals, Practices, and Optimization at ZuanZuan
Bilibili Tech
Bilibili Tech
Sep 9, 2022 · Operations

B站SRE's Stability Practices and Reflections

At the 2022 GOPS Global Operations Conference in Shenzhen, Bilibili’s infrastructure SRE lead Wu Anchuang unveiled the company’s comprehensive stability framework—detailing its SRE transformation, high‑availability architecture, active‑active disaster‑recovery, capacity planning, and event‑support strategies—marking the first public disclosure of these practices.

B站SREactivity assurance
0 likes · 1 min read
B站SRE's Stability Practices and Reflections
AntTech
AntTech
Jun 22, 2022 · Cloud Computing

Meta Reinforcement Learning Framework for Predictive Autoscaling in Cloud Environments

This article presents a cloud-native, end‑to‑end autoscaling solution that integrates traffic forecasting, CPU utilization meta‑prediction, and a reinforcement‑learning‑based scaling decision module into a fully differentiable system, achieving higher resource utilization and cost efficiency as demonstrated by ACM SIGKDD 2022 research.

Meta LearningPredictive Modelingautoscaling
0 likes · 10 min read
Meta Reinforcement Learning Framework for Predictive Autoscaling in Cloud Environments
High Availability Architecture
High Availability Architecture
Dec 28, 2021 · Backend Development

Design and Practice of the Nimbus Low‑Code Platform for Search Middleware

This article examines the challenges faced by Baidu's search middleware in high‑frequency iteration and complex backend development, and presents the design, implementation, and benefits of the Nimbus low‑code platform—including a graph engine, unified development environment, visual operator composition, automated testing, and intelligent capacity management—to accelerate product innovation while reducing development effort.

DevOpscapacity managementgraph engine
0 likes · 16 min read
Design and Practice of the Nimbus Low‑Code Platform for Search Middleware
dbaplus Community
dbaplus Community
Jun 17, 2021 · Cloud Native

How Dada Achieved Seamless Elastic Scaling for Massive Delivery Peaks

Facing surges during holidays and major shopping events, Dada’s DevOps team built a cloud‑native elastic scaling system that combines fine‑grained capacity management, multi‑cloud support, metric‑driven auto‑scaling, and extreme‑scale down strategies, delivering stable delivery performance while cutting costs.

Auto ScalingOperationscapacity management
0 likes · 17 min read
How Dada Achieved Seamless Elastic Scaling for Massive Delivery Peaks
Baidu Geek Talk
Baidu Geek Talk
May 26, 2021 · Operations

How Baidu Engineers Scalable Service Governance: Capacity, Traffic, and Stability

This interview details Baidu's practical approach to microservice governance, covering its definition, the evolution from ad‑hoc scaling to automated capacity, traffic, and stability engineering, and the challenges of data collection, standardized interfaces, and decision‑making policies for large‑scale systems.

MicroservicesService Meshcapacity management
0 likes · 12 min read
How Baidu Engineers Scalable Service Governance: Capacity, Traffic, and Stability
Efficient Ops
Efficient Ops
Apr 20, 2021 · Operations

How Dada’s Intelligent Elastic Scaling Cuts Costs and Boosts Delivery Performance

This article details Dada Group’s implementation of an intelligent elastic scaling architecture that automatically adjusts capacity during peak promotions and low‑traffic periods, improving delivery reliability, reducing costs, and supporting multi‑cloud and multi‑runtime environments through sophisticated monitoring and auto‑scaler mechanisms.

Auto ScalingOperationscapacity management
0 likes · 17 min read
How Dada’s Intelligent Elastic Scaling Cuts Costs and Boosts Delivery Performance
Dada Group Technology
Dada Group Technology
Apr 19, 2021 · Operations

Exploring Elastic Capacity and Automated Scaling Architecture at Dada Group

This article presents Dada Group's comprehensive approach to elastic capacity management and automated scaling, detailing the challenges faced during traffic spikes, the design of a cloud‑native auto‑scaler, multi‑metric observability, decision‑making logic, execution mechanisms, extreme scaling practices, and future optimization directions.

Auto ScalingCloud NativeSRE
0 likes · 15 min read
Exploring Elastic Capacity and Automated Scaling Architecture at Dada Group
Youku Technology
Youku Technology
Jul 16, 2020 · Operations

How Alibaba Entertainment Automates Capacity Management and Elastic Scaling

Alibaba Entertainment transformed its capacity management from manual, experience‑based decisions to a fully automated system that continuously evaluates single‑machine performance, identifies performance and success‑rate breakpoints, and drives elastic scaling, dramatically improving resource utilization, availability, and development efficiency across all its applications.

AutomationOperationsPerformance Testing
0 likes · 10 min read
How Alibaba Entertainment Automates Capacity Management and Elastic Scaling
Didi Tech
Didi Tech
Feb 18, 2020 · Operations

Didi's National Carpool Day: Technical Insights into Stability Assurance

Didi's National Carpool Day on Dec 3 2019 attracted 3.1M passengers; stability ensured via six pillars: organized task force, capacity forecasting and rapid container scaling, comprehensive monitoring with fire‑fighting map, robust contingency platform, strict process standards, and coordinated third‑party preparation.

Carpool DayDidiOperations
0 likes · 13 min read
Didi's National Carpool Day: Technical Insights into Stability Assurance
Architects' Tech Alliance
Architects' Tech Alliance
Jan 12, 2020 · Cloud Computing

Mitigating Hash Polarization and Elephant Flow in UCloud Physical Cloud Gateway Clusters: Multi‑Tunnel and Capacity Management Solutions

This article presents a detailed case study of how UCloud resolved hash polarization and elephant‑flow overload in physical cloud gateway clusters by deploying a multi‑tunnel traffic‑splitting strategy, expanding gateway capacity, implementing lossless isolation‑zone migration, and enhancing automation and high‑availability mechanisms, enabling the clusters to handle hundreds of gigabits of traffic during peak events.

Network Trafficcapacity managementcloud computing
0 likes · 10 min read
Mitigating Hash Polarization and Elephant Flow in UCloud Physical Cloud Gateway Clusters: Multi‑Tunnel and Capacity Management Solutions
UCloud Tech
UCloud Tech
Jan 7, 2020 · Cloud Computing

How Multi‑Tunnel Architecture Resolved Physical Cloud Traffic Overload

This article details how UCloud tackled severe traffic overload in its physical cloud gateway caused by hash polarization, introducing a multi‑tunnel solution, capacity management, isolation‑zone migration, and automated operations to achieve high availability and support hundreds of gigabits of traffic.

Network Trafficcapacity managementhash polarization
0 likes · 10 min read
How Multi‑Tunnel Architecture Resolved Physical Cloud Traffic Overload
Efficient Ops
Efficient Ops
Jun 20, 2019 · Operations

How Baidu’s Noah TSDB Handles Capacity Management at Scale

This article explains how Baidu’s Noah time‑series database measures, plans, and protects capacity, detailing throughput metrics, estimation and load‑testing methods, and a water‑level model that drives reliable scaling and overload mitigation for massive monitoring workloads.

Load TestingTSDBcapacity management
0 likes · 11 min read
How Baidu’s Noah TSDB Handles Capacity Management at Scale
Efficient Ops
Efficient Ops
Feb 14, 2019 · Operations

Scaling a 10,000‑Node Container Cloud: Ctrip’s Ops Practices and Lessons

This article details Ctrip's journey of building and operating a massive container cloud platform, covering its architectural evolution, operational challenges, tooling, capacity management, and future directions, offering practical insights for large‑scale cloud‑native environments.

Cloud NativeKubernetesOperations
0 likes · 17 min read
Scaling a 10,000‑Node Container Cloud: Ctrip’s Ops Practices and Lessons
Qunar Tech Salon
Qunar Tech Salon
Oct 13, 2017 · Operations

WeChat Operational Practices: Elastic Scaling, Cloud Management, Capacity Management, and Automated Scheduling

This article describes WeChat's operational standards, cloud‑native management, capacity planning, and automated scheduling techniques, covering configuration file conventions, name‑service design, cloud migration decisions, hardware‑metric based capacity evaluation, stress‑testing methods, and dynamic resource allocation to ensure efficient, reliable service scaling.

capacity managementcloud automationelastic scaling
0 likes · 25 min read
WeChat Operational Practices: Elastic Scaling, Cloud Management, Capacity Management, and Automated Scheduling
Efficient Ops
Efficient Ops
Oct 10, 2017 · Operations

WeChat’s 900M MAU Scaling: Secrets of Efficient Operations

The talk outlines WeChat’s approach to handling rapid user growth through disciplined operational standards, cloud‑native management, precise capacity planning, and automated scaling, detailing configuration file conventions, name‑service design, hardware metric evaluation, stress‑testing methods, and dynamic resource allocation to maintain high efficiency and low cost.

AutomationOperationscapacity management
0 likes · 25 min read
WeChat’s 900M MAU Scaling: Secrets of Efficient Operations
MaGe Linux Operations
MaGe Linux Operations
Aug 12, 2017 · Operations

How Tencent’s ZhiYun Platform Powers Massive Social Event Ops at Scale

This article explains how Tencent's SNG operations team leveraged the ZhiYun intelligent operations platform—through standardized processes, massive IaaS provisioning, CMDB management, automation workflows, and capacity monitoring—to flawlessly support the high‑traffic "military‑uniform photo" campaign across thousands of servers.

CMDBTencentcapacity management
0 likes · 10 min read
How Tencent’s ZhiYun Platform Powers Massive Social Event Ops at Scale
Ctrip Technology
Ctrip Technology
Feb 16, 2017 · Operations

Application‑Based Automated Capacity Management and Utilization Evaluation

The article presents a comprehensive, application‑centric approach to automated capacity management that analyzes why server utilization is low, defines safe usage thresholds, describes a load‑balancer‑driven stress‑testing workflow with regression modeling, and explains how this practice improves resource efficiency, cost savings, and developer‑ops collaboration.

AutomationDevOpsOperations
0 likes · 14 min read
Application‑Based Automated Capacity Management and Utilization Evaluation
Qunar Tech Salon
Qunar Tech Salon
Feb 14, 2017 · Operations

Application‑Based Automated Capacity Management and Utilization Evaluation

This article explains how to automate application‑centric capacity assessment, identify the safe utilization thresholds, use load‑balancer‑driven stress testing and regression modeling to pinpoint resource bottlenecks, and improve server usage while maintaining service reliability through close DevOps collaboration.

AutomationDevOpsOperations
0 likes · 15 min read
Application‑Based Automated Capacity Management and Utilization Evaluation
Efficient Ops
Efficient Ops
Feb 9, 2017 · Operations

Automating Application‑Based Capacity Management to Boost Resource Utilization

This article explains how to automate capacity management focused on application performance, identifies common causes of low resource utilization, proposes safe utilization thresholds, describes a testing framework that uses load‑balancer weighting and real‑time monitoring to pinpoint bottlenecks, and outlines how ops and developers can collaborate to improve efficiency.

AutomationOperationsPerformance Testing
0 likes · 18 min read
Automating Application‑Based Capacity Management to Boost Resource Utilization
Tencent Cloud Developer
Tencent Cloud Developer
Feb 7, 2017 · Cloud Computing

Six Methods for Capacity Management in Cloud Computing

Tencent’s social network division, overseeing nearly 100,000 Linux servers that power billions of daily QQ interactions, curbs rising hardware expenses by applying six capacity‑management strategies—performance balancing, memory‑density assessment, feature‑driven scaling, virtualization‑based fragmentation reduction, bottleneck‑oriented capacity planning, and selective hardware upgrades such as larger disks or GPUs—to boost utilization and lower operational costs.

Cost OptimizationInfrastructure Optimizationcapacity management
0 likes · 8 min read
Six Methods for Capacity Management in Cloud Computing
Efficient Ops
Efficient Ops
Aug 28, 2016 · Operations

Six Proven Methods to Optimize Server Capacity and Cut Costs in Large‑Scale Social Networks

Tencent's SNG team shares six practical capacity‑management techniques—performance, density, feature, fragmentation, barrel, and hardware selection methods—that helped reduce operational expenses by over a hundred million yuan annually while supporting hundreds of millions of daily active users.

Cost OptimizationOperationscapacity management
0 likes · 10 min read
Six Proven Methods to Optimize Server Capacity and Cut Costs in Large‑Scale Social Networks