Tagged articles
47 articles
Page 1 of 1
DevOps Coach
DevOps Coach
Dec 30, 2025 · Operations

How Switching from Kubernetes to AWS ECS Saved $10K+ Monthly and Slashed Deployments to Seconds

After abandoning Kubernetes and its complex CI pipelines, the team migrated to Amazon ECS, achieving a 70% reduction in pipeline complexity, cutting monthly cloud spend by over $10,000, accelerating deployments from minutes to seconds, and eliminating the need for two DevOps engineers, while highlighting when ECS may not be suitable.

AWS ECSDeployment SpeedDevOps
0 likes · 7 min read
How Switching from Kubernetes to AWS ECS Saved $10K+ Monthly and Slashed Deployments to Seconds
Raymond Ops
Raymond Ops
Dec 1, 2025 · Operations

Boost Ops Efficiency 300% with Terraform + Ansible: Master the IaC Stack in One Guide

This guide explains how Terraform and Ansible complement each other in modern cloud-native environments, detailing their core features, workflow integration, practical AWS and Nginx examples, best-practice recommendations, and security considerations to dramatically improve operational efficiency.

AnsibleConfiguration ManagementInfrastructure Automation
0 likes · 17 min read
Boost Ops Efficiency 300% with Terraform + Ansible: Master the IaC Stack in One Guide
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 21, 2025 · Operations

How Alibaba Cloud’s One‑Click IO Diagnosis Tackles High‑Volume Storage Bottlenecks

The article explains how Alibaba Cloud OS Console’s one‑click IO diagnosis automatically monitors key IO metrics, computes dynamic thresholds, detects anomalies such as high latency or iowait, and provides root‑cause analysis and remediation suggestions to improve cloud storage performance in multi‑tenant environments.

Alibaba Cloudcloud operationsdiagnostics
0 likes · 11 min read
How Alibaba Cloud’s One‑Click IO Diagnosis Tackles High‑Volume Storage Bottlenecks
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 12, 2025 · Operations

How Alibaba Cloud’s One‑Click IO Diagnosis Solves Multi‑Tenant Performance Bottlenecks

The article explains how Alibaba Cloud’s OS console implements a one‑click IO diagnostic that automatically detects, classifies, and resolves high‑latency, burst, and iowait IO issues in multi‑tenant cloud environments by using dynamic thresholds, periodic metric collection, and targeted root‑cause analysis.

Alibaba CloudIO diagnosticsPerformance Monitoring
0 likes · 11 min read
How Alibaba Cloud’s One‑Click IO Diagnosis Solves Multi‑Tenant Performance Bottlenecks
Tencent Architect
Tencent Architect
Aug 5, 2025 · Fundamentals

How TencentOS Redefines Memory Unloading to Slash Costs and Boost Performance

This article explains how Tencent Cloud’s rapid growth has driven innovative memory management techniques—such as multi‑level memory offloading, hot‑cold page detection, and swap subsystem redesign—to reduce memory costs, improve performance, and enhance scalability across diverse cloud workloads.

Linuxcloud operationstencentos
0 likes · 11 min read
How TencentOS Redefines Memory Unloading to Slash Costs and Boost Performance
Efficient Ops
Efficient Ops
Jul 1, 2025 · Operations

Inside Lenovo CloudOps: AI‑Driven Ops, LLMOps & FinOps Insights

The Lenovo Smart Cloud CloudOps session at the 26th GOPS Global Operations Conference showcased five deep‑dive topics—including large‑model‑powered intelligent operations, enterprise LLMOps, FinOps‑driven cost governance, cross‑region distributed ops, and SAP global ops—offering practical pathways for enterprises to accelerate their intelligent transformation.

AI OpsDistributed OperationsFinOps
0 likes · 8 min read
Inside Lenovo CloudOps: AI‑Driven Ops, LLMOps & FinOps Insights
Tech Architecture Stories
Tech Architecture Stories
Jun 14, 2025 · Operations

What Caused Google Cloud’s Massive June 2025 Outage and What We Can Learn

On June 12, 2025, a faulty policy update in Google’s Service Control triggered null‑pointer crashes across regions, causing a global outage that also impacted Cloudflare, Twitch, Discord, and others; the incident exposed missing feature flags, inadequate error handling, and lack of exponential backoff, prompting rapid SRE remediation.

Google CloudSREcloud operations
0 likes · 7 min read
What Caused Google Cloud’s Massive June 2025 Outage and What We Can Learn
Volcano Engine Developer Services
Volcano Engine Developer Services
May 22, 2025 · Artificial Intelligence

How LLMs Can Automate Ticket Escalation: Inside ByteBrain’s TickIt System

This article introduces TickIt, a ByteBrain system that leverages large language models to automatically identify and escalate critical Oncall tickets, detailing its multi‑class escalation, deduplication, and category‑guided fine‑tuning modules, experimental results, and the operational impact on cloud services.

LLMOncall analysisSupervised Fine‑Tuning
0 likes · 13 min read
How LLMs Can Automate Ticket Escalation: Inside ByteBrain’s TickIt System
21CTO
21CTO
Mar 2, 2025 · Operations

Why Platform Engineering Is Redefining Software Development and Threatening Traditional Roles

The article argues that platform engineering is driving an industrial revolution in software development, enabling massive speed and scale gains, consolidating many functions into platform teams, and reshaping or eliminating traditional roles such as DBAs and ops engineers, especially in large organizations.

cloud operationsplatform engineeringsoftware industrialization
0 likes · 8 min read
Why Platform Engineering Is Redefining Software Development and Threatening Traditional Roles
JD Cloud Developers
JD Cloud Developers
May 13, 2024 · Operations

Why Rust Powers oss_pipe: A High‑Performance Cloud File Migration Tool

The article introduces oss_pipe, a Rust‑based file migration utility designed for large‑scale object storage transfers, compares it with existing Java and Go tools, highlights Rust’s memory safety and performance advantages, outlines its core features, and presents benchmark results demonstrating multi‑gigabit throughput and efficient resource usage.

File MigrationRustcloud operations
0 likes · 6 min read
Why Rust Powers oss_pipe: A High‑Performance Cloud File Migration Tool
Alibaba Cloud Native
Alibaba Cloud Native
Mar 11, 2024 · Operations

How to Quickly Pinpoint Error and Slow Traces with Alibaba Cloud ARMS

This guide explains how Alibaba Cloud's ARMS error/slow trace analysis feature can automatically compare abnormal and normal traces to identify root causes such as host, interface, slow SQL, or message‑queue issues, providing step‑by‑step examples for real‑world e‑commerce scenarios.

ARMSPerformance MonitoringTrace Analysis
0 likes · 11 min read
How to Quickly Pinpoint Error and Slow Traces with Alibaba Cloud ARMS
Efficient Ops
Efficient Ops
Dec 26, 2023 · Operations

What Is ITU’s New AIOps Standard and How It Shapes Cloud Operations?

The article explains the ITU‑T Y.3550 AIOps standard, its AI‑driven cloud service development and operation requirements, the Chinese AIOps maturity‑model series, and the latest assessment results showing dozens of enterprises adopting these intelligent‑operations capabilities.

ITU standardaiaiops
0 likes · 6 min read
What Is ITU’s New AIOps Standard and How It Shapes Cloud Operations?
Efficient Ops
Efficient Ops
Aug 16, 2023 · Operations

How to Accurately Set Service Rate‑Limiting Thresholds in Large Cloud Systems

This article examines the challenges of setting effective rate‑limiting thresholds for massive cloud‑native services, compares TPS and concurrency metrics, proposes stress‑testing and historical‑data‑ARMA forecasting methods, and presents a practical system that delivers reliable limits for both node‑wide and per‑service protection.

ARMA forecastingPerformance TestingService Mesh
0 likes · 10 min read
How to Accurately Set Service Rate‑Limiting Thresholds in Large Cloud Systems
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Dec 20, 2022 · Operations

How Huawei Cloud SRE Scaled Monitoring with openGemini: A Real‑World Performance Case Study

Facing hundreds of terabytes of daily monitoring data, Huawei Cloud SRE replaced HBase with the open‑source time‑series database openGemini, conducting extensive write and query performance tests that demonstrated linear scaling, superior query speed, and significant reductions in storage, CPU, and memory usage.

Performance Testingcloud operationsmonitoring
0 likes · 8 min read
How Huawei Cloud SRE Scaled Monitoring with openGemini: A Real‑World Performance Case Study
Efficient Ops
Efficient Ops
Apr 29, 2022 · Operations

How Ctrip Scaled Its Cloud Platform to 10k Nodes: Real‑World Kubernetes Ops Lessons

This article shares Ctrip's practical experiences in scaling a hybrid private‑cloud platform to over ten thousand nodes, covering Kubernetes control‑plane stability, host monitoring, network observability, image management, and capacity planning to ensure high availability for massive online services.

KubernetesNetwork Observabilitycloud operations
0 likes · 18 min read
How Ctrip Scaled Its Cloud Platform to 10k Nodes: Real‑World Kubernetes Ops Lessons
DevOps Cloud Academy
DevOps Cloud Academy
Sep 9, 2021 · Operations

FinOps and DevOps Best Practices for Microsoft ERP Projects

This article explains FinOps as cloud financial operations, outlines how to plan Microsoft ERP projects, and presents eight DevOps best practices—including empowered teams, version control, deployment automation, trunk‑based development, continuous testing, test automation, shift‑left security, and monitoring—while advising on selecting appropriate DevOps tools.

DevOpsFinOpsMicrosoft ERP
0 likes · 10 min read
FinOps and DevOps Best Practices for Microsoft ERP Projects
JD Retail Technology
JD Retail Technology
Jun 5, 2020 · Operations

How JD Cloud Engineered a Seamless 618 Shopping Surge: Ops Strategies & Disaster Drills

This article details JD Cloud's comprehensive operational preparation for the 618 shopping festival, covering early resource procurement, hardware fault management, network and CDN scaling, extensive capacity‑testing, disaster‑recovery drills, and cross‑departmental coordination that together ensured stable service during massive traffic spikes.

Infrastructurecapacity planningcloud operations
0 likes · 8 min read
How JD Cloud Engineered a Seamless 618 Shopping Surge: Ops Strategies & Disaster Drills
21CTO
21CTO
Apr 6, 2020 · Operations

How Alipay Achieved Near‑Zero Downtime with Multi‑Datacenter Failover Architecture

This article explains the evolution of Alipay's high‑availability and disaster‑recovery architecture—from a simple single‑datacenter design to a multi‑datacenter, unit‑based system with failover and blue‑green deployment—highlighting the challenges, solutions, and operational benefits that enable continuous service during massive traffic spikes.

Alipay architectureBlue‑Green deploymentDistributed Systems
0 likes · 17 min read
How Alipay Achieved Near‑Zero Downtime with Multi‑Datacenter Failover Architecture
Continuous Delivery 2.0
Continuous Delivery 2.0
Mar 30, 2020 · Operations

Dynamic Runtime Configuration Management at Facebook: Use Cases and Tooling

The article explains how Facebook manages dynamic runtime configuration for millions of services—covering feature gating, experiments, traffic control, topology balancing, monitoring, machine‑learning model updates, and internal behavior—using a suite of tools such as Configerator, Gatekeeper, Package Vessel, Sitevars, and MobileConfig.

AB testingcloud operationsconfiguration-management
0 likes · 8 min read
Dynamic Runtime Configuration Management at Facebook: Use Cases and Tooling
Tencent Cloud Developer
Tencent Cloud Developer
Nov 21, 2019 · Operations

Serverless Operations: Efficient and Intelligent Cloud-native Practices

The article recaps Tencent Cloud’s Serverless operational suite—covering built‑in DevOps tools, logging, monitoring, auto‑scaling, and security—demonstrating how it replaces manual IaaS provisioning, accelerates development, and enables cloud‑native management, illustrated by a WeChat Mini‑Program album that cut build time from months to two weeks.

DevOpsInfrastructureServerless
0 likes · 19 min read
Serverless Operations: Efficient and Intelligent Cloud-native Practices
Tencent Cloud Developer
Tencent Cloud Developer
Nov 13, 2019 · Operations

Recap of Cloud+ Community Tech Salon – Efficient Intelligent Operations

The Cloud+ Community’s 29th technical salon on November 9 2019 in Shenzhen gathered Tencent and Jiwei experts to showcase efficient intelligent operations through AIOps practices, massive cloud migration strategies, the Blue Whale PaaS framework, Serverless DevOps best practices, and Kubernetes resource‑utilization techniques.

DevOpsKubernetesPaaS
0 likes · 6 min read
Recap of Cloud+ Community Tech Salon – Efficient Intelligent Operations
ITPUB
ITPUB
Mar 26, 2019 · Operations

How to Build a 99.99% High‑Availability Service: Practices and Architecture Evolution

This article explains the essential requirements for achieving 99.99% service availability—consistency, eliminating single points, placement groups, traffic isolation, same‑city active‑active, N+1 redundancy, and multi‑region active‑active—illustrated with a step‑by‑step Yum repository service case study and evolving architecture diagrams.

Deploymentarchitecturecloud operations
0 likes · 9 min read
How to Build a 99.99% High‑Availability Service: Practices and Architecture Evolution
JD Tech
JD Tech
Jan 17, 2019 · Operations

Technical Overview of JD's Archimedes Resource Scheduling System

The article presents a detailed technical analysis of JD's Archimedes project, describing its evolution from JDOS 2.0 to a large‑scale container scheduling platform that dramatically improves resource utilization, deployment speed, and cost efficiency across JD’s data centers.

Big DataJDKubernetes
0 likes · 6 min read
Technical Overview of JD's Archimedes Resource Scheduling System
Efficient Ops
Efficient Ops
Aug 16, 2018 · Operations

How Tencent Automates Massive Storage, CDN, and Network Operations at Scale

This article introduces three Tencent TEG sessions that reveal the automated operation systems behind massive storage and CDN services, billion‑level promotional event guarantees, and intelligent DCI network management, highlighting the challenges, solutions, and speaker expertise.

CDNautomationcloud operations
0 likes · 7 min read
How Tencent Automates Massive Storage, CDN, and Network Operations at Scale
Efficient Ops
Efficient Ops
Apr 18, 2018 · Operations

Huawei’s Triple‑Play Model: Advancing AIOps for Massive K8s and Serverless

At the 9th Global Operations Conference, Huawei Cloud’s chief architect Cai Xiaogang presented a three‑pronged AIOps strategy that combines large‑scale Kubernetes management, causal tracing in Serverless environments, multi‑source RCA analysis, and clustering‑based black‑box network packet inspection, showcasing how academia‑industry collaboration accelerates cloud‑native operations.

KubernetesRoot Cause AnalysisServerless
0 likes · 8 min read
Huawei’s Triple‑Play Model: Advancing AIOps for Massive K8s and Serverless
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 8, 2018 · Operations

How Cainiao Ark’s Elastic Scheduling Boosts Resource Efficiency and Cuts Costs

This article explains why Cainiao needed an elastic scheduling system, how its unique business and technical characteristics make it ideal for such a solution, and details the architecture, decision‑making layers, strategies, and real‑world results that together improve resource utilization, stability, and cost efficiency.

Auto ScalingCainiao ArkResource Management
0 likes · 27 min read
How Cainiao Ark’s Elastic Scheduling Boosts Resource Efficiency and Cuts Costs
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 19, 2017 · Operations

How Alibaba’s TPP Intelligent Scheduler Boosts Resource Utilization and Handles Double‑11 Traffic

The article details Alibaba's Taobao Personalization Platform (TPP) intelligent scheduling system, explaining its architecture, optimization algorithms, convergence logic, and performance results that dramatically improve CPU utilization and automate scaling during both regular operation and high‑traffic events like Double‑11.

AlibabaAuto Scalingcloud operations
0 likes · 21 min read
How Alibaba’s TPP Intelligent Scheduler Boosts Resource Utilization and Handles Double‑11 Traffic
MaGe Linux Operations
MaGe Linux Operations
Apr 23, 2017 · Operations

Scaling Game Server Ops: Managing 10,000+ Cloud Instances Efficiently

This article details YOOZOO Network's evolution from physical to virtualized and clustered game server architectures, the automation of operations across three generations, the design of the UJOBS job platform, robust database backup strategies, and a step‑by‑step migration of thousands of servers to Alibaba Cloud.

Database Backupautomationcloud operations
0 likes · 11 min read
Scaling Game Server Ops: Managing 10,000+ Cloud Instances Efficiently
Efficient Ops
Efficient Ops
Mar 7, 2017 · Big Data

How Tencent Scaled Its TDW to 8,800 Nodes and Mastered Cross-City Data Migration

Tencent’s senior engineer explains how the TDW (Tencent Distributed Data Warehouse) grew from a few hundred to thousands of nodes, the challenges of cross‑city migration, and the modeling, relationship‑chain, dual‑write tables, and platform strategies they built to ensure seamless, low‑impact data and task migration.

Big DataData MigrationTDW
0 likes · 26 min read
How Tencent Scaled Its TDW to 8,800 Nodes and Mastered Cross-City Data Migration
Tencent Cloud Developer
Tencent Cloud Developer
Feb 17, 2017 · Operations

Implementing Network Isolation with Elastic Network Interfaces on QCloud

The article explains how to achieve network isolation for a QCloud SQL cluster by creating and binding additional elastic NICs via API—assigning separate production, heartbeat, and storage interfaces to each node—while noting that true physical isolation is impossible and detailing the required configuration steps and encountered challenges.

Elastic Network InterfaceQCloudVPC
0 likes · 8 min read
Implementing Network Isolation with Elastic Network Interfaces on QCloud
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Jan 6, 2017 · Operations

How Qcmd Revolutionizes Automated Operations for 7,000+ Servers

Qcmd, the command execution system behind 360’s private HULK cloud platform, replaces SaltStack with an asynchronous, Golang‑based architecture that ensures high‑availability, encrypted messaging, and reliable mass‑host command execution across thousands of servers, dramatically reducing task timeouts and operational overhead.

Command ExecutionDistributed SystemsGolang
0 likes · 10 min read
How Qcmd Revolutionizes Automated Operations for 7,000+ Servers
Efficient Ops
Efficient Ops
Nov 14, 2016 · Operations

How a Banking Card Organization Built a Scalable Cloud Operations Platform

This article details the evolution from manual, standardized operations to an automated, intelligent cloud operations platform for a banking card organization, describing its motivations, core features, key scenarios, technical architecture, scheduling algorithms, data visualization, and real‑world outcomes.

Operations ManagementService Orchestrationautomation
0 likes · 13 min read
How a Banking Card Organization Built a Scalable Cloud Operations Platform
Architecture Digest
Architecture Digest
Jul 7, 2016 · Operations

Understanding Load Balancing and the Design of Alibaba's VIPServer

This article explains the fundamentals of load balancing, compares common techniques such as DNS round‑robin, hardware and software load balancers, discusses their advantages and drawbacks, and introduces Alibaba's VIPServer as a mid‑tier, seven‑layer load‑balancing solution with advanced health‑check and traffic‑routing features.

DNSL4/L7VIPServer
0 likes · 19 min read
Understanding Load Balancing and the Design of Alibaba's VIPServer
21CTO
21CTO
Jun 7, 2016 · Operations

Mastering Load Balancing: Lessons from Alibaba’s VIPServer Journey

This article explores the fundamentals and advanced techniques of load balancing, compares DNS round‑robin with dedicated load balancers, discusses scaling strategies, health‑check mechanisms, and introduces Alibaba’s VIPServer as a modern mid‑tier solution addressing real‑world operational challenges.

Distributed SystemsVIPServercloud operations
0 likes · 21 min read
Mastering Load Balancing: Lessons from Alibaba’s VIPServer Journey
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jun 2, 2015 · Fundamentals

Methodology for Implementing Modular Data Centers

This article presents a methodology for modular data center implementation, emphasizing the role of standardization, distinguishing design versus prefabrication, illustrating with micro‑module and container examples, and analyzing the standardization levels of major tech companies and colocation providers.

ICTPodcloud operations
0 likes · 8 min read
Methodology for Implementing Modular Data Centers