Tagged articles

cloud operations

48 articles · Page 1 of 1

Jul 4, 2026 · Operations

When a Non‑Engineer Deploys with Claude Code, a Hidden Bug Makes One Day of AI Cost a Month of Server Fees

A CFO used Claude Code to launch a SaaS product in two days, but a missing database field combined with an automatic retry mechanism caused a single day's AI API calls to generate costs equivalent to a whole month's server expenses, prompting a detailed post‑mortem on the root causes and preventive measures.

Claude CodeLLM costcloud operations

0 likes · 10 min read

When a Non‑Engineer Deploys with Claude Code, a Hidden Bug Makes One Day of AI Cost a Month of Server Fees

DevOps Coach

Dec 30, 2025 · Operations

How Switching from Kubernetes to AWS ECS Saved $10K+ Monthly and Slashed Deployments to Seconds

After abandoning Kubernetes and its complex CI pipelines, the team migrated to Amazon ECS, achieving a 70% reduction in pipeline complexity, cutting monthly cloud spend by over $10,000, accelerating deployments from minutes to seconds, and eliminating the need for two DevOps engineers, while highlighting when ECS may not be suitable.

AWS ECSDeployment SpeedKubernetes

0 likes · 7 min read

How Switching from Kubernetes to AWS ECS Saved $10K+ Monthly and Slashed Deployments to Seconds

Raymond Ops

Dec 1, 2025 · Operations

Boost Ops Efficiency 300% with Terraform + Ansible: Master the IaC Stack in One Guide

This guide explains how Terraform and Ansible complement each other in modern cloud-native environments, detailing their core features, workflow integration, practical AWS and Nginx examples, best-practice recommendations, and security considerations to dramatically improve operational efficiency.

AnsibleIaCInfrastructure Automation

0 likes · 17 min read

Boost Ops Efficiency 300% with Terraform + Ansible: Master the IaC Stack in One Guide

Alibaba Cloud Developer

Nov 21, 2025 · Operations

How Alibaba Cloud’s One‑Click IO Diagnosis Tackles High‑Volume Storage Bottlenecks

The article explains how Alibaba Cloud OS Console’s one‑click IO diagnosis automatically monitors key IO metrics, computes dynamic thresholds, detects anomalies such as high latency or iowait, and provides root‑cause analysis and remediation suggestions to improve cloud storage performance in multi‑tenant environments.

Alibaba CloudIO monitoringcloud operations

0 likes · 11 min read

How Alibaba Cloud’s One‑Click IO Diagnosis Tackles High‑Volume Storage Bottlenecks

Open Source Linux

Nov 17, 2025 · Operations

How a Cloud Ops Engineer Rescued a Critical Service from Disk‑Full Disaster

A senior cloud operations engineer receives a P1 alert for a web‑gateway server nearing 100% /var disk usage, then systematically logs in, diagnoses the log‑file bloat with df, du, and tail, truncates the offending debug log, and implements post‑mortem fixes to prevent recurrence.

LinuxP1 incidentSSH

0 likes · 10 min read

How a Cloud Ops Engineer Rescued a Critical Service from Disk‑Full Disaster

Alibaba Cloud Infrastructure

Nov 12, 2025 · Operations

How Alibaba Cloud’s One‑Click IO Diagnosis Solves Multi‑Tenant Performance Bottlenecks

The article explains how Alibaba Cloud’s OS console implements a one‑click IO diagnostic that automatically detects, classifies, and resolves high‑latency, burst, and iowait IO issues in multi‑tenant cloud environments by using dynamic thresholds, periodic metric collection, and targeted root‑cause analysis.

Alibaba CloudIO diagnosticscloud operations

0 likes · 11 min read

How Alibaba Cloud’s One‑Click IO Diagnosis Solves Multi‑Tenant Performance Bottlenecks

Tencent Architect

Aug 5, 2025 · Fundamentals

How TencentOS Redefines Memory Unloading to Slash Costs and Boost Performance

This article explains how Tencent Cloud’s rapid growth has driven innovative memory management techniques—such as multi‑level memory offloading, hot‑cold page detection, and swap subsystem redesign—to reduce memory costs, improve performance, and enhance scalability across diverse cloud workloads.

Linuxcloud operationstencentos

0 likes · 11 min read

How TencentOS Redefines Memory Unloading to Slash Costs and Boost Performance

Alibaba Cloud Developer

Jul 11, 2025 · Operations

How to Quickly Diagnose and Resolve Packet Loss in Alibaba Cloud Environments

This article explains how to use Alibaba Cloud's OS Console to identify, diagnose, and fix packet‑loss issues in cloud deployments, covering real‑world cases, step‑by‑step diagnostics, and practical tips for eliminating kernel, iptables, and netfilter causes.

Alibaba CloudLinuxPacket loss

0 likes · 9 min read

How to Quickly Diagnose and Resolve Packet Loss in Alibaba Cloud Environments

Ops Development & AI Practice

Jul 7, 2025 · Operations

Why Switching from count to for_each Makes Terraform Resource Management Safer

This article explains how Terraform's count creates indexed resource lists that are hard to delete individually, why for_each with named keys offers more flexible and safe lifecycle management, and provides step‑by‑step migration and deletion examples.

COUNTIaCResource Management

0 likes · 9 min read

Why Switching from count to for_each Makes Terraform Resource Management Safer

Alibaba Cloud Infrastructure

Jul 7, 2025 · Operations

How to Use Alibaba Cloud OOS AI Assistant for Instant Ops via DingTalk

This guide explains the challenges of traditional operations in the mobile era, introduces the Alibaba Cloud OOS AI Assistant for natural‑language, cross‑device incident response, and provides a step‑by‑step tutorial on configuring DingTalk bots, ChatOps integration, and best‑practice usage.

AI assistantAlibaba CloudChatOps

0 likes · 12 min read

How to Use Alibaba Cloud OOS AI Assistant for Instant Ops via DingTalk

Efficient Ops

Jul 1, 2025 · Operations

Inside Lenovo CloudOps: AI‑Driven Ops, LLMOps & FinOps Insights

The Lenovo Smart Cloud CloudOps session at the 26th GOPS Global Operations Conference showcased five deep‑dive topics—including large‑model‑powered intelligent operations, enterprise LLMOps, FinOps‑driven cost governance, cross‑region distributed ops, and SAP global ops—offering practical pathways for enterprises to accelerate their intelligent transformation.

AI OpsDistributed OperationsFinOps

0 likes · 8 min read

Inside Lenovo CloudOps: AI‑Driven Ops, LLMOps & FinOps Insights

Tech Architecture Stories

Jun 14, 2025 · Operations

What Caused Google Cloud’s Massive June 2025 Outage and What We Can Learn

On June 12, 2025, a faulty policy update in Google’s Service Control triggered null‑pointer crashes across regions, causing a global outage that also impacted Cloudflare, Twitch, Discord, and others; the incident exposed missing feature flags, inadequate error handling, and lack of exponential backoff, prompting rapid SRE remediation.

Google CloudSREcloud operations

0 likes · 7 min read

What Caused Google Cloud’s Massive June 2025 Outage and What We Can Learn

Volcano Engine Developer Services

May 22, 2025 · Artificial Intelligence

How LLMs Can Automate Ticket Escalation: Inside ByteBrain’s TickIt System

This article introduces TickIt, a ByteBrain system that leverages large language models to automatically identify and escalate critical Oncall tickets, detailing its multi‑class escalation, deduplication, and category‑guided fine‑tuning modules, experimental results, and the operational impact on cloud services.

Incident ManagementLLMOncall analysis

0 likes · 13 min read

How LLMs Can Automate Ticket Escalation: Inside ByteBrain’s TickIt System

21CTO

Mar 2, 2025 · Operations

Why Platform Engineering Is Redefining Software Development and Threatening Traditional Roles

The article argues that platform engineering is driving an industrial revolution in software development, enabling massive speed and scale gains, consolidating many functions into platform teams, and reshaping or eliminating traditional roles such as DBAs and ops engineers, especially in large organizations.

Platform Engineeringcloud operationssoftware industrialization

0 likes · 8 min read

Why Platform Engineering Is Redefining Software Development and Threatening Traditional Roles

Efficient Ops

Dec 22, 2024 · Operations

What Caused OpenAI’s Massive Outage? Inside the Kubernetes Failure and Recovery

On December 11, OpenAI suffered a severe outage across ChatGPT, its API, and Sora due to a misconfigured telemetry service that overloaded Kubernetes control planes worldwide, prompting a cascade of failures and a coordinated recovery effort.

Incident ManagementKubernetesOpenAI

0 likes · 8 min read

What Caused OpenAI’s Massive Outage? Inside the Kubernetes Failure and Recovery

JD Cloud Developers

May 13, 2024 · Operations

Why Rust Powers oss_pipe: A High‑Performance Cloud File Migration Tool

The article introduces oss_pipe, a Rust‑based file migration utility designed for large‑scale object storage transfers, compares it with existing Java and Go tools, highlights Rust’s memory safety and performance advantages, outlines its core features, and presents benchmark results demonstrating multi‑gigabit throughput and efficient resource usage.

File MigrationPerformance BenchmarkRust

0 likes · 6 min read

Why Rust Powers oss_pipe: A High‑Performance Cloud File Migration Tool

Alibaba Cloud Native

Mar 11, 2024 · Operations

How to Quickly Pinpoint Error and Slow Traces with Alibaba Cloud ARMS

This guide explains how Alibaba Cloud's ARMS error/slow trace analysis feature can automatically compare abnormal and normal traces to identify root causes such as host, interface, slow SQL, or message‑queue issues, providing step‑by‑step examples for real‑world e‑commerce scenarios.

ARMScloud operationserror-detection

0 likes · 11 min read

How to Quickly Pinpoint Error and Slow Traces with Alibaba Cloud ARMS

Efficient Ops

Dec 26, 2023 · Operations

What Is ITU’s New AIOps Standard and How It Shapes Cloud Operations?

The article explains the ITU‑T Y.3550 AIOps standard, its AI‑driven cloud service development and operation requirements, the Chinese AIOps maturity‑model series, and the latest assessment results showing dozens of enterprises adopting these intelligent‑operations capabilities.

AIAIOpsITU standard

0 likes · 6 min read

What Is ITU’s New AIOps Standard and How It Shapes Cloud Operations?

Efficient Ops

Aug 16, 2023 · Operations

How to Accurately Set Service Rate‑Limiting Thresholds in Large Cloud Systems

This article examines the challenges of setting effective rate‑limiting thresholds for massive cloud‑native services, compares TPS and concurrency metrics, proposes stress‑testing and historical‑data‑ARMA forecasting methods, and presents a practical system that delivers reliable limits for both node‑wide and per‑service protection.

ARMA forecastingService Meshcloud operations

0 likes · 10 min read

How to Accurately Set Service Rate‑Limiting Thresholds in Large Cloud Systems

Huawei Cloud Developer Alliance

Dec 20, 2022 · Operations

How Huawei Cloud SRE Scaled Monitoring with openGemini: A Real‑World Performance Case Study

Facing hundreds of terabytes of daily monitoring data, Huawei Cloud SRE replaced HBase with the open‑source time‑series database openGemini, conducting extensive write and query performance tests that demonstrated linear scaling, superior query speed, and significant reductions in storage, CPU, and memory usage.

cloud operationsmonitoringopenGemini

0 likes · 8 min read

How Huawei Cloud SRE Scaled Monitoring with openGemini: A Real‑World Performance Case Study

Efficient Ops

Apr 29, 2022 · Operations

How Ctrip Scaled Its Cloud Platform to 10k Nodes: Real‑World Kubernetes Ops Lessons

This article shares Ctrip's practical experiences in scaling a hybrid private‑cloud platform to over ten thousand nodes, covering Kubernetes control‑plane stability, host monitoring, network observability, image management, and capacity planning to ensure high availability for massive online services.

KubernetesNetwork ObservabilityPerformance Optimization

0 likes · 18 min read

How Ctrip Scaled Its Cloud Platform to 10k Nodes: Real‑World Kubernetes Ops Lessons

DevOps Cloud Academy

Sep 9, 2021 · Operations

FinOps and DevOps Best Practices for Microsoft ERP Projects

This article explains FinOps as cloud financial operations, outlines how to plan Microsoft ERP projects, and presents eight DevOps best practices—including empowered teams, version control, deployment automation, trunk‑based development, continuous testing, test automation, shift‑left security, and monitoring—while advising on selecting appropriate DevOps tools.

FinOpsMicrosoft ERPbest practices

0 likes · 10 min read

FinOps and DevOps Best Practices for Microsoft ERP Projects

MaGe Linux Operations

Aug 26, 2020 · Operations

Quick Guide to Secure Alibaba Cloud Server Setup with JDK, Tomcat & Docker

This step‑by‑step tutorial shows how to enable security groups, configure a BT panel or command‑line environment, install JDK, Tomcat, and Docker on an Alibaba Cloud Linux server, and verify firewall and service status for a production‑ready setup.

JDKLinux serverSecurity Group

0 likes · 8 min read

Quick Guide to Secure Alibaba Cloud Server Setup with JDK, Tomcat & Docker

Programmer DD

Aug 17, 2020 · Operations

What Docker’s New Terms of Service Mean for Export Controls and Chinese Companies

Docker's latest Terms of Service, effective August 13, 2020, introduce a strict U.S. export‑control clause that restricts usage in embargoed regions and for designated persons, explicitly applies to Docker Hub, and highlights several Chinese IT firms now listed on the U.S. Entity List.

cloud operationsexport controlterms of service

0 likes · 3 min read

What Docker’s New Terms of Service Mean for Export Controls and Chinese Companies

JD Retail Technology

Jun 5, 2020 · Operations

How JD Cloud Engineered a Seamless 618 Shopping Surge: Ops Strategies & Disaster Drills

This article details JD Cloud's comprehensive operational preparation for the 618 shopping festival, covering early resource procurement, hardware fault management, network and CDN scaling, extensive capacity‑testing, disaster‑recovery drills, and cross‑departmental coordination that together ensured stable service during massive traffic spikes.

Disaster Recoverycapacity planningcloud operations

0 likes · 8 min read

How JD Cloud Engineered a Seamless 618 Shopping Surge: Ops Strategies & Disaster Drills

21CTO

Apr 6, 2020 · Operations

How Alipay Achieved Near‑Zero Downtime with Multi‑Datacenter Failover Architecture

This article explains the evolution of Alipay's high‑availability and disaster‑recovery architecture—from a simple single‑datacenter design to a multi‑datacenter, unit‑based system with failover and blue‑green deployment—highlighting the challenges, solutions, and operational benefits that enable continuous service during massive traffic spikes.

Alipay architectureBlue-Green DeploymentDisaster Recovery

0 likes · 17 min read

How Alipay Achieved Near‑Zero Downtime with Multi‑Datacenter Failover Architecture

Continuous Delivery 2.0

Mar 30, 2020 · Operations

Dynamic Runtime Configuration Management at Facebook: Use Cases and Tooling

The article explains how Facebook manages dynamic runtime configuration for millions of services—covering feature gating, experiments, traffic control, topology balancing, monitoring, machine‑learning model updates, and internal behavior—using a suite of tools such as Configerator, Gatekeeper, Package Vessel, Sitevars, and MobileConfig.

AB testingcloud operationsconfiguration-management

0 likes · 8 min read

Dynamic Runtime Configuration Management at Facebook: Use Cases and Tooling

Tencent Cloud Developer

Nov 21, 2019 · Operations

Serverless Operations: Efficient and Intelligent Cloud-native Practices

The article recaps Tencent Cloud’s Serverless operational suite—covering built‑in DevOps tools, logging, monitoring, auto‑scaling, and security—demonstrating how it replaces manual IaaS provisioning, accelerates development, and enables cloud‑native management, illustrated by a WeChat Mini‑Program album that cut build time from months to two weeks.

AutomationServerlessTencent Cloud

0 likes · 19 min read

Serverless Operations: Efficient and Intelligent Cloud-native Practices

Tencent Cloud Developer

Nov 13, 2019 · Operations

Recap of Cloud+ Community Tech Salon – Efficient Intelligent Operations

The Cloud+ Community’s 29th technical salon on November 9 2019 in Shenzhen gathered Tencent and Jiwei experts to showcase efficient intelligent operations through AIOps practices, massive cloud migration strategies, the Blue Whale PaaS framework, Serverless DevOps best practices, and Kubernetes resource‑utilization techniques.

AIOpsKubernetesPaaS

0 likes · 6 min read

Recap of Cloud+ Community Tech Salon – Efficient Intelligent Operations

ITPUB

Mar 26, 2019 · Operations

How to Build a 99.99% High‑Availability Service: Practices and Architecture Evolution

This article explains the essential requirements for achieving 99.99% service availability—consistency, eliminating single points, placement groups, traffic isolation, same‑city active‑active, N+1 redundancy, and multi‑region active‑active—illustrated with a step‑by‑step Yum repository service case study and evolving architecture diagrams.

Deploymentarchitecturecloud operations

0 likes · 9 min read

How to Build a 99.99% High‑Availability Service: Practices and Architecture Evolution

Ctrip Technology

Mar 7, 2019 · Operations

Ctrip Container Cloud Operations: Practices, Challenges, and Future Outlook

This article presents Ctrip's experience in building and operating a private container cloud platform, detailing its architectural evolution, operational challenges, tooling, monitoring, capacity management, and future directions toward hybrid and cloud‑native environments.

ChatOpsKubernetescapacity management

0 likes · 12 min read

Ctrip Container Cloud Operations: Practices, Challenges, and Future Outlook

JD Tech

Jan 17, 2019 · Operations

Technical Overview of JD's Archimedes Resource Scheduling System

The article presents a detailed technical analysis of JD's Archimedes project, describing its evolution from JDOS 2.0 to a large‑scale container scheduling platform that dramatically improves resource utilization, deployment speed, and cost efficiency across JD’s data centers.

AIBig DataContainer Orchestration

0 likes · 6 min read

Technical Overview of JD's Archimedes Resource Scheduling System

Efficient Ops

Aug 16, 2018 · Operations

How Tencent Automates Massive Storage, CDN, and Network Operations at Scale

This article introduces three Tencent TEG sessions that reveal the automated operation systems behind massive storage and CDN services, billion‑level promotional event guarantees, and intelligent DCI network management, highlighting the challenges, solutions, and speaker expertise.

AutomationCDNNetwork Management

0 likes · 7 min read

How Tencent Automates Massive Storage, CDN, and Network Operations at Scale

Efficient Ops

Apr 18, 2018 · Operations

Huawei’s Triple‑Play Model: Advancing AIOps for Massive K8s and Serverless

At the 9th Global Operations Conference, Huawei Cloud’s chief architect Cai Xiaogang presented a three‑pronged AIOps strategy that combines large‑scale Kubernetes management, causal tracing in Serverless environments, multi‑source RCA analysis, and clustering‑based black‑box network packet inspection, showcasing how academia‑industry collaboration accelerates cloud‑native operations.

AIOpsClusteringKubernetes

0 likes · 8 min read

Huawei’s Triple‑Play Model: Advancing AIOps for Massive K8s and Serverless

Alibaba Cloud Developer

Mar 8, 2018 · Operations

How Cainiao Ark’s Elastic Scheduling Boosts Resource Efficiency and Cuts Costs

This article explains why Cainiao needed an elastic scheduling system, how its unique business and technical characteristics make it ideal for such a solution, and details the architecture, decision‑making layers, strategies, and real‑world results that together improve resource utilization, stability, and cost efficiency.

Auto ScalingCainiao ArkContainer Orchestration

0 likes · 27 min read

How Cainiao Ark’s Elastic Scheduling Boosts Resource Efficiency and Cuts Costs

MaGe Linux Operations

Dec 25, 2017 · Operations

How SaltStack Automates Cloud Operations: Boost Efficiency and Reduce Workload

This article explains how the open‑source automation tool SaltStack can be deployed in a large‑scale cloud environment to centralize management, distribute files, collect server data, and streamline configuration, thereby reducing operational effort and improving efficiency for administrators.

AutomationSaltStackZeroMQ

0 likes · 14 min read

How SaltStack Automates Cloud Operations: Boost Efficiency and Reduce Workload

Alibaba Cloud Developer

Dec 19, 2017 · Operations

How Alibaba’s TPP Intelligent Scheduler Boosts Resource Utilization and Handles Double‑11 Traffic

The article details Alibaba's Taobao Personalization Platform (TPP) intelligent scheduling system, explaining its architecture, optimization algorithms, convergence logic, and performance results that dramatically improve CPU utilization and automate scaling during both regular operation and high‑traffic events like Double‑11.

AlibabaAuto ScalingResource Scheduling

0 likes · 21 min read

How Alibaba’s TPP Intelligent Scheduler Boosts Resource Utilization and Handles Double‑11 Traffic

Efficient Ops

Jun 6, 2017 · Operations

How to Deploy Reliable Overseas IT Infrastructure: Key Strategies and Tools

This guide outlines essential questions, local network insights, IDC versus network layout choices, and practical tools for companies planning to expand their IT infrastructure across international markets, helping them manage latency, cost, and deployment speed.

IDCIT infrastructurecloud operations

0 likes · 12 min read

How to Deploy Reliable Overseas IT Infrastructure: Key Strategies and Tools

MaGe Linux Operations

Apr 23, 2017 · Operations

Scaling Game Server Ops: Managing 10,000+ Cloud Instances Efficiently

This article details YOOZOO Network's evolution from physical to virtualized and clustered game server architectures, the automation of operations across three generations, the design of the UJOBS job platform, robust database backup strategies, and a step‑by‑step migration of thousands of servers to Alibaba Cloud.

Automationcloud operationsdatabase backup

0 likes · 11 min read

Scaling Game Server Ops: Managing 10,000+ Cloud Instances Efficiently

Efficient Ops

Mar 7, 2017 · Big Data

How Tencent Scaled Its TDW to 8,800 Nodes and Mastered Cross-City Data Migration

Tencent’s senior engineer explains how the TDW (Tencent Distributed Data Warehouse) grew from a few hundred to thousands of nodes, the challenges of cross‑city migration, and the modeling, relationship‑chain, dual‑write tables, and platform strategies they built to ensure seamless, low‑impact data and task migration.

Big DataData MigrationTDW

0 likes · 26 min read

How Tencent Scaled Its TDW to 8,800 Nodes and Mastered Cross-City Data Migration

Tencent Cloud Developer

Feb 17, 2017 · Operations

Implementing Network Isolation with Elastic Network Interfaces on QCloud

The article explains how to achieve network isolation for a QCloud SQL cluster by creating and binding additional elastic NICs via API—assigning separate production, heartbeat, and storage interfaces to each node—while noting that true physical isolation is impossible and detailing the required configuration steps and encountered challenges.

Elastic Network InterfaceQCloudVPC

0 likes · 8 min read

Implementing Network Isolation with Elastic Network Interfaces on QCloud

360 Zhihui Cloud Developer

Jan 6, 2017 · Operations

How Qcmd Revolutionizes Automated Operations for 7,000+ Servers

Qcmd, the command execution system behind 360’s private HULK cloud platform, replaces SaltStack with an asynchronous, Golang‑based architecture that ensures high‑availability, encrypted messaging, and reliable mass‑host command execution across thousands of servers, dramatically reducing task timeouts and operational overhead.

Command Executioncloud operationsdistributed systems

0 likes · 10 min read

How Qcmd Revolutionizes Automated Operations for 7,000+ Servers

Efficient Ops

Nov 14, 2016 · Operations

How a Banking Card Organization Built a Scalable Cloud Operations Platform

This article details the evolution from manual, standardized operations to an automated, intelligent cloud operations platform for a banking card organization, describing its motivations, core features, key scenarios, technical architecture, scheduling algorithms, data visualization, and real‑world outcomes.

AutomationOperations ManagementService Orchestration

0 likes · 13 min read

How a Banking Card Organization Built a Scalable Cloud Operations Platform

Architecture Digest

Jul 7, 2016 · Operations

Understanding Load Balancing and the Design of Alibaba's VIPServer

This article explains the fundamentals of load balancing, compares common techniques such as DNS round‑robin, hardware and software load balancers, discusses their advantages and drawbacks, and introduces Alibaba's VIPServer as a mid‑tier, seven‑layer load‑balancing solution with advanced health‑check and traffic‑routing features.

DNSL4/L7VIPServer

0 likes · 19 min read

Understanding Load Balancing and the Design of Alibaba's VIPServer

21CTO

Jun 7, 2016 · Operations

Mastering Load Balancing: Lessons from Alibaba’s VIPServer Journey

This article explores the fundamentals and advanced techniques of load balancing, compares DNS round‑robin with dedicated load balancers, discusses scaling strategies, health‑check mechanisms, and introduces Alibaba’s VIPServer as a modern mid‑tier solution addressing real‑world operational challenges.

VIPServercloud operationsdistributed systems

0 likes · 21 min read

Mastering Load Balancing: Lessons from Alibaba’s VIPServer Journey

ITPUB

Jan 13, 2016 · Operations

How UPYUN Scaled Cloud Operations: Automation, Monitoring, and Performance Visualization

This article chronicles UPYUN’s evolution from a modest server setup in 2005 to a sophisticated cloud operations platform, detailing the challenges, automation strategies, monitoring practices, performance visualization techniques, and lessons learned for large‑scale CDN management.

AutomationCDNDeployment

0 likes · 11 min read

How UPYUN Scaled Cloud Operations: Automation, Monitoring, and Performance Visualization

Alibaba Cloud Infrastructure

Jun 2, 2015 · Fundamentals

Methodology for Implementing Modular Data Centers

This article presents a methodology for modular data center implementation, emphasizing the role of standardization, distinguishing design versus prefabrication, illustrating with micro‑module and container examples, and analyzing the standardization levels of major tech companies and colocation providers.

ICTStandardizationcloud operations

0 likes · 8 min read

Methodology for Implementing Modular Data Centers

MaGe Linux Operations

May 5, 2015 · Operations

Master Linux KVM: Step-by-Step Installation, Configuration, and VM Management

This guide walks you through installing KVM on Linux, configuring libvirt tools, verifying hardware virtualization support, setting up networking, creating and managing virtual machines with virt-install and virsh, and monitoring VM performance using virt-top.

KVMLibvirtLinux virtualization

0 likes · 12 min read

Master Linux KVM: Step-by-Step Installation, Configuration, and VM Management