Tagged articles

Operations

3329 articles · Page 31 of 34

Jan 8, 2017 · Operations

Why Global Server Load Balancing (GSLB) Is Hard: Technical Challenges and Solutions

This article explains what GSLB (Global Server Load Balancing) is, why achieving high availability, low latency, and accurate traffic distribution is difficult due to DNS limitations, caching, and routing constraints, and explores architectural and network‑level techniques such as feedback loops, anycast, and BGP routing to mitigate these challenges.

AnycastDNSGSLB

0 likes · 16 min read

Why Global Server Load Balancing (GSLB) Is Hard: Technical Challenges and Solutions

DevOps

Jan 4, 2017 · Operations

The Third Way of DevOps: Continuous Learning and Docker as Lab Equipment

The article explains the Third Way of DevOps—continuous learning through Kaizen and the PDSA cycle—showing how Docker serves as laboratory equipment that enables rapid, reproducible experiments, illustrated with examples from a financial institution and a personal baseball‑statistics project.

DockerLeanOperations

0 likes · 8 min read

The Third Way of DevOps: Continuous Learning and Docker as Lab Equipment

21CTO

Jan 4, 2017 · Operations

How to Build Truly High‑Availability Systems: Principles and Practices

This article explains what high availability means for distributed systems, outlines common availability tiers, and describes how redundancy, load balancing, and automatic failover across a typical Internet architecture can achieve reliable, scalable services.

OperationsReliabilitySystem Design

0 likes · 6 min read

How to Build Truly High‑Availability Systems: Principles and Practices

DevOps

Jan 3, 2017 · Operations

Applying the DevOps “Second Way” with Docker: Accelerating Feedback Loops

This article explains the DevOps “Second Way,” emphasizing faster, bidirectional feedback loops, and shows how Docker’s immutable containers, streamlined packaging, and embedded metadata reduce variation, accelerate defect detection, and shorten lead times in service delivery.

DockerOperationscontinuous delivery

0 likes · 7 min read

Applying the DevOps “Second Way” with Docker: Accelerating Feedback Loops

Efficient Ops

Dec 29, 2016 · Operations

Meet the Top Operations Experts of 2016: Profiles and Must‑Read Articles

This article introduces the standout operations professionals featured by the High‑Efficiency Operations community in 2016, summarizing each expert’s background, key achievements, and a curated list of their most influential technical articles for readers seeking deep insights into modern ops practices.

AutomationCloud ComputingOperations

0 likes · 12 min read

Meet the Top Operations Experts of 2016: Profiles and Must‑Read Articles

Efficient Ops

Dec 28, 2016 · Operations

Transforming Financial Application Operations: Lessons from a European Rollout

This article shares a detailed case study of how a financial services team restructured European application operations, applied lean retrospectives, built a top‑down monitoring system, and introduced systematic stakeholder collaboration to dramatically improve incident response, system robustness, and user satisfaction.

Incident ManagementOperationsapplication monitoring

0 likes · 14 min read

Transforming Financial Application Operations: Lessons from a European Rollout

ITFLY8 Architecture Home

Dec 27, 2016 · Operations

How Dangdang Scaled Its E‑Commerce Platform for 10× Traffic Peaks

This article details Dangdang's 15‑year evolution from a monolithic system to a distributed, SOA‑based architecture, outlining the challenges of high‑traffic e‑commerce events and the strategies—system grading, decoupling, asynchronous processing, batching, and rate limiting—used to achieve reliable, scalable operations.

E‑CommerceHigh concurrencyOperations

0 likes · 19 min read

How Dangdang Scaled Its E‑Commerce Platform for 10× Traffic Peaks

Efficient Ops

Dec 26, 2016 · Operations

How Tencent Scaled Social Data Storage While Cutting Costs

Facing massive user growth, Tencent’s social network team redesigned its KV storage architecture—introducing CKV and Grocery, automating capacity planning, data migration, and backup reuse—to dramatically lower costs, improve operational efficiency, and maintain high service quality across millions of devices.

AutomationOperationscost optimization

0 likes · 21 min read

How Tencent Scaled Social Data Storage While Cutting Costs

Alibaba Cloud Developer

Dec 26, 2016 · Operations

How Alibaba’s SunFire Powers Real‑Time Monitoring for Billion‑Scale Transactions

Alibaba’s SunFire platform delivers massive‑scale, real‑time log collection, processing, and visualization for e‑commerce spikes like Double 11, using low‑overhead agents, asynchronous Map/Reduce pipelines, fault‑tolerant task scheduling, and shared inputs to ensure accurate, low‑latency monitoring across billions of transactions.

AlibabaOperationsReal-time

0 likes · 18 min read

How Alibaba’s SunFire Powers Real‑Time Monitoring for Billion‑Scale Transactions

Efficient Ops

Dec 21, 2016 · Operations

Measure Your Continuous Delivery Maturity with a 47‑Item Checklist

Learn how to assess your Continuous Delivery maturity using a 47‑item checklist, understand its purpose for aligning goals, improving processes, and boosting value delivery, and calculate your score as a percentage to guide technical and organizational improvements.

Operationsmaturity checklistsoftware delivery

0 likes · 2 min read

Measure Your Continuous Delivery Maturity with a 47‑Item Checklist

ITFLY8 Architecture Home

Dec 20, 2016 · Operations

How Leading Internet Companies Automate Operations: From Planning to Intelligent Management

This article explains how large internet firms evolve their IT operations from reactive fire‑fighting teams to standardized, model‑driven, automated platforms covering planning, building, management, monitoring, and process‑oriented operations across compute, storage, and network resources.

AutomationCloud ComputingITIL

0 likes · 9 min read

How Leading Internet Companies Automate Operations: From Planning to Intelligent Management

Efficient Ops

Dec 19, 2016 · Operations

What 16 Major 2016 Outages Teach Us About Disaster Recovery

This article reviews sixteen notable 2016 service outages across finance, cloud, and entertainment, analyzes their causes—ranging from power failures to DDoS attacks—and highlights the critical need for robust disaster‑recovery and information‑security practices.

Incident ManagementInformation SecurityOperations

0 likes · 11 min read

What 16 Major 2016 Outages Teach Us About Disaster Recovery

DevOps

Dec 18, 2016 · Operations

Introduction to DevOps and Docker: Concepts, Components, and Implementation

This article explains the principles of DevOps, its technical, process, and organizational considerations, and introduces Docker as a key tool, detailing its architecture, components, native utilities, suitable scenarios, and how it enables continuous integration, delivery, and efficient operations.

CI/CDDockerOperations

0 likes · 14 min read

Introduction to DevOps and Docker: Concepts, Components, and Implementation

360 Zhihui Cloud Developer

Dec 15, 2016 · Operations

How Qcmd Revolutionizes Large‑Scale Server Automation Compared to SaltStack

This article explains how 360's Qcmd, a Golang‑based real‑time command execution system, overcomes SaltStack's limitations to reliably manage tens of thousands of servers with high success rates, flexible scripting, detailed monitoring, and efficient message handling.

AutomationCommand ExecutionLarge Scale

0 likes · 7 min read

How Qcmd Revolutionizes Large‑Scale Server Automation Compared to SaltStack

DevOps

Dec 13, 2016 · Operations

DevOps Is Not About Automation Tools, But They Are a Prerequisite

DevOps is a methodology that emphasizes collaboration between development and operations to accelerate software delivery, and while tools alone don’t constitute DevOps, automation and container technologies are essential prerequisites that reduce manual hand‑offs, enable self‑service, and improve feedback loops.

AutomationOperationscontinuous delivery

0 likes · 7 min read

DevOps

Dec 11, 2016 · Operations

The Evolution of DevOps: From Agile Foundations to CALMS, Containerization, and Enterprise Best Practices

From its origins at the 2008 Agile conference to the modern CALMS framework, this article traces DevOps’s evolution, compares traditional, DevOps 1.0 and 2.0 approaches, and outlines key Chinese practices such as containers, continuous deployment, micro‑services, and enterprise best‑practice recommendations.

CALMSOperationscontinuous delivery

0 likes · 11 min read

The Evolution of DevOps: From Agile Foundations to CALMS, Containerization, and Enterprise Best Practices

ITFLY8 Architecture Home

Dec 8, 2016 · Operations

How CAT Enables Scalable Real‑Time Monitoring for Distributed Systems

This article introduces CAT, an open‑source Java‑based distributed real‑time monitoring platform, detailing its design goals, architecture, message processing pipeline, logging instrumentation, API, real‑time analysis, report modeling, storage challenges, and key takeaways for building highly available, scalable monitoring solutions.

Distributed MonitoringOperationslog-aggregation

0 likes · 13 min read

How CAT Enables Scalable Real‑Time Monitoring for Distributed Systems

Alibaba Cloud Developer

Dec 7, 2016 · Operations

How Alibaba Automates Its Network for Double 11 Traffic Surges

This article outlines Alibaba researcher Zhang Ming’s presentation on the network automation system that enables Alibaba’s infrastructure to handle the massive traffic and rapid fault recovery required during the Double 11 shopping festival, highlighting the challenges, detection methods, and automated tools used across routers, switches, and L4‑L7 devices.

AlibabaOperationsfault detection

0 likes · 3 min read

How Alibaba Automates Its Network for Double 11 Traffic Surges

ITFLY8 Architecture Home

Dec 6, 2016 · Operations

How to Build a Unified Monitoring and Alert Platform with Ganglia and Centreon

This article explains how to design and implement a comprehensive operations monitoring platform using Ganglia for data collection and Centreon for alerting, detailing a six‑layer architecture, integration steps, data flow, and practical Q&A for effective fault detection and response.

AlertingCentreonGanglia

0 likes · 16 min read

How to Build a Unified Monitoring and Alert Platform with Ganglia and Centreon

Efficient Ops

Dec 5, 2016 · Operations

From PHP Monolith to Java Microservices: Mogujie's Ops Evolution and Lessons

This article recounts Mogujie's journey from a small PHP‑based LNMP stack to a Java‑driven micro‑service architecture, detailing the operational challenges, standardization efforts, continuous integration pipeline, and full‑link tracing techniques that enabled scalable, reliable e‑commerce services.

Continuous IntegrationFull‑Link TracingJava migration

0 likes · 17 min read

From PHP Monolith to Java Microservices: Mogujie's Ops Evolution and Lessons

Efficient Ops

Dec 4, 2016 · Operations

How Ctrip Built a Seamless Multi‑Region Dual‑Active Call Center

This article details Ctrip's evolution from a single‑site call‑center to a fully dual‑active, multi‑region architecture, covering the overall system design, public network, application, and client layers, unified login mechanisms, heartbeat monitoring, and future software‑only and mobile‑first directions.

Dual-ActiveOperationsSRE

0 likes · 27 min read

How Ctrip Built a Seamless Multi‑Region Dual‑Active Call Center

Qunar Tech Salon

Dec 1, 2016 · Backend Development

How to Prevent Service Failures: Suspect Third‑Party, Guard Users, and Perfect Your Own Service

The article shares practical strategies for preventing service failures by doubting third‑party services, protecting against misuse by consumers, and improving one’s own code and architecture, covering fallback plans, timeout settings, retry policies, API design, traffic control, and resource limits.

API-designOperationsReliability

0 likes · 16 min read

How to Prevent Service Failures: Suspect Third‑Party, Guard Users, and Perfect Your Own Service

Efficient Ops

Nov 28, 2016 · Operations

Essential Ops Insights: Tool Choices, Automation, and Best Practices from a Senior Ops Expert

This article compiles expert Q&A on operations, covering tool selection, monitoring, Linux version choices, automation platforms, security, Docker, backup strategies, and career advice, offering practical guidance for modern infrastructure management.

AutomationLinuxOperations

0 likes · 19 min read

Essential Ops Insights: Tool Choices, Automation, and Best Practices from a Senior Ops Expert

Efficient Ops

Nov 27, 2016 · Operations

When Ops Heroes Burn Out: Tackling Personal Heroism in Operations

The article explores personal heroism in operations, defining it as reliance on individual effort to keep flawed systems appearing normal, examines its short‑term benefits and long‑term drawbacks for companies, teams, and the heroes themselves, and offers practical strategies to eliminate this risky mindset.

Incident ManagementOperationsSLA

0 likes · 10 min read

When Ops Heroes Burn Out: Tackling Personal Heroism in Operations

dbaplus Community

Nov 23, 2016 · Operations

How to Rapidly Deploy DCOS Services with Ansible and Docker

This guide walks through an automated, fast‑track deployment of DCOS components—including service selection, Docker‑based containers, host initialization, system checks, Ansible provisioning, Consul service discovery, HAProxy load balancing, MySQL HA, and Zookeeper/Marathon integration—providing concrete commands, configuration snippets, and practical tips.

AnsibleAutomationConsul

0 likes · 12 min read

How to Rapidly Deploy DCOS Services with Ansible and Docker

Efficient Ops

Nov 21, 2016 · Operations

7 Proven Bandwidth Optimization Strategies to Cut Social Platform Costs by 2 Billion

This article shares Tencent's seven practical bandwidth‑saving techniques—ranging from disabling auto‑play to intelligent pre‑push, file compression, on‑demand usage, segmented download, technical breakthroughs, and content compliance—to dramatically reduce operational costs while maintaining user experience.

Network ManagementOperationsPerformance

0 likes · 9 min read

7 Proven Bandwidth Optimization Strategies to Cut Social Platform Costs by 2 Billion

dbaplus Community

Nov 20, 2016 · Operations

Top Insights from the 2016 Global Agile Operations Summit

The 2016 Global Agile Operations Summit in Shanghai concluded with a series of expert sessions covering agile DevOps trends, cloud‑native automation platforms, database performance tuning, container orchestration, and real‑world case studies from leading companies, followed by the award ceremony honoring ten MVPs who drove innovation across operations and infrastructure.

AutomationCloud ComputingMVP

0 likes · 15 min read

Top Insights from the 2016 Global Agile Operations Summit

Qunar Tech Salon

Nov 18, 2016 · Operations

Design and Implementation of Ctrip's Predictive Outbound Call Platform

This article describes Ctrip's large‑scale predictive outbound call platform, covering its underlying algorithms, SoftPBX integration, system architecture, concurrency enhancements, deployment experience, and measurable improvements in call success rates and agent efficiency.

OperationsPlatform designcall center

0 likes · 8 min read

Design and Implementation of Ctrip's Predictive Outbound Call Platform

Efficient Ops

Nov 14, 2016 · Operations

What Ancient Medicine Teaches About Modern IT Risk Management

Using the classic tale of Bian Que, this article explains how proactive, mid‑stage, and reactive risk controls in IT operations prevent small issues from becoming catastrophic failures, illustrated with real‑world storage, cloud, and equipment‑selection case studies.

IT infrastructureOperationspreventive control

0 likes · 7 min read

What Ancient Medicine Teaches About Modern IT Risk Management

StarRing Big Data Open Lab

Nov 14, 2016 · Operations

Master Real-Time Hadoop Alerts with Transwarp Manager

Deploying the Transwarp Manager alert system within Hadoop clusters enables operators to monitor resource shortages, failures, and health issues in real time, offering browsing, configurable thresholds, and instant email or script notifications to quickly identify and resolve problems before they impact services.

Alert MonitoringHadoopOperations

0 likes · 9 min read

Master Real-Time Hadoop Alerts with Transwarp Manager

Architecture Digest

Nov 10, 2016 · Operations

Interview with Lu Pengcheng on Mogu Street’s Monitoring System Architecture and Evolution

In this interview, Lu Pengcheng, a platform architect at Mogu Street, discusses the company’s large‑scale e‑commerce architecture, the evolution of its monitoring platform, design choices for high‑availability distributed systems, and future open‑source plans, providing practical insights for engineers and technical managers.

C++High AvailabilityOperations

0 likes · 9 min read

Interview with Lu Pengcheng on Mogu Street’s Monitoring System Architecture and Evolution

Efficient Ops

Nov 9, 2016 · Operations

How to Design Effective SLOs and SLAs: A Technical Deep Dive

This article explains the definitions of service, SLI, SLO, and SLA, outlines how to choose and measure appropriate indicators, shares best practices for setting and improving SLOs, and shows how SLAs combine objectives with consequences to manage service reliability.

Cloud ComputingOperationsSLA

0 likes · 11 min read

How to Design Effective SLOs and SLAs: A Technical Deep Dive

Node Underground

Nov 9, 2016 · Operations

4 Common Node.js Ops Issues and How to Fix Them

This article outlines four frequent Node.js operational problems—memory leaks, CPU bottlenecks, back‑pressure, and security risks—and provides practical solutions such as heap‑dump analysis, CPU profiling, APM monitoring, and using private npm registries with tools like Snyk to secure dependencies.

Node.jsOperationsmemory-leak

0 likes · 4 min read

4 Common Node.js Ops Issues and How to Fix Them

ITPUB

Nov 9, 2016 · Operations

Diagnosing and Resolving High CPU Usage in a Linux Gateway Process

This article walks through a real‑world remote debugging session where a high‑CPU issue in a gateway service was reproduced, analyzed with top, gstack, gcore, strace and gdb, and traced to a buffer overflow causing an infinite loop, then fixed.

CPUOperationsgdb

0 likes · 7 min read

Diagnosing and Resolving High CPU Usage in a Linux Gateway Process

Efficient Ops

Nov 7, 2016 · Operations

How to Train New SREs Effectively: Proven Practices and Playbooks

This article outlines a systematic approach to onboarding and training new Site Reliability Engineers, covering trust building, readiness assessment, diverse learning methods, structured curricula, on‑call milestones, project‑focused work, reverse‑engineering skills, statistical thinking, and improvisation techniques to develop high‑performing SRE teams.

On-CallOperationsSRE

0 likes · 17 min read

How to Train New SREs Effectively: Proven Practices and Playbooks

360 Zhihui Cloud Developer

Nov 3, 2016 · Operations

Boost Web Server Performance: Proven Kernel and Application Tuning Guide

This article shares a practical template for optimizing web service performance, covering kernel sysctl tweaks, file‑handle limits, and Nginx/PHP‑FPM configuration adjustments to eliminate bottlenecks and maximize throughput under high traffic conditions.

OperationsSystem Optimizationkernel tuning

0 likes · 5 min read

Boost Web Server Performance: Proven Kernel and Application Tuning Guide

Efficient Ops

Nov 2, 2016 · Operations

How Tencent Achieved Zero‑Impact Disaster Recovery for Hundreds of Millions of Users

This article details Tencent's multi‑region disaster‑recovery architecture and rapid, user‑transparent scheduling techniques that enable seamless service continuity for QQ and Qzone across billions of daily users, illustrated through real‑world drills and performance metrics.

Large-Scale SchedulingOperationsTencent

0 likes · 16 min read

How Tencent Achieved Zero‑Impact Disaster Recovery for Hundreds of Millions of Users

Architecture and Beyond

Nov 2, 2016 · Operations

Designing an Effective Log System for Startups: Levels, Collection, and ELK Architecture

This article explains how internet startups can build a robust logging system by defining log levels, essential log fields, best‑practice logging principles, and choosing between simple file logs or an ELK‑based collection pipeline for monitoring, troubleshooting, and analytics.

ELKLoggingOperations

0 likes · 12 min read

Designing an Effective Log System for Startups: Levels, Collection, and ELK Architecture

ITPUB

Nov 2, 2016 · Operations

Monitor Linux System Resources with Simple Shell Scripts

This guide shows how to write Bash functions that retrieve process IDs, CPU, memory, file‑descriptor usage, port status, system load and disk space on a Linux server, and how to combine them with conditional checks to generate alerts when thresholds are exceeded.

LinuxOperationsScript

0 likes · 16 min read

Monitor Linux System Resources with Simple Shell Scripts

Art of Distributed System Architecture Design

Nov 1, 2016 · Operations

JEN: JD Extended Nginx Platform for Scalable Management and Automation

The article introduces JEN, JD's extended Nginx platform that centralizes configuration, monitoring, traffic splitting, rate limiting and automated operations through a web console and Ansible integration, addressing the complexity, restart requirements, and scaling challenges of large‑scale Nginx deployments.

AutomationNginxOperations

0 likes · 14 min read

JEN: JD Extended Nginx Platform for Scalable Management and Automation

ITFLY8 Architecture Home

Oct 31, 2016 · Cloud Computing

How Taobao Scaled from LAMP to Cloud: Lessons in Cloud Migration Architecture

This article examines the evolution of Taobao's technical architecture—from a LAMP stack through Oracle‑based mainframes to a cloud‑native platform—highlighting the performance, scalability, and cost challenges of traditional IT and offering best‑practice strategies for migrating enterprise systems to the cloud.

Big DataCloud ComputingDatabases

0 likes · 15 min read

How Taobao Scaled from LAMP to Cloud: Lessons in Cloud Migration Architecture

Efficient Ops

Oct 29, 2016 · Databases

Why Your System Slows Down: Uncover Hidden Database Bottlenecks

The article explains how unnoticed database issues often cause system slowness, outlines key diagnostic questions for operations teams, and presents a three‑step approach—discover, solve, prevent—to regularly health‑check and optimize databases for reliable performance.

DatabasesOperationsPerformance

0 likes · 8 min read

Why Your System Slows Down: Uncover Hidden Database Bottlenecks

dbaplus Community

Oct 25, 2016 · Operations

How to Build a Visual Continuous Delivery Pipeline for Faster, Safer Releases

This article outlines the challenges of modern software delivery in a VUCA environment and presents a practical, step‑by‑step approach to designing a visual continuous‑delivery pipeline that balances speed, quality, and reliability through agile, lean, and DevOps practices.

AgileAutomationCI/CD

0 likes · 20 min read

How to Build a Visual Continuous Delivery Pipeline for Faster, Safer Releases

Efficient Ops

Oct 23, 2016 · Operations

How Google’s SRE Postmortems Drive System Reliability

This article explains Google’s SRE postmortem philosophy, the criteria for writing postmortems, best practices for a blame‑free culture, and how collaborative knowledge‑sharing and incentives improve incident handling and overall system reliability.

Incident ManagementOperationsSRE

0 likes · 14 min read

How Google’s SRE Postmortems Drive System Reliability

Architecture Digest

Oct 21, 2016 · Operations

Dynamic Configuration Management for Distributed Systems: Concepts, Challenges, and Practices

The article explains the importance of configuration in software, distinguishes static and dynamic configuration, discusses the challenges of managing configuration in large distributed systems, and describes the evolution, design principles, and practical solutions of configuration centers such as Alibaba's Diamond.

DiamondOperationsSoftware Evolution

0 likes · 21 min read

Dynamic Configuration Management for Distributed Systems: Concepts, Challenges, and Practices

360 Zhihui Cloud Developer

Oct 19, 2016 · Operations

Wonder Monitoring: Scaling Ops with Open‑Falcon‑Powered Automation

This article explains how the internally built Wonder monitoring system, based on Open‑Falcon, tackles large‑scale operational challenges by offering automated agent updates, customizable metrics, log and port monitoring, persistent alarm storage, enhanced alert content, and comprehensive dashboards for thousands of devices.

AlertingAutomationOpen-Falcon

0 likes · 7 min read

Wonder Monitoring: Scaling Ops with Open‑Falcon‑Powered Automation

Efficient Ops

Oct 17, 2016 · Operations

How Shanda Games Built a Scalable Automated Operations System

This article details Shanda Games' journey in designing and implementing a comprehensive automated operations platform—including installation, deployment, security, client and server updates, data analysis, backup, and monitoring—to efficiently manage hundreds of games across diverse hardware and operating systems.

AutomationDeploymentOperations

0 likes · 22 min read

How Shanda Games Built a Scalable Automated Operations System

Efficient Ops

Oct 16, 2016 · Operations

Balancing Reliability and Innovation: Google’s SRE Risk Management Explained

This article explores how Google Site Reliability Engineers manage service reliability by balancing risk, cost, and business goals, using metrics like unplanned downtime, availability formulas, and risk tolerance to set realistic SLOs for both consumer and infrastructure services.

GoogleOperationsSRE

0 likes · 21 min read

Balancing Reliability and Innovation: Google’s SRE Risk Management Explained

360 Zhihui Cloud Developer

Oct 14, 2016 · Operations

Mastering Rapid Incident Response: An Ops Engineer’s 4‑Step Method

This guide teaches operations engineers a four‑step “Wang‑Wen‑Wen‑Qie” methodology—observing system health, listening via alerts, questioning changes, and pulse‑checking with profiling tools—to rapidly diagnose and resolve high‑impact incidents while maintaining clear communication and post‑mortem learning.

OperationsTroubleshootingincident response

0 likes · 5 min read

Mastering Rapid Incident Response: An Ops Engineer’s 4‑Step Method

ITPUB

Oct 14, 2016 · Operations

Why Do Stale MySQL Sleep Connections Appear After HAProxy Timeout Mismatch?

A MySQL server experienced max_connection errors due to lingering sleep connections caused by mismatched timeout settings between LVS and HAProxy, and adjusting HAProxy's timeout resolved the issue and prevented further connection exhaustion.

Database TroubleshootingHAProxyLVS

0 likes · 5 min read

Why Do Stale MySQL Sleep Connections Appear After HAProxy Timeout Mismatch?

ITFLY8 Architecture Home

Oct 11, 2016 · Operations

How to Gracefully Degrade Services When Server Load Spikes

This article explains various service degradation strategies—including interface and page refusal, delayed persistence, and persistent‑layer restrictions—along with management approaches and implementation points such as middleware control, NGINX+LUA page blocking, and data‑operation rules, to keep core functions running under high server pressure.

CachingHigh AvailabilityOperations

0 likes · 4 min read

How to Gracefully Degrade Services When Server Load Spikes

Efficient Ops

Oct 9, 2016 · Mobile Development

How Alipay Scaled to a Super‑App: Architecture, Performance, and Ops Lessons

This article summarizes Alipay’s evolution into a super‑app, detailing its multi‑stage architecture, performance and power optimizations, stability improvements, and the comprehensive operations system that monitors and mitigates issues across millions of users.

AlipayMobile DevelopmentOperations

0 likes · 11 min read

How Alipay Scaled to a Super‑App: Architecture, Performance, and Ops Lessons

Efficient Ops

Oct 8, 2016 · Operations

How to Boost Server Resource Utilization: Strategies, Trade‑offs, and Metrics

This article explains why servers often run far below their theoretical capacity, defines the concept of highest usable resource utilization, and offers practical and advanced techniques—such as multithreading, workload consolidation, resource layering, and overselling—to improve utilization while weighing performance, cost, and reliability impacts.

OperationsPerformance OptimizationResource Efficiency

0 likes · 9 min read

How to Boost Server Resource Utilization: Strategies, Trade‑offs, and Metrics

DevOps

Oct 8, 2016 · Operations

What Is DevOps? Origins, Key Issues, Benefits, and Adoption

The article explains DevOps as the integration of development and operations, tracing its origins, outlining its cultural and technical challenges, detailing its benefits such as faster, more reliable releases, and reviewing the tools and global adoption trends, including a new Chinese survey initiative.

AutomationCI/CDCulture

0 likes · 11 min read

What Is DevOps? Origins, Key Issues, Benefits, and Adoption

Java Backend Technology

Oct 8, 2016 · Backend Development

Understanding Nginx: Core Concepts, Features, and Architecture

This article explains Nginx's role as a high‑performance HTTP and reverse‑proxy server, its event‑driven design, key features, internal process model, request handling flow, and real‑world deployments, providing a comprehensive overview for developers and operations engineers.

Backend DevelopmentOperationsReverse Proxy

0 likes · 11 min read

Understanding Nginx: Core Concepts, Features, and Architecture

Efficient Ops

Oct 6, 2016 · Operations

How Ctrip Scales Application Operations: Practices, Automation, and Reliability

This talk details Ctrip's application operations framework, covering data‑center scale, multi‑application deployment on Windows, high availability goals, capacity‑prediction models, disaster‑recovery design, incident response, and the evolution from manual tooling to automated, intelligent operations.

AutomationCloud ComputingIncident Management

0 likes · 21 min read

How Ctrip Scales Application Operations: Practices, Automation, and Reliability

Efficient Ops

Oct 5, 2016 · Operations

5 Must‑Read Ops ‘Prescriptions’ to Boost Your Infrastructure Skills

This article curates five technical reads—covering network operations, Google’s production environment, massive cost‑saving strategies, IDC automation, and Docker‑based RDS—each presented as a “medicine” with a brief description and a link for deeper insight.

Cloud ComputingDockerOperations

0 likes · 5 min read

5 Must‑Read Ops ‘Prescriptions’ to Boost Your Infrastructure Skills

ITPUB

Oct 4, 2016 · Operations

How to Build a Resilient High‑Traffic Website: A Complete Operations Guide

This guide outlines a step‑by‑step strategy for designing a highly available, secure, and scalable website architecture, covering domain acquisition, CDN deployment, image caching, data center selection, monitoring, DDoS mitigation, redundancy, server configuration, database replication, testing environments, and operational best practices.

High AvailabilityOperationssecurity

0 likes · 14 min read

How to Build a Resilient High‑Traffic Website: A Complete Operations Guide

Meituan Technology Team

Oct 1, 2016 · Operations

How Meituan Scaled Its Mobile Deal System for Mega‑Promotions: Traffic Modeling & Capacity Planning

This article details Meituan's technical approach to handling massive traffic spikes during large‑scale promotions, covering background of the O2O deal platform, traffic‑model construction, capacity‑budget calculations, micro‑service architecture evolution, pressure‑test strategies, and the PTP performance‑testing environment used to validate system limits.

Operationscapacity planningload testing

0 likes · 18 min read

How Meituan Scaled Its Mobile Deal System for Mega‑Promotions: Traffic Modeling & Capacity Planning

Java High-Performance Architecture

Sep 30, 2016 · Operations

How Twitter Deploys Its High‑Traffic widgets.js with Zero‑Downtime Rollbacks

Twitter’s widgets.js, serving 300,000 requests per second, uses a carefully engineered deployment pipeline that emphasizes instant rollbacks, progressive rollouts, and real‑time visibility through DNS routing, CDN distribution, and origin management to minimize risk and ensure reliability.

DNSOperationsRollback

0 likes · 5 min read

How Twitter Deploys Its High‑Traffic widgets.js with Zero‑Downtime Rollbacks

360 Zhihui Cloud Developer

Sep 29, 2016 · Operations

How to Fix Yum 404 Errors Caused by Missing $releasever Variable on CentOS

This article explains why yum install commands return 404 errors on CentOS due to an undefined $releasever variable, analyzes the root cause in yum configuration, and provides a step‑by‑step solution and useful troubleshooting tips for operations engineers.

CentOSOperationsTroubleshooting

0 likes · 5 min read

How to Fix Yum 404 Errors Caused by Missing $releasever Variable on CentOS

Efficient Ops

Sep 28, 2016 · Operations

Can Precise Operations Transform IT Service Delivery? A Deep Dive

This article explains the concept of precise operations, detailing how integrating business demand as a variable can make IT maintenance more proactive, value‑driven, and synchronized with business needs, and outlines a step‑by‑step framework with real‑world examples.

IT-serviceOperationsbusiness-alignment

0 likes · 14 min read

Can Precise Operations Transform IT Service Delivery? A Deep Dive

ITPUB

Sep 24, 2016 · Operations

Diagnose Linux Performance Issues in 1 Minute with 10 Essential Commands

When a Linux server’s load spikes, you can quickly pinpoint the root cause within a minute by running ten key commands that reveal CPU, memory, I/O, and kernel‑level metrics, enabling fast, data‑driven troubleshooting.

LinuxOperationsmonitoring

0 likes · 12 min read

Diagnose Linux Performance Issues in 1 Minute with 10 Essential Commands

Ctrip Technology

Sep 23, 2016 · Backend Development

Reconstructing Ctrip's Payment Engine: Architecture, Services, and Operational Practices

This article presents Hong Guangming's detailed overview of Ctrip's payment engine reconstruction, covering the legacy system's challenges, the new split‑service architecture, channel‑specific services, settlement reconciliation, and the multi‑layered strategies employed to achieve high availability and operational stability.

CtripOperationsbackend

0 likes · 5 min read

Reconstructing Ctrip's Payment Engine: Architecture, Services, and Operational Practices

Art of Distributed System Architecture Design

Sep 22, 2016 · Industry Insights

How WhatsApp Scaled to 450 Million Users with Erlang: Architecture and Lessons

This article dissects WhatsApp’s high‑reliability architecture that supports 450 million users, detailing its Erlang‑based backend, hardware choices, scaling techniques, monitoring tools, and the engineering lessons learned from pushing a single server to two‑million concurrent connections.

ErlangOperationsWhatsApp

0 likes · 19 min read

How WhatsApp Scaled to 450 Million Users with Erlang: Architecture and Lessons

dbaplus Community

Sep 22, 2016 · Operations

How Microsoft and Xiaomi Mastered DevOps: Practical Lessons for Global Scale

This article summarizes Ouyang Chen's GDevOps 2016 talk, covering the definition of DevOps, four personal viewpoints, Microsoft's three‑phase transformation, Xiaomi's rapid release pipeline, key principles, metrics such as time‑to‑detect, and essential tools for building an efficient DevOps culture.

AutomationContinuous IntegrationMicrosoft

0 likes · 19 min read

How Microsoft and Xiaomi Mastered DevOps: Practical Lessons for Global Scale

Art of Distributed System Architecture Design

Sep 20, 2016 · Operations

WhatsApp Scaling Architecture: Lessons from Two Years of Growth

Over the past two years WhatsApp has dramatically expanded its user base, hardware, and traffic while maintaining a tiny engineering team, highlighting the challenges of massive scalability, Erlang‑based distributed design, Mnesia database bottlenecks, decoupling strategies, and operational patches required to keep the service reliable.

ErlangMnesiaOperations

0 likes · 15 min read

WhatsApp Scaling Architecture: Lessons from Two Years of Growth

Practical DevOps Architecture

Sep 20, 2016 · Operations

Troubleshooting PPPoE Dial‑up Failure on Huawei AR2240 Gateway

The article explains why a Huawei AR2240 gateway using PPPoE on an on‑demand dial‑bundle fails to obtain an IP address, identifies the misconfiguration of the internal Ethernet interface, and provides step‑by‑step commands to switch the PPPoE mode to permanent online for successful dialing.

Dial-up ConfigurationHuawei AR2240Operations

0 likes · 4 min read

Troubleshooting PPPoE Dial‑up Failure on Huawei AR2240 Gateway

360 Zhihui Cloud Developer

Sep 18, 2016 · Artificial Intelligence

How Linear Regression Can Tame Your Nighttime Alert Fatigue

This article explores how historical monitoring alerts can be analyzed and predicted using linear regression, guiding operations engineers to preprocess data, build regression models, and forecast future alert trends to reduce manual alarm handling and improve system stability.

Machine LearningOperationsalert prediction

0 likes · 8 min read

How Linear Regression Can Tame Your Nighttime Alert Fatigue

Qunar Tech Salon

Sep 18, 2016 · Operations

Analyzing Nginx Access Logs for Traffic, Performance, and Optimization

This article explains how to extract valuable performance and traffic insights from Nginx access logs using shell commands and awk, covering request volume, peak rates, bandwidth usage, slow‑query detection, URL normalization, and practical optimization recommendations for web operations.

OperationsShell Scriptingawk

0 likes · 13 min read

Analyzing Nginx Access Logs for Traffic, Performance, and Optimization

Efficient Ops

Sep 13, 2016 · Operations

How Google SRE Principles Compare Across Industries

This article, excerpted from the upcoming Chinese edition of “SRE: Google Site Reliability Engineering”, examines how Google’s SRE guiding philosophies—disaster planning, post‑mortem culture, automation, and data‑driven decision‑making—are adopted, adapted, or contrasted in sectors such as manufacturing, aerospace, nuclear, telecommunications, healthcare, and finance, highlighting key similarities, differences, and lessons for Google and the broader tech industry.

AutomationOperationsSRE

0 likes · 21 min read

How Google SRE Principles Compare Across Industries

Architecture Digest

Sep 11, 2016 · Operations

Designing and Operating High‑Scale E‑commerce Systems: Insights from Dangdang

The article details Dangdang's 15‑year evolution from a monolithic platform to a distributed, SOA‑based architecture, describing system tiering, front‑end and back‑end scaling techniques, asynchronous processing, data‑flow optimization, and operational practices that enable stable handling of ten‑fold traffic spikes during major sales events.

E‑CommerceHigh concurrencyOperations

0 likes · 17 min read

Designing and Operating High‑Scale E‑commerce Systems: Insights from Dangdang

360 Zhihui Cloud Developer

Sep 8, 2016 · Operations

How to Diagnose App “Unavailable” Issues with Ping&DNS: A Step‑by‑Step Guide

This article walks readers through practical techniques for identifying and resolving common “unavailable” app problems—such as white screens, slow loading, or error codes—by using the lightweight Ping&DNS Android tool to check domain names, IP connectivity, DNS resolution, and traceroute data, empowering both novice users and professionals.

DNSMobile AppOperations

0 likes · 8 min read

How to Diagnose App “Unavailable” Issues with Ping&DNS: A Step‑by‑Step Guide

Qunar Tech Salon

Sep 7, 2016 · Databases

Principles and Practices of MySQL Database Sharding

The article explains when MySQL sharding is needed, outlines five practical principles for deciding to split databases or tables, provides real‑world examples, and shares operational tips for implementing horizontal partitioning to improve performance, availability, and manageability.

MySQLOperationsPartitioning

0 likes · 8 min read

Principles and Practices of MySQL Database Sharding

Architects' Tech Alliance

Sep 7, 2016 · Operations

How Agentless Backup Works in Cloud Environments and Its Trade‑offs

The article examines agentless backup technology, comparing its implementation in virtualized and physical environments, detailing supported interfaces, evaluating a real‑world Asigra Cloud Backup case, and discussing security risks, performance impacts, and when traditional agents remain necessary.

Cloud BackupData ProtectionInformation Security

0 likes · 7 min read

How Agentless Backup Works in Cloud Environments and Its Trade‑offs

Java High-Performance Architecture

Sep 4, 2016 · Operations

How to Limit Concurrent Connections from a Host Using iptables

This guide demonstrates how to simulate a high‑traffic scenario between two machines and use an iptables rule to reject connections from a specific host when its concurrent requests exceed ten, including command syntax, execution steps, and result analysis.

LinuxOperationsconnection limiting

0 likes · 3 min read

How to Limit Concurrent Connections from a Host Using iptables

Meituan Technology Team

Sep 2, 2016 · Databases

Automated Database Operations Platform Overview

An automated database operations platform shifts routine DBA tasks to developer self‑service and fully automates processes such as cluster provisioning, scaling, backup, migration, and sharding, using a stateless workflow center, job queue, and unified SqlEditor to improve efficiency, safety, and auditability.

AutomationData MigrationJob Center

0 likes · 8 min read

Automated Database Operations Platform Overview

Efficient Ops

Aug 28, 2016 · Operations

Six Proven Methods to Optimize Server Capacity and Cut Costs in Large‑Scale Social Networks

Tencent's SNG team shares six practical capacity‑management techniques—performance, density, feature, fragmentation, barrel, and hardware selection methods—that helped reduce operational expenses by over a hundred million yuan annually while supporting hundreds of millions of daily active users.

Operationscapacity managementcloud infrastructure

0 likes · 10 min read

Six Proven Methods to Optimize Server Capacity and Cut Costs in Large‑Scale Social Networks

Qunar Tech Salon

Aug 26, 2016 · Backend Development

Reconstructing Ctrip's Payment Engine: Architecture, Design, and Operational Practices

This presentation details Ctrip's payment engine reconstruction, describing the limitations of the legacy system, the new service‑oriented architecture separating payment planning and execution, channel‑specific services, the reconciliation system, and the operational measures taken to achieve high payment availability.

CtripOperationspayment

0 likes · 5 min read

Reconstructing Ctrip's Payment Engine: Architecture, Design, and Operational Practices

Ctrip Technology

Aug 26, 2016 · Information Security

Automated Firewall Operations and Management System at Ctrip

The article describes how Ctrip’s network security team built an automated, centralized firewall management platform that handles multi‑brand firewalls, streamlines policy queries, generation, and deployment, integrates with change‑ticket workflows, and dramatically improves operational efficiency while reducing human error.

CtripOperationsfirewall automation

0 likes · 14 min read

Automated Firewall Operations and Management System at Ctrip

Efficient Ops

Aug 23, 2016 · Mobile Development

How Tencent Cut Mobile QQ/Qzone Lag with Network & Client Optimizations

This article details Tencent's practical approaches to reducing user‑perceived latency in mobile QQ and Qzone by analyzing server, network, and client delays, employing private protocols, multi‑path connection strategies, real‑time monitoring, and big‑data clustering to identify and fix performance bottlenecks.

Operationsbig data analysisclient monitoring

0 likes · 16 min read

How Tencent Cut Mobile QQ/Qzone Lag with Network & Client Optimizations

Architecture Digest

Aug 22, 2016 · Operations

Understanding High‑Availability Systems: Design Principles, Technical Solutions, and SLA Measurement

This article explains the comprehensive concept of high‑availability systems, covering redundancy, failover, consistency challenges, various technical solutions, SLA definitions, and the organizational and engineering practices required to achieve multiple “9s” of availability.

High AvailabilityOperationsSLA

0 likes · 14 min read

Understanding High‑Availability Systems: Design Principles, Technical Solutions, and SLA Measurement

ITPUB

Aug 21, 2016 · Backend Development

How to Diagnose and Prevent Redis Data Loss in Production

This article examines common causes of Redis data loss, walks through a real‑world incident where 90,000 keys vanished, and provides concrete monitoring, configuration, and operational safeguards to detect and avoid such failures.

Data lossOperationsRedis

0 likes · 11 min read

How to Diagnose and Prevent Redis Data Loss in Production

Efficient Ops

Aug 18, 2016 · Operations

How Ant Financial Scales to 86,000 TPS: Cloud‑Native Operations Lessons

This article details Ant Financial's evolution from supporting 20,000 transactions per minute in 2010 to 86,000 transactions per second in 2015, describing its multi‑active architecture, financial‑grade operation platform, and organizational mechanisms that enable high‑availability, automated capacity management and fault handling in a cloud‑native environment.

Operationsfinancial technology

0 likes · 15 min read

How Ant Financial Scales to 86,000 TPS: Cloud‑Native Operations Lessons

Baidu Intelligent Testing

Aug 18, 2016 · Mobile Development

Establishing an Effective Mobile App Quality Monitoring System: Standards, Metrics, and Data Utilization

This article explains how to build a comprehensive mobile app quality monitoring framework by defining quality standards, setting capability indicators, and leveraging data acquisition, analysis, and visualization to continuously improve product reliability and user experience across different development stages.

Mobile AppOperationsQuality Monitoring

0 likes · 11 min read

Establishing an Effective Mobile App Quality Monitoring System: Standards, Metrics, and Data Utilization

ITPUB

Aug 16, 2016 · Databases

Achieving Seamless MySQL HA with Pacemaker and MHA: Lessons from DTCC 2016

This article details a MySQL high‑availability solution built on Pacemaker, Corosync and MHA, explains why earlier keepalived‑based designs suffered split‑brain issues, and walks through the architecture, quorum handling, resource agents, failover workflow, testing methodology, and practical lessons learned.

High AvailabilityMHAMySQL

0 likes · 16 min read

Achieving Seamless MySQL HA with Pacemaker and MHA: Lessons from DTCC 2016

MaGe Linux Operations

Aug 10, 2016 · Operations

How to Build a Docker Swarm Cluster on Three Ubuntu Nodes Step‑by‑Step

This guide walks you through setting up a Docker Swarm cluster on three Ubuntu 16.04 servers, covering environment preparation, Docker installation, configuring a Consul discovery backend, creating manager and worker nodes, and managing containers across the swarm using the Docker remote API.

ConsulContainer OrchestrationDocker

0 likes · 6 min read

How to Build a Docker Swarm Cluster on Three Ubuntu Nodes Step‑by‑Step

Efficient Ops

Aug 7, 2016 · Operations

Automated Operations Platforms: Stages, Pain Points, and Design Blueprint

This article outlines the evolution of enterprise operations through four stages, identifies seven common operational pain points, and presents a comprehensive model for building an automated operations platform that integrates design, deployment, monitoring, optimization, and troubleshooting.

AutomationCMDBIT infrastructure

0 likes · 12 min read

Automated Operations Platforms: Stages, Pain Points, and Design Blueprint

Architecture Digest

Aug 7, 2016 · Operations

Designing a Three‑Dimensional High‑Availability Architecture for Alibaba's Game Integration System

The article describes how Alibaba's game integration platform achieved business‑oriented high availability by abandoning traditional system‑centric designs and implementing a three‑dimensional architecture that combines clear HA goals, multi‑active deployment, client‑side retries, functional isolation, automated monitoring, and rapid fault recovery, ultimately meeting a 3‑minute issue‑location and 5‑minute business‑recovery target.

High AvailabilityOperationsbusiness‑oriented HA

0 likes · 21 min read

Designing a Three‑Dimensional High‑Availability Architecture for Alibaba's Game Integration System

Efficient Ops

Aug 3, 2016 · Operations

How Harbor Enables Seamless Container Image Replication Across Registries

This article explains the design and implementation of Harbor's policy‑based Docker image replication, detailing its architecture, job service workflow, state‑machine handling, and how it reduces storage‑specific dependencies while simplifying large‑scale container registry synchronization.

Container RegistryHarborOperations

0 likes · 8 min read

How Harbor Enables Seamless Container Image Replication Across Registries

MaGe Linux Operations

Aug 1, 2016 · Operations

Mastering MHA Core Parameters: Complete Guide to MySQL HA Configuration

This article provides a detailed walkthrough of MHA's core configuration parameters—including server scopes, connection settings, candidate master rules, failover scripts, and monitoring options—explaining where each setting belongs and how to fine‑tune MySQL high‑availability behavior.

High AvailabilityMHAMySQL

0 likes · 11 min read

Mastering MHA Core Parameters: Complete Guide to MySQL HA Configuration

Efficient Ops

Aug 1, 2016 · Operations

How Tencent Shifted 70M Users During Tianjin Explosion – A Multi‑Active Ops Playbook

This article details how Tencent's operations team orchestrated a seamless, zero‑impact migration of over 70 million users across three data centers during the 2015 Tianjin explosion, highlighting the four key capabilities—distribution, scheduling, data synchronization, and automated operations—that enabled multi‑active disaster recovery at massive scale.

Data synchronizationDisaster RecoveryOperations

0 likes · 22 min read

How Tencent Shifted 70M Users During Tianjin Explosion – A Multi‑Active Ops Playbook

21CTO

Jul 30, 2016 · Operations

Building a 3‑Minute Fault Detection, 5‑Minute Recovery HA System for Games

This article explains how Alibaba’s NineGame platform achieved ultra‑high availability by shifting from system‑centric to business‑centric design, defining measurable goals (3‑minute issue detection, 5‑minute recovery, bi‑monthly incidents) and implementing a layered, automated, visual monitoring, client‑side retry, HTTP‑DNS, functional isolation, and multi‑site active‑active architecture.

Operationsbusiness‑centric designfault tolerance

0 likes · 22 min read

Building a 3‑Minute Fault Detection, 5‑Minute Recovery HA System for Games

Baidu Intelligent Testing

Jul 28, 2016 · Operations

Ensuring Store Data Quality in O2O Products: Processes and Rules

This article outlines the importance of store data in O2O products and presents a comprehensive workflow—including single‑attribute rules, multi‑attribute cross‑validation, and auxiliary checks—to detect and remediate low‑quality or erroneous store information, thereby improving user experience.

Data QualityData ValidationO2O

0 likes · 8 min read

Ensuring Store Data Quality in O2O Products: Processes and Rules

Efficient Ops

Jul 27, 2016 · Operations

Mastering Service Degradation: Strategies to Keep High‑Traffic Systems Alive

This article explores practical service‑degradation techniques—including automatic and manual switches, read/write fallback, and multi‑level strategies—to ensure core functionality remains available during traffic spikes, failures, or resource constraints in high‑concurrency systems for.

High concurrencyOperationsbackend

0 likes · 11 min read

Mastering Service Degradation: Strategies to Keep High‑Traffic Systems Alive

dbaplus Community

Jul 27, 2016 · Operations

Master Java Performance Debugging: Hprof, pidstat, and Real‑World Memory Leak Fixes

This guide walks Java developers through practical performance analysis using the Hprof agent and pidstat tool, demonstrates step‑by‑step command usage with real code examples, and presents a memory‑leak case study that explains GC overhead limits, dump analysis, and concrete remediation steps.

HprofJavaOperations

0 likes · 9 min read

Master Java Performance Debugging: Hprof, pidstat, and Real‑World Memory Leak Fixes

Qunar Tech Salon

Jul 26, 2016 · Operations

Qunar's Watcher Monitoring System: Design, Implementation, and Operational Practices

Zhang Yue, a Qunar operations engineer, discusses the design, selection, architecture, scalability challenges, visualization, alert strategies, and future plans of the company's in‑house monitoring platform Watcher, highlighting lessons learned from migrating from Cacti to a Graphite‑based, Grafana‑enhanced solution.

AlertingGrafanaGraphite

0 likes · 7 min read

Qunar's Watcher Monitoring System: Design, Implementation, and Operational Practices

Efficient Ops

Jul 25, 2016 · Operations

How We Overcame Real‑World Challenges in a Large‑Scale Oracle Database Cutover

This article recounts a seven‑year‑old Oracle 10g database migration, detailing project background, team turmoil, topology redesign, security constraints, data‑sync strategies, custom tools, high‑fidelity testing, unexpected failures, and the lessons learned for reliable operations.

Data synchronizationOperationsOracle

0 likes · 14 min read

How We Overcame Real‑World Challenges in a Large‑Scale Oracle Database Cutover

Architects' Tech Alliance

Jul 20, 2016 · Operations

How Distributed Indexing Improves Backup Performance and Scalability

The article explains how traditional centralized backup indexes become performance bottlenecks as data grows, and details Simpana's two‑level distributed indexing architecture—primary and secondary indexes—showing how it enhances backup speed, reduces network load, and simplifies recovery across multi‑site environments.

Data RecoveryOperationsSimpana

0 likes · 7 min read

How Distributed Indexing Improves Backup Performance and Scalability

Efficient Ops

Jul 19, 2016 · Operations

How Alibaba Games Built a 4‑9 High‑Availability System: Architecture, HTTP‑DNS & Ops Practices

This article details Alibaba Games' journey to achieve four‑nine reliability through a business‑focused high‑availability architecture, including system analysis, a four‑layer design, HTTP‑DNS client retry, service decoupling, multi‑active deployment, comprehensive monitoring, and measurable operational goals.

Operationshttp-dnsservice decoupling

0 likes · 21 min read

How Alibaba Games Built a 4‑9 High‑Availability System: Architecture, HTTP‑DNS & Ops Practices