Tagged articles
3281 articles
Page 23 of 33
Java Backend Technology
Java Backend Technology
Mar 5, 2020 · Operations

How a Massive Delete-Database Crisis at Weimeng Reveals Key Ops Lessons

On Feb 23, Weimeng suffered a large‑scale system outage caused by a core operations staff mistakenly deleting production databases, prompting a multi‑day recovery effort with Tencent Cloud support; the article examines the incident’s background, historical parallels, crisis response, and broader operational insights for DevOps and reliability engineering.

Database RecoveryDevOpsOperations
0 likes · 16 min read
How a Massive Delete-Database Crisis at Weimeng Reveals Key Ops Lessons
Youku Technology
Youku Technology
Mar 4, 2020 · Operations

Youku Playback Testing Platform: Unified Automation Framework, Services, and System Design

Youku’s unified playback testing platform consolidates a modular automation framework, a comprehensive service chain, and a layered platform ecosystem to standardize workflows, support multiple device types, and provide transparent, real‑time monitoring, thereby reducing development complexity and paving the way for intelligent case recommendation and dynamic verification.

OperationsSoftware TestingTesting Platform
0 likes · 11 min read
Youku Playback Testing Platform: Unified Automation Framework, Services, and System Design
Top Architect
Top Architect
Mar 3, 2020 · Databases

MySQL Performance Tuning Tools: mysqltuner.pl, tuning-primer.sh, pt-variable-advisor, and pt-query-digest

This article introduces several MySQL performance‑tuning utilities—including mysqltuner.pl, tuning‑primer.sh, pt‑variable‑advisor, and pt‑query‑digest—explains how to download, install, run them, and interpret their reports to identify configuration issues and optimize database performance.

Database ToolsOperationsSQL Optimization
0 likes · 9 min read
MySQL Performance Tuning Tools: mysqltuner.pl, tuning-primer.sh, pt-variable-advisor, and pt-query-digest
Liangxu Linux
Liangxu Linux
Mar 2, 2020 · Operations

Master Linux CLI with kmdr: Interactive Command Explanation Tool

This article introduces the free, open‑source kmdr CLI tool, explains how to install it with Node.js or use the web demo, and demonstrates its ability to break down complex Linux commands into readable modules, covering a wide range of common utilities.

CLICommand-line toolsNode.js
0 likes · 8 min read
Master Linux CLI with kmdr: Interactive Command Explanation Tool
Liangxu Linux
Liangxu Linux
Mar 2, 2020 · Operations

Master Linux Terminal: Fix Common Command Errors and Essential Shortcuts

This guide explains typical Linux terminal pitfalls such as incomplete commands, filename typos, and wrong directories, and provides practical shortcuts like tab completion, history navigation, and quick command substitution to boost productivity for developers and system operators.

Operationsshortcutsterminal
0 likes · 6 min read
Master Linux Terminal: Fix Common Command Errors and Essential Shortcuts
dbaplus Community
dbaplus Community
Mar 2, 2020 · Operations

How Jiangsu Mobile Built a Billion‑Call Real‑Time Monitoring Platform with Prometheus

Facing the explosion of 5G traffic and billions of daily call records, Jiangsu Mobile’s IT operations team adopted Prometheus as the core time‑series database, designing a high‑availability, low‑latency monitoring platform that captures, stores, visualizes and predicts performance metrics across their massive billing system.

5GOperationsPrometheus
0 likes · 9 min read
How Jiangsu Mobile Built a Billion‑Call Real‑Time Monitoring Platform with Prometheus
DevOps Cloud Academy
DevOps Cloud Academy
Feb 28, 2020 · Operations

How the Mall Team Leveraged a DevOps Toolchain for Remote Development During the Pandemic

During the COVID‑19 pandemic, the mall development team adopted a comprehensive DevOps toolchain—including JIRA, GitLab, Jenkins, Sonar, Docker, and Wiki—to enable end‑to‑end remote development, automated pipelines, and continuous delivery, resulting in improved efficiency, reliable releases, and seamless collaboration.

DevOpsDockerJenkins
0 likes · 8 min read
How the Mall Team Leveraged a DevOps Toolchain for Remote Development During the Pandemic
DevOps Cloud Academy
DevOps Cloud Academy
Feb 27, 2020 · Operations

Jenkins Infrastructure, Project Management, and Configuration‑as‑Code Overview

This article introduces Jenkins infrastructure setup, including installation via Ansible, Puppet, Chef or Docker, outlines management tools such as CLI, REST API, python‑jenkins and Jenkins‑client, describes project creation plugins like Job DSL, Job Builder and Jenkinsfile, and explains system configuration using Groovy scripts and the Configuration‑as‑Code plugin.

DevOpsInfrastructureJenkins
0 likes · 3 min read
Jenkins Infrastructure, Project Management, and Configuration‑as‑Code Overview
Efficient Ops
Efficient Ops
Feb 26, 2020 · Operations

What the Weimeng Delete‑Database Outage Teaches About Modern Ops

After a core operations staff accidentally deleted Weimeng’s production database in February, the platform endured a multi‑day outage, prompting a transparent crisis response, extensive Tencent Cloud support, and a deep analysis of recovery challenges, operational best practices, and the broader lessons for modern DevOps teams.

Database RecoveryOperationscrisis management
0 likes · 15 min read
What the Weimeng Delete‑Database Outage Teaches About Modern Ops
ITPUB
ITPUB
Feb 26, 2020 · Information Security

What We Learned from the Weimeng Data Deletion Disaster: Backup and Permission Strategies

The article analyzes the recent Weimeng database deletion incident, explains why recovery took 36 hours, and provides practical guidance on backup practices, minimal‑privilege management, and cloud‑based disaster recovery to prevent similar data loss in small and large organizations.

BackupDatabase SecurityInformation Security
0 likes · 9 min read
What We Learned from the Weimeng Data Deletion Disaster: Backup and Permission Strategies
21CTO
21CTO
Feb 25, 2020 · Operations

Inside the Massive SaaS Data Deletion: How a Core Engineer Wiped Out Millions

A Chinese SaaS provider suffered a catastrophic data loss when a core operations employee maliciously deleted its production databases, prompting emergency repairs, police involvement, and a multi‑day recovery effort that exposed critical gaps in permission management and backup strategies.

BackupOperationsSaaS
0 likes · 8 min read
Inside the Massive SaaS Data Deletion: How a Core Engineer Wiped Out Millions
Efficient Ops
Efficient Ops
Feb 24, 2020 · Operations

How to Build an Effective Operations Monitoring Platform: Tools, Design, and Best Practices

This article explains why monitoring is essential for operations, reviews popular monitoring tools such as Cacti, Nagios, Zabbix, Ganglia, Centreon, Prometheus and Grafana, outlines a six‑layer unified monitoring platform architecture, offers selection guidance for different enterprise sizes, and shares evolution lessons from small to large scale deployments.

DevOpsGrafanaOperations
0 likes · 20 min read
How to Build an Effective Operations Monitoring Platform: Tools, Design, and Best Practices
Big Data Technology Architecture
Big Data Technology Architecture
Feb 24, 2020 · Operations

Evolution and Optimization of JD.com Order Center Elasticsearch Cluster Architecture

This article details how JD.com’s order center migrated its Elasticsearch cluster through multiple architectural stages—initial deployment, isolation, replica tuning, master‑slave adjustments, and real‑time dual‑cluster backup—while addressing data synchronization, scaling, and performance pitfalls to achieve high availability and query stability.

Cluster ArchitectureElasticsearchJD.com
0 likes · 13 min read
Evolution and Optimization of JD.com Order Center Elasticsearch Cluster Architecture
Qunar Tech Salon
Qunar Tech Salon
Feb 20, 2020 · Operations

Design and Implementation of Business‑Driven Monitoring Systems at JD Cloud

This article explains why monitoring is essential for operations, outlines the four‑layer monitoring standard (infrastructure, liveliness, performance, business), breaks down functional modules and data flows, and showcases JD Cloud's practical design, alarm‑convergence project, and future AI‑driven observability directions.

JD CloudOperationsalert convergence
0 likes · 12 min read
Design and Implementation of Business‑Driven Monitoring Systems at JD Cloud
DevOps
DevOps
Feb 19, 2020 · Operations

Single‑Point Breakthrough in Enterprise DevOps Transformation: JD.com Case Study

The article explains how focusing on a single critical point—such as the deployment stage—can dramatically accelerate an organization’s end‑to‑end DevOps transformation, illustrated with JD.com’s journey from manual releases to an automated, high‑efficiency continuous delivery platform.

Continuous DeliveryDevOpsJD.com
0 likes · 11 min read
Single‑Point Breakthrough in Enterprise DevOps Transformation: JD.com Case Study
Didi Tech
Didi Tech
Feb 18, 2020 · Operations

Didi's National Carpool Day: Technical Insights into Stability Assurance

Didi's National Carpool Day on Dec 3 2019 attracted 3.1M passengers; stability ensured via six pillars: organized task force, capacity forecasting and rapid container scaling, comprehensive monitoring with fire‑fighting map, robust contingency platform, strict process standards, and coordinated third‑party preparation.

Carpool DayDidiOperations
0 likes · 13 min read
Didi's National Carpool Day: Technical Insights into Stability Assurance
HomeTech
HomeTech
Feb 12, 2020 · Operations

Design and Architecture of an IBPM Workflow Platform

This article outlines the design, architecture, and key features of an IBPM workflow platform, detailing its background, core concepts, design principles, extensibility, and future direction for creating a configurable, integrated, and intelligent business process management solution.

BPMOperationsplatform design
0 likes · 4 min read
Design and Architecture of an IBPM Workflow Platform
Mafengwo Technology
Mafengwo Technology
Feb 8, 2020 · Operations

How a Travel Platform Engineered a Pandemic‑Era Emergency Response: Operations Lessons

During the 2020 Chinese New Year lockdown, a travel platform mobilized its development, product, and operations teams to rapidly build refund systems, coordinate with suppliers, and ensure continuous online services, showcasing a user‑first, cross‑functional emergency strategy that balanced technical delivery with intense customer pressure.

Operationsincident responsepandemic
0 likes · 13 min read
How a Travel Platform Engineered a Pandemic‑Era Emergency Response: Operations Lessons
Efficient Ops
Efficient Ops
Feb 5, 2020 · Operations

Balancing Stability and Speed: Google SRE Lessons for Modern Ops Teams

This article examines the inherent tension between operations and development, explains Google’s error‑budget and SLO approach, and shares practical DevOps, on‑call, automation, and talent strategies that help ops teams improve efficiency while maintaining product reliability.

Error BudgetOn-CallOperations
0 likes · 9 min read
Balancing Stability and Speed: Google SRE Lessons for Modern Ops Teams
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 2, 2020 · Backend Development

How Chinese Developers Built a Rapid COVID-19 Travel Query Tool in One Day

In early 2020, a small team of Chinese developers swiftly created a COVID-19 travel companion query tool—designing, coding, and deploying a searchable web service within a single day, then scaling it to millions of users using CDN, static site generation, and cloud storage, while emphasizing data accuracy and rapid response.

COVID-19OperationsRapid Prototyping
0 likes · 11 min read
How Chinese Developers Built a Rapid COVID-19 Travel Query Tool in One Day
Architects' Tech Alliance
Architects' Tech Alliance
Jan 17, 2020 · Fundamentals

Overview of Server Benchmark Standards: TPC and SPEC

The article explains the origins, metrics, and test suites of TPC and SPEC benchmarks, describes their various models for CPU, web, HPC and storage performance, shows how to query official results, and notes a promotional bundle of technical e‑books.

CPUOperationsPerformance Testing
0 likes · 9 min read
Overview of Server Benchmark Standards: TPC and SPEC
Tencent Tech
Tencent Tech
Jan 17, 2020 · Cloud Computing

How QQ Tackled Massive Cloud Migration Challenges – Tencent’s Strategy Revealed

Tencent’s QQ service migrated over a million servers to public cloud, detailing comprehensive planning, phased execution, and solutions to security, dependency, disaster recovery, and gray‑scale challenges, while highlighting infrastructure upgrades, database migration, cloud‑native tools, and operational transformations that ensured zero user impact.

InfrastructureOperationsQQ
0 likes · 20 min read
How QQ Tackled Massive Cloud Migration Challenges – Tencent’s Strategy Revealed
Efficient Ops
Efficient Ops
Jan 16, 2020 · Operations

Mastering WAS Memory Overflow: Elegant Strategies for Resolution

This article explains IBM WebSphere Application Server's memory architecture, common causes of Java OutOfMemoryError in WAS, and provides a step‑by‑step guide—including log collection, heap analysis, and preventive measures—to diagnose, resolve, and avoid memory overflow incidents in production environments.

Garbage CollectionOperationsWAS
0 likes · 16 min read
Mastering WAS Memory Overflow: Elegant Strategies for Resolution
DevOps
DevOps
Jan 15, 2020 · Operations

Building Trust, Respect, and Accountability: The Role of Culture in DevOps Transformation

The article explains how a strong, transparent enterprise culture—characterized by trust, respect, and accountability—is the foundational prerequisite for successful DevOps transformation, illustrating key concepts, cultural barriers, and real‑world case studies that show why cultural change must precede technical adoption.

CultureDevOpsEnterprise
0 likes · 10 min read
Building Trust, Respect, and Accountability: The Role of Culture in DevOps Transformation
Efficient Ops
Efficient Ops
Jan 8, 2020 · Operations

How a Bank Built an Automated Operations Platform and CMDB Middle‑Platform

This article details how Ping An Bank tackled rapid growth and complex regulatory demands by creating an automated operations middle‑platform, designing a CMDB with data‑closure and subscription mechanisms, and implementing orchestration, gray‑scale deployment, and high‑risk detection to achieve resilient, scalable infrastructure management.

CMDBInfrastructureOperations
0 likes · 21 min read
How a Bank Built an Automated Operations Platform and CMDB Middle‑Platform
macrozheng
macrozheng
Jan 8, 2020 · Operations

How to Set Up Jenkins Automated Deployment for the Mall Project

This guide walks you through preparing scripts, uploading them, making them executable, and creating Jenkins jobs for each module of the multi‑module Mall project to achieve fully automated deployment using free‑style projects and SSH execution.

DeploymentDevOpsJenkins
0 likes · 8 min read
How to Set Up Jenkins Automated Deployment for the Mall Project
DevOps
DevOps
Jan 7, 2020 · Operations

DevOps Planning and Practice in a Large State‑Owned Commercial Bank

This article outlines how a major state‑owned commercial bank designed and implemented a DevOps framework—including goals, architecture, the three main pillars of tools, processes, and standards—and shares practical insights, maturity assessment methods, and Q&A for large‑scale financial institutions.

BankingDevOpsMaturity Assessment
0 likes · 12 min read
DevOps Planning and Practice in a Large State‑Owned Commercial Bank
Java Backend Technology
Java Backend Technology
Jan 7, 2020 · Backend Development

Mastering Retry and Idempotency: Prevent Timeout Failures in High‑Concurrency Systems

This article examines a real‑world group‑buy scenario, explains why timeout‑prone interfaces need robust retry and idempotency handling, distinguishes read and write timeouts, outlines key idempotency practices for services and messages, and introduces Guava‑retrying and Spring‑retry as elegant solutions.

Distributed SystemsOperationsRetry
0 likes · 13 min read
Mastering Retry and Idempotency: Prevent Timeout Failures in High‑Concurrency Systems
Qunar Tech Salon
Qunar Tech Salon
Jan 7, 2020 · Operations

Comprehensive Dependency Governance for High‑Availability Backend Systems

This article outlines a systematic approach to dependency governance in high‑traffic backend services, covering service classification, rate limiting, Dubbo, HTTP, database, and message‑queue management to enhance availability, reduce failure impact, and improve overall system stability.

DubboOperationsdependency management
0 likes · 10 min read
Comprehensive Dependency Governance for High‑Availability Backend Systems
MaGe Linux Operations
MaGe Linux Operations
Jan 2, 2020 · Operations

Mastering RPM and YUM: Essential Commands for Linux Package Management

This guide explains Linux package naming conventions, how to inspect binary dependencies and installed libraries, and provides a comprehensive collection of RPM and YUM commands—including installation, query, verification, removal, and repository configuration—to help administrators manage software efficiently.

CLILinuxOperations
0 likes · 7 min read
Mastering RPM and YUM: Essential Commands for Linux Package Management
Youku Technology
Youku Technology
Jan 2, 2020 · Operations

Quality Assurance and Stability Strategies for Alibaba Double 11 "Cat Night" Live Streaming

The QA team delivered a seamless, globally stable Double 11 “Cat Night” live stream across three apps and dozens of devices by applying client‑ and server‑side stability measures, international latency simulation, IPv6 support, cost‑effective CDN strategies, full‑chain monitoring, and automated asset‑loss safeguards, achieving zero financial loss.

MobileOperationsPerformance Testing
0 likes · 16 min read
Quality Assurance and Stability Strategies for Alibaba Double 11 "Cat Night" Live Streaming
dbaplus Community
dbaplus Community
Dec 30, 2019 · Operations

How Alibaba’s ECS Team Built a Scalable SRE System for Massive Cloud Services

This article explains the origins of Site Reliability Engineering (SRE), outlines the responsibilities of SRE teams, and details Alibaba Cloud’s ECS SRE practices—including capacity planning, performance optimization, full‑stack stability governance, automated release pipelines, on‑call processes, and the core principles and mindset that guide modern SRE work.

OperationsSRESite Reliability Engineering
0 likes · 28 min read
How Alibaba’s ECS Team Built a Scalable SRE System for Massive Cloud Services
Youzan Coder
Youzan Coder
Dec 30, 2019 · Operations

How to Measure and Improve Project Efficiency: A Practical Guide

This article explains why measurement is essential for management, outlines a step‑by‑step process for collecting and analyzing efficiency metrics, and shows how to turn data‑driven insights into concrete conclusions and actionable improvement plans for software projects.

Continuous ImprovementOperationsProject Management
0 likes · 10 min read
How to Measure and Improve Project Efficiency: A Practical Guide
DevOps Cloud Academy
DevOps Cloud Academy
Dec 30, 2019 · Operations

How to Implement an Effective CI/CD Pipeline

Implementing an effective CI/CD pipeline involves understanding continuous integration, delivery, and deployment, recognizing their benefits such as faster feedback and early error detection, and following key stages—from commit and build to testing and production deployment—while selecting appropriate tools and practices to streamline software delivery.

Continuous DeliveryDevOpsOperations
0 likes · 6 min read
How to Implement an Effective CI/CD Pipeline
Efficient Ops
Efficient Ops
Dec 28, 2019 · Operations

What the 2019 IT Operations Whitepaper Reveals About Enterprise Ops Trends

The 2019 Enterprise IT Operations Whitepaper, released at the national Operations Conference, systematically examines the definition, value, key capabilities, industry applications, challenges, and future trends of IT operations across telecom, finance, Internet, and manufacturing sectors.

Artificial IntelligenceBig DataIT Operations
0 likes · 6 min read
What the 2019 IT Operations Whitepaper Reveals About Enterprise Ops Trends
Qunar Tech Salon
Qunar Tech Salon
Dec 27, 2019 · Operations

Qunar Ticket Test‑Environment Governance and Automated Monitoring Framework

This article describes Qunar Ticket’s comprehensive test‑environment governance framework, including the “Mirror‑Inspect” monitoring service, configuration and data synchronization strategies, and automated allocation management, highlighting how these practices reduced environment‑related project delays from up to 20% to below 8%.

Configuration ManagementOperationsmonitoring
0 likes · 11 min read
Qunar Ticket Test‑Environment Governance and Automated Monitoring Framework
Efficient Ops
Efficient Ops
Dec 26, 2019 · Operations

How China Telecom’s DICT Leverages DevOps for Agile Cloud‑Native Development

At the 2019 Operations Conference in Beijing, China Telecom’s R&D leader detailed the company’s transformation journey, the DICT capability center’s strategy, the Biying cloud platform, and the implementation of an integrated DevOps platform that streamlines end‑to‑end software delivery using containers, CI and security automation.

Agile DevelopmentChina TelecomCloud Native
0 likes · 3 min read
How China Telecom’s DICT Leverages DevOps for Agile Cloud‑Native Development
Efficient Ops
Efficient Ops
Dec 26, 2019 · Operations

Inside Jiangsu Telecom’s Leap to Level‑3 DevOps Continuous Delivery

The article recounts Jiangsu Telecom’s successful Level‑3 DevOps continuous delivery assessment at the 2019 Beijing Operations Conference, highlighting the role of standardization and tooling, sharing interview insights on the intelligent pre‑processing system, and outlining the broader DevOps standard ecosystem in China.

Continuous DeliveryOperationscase study
0 likes · 10 min read
Inside Jiangsu Telecom’s Leap to Level‑3 DevOps Continuous Delivery
Efficient Ops
Efficient Ops
Dec 26, 2019 · Operations

How CITIC Bank Achieved Level‑3 DevOps Continuous Delivery: Key Lessons

CITIC Bank’s software development center shares how three flagship projects passed the level‑3 DevOps continuous‑delivery assessment, revealing the role of standardization, tool empowerment, agile practices, and container‑based pipelines in accelerating delivery and boosting team morale.

CITIC BankContinuous DeliveryOperations
0 likes · 15 min read
How CITIC Bank Achieved Level‑3 DevOps Continuous Delivery: Key Lessons
Aikesheng Open Source Community
Aikesheng Open Source Community
Dec 25, 2019 · Operations

Deploying Thanos for Unified Prometheus Monitoring and Long‑Term Storage

This guide explains the background, key features, architecture, and step‑by‑step deployment of Thanos—including Sidecar, Store, Query, Compact, Bucket, Rule, and Check components—to provide a unified, high‑availability Prometheus monitoring view with unlimited historical data storage using object storage.

Cloud NativeDeploymentLong‑term Storage
0 likes · 9 min read
Deploying Thanos for Unified Prometheus Monitoring and Long‑Term Storage
FunTester
FunTester
Dec 25, 2019 · Industry Insights

Why DevTestOps Is the Next Evolution in DevOps Automation

This article explains the evolution from traditional DevOps to DevTestOps, detailing continuous testing, the benefits of integrating automated testing into DevOps pipelines, practical implementation steps, and why organizations should adopt DevTestOps to enhance software quality and delivery speed.

ContinuousTestingDevOpsDevTestOps
0 likes · 8 min read
Why DevTestOps Is the Next Evolution in DevOps Automation
Efficient Ops
Efficient Ops
Dec 22, 2019 · Operations

How Baidu’s Noah Monitoring System Tackles AIOps Challenges at Scale

This article examines Baidu’s Noah monitoring and alarm platform, detailing its end‑to‑end fault‑handling workflow, the three‑component architecture, and the practical challenges of deploying AIOps—such as long algorithm iteration cycles, complex alarm management, and alarm storms—while highlighting scalability and commercial considerations.

Alarm ManagementOperationsaiops
0 likes · 15 min read
How Baidu’s Noah Monitoring System Tackles AIOps Challenges at Scale
Ctrip Technology
Ctrip Technology
Dec 19, 2019 · Cloud Native

Evolution of Ctrip Cloud Platform: From OpenStack IAAS to Cloud‑Native Kubernetes

This article chronicles Ctrip's cloud‑infrastructure journey—from the early OpenStack‑based IAAS platform through containerization with Mesos, the migration to large‑scale Kubernetes clusters, and the adoption of cloud‑native practices that improve resource utilization, deployment speed, and application governance.

Cloud NativeDevOpsInfrastructure as Code
0 likes · 10 min read
Evolution of Ctrip Cloud Platform: From OpenStack IAAS to Cloud‑Native Kubernetes
Efficient Ops
Efficient Ops
Dec 18, 2019 · Operations

How CITIC Bank Pioneered Organizational‑Level Agile and DevOps Practices

CITIC Bank’s DevOps lead Li Hongtao explains how the bank’s new organizational‑level agile practice integrates development and data‑center operations, employs a DevOps capability maturity model, and cultivates agile coaches to overcome transformation challenges, offering a practical roadmap for peers in the banking sector.

Digital TransformationOperationsagile
0 likes · 5 min read
How CITIC Bank Pioneered Organizational‑Level Agile and DevOps Practices
Qunar Tech Salon
Qunar Tech Salon
Dec 17, 2019 · Operations

Evolution of Call Center Technology: From Hotlines to Multimedia

This article traces the evolution of call center technology across four generations—from early hotlines using PSTN and PBX, through IVR and CTI innovations, to modern multimedia channels—highlighting key concepts, features, and their impact on operational efficiency and customer service.

CTIIVRMultimedia
0 likes · 10 min read
Evolution of Call Center Technology: From Hotlines to Multimedia
Java Architect Essentials
Java Architect Essentials
Dec 15, 2019 · Backend Development

Designing Ultra‑High‑Performance Flash‑Sale Systems: Architecture, Consistency, and Availability

This article analyzes the core challenges of building flash‑sale (秒杀) systems—high concurrency reads and writes, strict consistency, and ultra‑high availability—and presents a layered architectural approach covering dynamic/static separation, hotspot optimization, database tuning, and comprehensive high‑availability strategies.

Backend ArchitectureConsistencyOperations
0 likes · 28 min read
Designing Ultra‑High‑Performance Flash‑Sale Systems: Architecture, Consistency, and Availability
360 Quality & Efficiency
360 Quality & Efficiency
Dec 13, 2019 · Operations

Using Zabbix to Monitor Service Ports and Configure Email Alerts

This article explains how to use Zabbix for simple service‑port monitoring, covering installation, host and item creation, trigger and graph setup, and email notification configuration, providing a practical guide for developers who need lightweight operational monitoring without writing custom code.

Email NotificationOperationsService Port
0 likes · 8 min read
Using Zabbix to Monitor Service Ports and Configure Email Alerts
Architects Research Society
Architects Research Society
Dec 9, 2019 · Operations

Overview of StackStorm: An Open‑Source Automation Platform

StackStorm is an open‑source automation platform that integrates existing infrastructure and applications, enabling event‑driven workflows, troubleshooting, auto‑remediation, and continuous deployment through modular components such as sensors, triggers, actions, rules, workflows, and packs, all managed via a web UI, CLI, and REST API.

DevOpsIntegrationOperations
0 likes · 7 min read
Overview of StackStorm: An Open‑Source Automation Platform
NetEase Game Operations Platform
NetEase Game Operations Platform
Dec 7, 2019 · Operations

Intelligent Anomaly Detection for Operations Maintenance: Machine Learning Methods and Workflow

This article explains the importance of operations maintenance, outlines the challenges of traditional rule‑based anomaly detection, and describes how machine‑learning‑driven AIOps—including feature engineering, unsupervised and supervised models—can provide more accurate, scalable, and automated detection of server anomalies.

Operationsaiopsfeature engineering
0 likes · 10 min read
Intelligent Anomaly Detection for Operations Maintenance: Machine Learning Methods and Workflow
MaGe Linux Operations
MaGe Linux Operations
Dec 5, 2019 · Operations

When Alipay Crashed: Lessons on High Availability and Disaster Recovery

On December 5th Alipay experienced a brief outage that sent users into panic, prompting a humorous recount of personal losses, meme images, and a reminder of the critical importance of high‑availability architecture and disaster‑recovery planning for large‑scale financial services.

Alipay outageFinancial ServicesOperations
0 likes · 3 min read
When Alipay Crashed: Lessons on High Availability and Disaster Recovery
21CTO
21CTO
Dec 3, 2019 · Operations

Why Most Alerts Fail and How to Build Actionable Monitoring

This article explains why many system alerts are poorly designed, describes the true purpose of alerts as actionable notifications, distinguishes business rule monitoring from reliability monitoring, and presents practical metrics, strategies, and simple anomaly‑detection algorithms to create high‑quality, actionable alerts for reliable operations.

AlertingMetricsOperations
0 likes · 23 min read
Why Most Alerts Fail and How to Build Actionable Monitoring
Youku Technology
Youku Technology
Dec 2, 2019 · Operations

Technical Architecture and Operational Practices of Alibaba's 2019 Double‑11 "Cat Evening" Live Show

Alibaba’s 2019 Double‑11 “Cat Evening” live show combined a unified codebase across Youku, Taobao and Tmall with synchronized clocks, latency‑measurement devices and SEI‑injected messages to guarantee fair, zero‑loss interactions, while employing dynamic routing, pre‑warming, peak‑shaving, downstream protection and rehearsed contingency plans to handle massive concurrency and ensure stable, high‑quality user experience.

AlibabaFairnessOperations
0 likes · 11 min read
Technical Architecture and Operational Practices of Alibaba's 2019 Double‑11 "Cat Evening" Live Show
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 29, 2019 · Operations

How Alibaba’s ECS SRE Team Built a Rock‑Solid Cloud Infrastructure for 100% Cloud Migration

This article explains how Alibaba's Elastic Compute Service (ECS) SRE team tackled massive traffic, database bottlenecks, alert overload, and resource inconsistencies by establishing a full‑stack reliability organization, upgrading core components, automating pipelines, and instituting rigorous monitoring, incident response, and change‑management processes.

OperationsSRESite Reliability Engineering
0 likes · 27 min read
How Alibaba’s ECS SRE Team Built a Rock‑Solid Cloud Infrastructure for 100% Cloud Migration
Youku Technology
Youku Technology
Nov 26, 2019 · Operations

Resource Assurance Strategies and Practices for Alibaba Youku Double‑11 Promotion

The article outlines Alibaba Youku’s end‑to‑end resource‑assurance platform for Double‑11 promotions, detailing automated demand collection, business‑to‑technical metric conversion, single‑machine capacity testing, rapid scaling and emergency borrowing, which together cut manual reviews by 80 % and boosted delivery efficiency tenfold.

OperationsResource Managementautomation
0 likes · 13 min read
Resource Assurance Strategies and Practices for Alibaba Youku Double‑11 Promotion
Architect's Tech Stack
Architect's Tech Stack
Nov 23, 2019 · Databases

Redis Usage Guidelines and Operational Restrictions

This article provides comprehensive best‑practice guidelines for using Redis, covering data classification, key naming, size and connection limits, cache TTL, recommended client‑hash sharding, and a strict list of prohibited commands and operations to ensure performance, reliability, and maintainability.

CacheOperationsperformance
0 likes · 9 min read
Redis Usage Guidelines and Operational Restrictions
Programmer DD
Programmer DD
Nov 23, 2019 · Operations

Essential Checklist for Rapid Server Troubleshooting

This guide walks you through a systematic, step‑by‑step process for diagnosing and resolving poor‑performance or failure incidents on Linux servers, covering everything from gathering context and checking who is logged in to inspecting processes, network services, hardware, I/O, logs, cron jobs and application‑level diagnostics.

LinuxOperationsmonitoring
0 likes · 11 min read
Essential Checklist for Rapid Server Troubleshooting
Ctrip Technology
Ctrip Technology
Nov 21, 2019 · Cloud Native

Case Study: Intermittent Container Timeout Issues – Analysis and Resolution

This article presents a detailed case study of intermittent container timeout problems in a Kubernetes environment, examining kernel upgrades, NUMA configurations, CPU affinity bindings, kubelet behavior, cadvisor overhead, and hardware faults, and outlines the investigative steps and solutions applied.

CPU affinityContainerHardware Fault
0 likes · 8 min read
Case Study: Intermittent Container Timeout Issues – Analysis and Resolution
MaGe Linux Operations
MaGe Linux Operations
Nov 20, 2019 · Operations

Essential Shell Script Best Practices for Reliable Automation

This article outlines the evolution from manual to automated operations, then presents a comprehensive set of shell‑script guidelines—including header conventions, formatting, safety checks, variable handling, loop pitfalls, logging, concurrency locks, and risk‑avoidance techniques—to help engineers write robust, maintainable automation scripts.

OperationsShell scriptingbest practices
0 likes · 10 min read
Essential Shell Script Best Practices for Reliable Automation
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 18, 2019 · Operations

How Alipay’s Tech Team Turned ‘Impossible’ Double‑11 Peaks into Seamless Transactions

Over eleven years Alipay’s engineers transformed the daunting Double‑11 traffic surges from chaotic outages into a smooth, scalable system through relentless capacity planning, architectural revolutions, rigorous stress testing, and the adoption of the self‑developed OceanBase database, turning “impossible” goals into everyday reality.

AlipayDouble11Operations
0 likes · 23 min read
How Alipay’s Tech Team Turned ‘Impossible’ Double‑11 Peaks into Seamless Transactions
Architecture Digest
Architecture Digest
Nov 16, 2019 · Operations

What Happens If Alipay’s Data Centers Are Physically Destroyed? A Deep Dive into Redundancy and Disaster Recovery

The article examines how Alipay’s financial data would survive a physical destruction of its servers by explaining multi‑site data center architectures, hot and cold backups, power redundancy, fire‑suppression systems, and the role of partner banks in data recovery, highlighting the extensive resilience measures in modern financial infrastructures.

AlipayInformation SecurityOperations
0 likes · 8 min read
What Happens If Alipay’s Data Centers Are Physically Destroyed? A Deep Dive into Redundancy and Disaster Recovery
Ctrip Technology
Ctrip Technology
Nov 14, 2019 · Operations

Chaos Engineering: Principles, Practices, and Lessons from Ctrip

The article explains Chaos Engineering as a discipline for deliberately injecting failures into distributed systems to uncover hidden weaknesses, outlines its five core principles, describes practical implementation steps and real‑world examples from Ctrip, and discusses future directions for reliability engineering.

Distributed SystemsFault InjectionOperations
0 likes · 9 min read
Chaos Engineering: Principles, Practices, and Lessons from Ctrip
21CTO
21CTO
Nov 12, 2019 · Backend Development

Why API Gateways Are Essential for Secure, Efficient Microservices

Implementing an API gateway in a microservice architecture addresses key challenges such as external‑internal isolation, backend security, operational cost reduction, streamlined change processes, and client‑service decoupling, while enabling features like service circuit breaking, gray releases, and online testing to simplify development and improve reliability.

MicroservicesOperationsService Architecture
0 likes · 3 min read
Why API Gateways Are Essential for Secure, Efficient Microservices
JD Retail Technology
JD Retail Technology
Nov 8, 2019 · Operations

Smart Supply Chain Operations for JD.com 11.11 Promotion: Integrated Planning, AI‑Driven Forecasting, and Real‑Time Optimization

JD.com's Smart Supply Chain Y Business Management team collaborated across divisions to implement AI‑driven demand forecasting, automated replenishment, micro‑service architecture, and real‑time monitoring, enabling precise inventory control, cost reduction, and seamless 11.11 promotion fulfillment through integrated planning, pricing, and fulfillment innovations.

AIDemand ForecastingMicroservices
0 likes · 21 min read
Smart Supply Chain Operations for JD.com 11.11 Promotion: Integrated Planning, AI‑Driven Forecasting, and Real‑Time Optimization
Liangxu Linux
Liangxu Linux
Nov 7, 2019 · Operations

Monitor Linux Processes with a Simple Shell Script

This guide shows how to create a reusable shell function that retrieves a process ID for a given user and program, demonstrates its usage, and explains each command involved so you can reliably detect when a service stops running.

OperationsShellautomation
0 likes · 5 min read
Monitor Linux Processes with a Simple Shell Script
JD Retail Technology
JD Retail Technology
Nov 7, 2019 · Operations

7FRESH Technical Preparation for the 11.11 Shopping Festival: System Scaling, Degradation Strategies, Emergency Plans, Performance Testing, and Operational Monitoring

The article details how 7FRESH's R&D, testing, network operations, and product teams coordinated system capacity expansion, degradation mechanisms, emergency response procedures, extensive performance testing, and 24/7 monitoring to ensure stable and scalable service during the high‑traffic 11.11 shopping event.

OperationsPerformance Testingcapacity planning
0 likes · 10 min read
7FRESH Technical Preparation for the 11.11 Shopping Festival: System Scaling, Degradation Strategies, Emergency Plans, Performance Testing, and Operational Monitoring
JD Retail Technology
JD Retail Technology
Nov 6, 2019 · Artificial Intelligence

Technical Overview of JD.com Search and Recommendation Systems for the 11.11 Shopping Festival

The article details JD.com's internally developed distributed search engine and recommendation platform, their new architectures, deep‑learning‑driven ranking and recall models, component‑based deployment, extensive performance testing, and coordinated operations that powered the massive 11.11 shopping event.

Deep LearningOperationsPerformance Testing
0 likes · 5 min read
Technical Overview of JD.com Search and Recommendation Systems for the 11.11 Shopping Festival
Architects Research Society
Architects Research Society
Nov 5, 2019 · Backend Development

Principled GraphQL: Ten Principles for Building, Maintaining, and Operating Data Graphs

This article presents ten GraphQL principles—grouped into integrity, agility, and operations—that guide the design, evolution, and secure large‑scale deployment of a unified data‑graph layer, emphasizing a single schema, collaborative implementation, schema registries, performance monitoring, and structured logging.

BackendData GraphGraphQL
0 likes · 17 min read
Principled GraphQL: Ten Principles for Building, Maintaining, and Operating Data Graphs
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Nov 5, 2019 · Operations

How 360 Scaled AIOps: From Data to Self‑Healing Operations

At the 360 Internet Technology Training Camp, experts detailed how AI-driven AIOps can transform large‑scale operations, covering data collection, model‑based anomaly detection, alert correlation, self‑healing workflows, and visual dashboards, and presented a practical end‑to‑end framework that other companies can adopt quickly.

Big DataOperationsaiops
0 likes · 15 min read
How 360 Scaled AIOps: From Data to Self‑Healing Operations
Efficient Ops
Efficient Ops
Nov 5, 2019 · Operations

From Waterfall to AIOps: How One Ops Leader Transformed Zhejiang Mobile’s IT

In an in‑depth interview, Fang Wei, former assistant general manager of Zhejiang Mobile’s network department, shares his 15‑year journey from B‑domain maintenance to leading DevOps, cloud, and AIOps initiatives, detailing the shift from waterfall processes to agile, micro‑services, containerization, and AI‑driven operations that reshaped the company’s IT landscape.

Operationsagileaiops
0 likes · 15 min read
From Waterfall to AIOps: How One Ops Leader Transformed Zhejiang Mobile’s IT
58 Tech
58 Tech
Nov 4, 2019 · Operations

Intelligent Operations Practices: Multi‑Dimensional Anomaly Detection, Alarm Merging, Knowledge‑Graph Construction, and Root‑Cause Analysis

This article summarizes the keynote on intelligent operations presented at the 13th GOPS Global Operations Conference, covering multi‑dimensional anomaly detection, smart alarm aggregation, the construction of an operations knowledge graph, and AI‑driven root‑cause analysis techniques for large‑scale server environments.

OperationsRoot Cause Analysisalarm merging
0 likes · 9 min read
Intelligent Operations Practices: Multi‑Dimensional Anomaly Detection, Alarm Merging, Knowledge‑Graph Construction, and Root‑Cause Analysis
Efficient Ops
Efficient Ops
Nov 3, 2019 · Operations

How Zhejiang Mobile Is Pioneering AIOps to Reach NoOps

Zhejiang Mobile’s IT department chronicles its journey from a 2015 cloud‑native initiative to a cutting‑edge AIOps transformation, detailing a six‑level NoOps roadmap, digital fault‑governance, middle‑platform consolidation, organizational agility, and measurable operational gains that position it as a telecom industry leader.

Artificial IntelligenceBig DataDigital Transformation
0 likes · 7 min read
How Zhejiang Mobile Is Pioneering AIOps to Reach NoOps
JD Retail Technology
JD Retail Technology
Oct 31, 2019 · Operations

Collaborative Load Testing for JD.com 11.11 Event: Organizational Changes, Scale Expansion, and ForceBot Traffic Recording & Replay

The article details JD.com's coordinated effort to prepare for the 11.11 shopping festival by expanding load‑testing scale, improving cross‑team collaboration, and enhancing the ForceBot platform with traffic recording and replay capabilities to achieve more realistic and efficient full‑chain performance evaluations.

JD.comLoad TestingOperations
0 likes · 7 min read
Collaborative Load Testing for JD.com 11.11 Event: Organizational Changes, Scale Expansion, and ForceBot Traffic Recording & Replay
dbaplus Community
dbaplus Community
Oct 28, 2019 · Operations

Avoid Common Prometheus Pitfalls: Best Practices for Reliable Monitoring

This article shares practical Prometheus best‑practice tips, covering the accuracy‑reliability trade‑off, self‑monitoring setups, avoiding NFS storage, pruning high‑cardinality metrics, handling rate‑function traps, alert‑graph mismatches, group_interval effects, and the overarching goal of stable, cost‑effective observability.

AlertingOperationsPrometheus
0 likes · 9 min read
Avoid Common Prometheus Pitfalls: Best Practices for Reliable Monitoring
21CTO
21CTO
Oct 28, 2019 · Operations

What Keeps Aviation IT Safe? Lessons from System Design and Data‑Driven Ops

The article reflects on the challenges of modernizing aviation IT systems, highlighting safety‑first regulations, the lack of plug‑in architecture, the need for robust load‑balancing and fault‑tolerance, and how data‑driven automation can bridge the gap between strict oversight and efficient operations.

Data-drivenOperationsaviation
0 likes · 13 min read
What Keeps Aviation IT Safe? Lessons from System Design and Data‑Driven Ops
Sohu Tech Products
Sohu Tech Products
Oct 23, 2019 · Operations

Google SRE Weekly Alert Limits and Practical Strategies for Reducing Alert Fatigue

This article examines how Google SRE limits weekly alerts to ten, compares it with typical Chinese internet operations teams, and provides practical strategies—including on‑call scheduling, alert escalation, automation, dashboard optimization, and team management—to dramatically reduce alert volume and improve incident response.

Alert ManagementOn-CallOperations
0 likes · 15 min read
Google SRE Weekly Alert Limits and Practical Strategies for Reducing Alert Fatigue
Tencent Cloud Developer
Tencent Cloud Developer
Oct 23, 2019 · Fundamentals

Building Programmer Soft Skills: Insights from a 2100-Day Technical Sharing Journey

Senior database engineer Yang Jianrong recounts his 2,100‑day daily technical‑sharing journey, emphasizing programmer soft‑skills—steady learning, clear communication, and open mindset—while offering four practical tips on planning, avoiding technical silos, using fragmented time, and engaging communities, and outlining future focus on AIOps, modern languages, and advanced database technologies.

Operationscareer-developmentdatabase-engineer
0 likes · 11 min read
Building Programmer Soft Skills: Insights from a 2100-Day Technical Sharing Journey
JD Retail Technology
JD Retail Technology
Oct 22, 2019 · Industry Insights

How JD.com Prepares Its Systems for 11.11: Stress Tests, Forcebot Evolution, and Quality Controls

JD.com's Retail Technology and Data Platform orchestrated a full‑chain, four‑entry‑point stress test for the 11.11 shopping festival, introduced an upgraded Forcebot traffic‑recording tool, and implemented a "Quality Month" with ten safeguards to ensure system stability and prevent incidents during the massive sales event.

DevOpsOperationse‑commerce
0 likes · 7 min read
How JD.com Prepares Its Systems for 11.11: Stress Tests, Forcebot Evolution, and Quality Controls