Tagged articles
3281 articles
Page 25 of 33
Efficient Ops
Efficient Ops
Apr 18, 2019 · Operations

Choosing the Right Monitoring Stack: From Nagios to Prometheus & Grafana

This article reviews common open‑source monitoring combinations, compares their strengths and weaknesses, and shares practical guidance on selecting collectors, storage back‑ends, and visualization tools such as Telegraf, InfluxDB, Prometheus, Grafana, and alertmanager for large‑scale data platform operations.

GrafanaInfluxDBNagios
0 likes · 12 min read
Choosing the Right Monitoring Stack: From Nagios to Prometheus & Grafana
58UXD
58UXD
Apr 18, 2019 · Operations

How Winning Design Strategies Boosted Spring Festival Campaign Traffic

This article dissects the 2019 Spring Festival (春运) campaign by 58.com, revealing how a win‑win design mindset, data‑driven insights, and integrated business collaboration transformed user experience, increased traffic, and delivered measurable results across multiple channels and game‑based interactions.

Design ThinkingOperationsUser experience
0 likes · 11 min read
How Winning Design Strategies Boosted Spring Festival Campaign Traffic
Architecture Digest
Architecture Digest
Apr 18, 2019 · Databases

MySQL High Performance Optimization Guidelines and Best Practices

This article presents a comprehensive set of MySQL high‑performance optimization guidelines, covering naming conventions, table design, data types, index strategies, SQL coding standards, replication, backup, and operational best practices to improve efficiency, reliability, and scalability of database systems.

Database designOperationsSQL Best Practices
0 likes · 19 min read
MySQL High Performance Optimization Guidelines and Best Practices
Efficient Ops
Efficient Ops
Apr 17, 2019 · Fundamentals

Mastering Scalable Web Architecture: From Front‑End to Data Center

An in‑depth guide walks through the essential layers of modern website architecture—including front‑end optimization, application frameworks, service distribution, storage solutions, backend processing, monitoring, security, and data‑center design—offering practical strategies for building high‑performance, scalable web systems.

Operationsfrontendsecurity
0 likes · 11 min read
Mastering Scalable Web Architecture: From Front‑End to Data Center
ITPUB
ITPUB
Apr 15, 2019 · Operations

Essential Practices to Prevent Operational Failures and Boost System Availability

This guide outlines six practical strategies—rollback testing, cautious destructive actions, clear command prompts, verified backups, careful handovers, and proactive monitoring—to help operations teams minimize outages and maintain high system availability.

AvailabilityOperationsbackup verification
0 likes · 6 min read
Essential Practices to Prevent Operational Failures and Boost System Availability
MaGe Linux Operations
MaGe Linux Operations
Apr 14, 2019 · Operations

Mastering Load Balancing: When to Choose LVS, Nginx, or HAProxy

This article explains how modern internet systems use server clusters and load balancers, compares the three most popular software solutions—LVS, Nginx, and HAProxy—covers their architectures, NAT and DR modes, advantages, disadvantages, and provides guidance on selecting the right tool for different scale scenarios.

HAProxyLVSNginx
0 likes · 13 min read
Mastering Load Balancing: When to Choose LVS, Nginx, or HAProxy
NetEase Game Operations Platform
NetEase Game Operations Platform
Apr 13, 2019 · Operations

Automating Service Discovery and Load Balancing with Consul, HAProxy, and Docker in a Microservices Architecture

This article explains how to transform a traditional monolithic deployment into a fully automated micro‑services environment by containerizing services, using Consul for dynamic service discovery and configuration, and configuring HAProxy with DNS resolvers to achieve seamless load balancing and zero‑downtime updates.

ConsulDockerHAProxy
0 likes · 12 min read
Automating Service Discovery and Load Balancing with Consul, HAProxy, and Docker in a Microservices Architecture
DevOps Cloud Academy
DevOps Cloud Academy
Apr 9, 2019 · Operations

Chapter 3: Managing Jenkins (Projects, Views, Plugins)

This guide explains Jenkins project management, covering naming conventions, creating new projects, configuring build history, parameterized builds, triggers, Jenkinsfile setup, as well as building, viewing logs, and debugging pipelines with illustrative screenshots.

DevOpsJenkinsOperations
0 likes · 2 min read
Chapter 3: Managing Jenkins (Projects, Views, Plugins)
Java High-Performance Architecture
Java High-Performance Architecture
Apr 9, 2019 · Operations

Mastering Load Balancing: Types, Algorithms, and Best Practices

This article outlines the three main load‑balancing methods—DNS, hardware, and software—detailing their advantages and drawbacks, then explains common algorithms such as round‑robin, weighted round‑robin, least‑connections, performance‑based, and hash, and provides guidance on combining them for optimal architecture.

AlgorithmsInfrastructureOperations
0 likes · 5 min read
Mastering Load Balancing: Types, Algorithms, and Best Practices
360 Quality & Efficiency
360 Quality & Efficiency
Apr 4, 2019 · Operations

Understanding System Load Average and CPU Usage in Linux

This article explains the meaning of the Linux uptime/top output, defines system load average as the average number of runnable and uninterruptible processes, distinguishes it from CPU utilization, and provides guidance on interpreting load values for single‑core and multi‑core systems.

CPU usageLinuxLoad Average
0 likes · 8 min read
Understanding System Load Average and CPU Usage in Linux
58 Tech
58 Tech
Apr 4, 2019 · Operations

Redesign of the Signal System for Task Scheduling and Dependency Management

This article explains the shortcomings of the legacy signal design in a scheduling platform, outlines four major dependency problems, and presents a newly engineered signal system with modular functions, instance ID generation, competitive priority rules, and state management to reliably support complex cross‑period and parallel job dependencies.

Operationspriority handlingsignal system
0 likes · 9 min read
Redesign of the Signal System for Task Scheduling and Dependency Management
DevOps
DevOps
Apr 3, 2019 · Operations

DevOps Transformation: Stories of Role Integration and Work Consolidation

The article examines real‑world DevOps transformation cases, illustrating how shifting operations staff into development teams can create both integration challenges and opportunities, and proposes a framework for distinguishing repeatable versus unique work to guide effective consolidation, standardization, and automation in software delivery.

DevOpsOperationsTeam Integration
0 likes · 10 min read
DevOps Transformation: Stories of Role Integration and Work Consolidation
Efficient Ops
Efficient Ops
Apr 1, 2019 · Operations

Beyond Linux: Mastering Modern Operations – From Deployment to Cloud

This article explores the full spectrum of modern operations, covering environment deployment, troubleshooting, backup, high availability, monitoring, security, automation, virtualization, and cloud services, while highlighting essential tools and best practices for both Linux and Windows environments.

DeploymentOperationsautomation
0 likes · 8 min read
Beyond Linux: Mastering Modern Operations – From Deployment to Cloud
Efficient Ops
Efficient Ops
Mar 31, 2019 · Operations

How to Design Actionable Alerts and Effective Monitoring Strategies

This article explains why most alerts are poorly designed, defines actionable alerts, outlines monitoring objectives, discusses metric selection, and presents simple yet powerful algorithms for anomaly detection to improve system reliability and operational efficiency.

MetricsOperationsalert design
0 likes · 21 min read
How to Design Actionable Alerts and Effective Monitoring Strategies
Programmer DD
Programmer DD
Mar 31, 2019 · Cloud Computing

10 Hard‑Earned AWS Lessons That Shape Modern Cloud Architecture

Reflecting on a decade of AWS, this article shares ten hard‑earned lessons—from building evolvable systems and anticipating failures to prioritizing security, automation, and open platforms—that guide the design, operation, and scaling of cloud services for today’s enterprises.

AWSOperationsarchitecture
0 likes · 13 min read
10 Hard‑Earned AWS Lessons That Shape Modern Cloud Architecture
Efficient Ops
Efficient Ops
Mar 28, 2019 · Information Security

How Leading Tech Companies Audit and Control Ops Permissions

This article explains how large enterprises such as BAT and banks implement strict auditing and supervision of operational privileges, using personal accounts, command logging, OSSEC monitoring, firewall limits, and cross‑team oversight to enforce the principle of least privilege.

DevOpsOperationsPrivilege Management
0 likes · 6 min read
How Leading Tech Companies Audit and Control Ops Permissions
Ctrip Technology
Ctrip Technology
Mar 28, 2019 · Operations

Comprehensive Guide to Enterprise WiFi Planning, Deployment, and Operations – Practices from Ctrip

This article presents a detailed, practice‑driven guide for enterprise WiFi, covering network planning, full‑coverage design, channel optimization, security, KPI‑based monitoring, probe‑based measurement, troubleshooting techniques, and real‑world case studies from Ctrip, highlighting how systematic operations can ensure high‑quality wireless service.

EnterpriseOperationsWiFi
0 likes · 16 min read
Comprehensive Guide to Enterprise WiFi Planning, Deployment, and Operations – Practices from Ctrip
DevOps Cloud Academy
DevOps Cloud Academy
Mar 27, 2019 · Operations

Chapter 2 – Installing Jenkins

This guide details the prerequisites, multiple deployment methods (WAR, macOS, Windows, Linux), and post‑installation configuration steps for Jenkins, including unlocking the instance, installing plugins, creating an admin user, setting an update site, and configuring a slave node.

DevOpsInstallationJenkins
0 likes · 5 min read
Chapter 2 – Installing Jenkins
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Mar 25, 2019 · Operations

Useful Linux Command‑Line Tips to Boost Productivity

This article presents a collection of practical Linux command‑line shortcuts and techniques—including cursor navigation, history execution, disk and memory inspection, process management, multi‑command chaining, and file handling—that can significantly improve efficiency for developers and system administrators.

BashOperationsShell
0 likes · 12 min read
Useful Linux Command‑Line Tips to Boost Productivity
58 Tech
58 Tech
Mar 25, 2019 · Artificial Intelligence

Machine Learning‑Based Threshold‑Free Monitoring for Business Metrics

This article describes a monitoring system that leverages machine learning to perform threshold‑free, real‑time anomaly detection on macro business indicators such as network traffic and access volume, detailing its architecture, sample labeling, model training, and multi‑level alarm strategies.

AIOperationsanomaly detection
0 likes · 7 min read
Machine Learning‑Based Threshold‑Free Monitoring for Business Metrics
58 Tech
58 Tech
Mar 25, 2019 · Operations

Alarm Convergence, Merging, and Self‑Healing in the 58 Monitoring Platform

The article describes how the 58 monitoring platform reduces alarm storms through alarm convergence, intelligent merging using Gini‑based decision trees, and automated self‑healing, thereby improving alert quality, cutting noise by about 70%, and helping engineers resolve incidents faster.

Operationsalarm convergencealert merging
0 likes · 9 min read
Alarm Convergence, Merging, and Self‑Healing in the 58 Monitoring Platform
Tencent Cloud Developer
Tencent Cloud Developer
Mar 19, 2019 · Cloud Computing

Why Cloud Computing Is the Future Path for Operations Professionals

Ops engineers who embrace the cloud—leveraging serverless, Kubernetes, AI, edge and elastic resources—gain cost‑efficient scalability, avoid on‑premise limitations, and open career paths such as cloud reliability engineer, solution architect, integration specialist or technical operations manager, ensuring relevance in the dominant, irreversible cloud‑first future.

Career DevelopmentOperationssolution architecture
0 likes · 6 min read
Why Cloud Computing Is the Future Path for Operations Professionals
JD Tech
JD Tech
Mar 19, 2019 · R&D Management

Challenges and Proper Practices for Measuring Software R&D Efficiency

The article examines the difficulties of quantifying software development efficiency, critiques common metric approaches, and proposes a principled framework that emphasizes global, outcome‑oriented indicators across delivery efficiency, quality, and capability to guide systematic R&D performance improvement.

OperationsR&D efficiencydelivery capability
0 likes · 9 min read
Challenges and Proper Practices for Measuring Software R&D Efficiency
Continuous Delivery 2.0
Continuous Delivery 2.0
Mar 19, 2019 · Operations

Key Metrics for Agile Teams: From Lead Time to Security Indicators

This article explains how software teams can select, combine, and interpret nine essential metrics—including lead time, cycle time, team velocity, defect rates, MTBF, MTTR, and security incident counts—to drive continuous improvement, align with business goals, and ultimately achieve successful outcomes.

Lead TimeOperationsagile
0 likes · 12 min read
Key Metrics for Agile Teams: From Lead Time to Security Indicators
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 18, 2019 · Operations

Alibaba Hema’s 7‑Layer Funnel & 23 Tactics for Ultra‑Fast Delivery Stability

The article outlines Alibaba’s Hema delivery platform’s end‑to‑end stability strategy, detailing a 7‑layer funnel review process, three core norms (development, architecture, stability), and 23 practical tactics—including core‑noncore isolation, proactive monitoring, fault prevention, rapid recovery, and service‑level controls—to ensure reliable 30‑minute deliveries despite complex logistics and external disruptions.

Operationsarchitecturedelivery
0 likes · 13 min read
Alibaba Hema’s 7‑Layer Funnel & 23 Tactics for Ultra‑Fast Delivery Stability
Efficient Ops
Efficient Ops
Mar 14, 2019 · Operations

9 Essential Logging Best Practices to Boost System Performance

This article presents nine practical logging best‑practice recommendations—from understanding human and machine audiences and standardizing log formats to leveraging metrics, proper alerting, severity levels, contextual information, and advanced framework features—helping operations teams improve system performance and troubleshooting efficiency.

MetricsOperationsbest practices
0 likes · 11 min read
9 Essential Logging Best Practices to Boost System Performance
Efficient Ops
Efficient Ops
Mar 14, 2019 · Operations

Why IT Operations Must Evolve: From Cost Center to Strategic Asset

The article examines how rapid cloud adoption, AI‑ops, and DevOps blur traditional IT operations roles, arguing that ops must shift from a low‑value cost center to a profit‑generating, efficiency‑driving function through mindset change, institutional innovation, expanded responsibilities, modern tools, and continuous skill upgrades.

Digital TransformationIT OperationsOperations
0 likes · 15 min read
Why IT Operations Must Evolve: From Cost Center to Strategic Asset
JD Tech
JD Tech
Mar 14, 2019 · Operations

Understanding Server Clustering and Load Balancing: LVS, Nginx, and HAProxy

This article explains server clustering and load‑balancing concepts, detailing the architecture and operation of LVS, Nginx, and HAProxy, and compares their advantages, disadvantages, and typical deployment scenarios; it also discusses NAT and DR modes, load‑balancer placement, and best‑practice recommendations for different traffic volumes.

HAProxyLVSOperations
0 likes · 12 min read
Understanding Server Clustering and Load Balancing: LVS, Nginx, and HAProxy
JD Tech
JD Tech
Mar 13, 2019 · Operations

Evolution of JD Digital Technology’s Host Monitoring System “DiTing”: From V1 to V3

The article chronicles the design, evolution, and lessons learned of JD Digital Technology’s self‑built host monitoring platform “DiTing”, detailing its initial requirements, V1 architecture, subsequent V2 and V3 redesigns, encountered challenges, and future directions toward intelligent operations.

Big DataOperationsSystem Architecture
0 likes · 12 min read
Evolution of JD Digital Technology’s Host Monitoring System “DiTing”: From V1 to V3
58 Tech
58 Tech
Mar 12, 2019 · Operations

Overview of the Octopus Automation Platform Architecture and Core Modules

The article introduces Octopus, the core automation service of 58 Group, detailing its overall architecture, the Octopus Agent lifecycle, communication mechanisms, management center capabilities, and key functional modules such as server information collection, command execution, deployment, permission control, and file transfer.

APIAgentDeployment
0 likes · 11 min read
Overview of the Octopus Automation Platform Architecture and Core Modules
21CTO
21CTO
Mar 11, 2019 · Operations

Why the US Navy’s Aegis UI Chooses 2D Over 3D – Lessons for High‑Stakes Interface Design

The article dissects the Aegis combat system’s dual‑screen UI, explaining why a simple 2D top‑down map paired with a side view outperforms flashy 3D graphics, how multi‑target blocks replace tables for faster decision‑making, and how human‑factor testing, eye‑tracking and standardized symbols dramatically improve combat efficiency.

AegisMilitaryOperations
0 likes · 14 min read
Why the US Navy’s Aegis UI Chooses 2D Over 3D – Lessons for High‑Stakes Interface Design
Efficient Ops
Efficient Ops
Mar 10, 2019 · Operations

Why Operations Won’t Die: A Veteran’s Perspective

A seasoned operations professional argues that despite sensational claims, the ops function remains essential—driven by its core responsibilities of quality, cost, efficiency, and security, evolving with cloud computing, DevOps, and emerging IoT demands.

DevOpsIT infrastructureOperations
0 likes · 11 min read
Why Operations Won’t Die: A Veteran’s Perspective
MaGe Linux Operations
MaGe Linux Operations
Mar 8, 2019 · Operations

Mastering High‑Availability Clusters: Resources, Constraints, and Failure Handling

This article explains the principles and components of high‑availability (HA) clusters, covering active/standby nodes, resource stickiness and constraints, heartbeat and quorum mechanisms, split‑brain avoidance, failure detection methods, and the minimal setup required for a reliable web‑service HA deployment.

HeartbeatOperationsResource Management
0 likes · 14 min read
Mastering High‑Availability Clusters: Resources, Constraints, and Failure Handling
DevOps
DevOps
Mar 7, 2019 · Operations

The Illusion of Tool‑Stacked DevOps and the Need for a True DevOps Culture

This article examines how DevOps has been reduced to a collection of automation tools, critiques the resulting "same‑bed‑different‑dreams" separation of development and operations, and outlines the cultural principles—shared responsibility, trust, autonomy, built‑in quality, feedback, and automation—necessary for a genuine DevOps transformation.

CultureDevOpsOperations
0 likes · 12 min read
The Illusion of Tool‑Stacked DevOps and the Need for a True DevOps Culture
Efficient Ops
Efficient Ops
Mar 7, 2019 · Operations

Why Operations Won’t Die: The Real Role of Ops in the Cloud Era

The article argues that operations will not disappear, explaining its essential functions—quality, cost, efficiency, and security—how cloud computing reshapes the role, the evolution toward DevOps, and why both cloud outages and industry trends actually underscore ops’ enduring importance.

DevOpsOperationsautomation
0 likes · 11 min read
Why Operations Won’t Die: The Real Role of Ops in the Cloud Era
Efficient Ops
Efficient Ops
Mar 6, 2019 · Databases

How NetEase Built an Automated DBA Platform with AIOps for Massive Scale

This article details NetEase's journey in designing and implementing a large‑scale database automation platform, covering its requirements, tool‑based operations, architecture, AIOps integration, and the practical lessons learned for managing thousands of database clusters efficiently.

OperationsScalabilityaiops
0 likes · 20 min read
How NetEase Built an Automated DBA Platform with AIOps for Massive Scale
MaGe Linux Operations
MaGe Linux Operations
Mar 6, 2019 · Operations

Master Essential Linux Shell Scripts for System Monitoring and Automation

This guide presents practical Bash scripting techniques—including precautions, random string generation, color output functions, bulk user creation, package checks, service status verification, host liveness testing, resource monitoring, disk usage audits, and website availability checks—to help you automate Linux system administration tasks effectively.

BashOperationsShell scripting
0 likes · 5 min read
Master Essential Linux Shell Scripts for System Monitoring and Automation
DevOps
DevOps
Mar 3, 2019 · Operations

The Evolution of DevOps: From Early Computing to Agile Software Development

This article traces the historical development of DevOps from the early days of self‑developed and self‑maintained computer programs, through the rise of professional developers and operations engineers, to the modern agile era where development and operations must collaborate to meet rapid market changes.

DevOpsIT OperationsOperations
0 likes · 13 min read
The Evolution of DevOps: From Early Computing to Agile Software Development
DevOps
DevOps
Feb 27, 2019 · Operations

A Historical Overview of DevOps: From a Belgian Consultant to a Global Movement

This article traces the evolution of DevOps from Patrick Debois' 2007 frustrations as a Belgian IT consultant through key conferences, blogs, and publications that shaped the DevOps movement, highlighting its roots in Agile practices and the convergence of development and operations.

Continuous DeliveryDevOpsOperations
0 likes · 9 min read
A Historical Overview of DevOps: From a Belgian Consultant to a Global Movement
Efficient Ops
Efficient Ops
Feb 27, 2019 · Operations

Master Linux System Monitoring: Essential Commands & Smem Tips

This guide walks you through Linux command‑line shortcuts, the five key system‑operation metrics, and powerful tools like smem, ps, and sort to efficiently monitor CPU, memory, processes, disks, and network while also handling zombie processes.

LinuxOperationscommand-line
0 likes · 8 min read
Master Linux System Monitoring: Essential Commands & Smem Tips
DevOps
DevOps
Feb 26, 2019 · Operations

Planning a DevOps Infrastructure for Traditional Enterprises: Capabilities and Tool Mapping

This article analyzes the essential capabilities required for building a DevOps infrastructure in traditional enterprises across foundation, development, testing, operations, and project management, mapping each capability to representative tools and offering guidance on flexible, evolving architecture design.

DevOpsInfrastructureOperations
0 likes · 12 min read
Planning a DevOps Infrastructure for Traditional Enterprises: Capabilities and Tool Mapping
AntTech
AntTech
Feb 22, 2019 · Operations

Technical Risk Prevention Platform: Building Fault Immunity for Financial Transaction Systems

The article outlines Ant Financial's technical risk prevention platform, describing the challenges of financial‑grade distributed architectures, the multi‑layer risk assurance system, the TRaaS platform's risk baseline, handling, and change‑control mechanisms, and how these practices empower partners to achieve high‑availability and secure financial services.

Operationsfinancial technologyplatform engineering
0 likes · 13 min read
Technical Risk Prevention Platform: Building Fault Immunity for Financial Transaction Systems
360 Tech Engineering
360 Tech Engineering
Feb 20, 2019 · Databases

Pika Best Practices: 30 Tips for Optimizing the RocksDB‑Based Redis‑Compatible Storage

This article presents thirty practical recommendations for deploying, configuring, and maintaining Pika—a high‑capacity, RocksDB‑backed Redis‑compatible storage system—covering version selection, thread settings, hardware choices, key design, memory management, replication, backup, compaction, security, and monitoring to achieve reliable and high‑performance operation.

Database TuningOperationsPika
0 likes · 16 min read
Pika Best Practices: 30 Tips for Optimizing the RocksDB‑Based Redis‑Compatible Storage
Efficient Ops
Efficient Ops
Feb 19, 2019 · Operations

Turning Middleware Pain into Power: Practical Ops Strategies for Financial Systems

This talk reveals why middleware operations in financial institutions feel especially painful, examines the specific cost, autonomy, and reliability challenges, and outlines a step‑by‑step evolution toward tool‑driven platforms, hybrid‑cloud deployment, and AIOps that reduce manual toil and improve system resilience.

Operationsaiopscloud
0 likes · 20 min read
Turning Middleware Pain into Power: Practical Ops Strategies for Financial Systems
Qunar Tech Salon
Qunar Tech Salon
Feb 19, 2019 · Operations

Forbidden City Night Festival Ticketing Chaos and How to Recover a Crashed Website

The article recounts the Forbidden City’s first night‑time Lantern Festival event, the overwhelming demand that caused the museum’s ticketing website to crash, and includes an interview with a senior operations engineer who explains the causes of such overloads and outlines rapid mitigation and scaling strategies.

Operationsscalingsystem reliability
0 likes · 6 min read
Forbidden City Night Festival Ticketing Chaos and How to Recover a Crashed Website
Efficient Ops
Efficient Ops
Feb 14, 2019 · Operations

Scaling a 10,000‑Node Container Cloud: Ctrip’s Ops Practices and Lessons

This article details Ctrip's journey of building and operating a massive container cloud platform, covering its architectural evolution, operational challenges, tooling, capacity management, and future directions, offering practical insights for large‑scale cloud‑native environments.

Cloud NativeKubernetesOperations
0 likes · 17 min read
Scaling a 10,000‑Node Container Cloud: Ctrip’s Ops Practices and Lessons
ITPUB
ITPUB
Feb 12, 2019 · Operations

Why Docker Might Be a Dangerous Gamble: Uncovering Its Design Flaws

The article presents a detailed critique of Docker, arguing that despite its marketed benefits of portability, security, and resource management, its design introduces significant complexity, hidden costs, and operational risks that many organizations overlook when adopting it for production workloads.

DockerOperationsSoftware Architecture
0 likes · 29 min read
Why Docker Might Be a Dangerous Gamble: Uncovering Its Design Flaws
21CTO
21CTO
Feb 8, 2019 · Operations

Baidu’s Secret to Handling 9 Billion Spring Festival Red Envelope Interactions

During the 2019 Chinese New Year Gala, Baidu mobilized a massive technical operation—scaling cloud resources, isolating traffic, and deploying AI‑driven security—to flawlessly process over 9 billion red‑packet interactions despite unprecedented traffic spikes and login surges.

Operationslarge-scale trafficsecurity
0 likes · 9 min read
Baidu’s Secret to Handling 9 Billion Spring Festival Red Envelope Interactions
21CTO
21CTO
Feb 1, 2019 · Cloud Native

Is Docker a Hidden Trap? Uncovering the Real Costs Behind Container Hype

The article critically examines Docker’s promised benefits—portability, security, and orchestration—highlighting its design shortcomings, hidden complexities, lock‑in risks, and the often‑overlooked alternatives that can deliver the same goals with far less overhead.

ContainersOperationscloud-native
0 likes · 28 min read
Is Docker a Hidden Trap? Uncovering the Real Costs Behind Container Hype
ITPUB
ITPUB
Jan 31, 2019 · Operations

Master Monitoring: Collect Metrics for New Systems Using White‑Box Techniques & the Four Golden SRE Indicators

This article explains how to approach monitoring for a newly introduced system by focusing on white‑box metric collection, distinguishing basic and business metrics, outlining common collection methods, and detailing Google SRE's four golden indicators—error, latency, traffic, and saturation—to guide effective observability.

MetricsOperationsSRE
0 likes · 10 min read
Master Monitoring: Collect Metrics for New Systems Using White‑Box Techniques & the Four Golden SRE Indicators
Efficient Ops
Efficient Ops
Jan 30, 2019 · Operations

From Rookie to Ops Manager: Key Lessons on Linux, Infrastructure, and Career Growth

The author shares a journey from a college Linux basics class to becoming an operations manager, detailing early hands‑on tasks, challenges in chaotic server environments, the creation of monitoring systems, and three key career lessons about learning, deepening technical understanding, and evaluating workplace fit.

LinuxOperationsSystem Administration
0 likes · 6 min read
From Rookie to Ops Manager: Key Lessons on Linux, Infrastructure, and Career Growth
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 30, 2019 · Operations

How Youku Scaled IPv6 from Zero to 500K Users in Days

This article details Youku's rapid, large‑scale IPv6 rollout—from initial pilot to half‑million users—covering the motivations, phased migration plan, technical challenges, implementation steps across client and server, gray‑release strategies, monitoring, and future outlook.

IPv6Large‑Scale DeploymentNetwork Migration
0 likes · 16 min read
How Youku Scaled IPv6 from Zero to 500K Users in Days
Efficient Ops
Efficient Ops
Jan 24, 2019 · Information Security

How Alibaba Scales Host Security Across Its Global Economic Ecosystem

This talk outlines Alibaba’s massive global host infrastructure, the evolving security governance from manual controls to data‑driven, automated systems, the challenges of compliance and operational efficiency, and future directions such as zero‑trust and invisible security.

Host SecurityInformation SecurityOperations
0 likes · 16 min read
How Alibaba Scales Host Security Across Its Global Economic Ecosystem
UCloud Tech
UCloud Tech
Jan 24, 2019 · Operations

How UCloud Executed a Seamless Hot Migration of Its Seoul Data Center

This article details UCloud's five‑month, multi‑department hot migration of its Seoul data center, covering planning, ZooKeeper scaling, udatabase and MySQL migration strategies, deployment platforms, and the final cut‑over steps that ensured zero user impact.

Data Center MigrationHot MigrationOperations
0 likes · 14 min read
How UCloud Executed a Seamless Hot Migration of Its Seoul Data Center
Efficient Ops
Efficient Ops
Jan 23, 2019 · Operations

Designing an Operations Monitoring Platform: Tools & Best Practices

This article explores the essential concepts for selecting and building an operations monitoring platform, reviewing popular tools such as Cacti, Nagios, Zabbix, Ganglia, Centreon, Prometheus, and Grafana, and outlines a six‑layer architecture and practical strategies for scaling, alerting, and high‑availability in diverse environments.

AlertingDevOpsInfrastructure
0 likes · 19 min read
Designing an Operations Monitoring Platform: Tools & Best Practices
Efficient Ops
Efficient Ops
Jan 10, 2019 · Operations

Essential DBA & Ops Practices to Prevent System Failures

This article outlines ten practical guidelines for DBAs and system administrators—including rollback‑ready changes, cautious use of destructive commands, prompt customization, reliable backups, production respect, thorough handovers, alerting, monitoring, careful failover, meticulous checks, and the virtue of simplicity—to minimize costly system outages.

LinuxOperationsOracle
0 likes · 7 min read
Essential DBA & Ops Practices to Prevent System Failures
Ctrip Technology
Ctrip Technology
Jan 7, 2019 · Artificial Intelligence

AIOps Practices and Exploration at Ctrip: Challenges, Solutions, and Future Outlook

This article presents Ctrip's extensive AIOps exploration, detailing operational challenges caused by massive monitoring data, the evolution of DevOps practices, the design of intelligent anomaly detection and diagnosis systems, practical use cases, and a forward‑looking perspective on the future of AI‑driven operations.

Fourier TransformOperationsaiops
0 likes · 20 min read
AIOps Practices and Exploration at Ctrip: Challenges, Solutions, and Future Outlook
JD Tech
JD Tech
Jan 3, 2019 · Operations

Comprehensive Monitoring Strategies for E‑commerce Platforms: Black‑Box and White‑Box Approaches

This article systematically explains how to enhance e‑commerce platform availability by implementing both black‑box monitoring to detect functional failures and white‑box monitoring to pinpoint root causes, detailing core order‑process metrics, common issues, mitigation strategies, and illustrative Grafana dashboards.

GrafanaOperationsSRE
0 likes · 9 min read
Comprehensive Monitoring Strategies for E‑commerce Platforms: Black‑Box and White‑Box Approaches
Efficient Ops
Efficient Ops
Jan 2, 2019 · Operations

Essential Ops Practices: Prevent Disasters with Backups, Security, and Monitoring

This guide outlines critical operational practices for Linux server management, emphasizing thorough testing, cautious command execution, regular backups, strict access controls, comprehensive monitoring, performance tuning, and a disciplined mindset to avoid costly incidents and ensure system stability.

Operationsmonitoringsecurity
0 likes · 12 min read
Essential Ops Practices: Prevent Disasters with Backups, Security, and Monitoring
58 Tech
58 Tech
Dec 26, 2018 · Operations

Overview of the 58 Intelligent Monitoring System and Its Multi‑Dimensional Architecture

The 58 Intelligent Monitoring System provides a flexible, 24/7, multi‑dimensional monitoring solution that covers network, server, system, application and business layers, incorporates AI‑driven prediction, anomaly detection, alarm merging, root‑cause analysis and self‑healing, and offers both PC and WeChat interfaces for operators.

AlertingOperationsSystem Architecture
0 likes · 16 min read
Overview of the 58 Intelligent Monitoring System and Its Multi‑Dimensional Architecture
Efficient Ops
Efficient Ops
Dec 24, 2018 · Operations

How Baidu’s Noah Platform Unifies Ops Data with Pull, Push, and Lazy ETL

This article explains how Baidu Cloud's Noah intelligent operations product builds a unified operations knowledge base by categorizing metadata, status, and event data and applying three ETL approaches—Pull, Push, and Lazy—to handle offline, near‑line, and real‑time data integration.

Data IntegrationETLKnowledge Base
0 likes · 8 min read
How Baidu’s Noah Platform Unifies Ops Data with Pull, Push, and Lazy ETL
MaGe Linux Operations
MaGe Linux Operations
Dec 24, 2018 · Operations

How to Quickly Diagnose and Fix High CPU Usage on a Data Platform Server

This guide walks through a step‑by‑step investigation of a sudden 98% CPU spike on a data‑platform server, showing how to pinpoint the offending process, trace the problematic Java thread, analyze the root cause in a time‑utility method, and apply an optimized solution that reduces CPU load by thirtyfold.

CPU troubleshootingLinuxOperations
0 likes · 7 min read
How to Quickly Diagnose and Fix High CPU Usage on a Data Platform Server
Programmer DD
Programmer DD
Dec 23, 2018 · Operations

How to Implement Service Degradation for High Availability

This article explains the concept of service degradation, why it is needed to maximize limited resources during traffic spikes, outlines common degradation strategies, and provides practical steps and code examples for ranking, sequencing, and implementing degradation in both front‑end and back‑end systems.

OperationsSystem Designdegradation
0 likes · 11 min read
How to Implement Service Degradation for High Availability
DevOps
DevOps
Dec 20, 2018 · Operations

What Is Kanban? Ten Things You Need to Know

The article introduces the Kanban method as a lean approach to managing professional services, outlines ten essential principles such as focusing on flow, incremental change, risk management, and scalability, and concludes with a recruitment announcement seeking DevOps engineers in Beijing.

DevOpsKanbanOperations
0 likes · 8 min read
What Is Kanban? Ten Things You Need to Know
Youku Technology
Youku Technology
Dec 20, 2018 · Operations

Youku IPv6 Migration: Planning, Implementation, and Lessons Learned

Youku’s pioneering IPv6 migration, launched in early 2018 and completed by Double 11, progressed through external, dual‑stack internal, and IPv6‑only phases, tackled test‑environment, MTU, and library issues, employed sophisticated gray‑release and monitoring, and ultimately unlocked unlimited address space, enhanced security, and faster, scalable video delivery.

BackendIPv6Network Migration
0 likes · 15 min read
Youku IPv6 Migration: Planning, Implementation, and Lessons Learned
Efficient Ops
Efficient Ops
Dec 19, 2018 · Cloud Computing

How to Build and Operate a National-Scale Private Cloud: Lessons and Trends

This talk outlines why organizations pursue cloud adoption, defines cloud‑native goals, reviews emerging trends such as bare‑metal and hyper‑convergence, and shares practical private‑cloud operation experiences, including ITIL processes, project management, and tooling, offering a comprehensive view of national‑level private‑cloud practice.

Bare MetalCloud NativeITIL
0 likes · 12 min read
How to Build and Operate a National-Scale Private Cloud: Lessons and Trends
AntTech
AntTech
Dec 19, 2018 · Information Security

Red‑Blue Technical Attack‑Defense Exercises and SRE Practices at Ant Financial

Ant Financial’s internal red‑blue technical attack‑defense program, driven by a dedicated blue team and SRE‑based red team, continuously probes system weaknesses, refines fault‑injection tools like Awatch, and evolves high‑availability and self‑healing mechanisms to strengthen risk control and operational reliability.

Fault InjectionInformation SecurityOperations
0 likes · 10 min read
Red‑Blue Technical Attack‑Defense Exercises and SRE Practices at Ant Financial
JD Tech
JD Tech
Dec 17, 2018 · Operations

Improving JD Intelligent Supply Chain Efficiency and System Stability for Major Sales Events

The article details JD's intelligent supply chain enhancements—including machine‑learning demand forecasting, a new "explosive product warehouse" model, non‑stock fulfillment visualization, blockchain‑based product traceability, and comprehensive system‑stability measures such as data‑consistency checkpoints, throughput buffering, and 24/7 incident response—to boost efficiency and reliability during large‑scale promotions.

Big DataBlockchainOperations
0 likes · 7 min read
Improving JD Intelligent Supply Chain Efficiency and System Stability for Major Sales Events
21CTO
21CTO
Dec 15, 2018 · Information Security

When Deleting Databases Becomes Revenge: Real‑World Cases and What You Must Do

This article recounts several real incidents where disgruntled engineers or admins deleted critical databases as retaliation, highlighting the severe consequences and stressing that proper backups and cautious use of destructive commands are essential for any organization.

IncidentOperationsrm
0 likes · 5 min read
When Deleting Databases Becomes Revenge: Real‑World Cases and What You Must Do
Java Captain
Java Captain
Dec 15, 2018 · Fundamentals

Understanding Distributed and Cluster Deployments: A Restaurant Analogy

The article uses a restaurant scenario to explain the differences between centralized, cluster, and distributed system deployments, illustrating how performance, security, scalability, and availability map to user requirements and why scaling from a single server to clusters and distributed architectures is essential as demand grows.

OperationsScalabilitySystem Architecture
0 likes · 7 min read
Understanding Distributed and Cluster Deployments: A Restaurant Analogy
JD Tech
JD Tech
Dec 13, 2018 · Operations

Monitoring Puppet Configuration Management: Workflow, Metrics, and Troubleshooting

This article explains how to monitor the Puppet configuration management system, covering its request‑response‑execution‑report workflow, key monitoring metrics, black‑box and white‑box monitoring approaches, common issues, and practical solutions for ensuring large‑scale cluster consistency.

Configuration ManagementOperationsPuppet
0 likes · 8 min read
Monitoring Puppet Configuration Management: Workflow, Metrics, and Troubleshooting
High Availability Architecture
High Availability Architecture
Dec 13, 2018 · Operations

Microservice Architecture Visualization: Practices and Benefits at Alibaba

The article explains why visualizing microservice architectures is essential for high availability, describes common and advanced visualization methods, discusses how to make visualization effective, handle architectural changes, identify key components, and leverage visual data for operations and reliability improvements.

AlibabaMicroservicesOperations
0 likes · 14 min read
Microservice Architecture Visualization: Practices and Benefits at Alibaba
Efficient Ops
Efficient Ops
Dec 11, 2018 · Operations

How Alibaba’s AI‑Powered Monitoring Tackles Complex Business Anomalies

In this talk, Alibaba senior tech expert Wang Zhaogang explains how intelligent monitoring, powered by machine‑learning algorithms and multi‑metric analysis, addresses the challenges of diverse business scenarios, enhances anomaly detection, improves root‑cause analysis, and shapes the future of smart operations.

OperationsRoot Cause Analysisanomaly detection
0 likes · 23 min read
How Alibaba’s AI‑Powered Monitoring Tackles Complex Business Anomalies
Programmer DD
Programmer DD
Dec 9, 2018 · Operations

What Can Nginx Do Without Third‑Party Modules? A Practical Guide

This article details the core capabilities of Nginx without third‑party modules, including reverse proxy, various load‑balancing strategies, static and dynamic HTTP serving, forward proxy setup, and hot‑reload commands, providing clear configuration examples for each feature.

ConfigurationHTTP serverNginx
0 likes · 10 min read
What Can Nginx Do Without Third‑Party Modules? A Practical Guide
JD Tech
JD Tech
Dec 6, 2018 · Operations

Shortening Decision Chains: End-to-End Inventory Management and Intelligent Replenishment in JD's Supply Chain

JD's chief scientist Shen Zuo‑jun explains how shortening the decision chain with end‑to‑end algorithms and intelligent multi‑level replenishment dramatically improves inventory turnover, stock availability, and forecasting accuracy, showcasing a novel supply‑chain research direction that integrates AI, big data, and human expertise.

End-to-EndOperationsforecasting
0 likes · 9 min read
Shortening Decision Chains: End-to-End Inventory Management and Intelligent Replenishment in JD's Supply Chain
Architect's Tech Stack
Architect's Tech Stack
Dec 5, 2018 · Operations

Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform

The article shares a comprehensive, experience‑driven guide on building fault‑tolerant systems—covering retry mechanisms, dynamic node removal, timeout settings, service degradation, decoupling, and business‑level safeguards—to enable a platform that scales from millions to billions of daily requests without relying on manual fire‑fighting.

OperationsSystem Designfault tolerance
0 likes · 21 min read
Practical Fault‑Tolerance Practices in a Large‑Scale Activity Operations Platform
MaGe Linux Operations
MaGe Linux Operations
Dec 4, 2018 · Operations

Essential Linux Skills Every Beginner Must Master

This guide outlines why Linux dominates the internet, recommends starting with CentOS or RHEL, suggests effective learning resources, and lists the core knowledge, tools, and advanced topics every aspiring Linux operations engineer should master.

DevOpsLinuxOperations
0 likes · 6 min read
Essential Linux Skills Every Beginner Must Master
JD Tech
JD Tech
Nov 28, 2018 · Operations

Technical Systems Behind JD Logistics for the 11.11 Global Shopping Festival

The article details how JD Logistics’ extensive warehouse, routing, distribution, and fulfillment systems—leveraging big data, AI, GIS, IoT, and distributed architectures—were engineered and optimized to handle the massive order surge during the 11.11 Global Shopping Festival with high throughput, low latency, and zero incidents.

AIBig DataGIS
0 likes · 8 min read
Technical Systems Behind JD Logistics for the 11.11 Global Shopping Festival
Efficient Ops
Efficient Ops
Nov 27, 2018 · Operations

How Alibaba Automates Server Fault Detection and Self‑Healing at Scale

Alibaba’s massive data‑center operations face growing hardware failures, so they built the DAM (Dammo) platform that integrates Tianji management, predictive fault detection, automated remediation, and self‑balancing cluster reconstruction, achieving near‑complete hardware issue coverage and reducing manual intervention across hundreds of thousands of servers.

Operationsaiopscloud computing
0 likes · 17 min read
How Alibaba Automates Server Fault Detection and Self‑Healing at Scale
Efficient Ops
Efficient Ops
Nov 25, 2018 · Operations

Top 13 Essential Linux Tools for System Monitoring and Security

This article introduces thirteen practical Linux operation tools—including Nethogs, IOZone, IOTop, IPtraf, IFTop, Fail2ban, and more—providing concise descriptions, download links, and step‑by‑step installation commands to help system administrators monitor performance, network traffic, and protect against attacks.

LinuxOperationsPerformance Testing
0 likes · 11 min read
Top 13 Essential Linux Tools for System Monitoring and Security
Architects Research Society
Architects Research Society
Nov 25, 2018 · Operations

eBay Scalability Best Practices: Functional Partitioning, Horizontal Sharding, Asynchronous Decoupling, and More

The article outlines eBay's key scalability best practices—including functional partitioning, horizontal sharding, avoiding distributed transactions, aggressive asynchronous decoupling, moving work to async pipelines, pervasive virtualization, and intelligent caching—to achieve linear or better resource usage as load grows.

AsynchronousOperationscaching
0 likes · 14 min read
eBay Scalability Best Practices: Functional Partitioning, Horizontal Sharding, Asynchronous Decoupling, and More
360 Tech Engineering
360 Tech Engineering
Nov 22, 2018 · Artificial Intelligence

AIOps Practices at 360: Cost Reduction, Efficiency Gains, and Intelligent Operations

This article presents 360's AIOps project, detailing how AI-driven capacity forecasting, host classification, resource recycling, intelligent MySQL scheduling, anomaly detection, alarm convergence, and root‑cause analysis have saved millions, improved efficiency, and paved the way for a fully automated operations workflow.

Capacity ForecastingCost OptimizationOperations
0 likes · 14 min read
AIOps Practices at 360: Cost Reduction, Efficiency Gains, and Intelligent Operations
AntTech
AntTech
Nov 21, 2018 · Operations

Building a High‑Availability Wireless Test Cluster for Mobile Apps at Ant Financial

The article details Ant Financial's development of a highly available wireless test cluster that supports automated testing for its massive mobile app ecosystem, describing its architecture, data‑driven monitoring, full integration, and the All‑in‑One solution that enables rapid, cost‑effective iteration across dozens of services and IoT scenarios.

Automated TestingDevice FarmOperations
0 likes · 9 min read
Building a High‑Availability Wireless Test Cluster for Mobile Apps at Ant Financial
Didi Tech
Didi Tech
Nov 20, 2018 · Operations

Didi's Message Queue Architecture, Migration Strategies, and RocketMQ Operational Practices

At Didi, the team replaced a chaotic mix of Kafka, Redis, and other queues with a custom, RocketMQ‑based service, using dual‑write and dual‑read migration, extensive performance testing, custom failover, batch extensions, and operational tweaks to achieve stable high‑throughput, low‑latency messaging at massive scale.

Message QueueOperationsPerformance Testing
0 likes · 17 min read
Didi's Message Queue Architecture, Migration Strategies, and RocketMQ Operational Practices
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 19, 2018 · Operations

How Alibaba Automates Hardware Fault Detection and Self‑Healing at Scale

This article explains how Alibaba’s massive data‑center operations detect hardware failures early, automatically isolate faulty servers, and execute self‑healing workflows through a centralized, cloud‑native platform, detailing detection methods, convergence rules, architecture evolution, and the benefits of a closed‑loop AIOps system.

Operationsaiopscloud-native
0 likes · 15 min read
How Alibaba Automates Hardware Fault Detection and Self‑Healing at Scale