Tagged articles
3281 articles
Page 27 of 33
ITPUB
ITPUB
Jun 20, 2018 · Databases

How JD Logistics Scales Warehouse Databases with Automation and High‑Availability Strategies

This article details JD Logistics' warehouse management system database architecture, the shift between local and centralized deployments, and how the UDBA automation platform, performance tuning, fault‑self‑healing, data archiving, and MySQL upgrades together ensure high performance and high availability across thousands of warehouses.

Operationsautomationdatabases
0 likes · 13 min read
How JD Logistics Scales Warehouse Databases with Automation and High‑Availability Strategies
21CTO
21CTO
Jun 19, 2018 · Operations

How Netflix’s Full‑Cycle Developers Eliminate the DevOps Bottleneck

Netflix’s Edge Engineering team shares how adopting a full‑cycle developer model—where engineers own design, development, testing, deployment, operations, and support—reduces hand‑off delays, improves feedback loops, and scales productivity across the entire software lifecycle.

DevOpsFull-cycle DevelopmentNetflix
0 likes · 13 min read
How Netflix’s Full‑Cycle Developers Eliminate the DevOps Bottleneck
Ctrip Technology
Ctrip Technology
Jun 19, 2018 · Artificial Intelligence

AIOps at Ctrip: Concepts, Typical Application Scenarios, and Algorithmic Practices

This article introduces Ctrip's AIOps journey, explaining the AI‑driven operations concept, showcasing typical use cases such as anomaly detection, intelligent fault diagnosis, and resource utilization improvement, and detailing the underlying statistical and machine‑learning algorithms that enable these capabilities.

CtripOperationsResource Optimization
0 likes · 16 min read
AIOps at Ctrip: Concepts, Typical Application Scenarios, and Algorithmic Practices
AntTech
AntTech
Jun 19, 2018 · Cloud Native

Financial‑Grade Cloud Native Architecture: Challenges, Practices, and Transformation Path

This article outlines the evolution of financial‑grade cloud native architecture, describing its origins, key principles, incremental delivery, sustainable innovation, and evolutionary planning, while addressing scalability, disaster‑recovery, distributed‑transaction, and elastic resource challenges with practical Ant Financial case studies.

Operationsdatabasesfinancial technology
0 likes · 37 min read
Financial‑Grade Cloud Native Architecture: Challenges, Practices, and Transformation Path
DevOps
DevOps
Jun 14, 2018 · Operations

Understanding DevOps: Role Merging, Automation, and Organizational Impact

This article examines how DevOps emerged from the merging of development and operations roles, explores automation practices in small and large teams, outlines the three-step DevOps workflow, and discusses the cultural and organizational challenges of adopting DevOps at scale.

DevOpsDigital TransformationOperations
0 likes · 9 min read
Understanding DevOps: Role Merging, Automation, and Organizational Impact
Tencent Cloud Developer
Tencent Cloud Developer
Jun 14, 2018 · Operations

Tencent Cloud Database Massive Operations: Team Building, Automated Operations Platform, and Intelligent Practices

Tencent Cloud Database’s massive‑operation strategy combines a dedicated architect team, a three‑layer automated platform for resource, task and health management, and AI‑driven intelligent services that customize workloads, automate tuning, and enable proactive scaling and self‑healing across hundreds of thousands of instances.

AIOperationsautomation
0 likes · 11 min read
Tencent Cloud Database Massive Operations: Team Building, Automated Operations Platform, and Intelligent Practices
JD Tech
JD Tech
Jun 14, 2018 · Operations

Design and Implementation of a Lightweight Service Monitoring and Traffic Management System

This article shares the design and implementation of a lightweight, robust, and low‑intrusion monitoring management system for microservice traffic, detailing data collection via client filters, Redis‑based structured storage, alerting, rate‑limiting, degradation, and authorization mechanisms, and discusses performance optimizations and future improvements.

MicroservicesOperationsmonitoring
0 likes · 11 min read
Design and Implementation of a Lightweight Service Monitoring and Traffic Management System
DevOps Cloud Academy
DevOps Cloud Academy
Jun 11, 2018 · Operations

Creating and Configuring Jenkins Project Views

This guide explains how to create a new view in Jenkins, configure its settings, and modify the view later through the edit interface, providing step‑by‑step instructions with illustrative screenshots for effective job organization.

JenkinsOperationsProject View
0 likes · 1 min read
Creating and Configuring Jenkins Project Views
Programmer DD
Programmer DD
Jun 7, 2018 · Operations

How to Build a High‑Availability RabbitMQ Cluster with Load Balancing

This guide explains the principles behind RabbitMQ clustering, shows how metadata synchronization works, compares design choices, and provides step‑by‑step instructions—including component installation, node configuration, HAProxy load‑balancing setup, and a sample architecture diagram—to create a reliable, scalable RabbitMQ cluster for production use.

HAProxyOperationsclustering
0 likes · 16 min read
How to Build a High‑Availability RabbitMQ Cluster with Load Balancing
dbaplus Community
dbaplus Community
Jun 7, 2018 · Operations

Why Ceph’s Unlimited Scalability Isn’t As Simple As It Looks

The article examines Ceph’s claimed infinite scalability, cost advantages, and operational stability from an SRE perspective, comparing it with centralized systems like HDFS, and reveals practical challenges such as expansion granularity, crushmap rebalancing, utilization limits, and maintenance overhead.

CephHDFSOperations
0 likes · 15 min read
Why Ceph’s Unlimited Scalability Isn’t As Simple As It Looks
ITPUB
ITPUB
Jun 5, 2018 · Operations

How to Diagnose CPU Spikes on Linux: A Real‑World Top and Thread Dump Walkthrough

This article walks through a practical Linux performance investigation, showing how to use the top command to pinpoint high‑CPU processes, examine thread details, convert thread IDs, analyze thread dumps for lock contention, and interpret key top output fields for effective troubleshooting.

CPULinuxOperations
0 likes · 6 min read
How to Diagnose CPU Spikes on Linux: A Real‑World Top and Thread Dump Walkthrough
Efficient Ops
Efficient Ops
May 30, 2018 · Databases

How SF Express Transformed Its Database Operations: From Legacy to Open‑Source, Distributed, and Intelligent Ops

This talk details SF Express’s journey from heterogeneous legacy databases to standardized open‑source, distributed architectures and intelligent operations, covering standardization, migration to open‑source, scaling with Mycat, automated resource pooling, and the ThinkDB platform that drives proactive, automated DBA workflows.

Distributed SystemsMycatOperations
0 likes · 18 min read
How SF Express Transformed Its Database Operations: From Legacy to Open‑Source, Distributed, and Intelligent Ops
Tencent Cloud Developer
Tencent Cloud Developer
May 30, 2018 · Operations

Tencent Hub: DevOps Best Practices and Workflow Architecture

Zou Hui explained Tencent Hub’s end‑to‑end DevOps platform, detailing how clarified, automated workflows—spanning code development, building, release, containerized plugins, and a multi‑level artifact registry—enable balanced quality and speed while supporting flexible, parallel execution and comprehensive permission‑controlled management across diverse deployment scenarios.

DevOpsOperationsTencent Hub
0 likes · 10 min read
Tencent Hub: DevOps Best Practices and Workflow Architecture
Qunar Tech Salon
Qunar Tech Salon
May 30, 2018 · Operations

Recap of the QInfrarch Session at the 2018 Qunar Technology Carnival

The QInfrarch special session of the 2018 Qunar Technology Carnival gathered a packed audience on May 27, featuring multiple technical talks on real‑time push architecture, IDC networking, ticket search, decentralization, multi‑datacenter redundancy, and fault‑injection platforms, followed by lively Q&A, networking, and enthusiastic follow‑up requests.

InfrastructureOperationsQInfrarch
0 likes · 4 min read
Recap of the QInfrarch Session at the 2018 Qunar Technology Carnival
Efficient Ops
Efficient Ops
May 27, 2018 · Operations

Mastering High Availability and High Concurrency: Principles and Practical Techniques

This article outlines guiding principles, high‑availability strategies, and high‑concurrency techniques—covering stateless design, resource isolation, quota management, monitoring, degradation, rollback, and scaling—to help engineers build resilient, scalable systems while balancing cost and performance.

OperationsScalabilitySystem Design
0 likes · 21 min read
Mastering High Availability and High Concurrency: Principles and Practical Techniques
ITFLY8 Architecture Home
ITFLY8 Architecture Home
May 27, 2018 · Information Security

How Google Secures Its Global Data Centers: Inside the Infrastructure

Google’s technical infrastructure—supporting services like Search, Gmail, G Suite, and GCP—employs layered physical, hardware, software, and operational security measures, including biometric access, custom secure chips, encrypted boot, service isolation, identity management, and robust DoS defenses to protect data and operations worldwide.

Data Center SecurityGoogleInfrastructure
0 likes · 20 min read
How Google Secures Its Global Data Centers: Inside the Infrastructure
Efficient Ops
Efficient Ops
May 23, 2018 · Operations

How Alibaba Guarantees High‑Availability Ops for New Retail

This article explains Alibaba's GOC‑driven operation‑assurance solution for new retail, covering the sector's evolution, unique reliability challenges, a four‑pillar support framework—including high‑availability, mobile ops, emergency response, and change control—and real‑world best practices from Hema Fresh.

AlibabaOperationsemergency response
0 likes · 19 min read
How Alibaba Guarantees High‑Availability Ops for New Retail
Efficient Ops
Efficient Ops
May 21, 2018 · Operations

Mastering Service Performance: CPU, Memory, JVM & Linux Monitoring Guide

This comprehensive guide explains how to monitor and tune service performance by examining CPU load, system and JVM memory usage, buffer/cache concepts, key performance metrics such as response time, throughput, QPS, and provides essential Linux tools and commands for effective operations management.

JVMOperationsPerformance Monitoring
0 likes · 21 min read
Mastering Service Performance: CPU, Memory, JVM & Linux Monitoring Guide
DevOpsClub
DevOpsClub
May 11, 2018 · Operations

How Anti‑Fragility and GameDays Turn System Failures into Growth

This article explores anti‑fragility theory and real‑world DevOps practices such as Phoenix Server, Chaos Monkey, GameDays, and blameless post‑mortems, showing how organizations can transform inevitable failures into opportunities for resilience and continuous improvement.

Blameless CultureOperationsanti-fragility
0 likes · 11 min read
How Anti‑Fragility and GameDays Turn System Failures into Growth
Efficient Ops
Efficient Ops
May 10, 2018 · Operations

How Ele.me Scaled to 10M+ Daily Orders with Multi‑Active Architecture

The talk details Ele.me’s rapid growth from 300k to over 10 million daily orders, describing the challenges of high‑concurrency, multi‑active micro‑service architecture, IDC planning, database refactoring, disaster‑recovery, NOC operations, and the systematic processes that enabled stable, scalable delivery across two data centers.

IDC planningOperationsScalability
0 likes · 19 min read
How Ele.me Scaled to 10M+ Daily Orders with Multi‑Active Architecture
Efficient Ops
Efficient Ops
May 9, 2018 · Operations

How eBay Automates Cross‑Platform Patch Deployment at Scale

This article details eBay's 11‑year journey in automating system‑wide patch deployment across Windows and Linux servers, covering challenges, process evolution, security considerations, testing strategies, and future plans for kernel hot‑patching and container‑based updates.

OperationsPatch managementautomation
0 likes · 17 min read
How eBay Automates Cross‑Platform Patch Deployment at Scale
dbaplus Community
dbaplus Community
May 8, 2018 · Operations

How to Build Reliable Operations: From BCM to Google SRE Practices

This article examines the growing challenges of system availability in modern operations, explains the concept of availability and the N‑nine metric, introduces Business Continuity Management and Google SRE approaches, and provides concrete technical and managerial methods—including architecture standardization, scaling strategies, tooling, emergency drills, and incident‑centralized management—to improve operational reliability.

AvailabilityBCMOperations
0 likes · 30 min read
How to Build Reliable Operations: From BCM to Google SRE Practices
Efficient Ops
Efficient Ops
May 8, 2018 · Operations

20 Proven Ops Automation Rules Every Team Should Follow

This article presents twenty practical principles for building and maintaining an effective, business‑oriented operations automation system, covering mindset, architecture, design, tooling, team composition, data handling, security, and implementation best practices for modern enterprises.

InfrastructureOperationsautomation
0 likes · 5 min read
20 Proven Ops Automation Rules Every Team Should Follow
MaGe Linux Operations
MaGe Linux Operations
May 6, 2018 · Operations

6 Common Linux Ops Issues and How to Diagnose & Fix Them

Learn a systematic Linux troubleshooting workflow and detailed solutions for six typical operational problems—including filesystem corruption, disk space exhaustion, inode depletion, lingering deleted files, too‑many‑open‑files errors, and read‑only filesystem issues—complete with command‑line examples and step‑by‑step fixes.

FilesystemLinuxOperations
0 likes · 13 min read
6 Common Linux Ops Issues and How to Diagnose & Fix Them
dbaplus Community
dbaplus Community
May 2, 2018 · Big Data

Why Big Data Clusters Need a Robust Automated Monitoring & Alerting System

The article explains the unique challenges of monitoring and alerting in large‑scale big‑data environments, outlines the evolution and architecture of such systems, and provides detailed guidance on data collection, time‑series storage, rule definition, and alert actions for reliable operations.

Operationsarchitecturemonitoring
0 likes · 17 min read
Why Big Data Clusters Need a Robust Automated Monitoring & Alerting System
Efficient Ops
Efficient Ops
May 2, 2018 · Operations

How Tencent Scales 20,000+ Servers: Lessons from SNG Operations

This talk outlines the five major challenges faced by Tencent's SNG component operations—geographic distribution, HTTPS certificate management, massive device failures, long‑term maintenance, and large‑scale scaling—and describes the underlying architecture, operational principles, and practical techniques used to automate and reliably support millions of users during peak events.

OperationsTencentautomation
0 likes · 20 min read
How Tencent Scales 20,000+ Servers: Lessons from SNG Operations
System Architect Go
System Architect Go
May 1, 2018 · Operations

How to Set Up Real-Time Logging with Slack

This guide explains step‑by‑step how to configure Slack as a real‑time log channel by creating a workspace, setting up a channel, generating an incoming webhook URL, and posting JSON log messages via HTTP so you can monitor application logs instantly.

OperationsReal-time loggingSlack
0 likes · 2 min read
How to Set Up Real-Time Logging with Slack
Efficient Ops
Efficient Ops
Apr 25, 2018 · Operations

How Tencent Cut Over $1B in Bandwidth Costs with Smart Image & Video Compression

This article shares Tencent SNG's practical experience in bandwidth cost optimization, detailing how advanced image and video compression techniques, adaptive resolution, AI‑driven super‑resolution, and efficient transcoding pipelines reduced over a billion yuan in cash flow while preserving user experience and product quality.

AICost reductionOperations
0 likes · 24 min read
How Tencent Cut Over $1B in Bandwidth Costs with Smart Image & Video Compression
Architecture and Beyond
Architecture and Beyond
Apr 22, 2018 · Backend Development

Comprehensive Guide to Building a Backend Technology Stack for Startup Companies

This article provides a detailed, step‑by‑step overview of how startups can design, select, and integrate languages, components, processes, and systems—including databases, RPC frameworks, monitoring, CI/CD, and cloud services—to construct a robust, scalable backend architecture that balances cost, performance, and operational maturity.

BackendOperationsTechnology Stack
0 likes · 31 min read
Comprehensive Guide to Building a Backend Technology Stack for Startup Companies
Efficient Ops
Efficient Ops
Apr 19, 2018 · Operations

How Alibaba Prevents Release Failures in Billion‑Dollar Transactions

Alibaba’s experts share how they boost release speed and stability for trillion‑dollar transactions by combining P2P file distribution, automated monitoring, AI‑driven anomaly detection, and an unattended release system that automatically pauses risky deployments, reducing faults while handling massive e‑commerce workloads.

AI MonitoringDeploymentOperations
0 likes · 25 min read
How Alibaba Prevents Release Failures in Billion‑Dollar Transactions
Meituan Technology Team
Meituan Technology Team
Apr 19, 2018 · Operations

How Meituan‑Dianping Built a 100% High‑Availability Core Transaction System

This article analyzes the rapid growth challenges of Meituan‑Dianping's core payment flow, explains key availability metrics such as MTBF and MTTR, and presents a comprehensive set of architectural, operational, and tooling strategies—including dependency decoupling, timeout tuning, circuit breaking, and full‑link stress testing—to achieve stable, fault‑tolerant transactions.

MicroservicesOperationscircuit breaker
0 likes · 20 min read
How Meituan‑Dianping Built a 100% High‑Availability Core Transaction System
ITPUB
ITPUB
Apr 19, 2018 · Databases

How Didi Scales MySQL: From Manual Ops to Full Automation

This article outlines Didi's MySQL database architecture, the challenges of managing thousands of instances, and the step‑by‑step automation framework—including dbproxy, high‑availability, backup, monitoring, and deployment modules—that reduces manual DBA work by over 70%.

DBADidiOperations
0 likes · 14 min read
How Didi Scales MySQL: From Manual Ops to Full Automation
DevOps
DevOps
Apr 17, 2018 · Operations

Managing Shared Configuration in VSTS Using Library Variable Groups

This guide explains how to centralize duplicated VSTS deployment parameters and PowerShell scripts by creating Library variable groups, setting their security, and referencing them in Release definitions to simplify configuration management across multiple projects.

Azure DevOpsLibraryOperations
0 likes · 3 min read
Managing Shared Configuration in VSTS Using Library Variable Groups
MaGe Linux Operations
MaGe Linux Operations
Apr 13, 2018 · Operations

How Alibaba Built Its DevOps Automation Platform: Key Practices and Lessons

This article outlines Alibaba's DevOps transformation, describing the three operational stages, four foundations of automated operations, CI/CD implementation, essential system characteristics, development‑defined operations, config‑driven changes, and the tools that enable high‑availability, efficiency, and scalability.

AlibabaConfigurationInfrastructure
0 likes · 10 min read
How Alibaba Built Its DevOps Automation Platform: Key Practices and Lessons
Practical DevOps Architecture
Practical DevOps Architecture
Apr 10, 2018 · Operations

Ansible Installation and Configuration Guide

This guide explains how to install Ansible via yum or pip, outlines its directory layout, describes host inventory setup, details SSH key configuration for password‑less access, and introduces common modules such as ping for basic connectivity testing.

AnsibleInstallationOperations
0 likes · 4 min read
Ansible Installation and Configuration Guide
Efficient Ops
Efficient Ops
Apr 8, 2018 · Operations

Why ELK Is the Ultimate Solution for Log Management and Monitoring

This article introduces the ELK stack—Elasticsearch, Logstash, and Kibana—explaining its core components, architecture, comparison with databases and grep, typical use cases across security, networking, and application monitoring, deployment considerations, challenges, SaaS prospects, and recommended learning resources.

ELKElasticsearchLog Management
0 likes · 10 min read
Why ELK Is the Ultimate Solution for Log Management and Monitoring
Architecture Digest
Architecture Digest
Apr 7, 2018 · Operations

Comparison of Service Discovery Tools: Zookeeper, etcd, and Consul

This article compares three popular service discovery solutions—Zookeeper, etcd, and Consul—detailing their architectures, features, integration methods, strengths, and weaknesses, and concludes with a recommendation for using Consul in multi‑data‑center environments while noting its real‑time notification limitations.

ConsulOperationsZooKeeper
0 likes · 8 min read
Comparison of Service Discovery Tools: Zookeeper, etcd, and Consul
Efficient Ops
Efficient Ops
Apr 2, 2018 · Operations

How Bilibili Revamped Its Monitoring Architecture: From Zabbix to Dapper

An in‑depth look at Bilibili’s multi‑layer monitoring overhaul, detailing the shift from a monolithic Zabbix setup to micro‑service‑based ELK, Dapper, Misaka, Traceon and Lancer systems, and how layered observability improves fault detection across business, application, and infrastructure levels.

Distributed TracingMicroservicesOperations
0 likes · 10 min read
How Bilibili Revamped Its Monitoring Architecture: From Zabbix to Dapper
MaGe Linux Operations
MaGe Linux Operations
Mar 31, 2018 · Operations

Essential Linux Ops Tools: Monitoring, Performance, and Security Utilities

This article introduces a curated set of practical Linux operations tools—including Nethogs, IOZone, IOTop, IPtraf, IFTop, Fail2ban, Tmux, NMON, MultiTail, NMap, and Httperf—detailing their purpose, installation steps, key command‑line options, and usage examples to help system administrators monitor bandwidth, disk I/O, processes, logs, and security on Linux servers.

LinuxOperationsmonitoring
0 likes · 11 min read
Essential Linux Ops Tools: Monitoring, Performance, and Security Utilities
JD Tech
JD Tech
Mar 30, 2018 · Backend Development

Effective Logging Practices and Standards for Java Backend Systems

This article explains why proper logging is crucial for Java backend maintenance, defines useful log levels, outlines team rules and best‑practice implementations—including traceId usage, log file organization, and real‑time monitoring—to enable fast issue diagnosis and improve overall engineering quality.

BackendOperationsjava
0 likes · 10 min read
Effective Logging Practices and Standards for Java Backend Systems
DevOps Coach
DevOps Coach
Mar 29, 2018 · Operations

7 Must-Have Skills Every DevOps Engineer Needs

The article outlines the seven essential competencies—flexibility, security, collaboration, scripting, decision‑making, infrastructure knowledge, and soft skills—that DevOps engineers must master to bridge development and operations, accelerate delivery, and maintain secure, reliable systems.

CollaborationDevOpsInfrastructure
0 likes · 8 min read
7 Must-Have Skills Every DevOps Engineer Needs
Tencent Cloud Developer
Tencent Cloud Developer
Mar 29, 2018 · Artificial Intelligence

How AI Powers a Smart Ops Bot for Seamless Dev‑Ops Collaboration

This article explains the motivation behind the growing gap between developers and operations, introduces Tencent Cloud's AI‑driven intelligent operations robot, outlines its core features, typical use cases, and dives into the retrieval‑based dialogue system and matching models that enable natural‑language interactions.

AI OpsChatbotDevOps
0 likes · 13 min read
How AI Powers a Smart Ops Bot for Seamless Dev‑Ops Collaboration
Efficient Ops
Efficient Ops
Mar 27, 2018 · Cloud Computing

Why X86 Bare‑Metal Services Matter and How to Build Them in the Cloud

This article explains why X86 bare‑metal services are essential for high‑performance, security‑critical workloads, describes their architecture and management processes, and outlines the steps—standardization, automation, service‑orientation, and self‑service—used by Hengfeng Bank to implement and operate them.

Bare MetalInfrastructureOperations
0 likes · 16 min read
Why X86 Bare‑Metal Services Matter and How to Build Them in the Cloud
DevOps
DevOps
Mar 22, 2018 · Operations

Leading a DevOps Transformation: Five Misconceptions, Five Practices, and Concrete Implementation Advice

This article examines why DevOps transformations often fail, outlines five common misconceptions and five proven practices, and provides concrete, data‑driven recommendations—including cultural evolution, small‑batch work, feedback loops, value‑stream collaboration, and waste elimination—to help organizations achieve faster, safer, and more reliable software delivery.

Continuous DeliveryDevOpsOperations
0 likes · 21 min read
Leading a DevOps Transformation: Five Misconceptions, Five Practices, and Concrete Implementation Advice
DevOps
DevOps
Mar 20, 2018 · Operations

How Large-Scale Development Teams Implement DevOps Transformation: Engineering Systems, Automated Deployment, Telemetry, and Continuous Improvement

This article describes how Microsoft’s global development platform team built a highly available, automated DevOps pipeline on Azure, detailing the engineering system, deployment process, telemetry collection, alert handling, security practices, open‑source integration, and metrics‑driven continuous improvement.

DevOpsOperationsautomation
0 likes · 17 min read
How Large-Scale Development Teams Implement DevOps Transformation: Engineering Systems, Automated Deployment, Telemetry, and Continuous Improvement
DevOps Engineer
DevOps Engineer
Mar 19, 2018 · Operations

DevOps Cultural Philosophy and Practical Practices

The article explains DevOps culture, the shift toward eliminating barriers between development and operations, and outlines key practices such as frequent small releases, microservices, continuous integration, continuous delivery, and infrastructure as code to accelerate innovation while maintaining reliability.

Continuous DeliveryDevOpsInfrastructure as Code
0 likes · 7 min read
DevOps Cultural Philosophy and Practical Practices
DevOps Engineer
DevOps Engineer
Mar 19, 2018 · Operations

Understanding DevOps: Culture, Practices, and Tools for Faster Application Delivery

The article explains how DevOps combines culture, practices, and tooling to enable organizations to deliver applications and services more rapidly by breaking down silos between development and operations, fostering cross‑functional collaboration, automation, and continuous improvement throughout the software lifecycle.

CollaborationContinuous DeliveryDevOps
0 likes · 2 min read
Understanding DevOps: Culture, Practices, and Tools for Faster Application Delivery
21CTO
21CTO
Mar 19, 2018 · Operations

How Tencent Scaled Its Network from 2004‑2013: Key Lessons in Data‑Center Evolution

This article chronicles Tencent's network journey from its modest 2004 infrastructure through rapid expansion, critical incidents, and architectural breakthroughs like SET zones, SDN, and MPLS VPN, illustrating how the company transformed its data‑center operations to support massive user growth.

InfrastructureOperationsSDN
0 likes · 11 min read
How Tencent Scaled Its Network from 2004‑2013: Key Lessons in Data‑Center Evolution
Efficient Ops
Efficient Ops
Mar 15, 2018 · Operations

How Baidu’s CCS System Scales Command Execution Across Millions of Servers

This article examines Baidu’s Cluster Control System (CCS), detailing its two‑level data model, four‑tier scheduling architecture, and three‑layer execution agents, and explains how control and execution information, redundancy, and fault‑tolerant designs enable reliable large‑scale command execution across thousands of servers.

Command ExecutionDistributed SystemsOperations
0 likes · 12 min read
How Baidu’s CCS System Scales Command Execution Across Millions of Servers
Efficient Ops
Efficient Ops
Mar 15, 2018 · Operations

Mastering Large-Scale Command Execution: From Basics to Baidu’s Cluster Control System

This article explores the fundamentals of command execution, examines the challenges of scaling command delivery across hundreds of thousands of servers, and details Baidu’s Cluster Control System architecture that enables efficient, flexible, and extensible distributed command management for operations teams.

Command ExecutionDeploymentDistributed Systems
0 likes · 10 min read
Mastering Large-Scale Command Execution: From Basics to Baidu’s Cluster Control System
dbaplus Community
dbaplus Community
Mar 11, 2018 · Cloud Computing

How a Chinese Telecom Payment Platform Mastered Cloud Migration in 8 Hours

This article details the end‑to‑end cloud migration of China Telecom's payment platform, covering pre‑migration challenges, architectural redesign, data‑sync strategies, the eight‑hour cut‑over process, post‑migration performance gains, and future DBaaS plans, all based on a 2017 DBAplus conference talk.

DBaaSInfrastructureOperations
0 likes · 19 min read
How a Chinese Telecom Payment Platform Mastered Cloud Migration in 8 Hours
Efficient Ops
Efficient Ops
Mar 7, 2018 · Operations

Mastering Log Collection: From Daily Ops to the ELK Stack

This article explores the everyday challenges of operations teams handling system, access, runtime, error, and business logs, outlines the pain points of log collection and standardization, and provides a comprehensive guide to implementing the ELK (Elastic) stack—including Elasticsearch, Logstash, and Kibana—for effective monitoring and analysis.

ELKKibanaLogstash
0 likes · 13 min read
Mastering Log Collection: From Daily Ops to the ELK Stack
DevOps
DevOps
Mar 6, 2018 · Operations

Curated DevOps Book List Based on the DevOps Handbook

This article presents a curated list of 25 DevOps books, compiled from the DevOps Handbook and other sources, displayed with images, and invites readers to share, recommend, and comment as the list continues to be updated.

DevOpsOperationsbook list
0 likes · 2 min read
Curated DevOps Book List Based on the DevOps Handbook
Efficient Ops
Efficient Ops
Mar 6, 2018 · Operations

How Tencent’s SNG Ops Team Overcame Five Massive Service Challenges

The SNG Operations team shares the five critical challenges of managing tens of thousands of domains, certificates, server failures, automation, and rapid scaling during peak events, and outlines the practical strategies they used to ensure reliable, near‑real‑time service delivery.

Operationsautomationcertificate-management
0 likes · 6 min read
How Tencent’s SNG Ops Team Overcame Five Massive Service Challenges
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 5, 2018 · Operations

Boosting Test Environment Stability: Automated Container Replacement & Buffer Pools

This article analyzes the instability of Alibaba's test environment container provisioning, identifies root causes, and presents a comprehensive solution—including automatic container replacement, a buffer pool, and resource‑pool rationalization—that raised the container success rate to 99.9% and stabilized performance.

Operationsbuffer poolcontainer orchestration
0 likes · 9 min read
Boosting Test Environment Stability: Automated Container Replacement & Buffer Pools
Efficient Ops
Efficient Ops
Mar 2, 2018 · Operations

Mastering System Performance Tuning: A Practical 5W+1H Guide

This article provides a comprehensive, easy‑to‑understand overview of performance tuning, covering what, why, when, where, who, and how to optimize hardware, operating systems, and applications, with practical examples, metrics, tools, and step‑by‑step procedures for both pre‑deployment and post‑deployment optimization.

HardwareOperationsSystem optimization
0 likes · 21 min read
Mastering System Performance Tuning: A Practical 5W+1H Guide
MaGe Linux Operations
MaGe Linux Operations
Mar 1, 2018 · Operations

Top 10 Linux Ops Troubleshooting Tips Every Sysadmin Should Know

An experienced Linux sysadmin shares a curated list of common operational issues—from shell script execution failures and cron output overload to disk space leaks, MySQL storage pitfalls, and network latency—detailing root causes, step‑by‑step diagnostics, and practical solutions to keep servers running smoothly.

LinuxOperationsShell
0 likes · 15 min read
Top 10 Linux Ops Troubleshooting Tips Every Sysadmin Should Know
AntTech
AntTech
Mar 1, 2018 · Operations

Intelligent Scheduling in Customer Service: Architecture, Challenges, and Future Directions

The article examines how intelligent scheduling combines AI-driven bots and human agents to dynamically allocate customer service resources, addressing market slowdown, complex business structures, and operational pain points through perception, decision‑making, and execution capabilities, while outlining current implementations and future plans at Ant Financial.

AIIntelligent SchedulingOperations
0 likes · 14 min read
Intelligent Scheduling in Customer Service: Architecture, Challenges, and Future Directions
Efficient Ops
Efficient Ops
Feb 28, 2018 · Operations

How Meituan Scaled Delivery Ops with Automated Monitoring and Full‑Link Testing

This article explains how Meituan's food delivery platform built an automated operations system—covering complex workflows, traffic spikes, rapid growth, pain‑point analysis, core goals, system architecture, and automation techniques such as anomaly detection, service‑protection triggers, and full‑link testing—to improve reliability and reduce manual effort.

MeituanOperationsautomation
0 likes · 17 min read
How Meituan Scaled Delivery Ops with Automated Monitoring and Full‑Link Testing
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Feb 24, 2018 · Operations

How Tencent’s Blue Whale Transforms Operations: From Automation to Data‑Driven Service

This article outlines the evolution of Tencent Game's Blue Whale platform, describing its background, design philosophy, six‑platform architecture, and phased approach to automating basic operations, empowering product teams, and leveraging real‑time big‑data analytics to create a data‑driven, service‑oriented operations ecosystem.

Operationsplatform
0 likes · 23 min read
How Tencent’s Blue Whale Transforms Operations: From Automation to Data‑Driven Service
Efficient Ops
Efficient Ops
Feb 23, 2018 · Operations

What a Decade of Ops Taught Me: Key Strategies for Scalable Infrastructure

This article reflects on ten years of Tencent's operations experience, sharing the author's career journey, the evolution of large‑scale service management, the design of the L5 fault‑tolerant system, unified frameworks, resource packaging, CMDB virtual mirrors, and automated deployment practices that together enable reliable, efficient, and scalable infrastructure.

CMDBOperationsautomation
0 likes · 11 min read
What a Decade of Ops Taught Me: Key Strategies for Scalable Infrastructure
MaGe Linux Operations
MaGe Linux Operations
Feb 23, 2018 · Operations

Essential Linux Ops Interview Guide: From RAID Basics to Load‑Balancing Strategies

This comprehensive guide covers Linux operations interview topics, including the definition of ops, game‑ops roles, server management techniques, RAID levels, load‑balancing tools (LVS, Nginx, HAProxy), middleware, MySQL troubleshooting, backup solutions, health‑check configuration, common networking commands, virus removal, TCP/IP model, Nginx modules, log retention, system optimization, and useful command‑line shortcuts, all presented with clear explanations and practical examples.

LinuxOperationsRAID
0 likes · 38 min read
Essential Linux Ops Interview Guide: From RAID Basics to Load‑Balancing Strategies
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Feb 12, 2018 · Operations

Intelligent Network Practices for Alibaba's Double 11: Automation, Fault Detection, and Traffic Optimization

Alibaba senior technical expert Houyi explains how intelligent network automation, rapid fault detection, automatic isolation, and traffic‑optimizing technologies were applied during Double 11 to dramatically improve stability, reduce costs, and enhance overall network performance across millions of devices.

AlibabaOperationsfault detection
0 likes · 16 min read
Intelligent Network Practices for Alibaba's Double 11: Automation, Fault Detection, and Traffic Optimization
DevOps Engineer
DevOps Engineer
Feb 7, 2018 · Operations

Understanding DevOps: Concepts, History, Benefits, and Adoption

This article explains the DevOps concept, its historical evolution, the advantages of faster and more reliable software delivery, the cultural and technical drivers behind its rise, and current adoption trends and tools used by enterprises worldwide.

CultureDevOpsOperations
0 likes · 7 min read
Understanding DevOps: Concepts, History, Benefits, and Adoption
DevOps
DevOps
Feb 6, 2018 · Operations

From DevOps to Lean: A Two‑Year Reflection on Value‑Stream Delivery and Continuous Improvement

The article reflects on how DevOps, Docker, Kubernetes and lean/TOC thinking have transformed over the past two years, explains the three‑step workflow for building a value‑stream delivery pipeline, and offers practical guidance on culture, feedback loops, and handling unplanned work to achieve reliable, business‑focused IT operations.

Continuous DeliveryDevOpsIT Management
0 likes · 10 min read
From DevOps to Lean: A Two‑Year Reflection on Value‑Stream Delivery and Continuous Improvement
Efficient Ops
Efficient Ops
Feb 6, 2018 · Operations

Hybrid Learning Beats Thresholds: Anomaly Detection for Millions of KPI Curves

The article recounts the author’s 2017‑onward journey building an intelligent operations platform at Tencent, detailing challenges such as legacy thresholds, AIOps talent shortage, and lack of frameworks, and explains how a two‑stage hybrid unsupervised‑supervised model was devised to automatically detect anomalies across millions of KPI time‑series, enabling scalable root‑cause analysis and cost optimization.

OperationsTime Seriesaiops
0 likes · 7 min read
Hybrid Learning Beats Thresholds: Anomaly Detection for Millions of KPI Curves
Efficient Ops
Efficient Ops
Feb 5, 2018 · Operations

How WeChat Scales Massive Real-Time Monitoring: Design & Practices

This article details the architecture and practical techniques behind WeChat's large‑scale monitoring system, covering lightweight data collection, classification of real‑time, non‑real‑time and user‑specific metrics, anomaly detection algorithms, automated configuration, and high‑performance storage solutions for billions of events per minute.

OperationsReal-Timedata collection
0 likes · 14 min read
How WeChat Scales Massive Real-Time Monitoring: Design & Practices
JD Retail Technology
JD Retail Technology
Feb 5, 2018 · Backend Development

Design and Implementation of Footprint Platform, Mock Server Platform, and Pre‑release Gray Release Solution for Virtual Products

This article presents the challenges of virtual product development and describes three engineering solutions—a Footprint tracking system, a Mock Server platform, and a pre‑release gray‑release strategy—detailing their backgrounds, architectures, implementations, and operational benefits for improving debugging, testing, and deployment efficiency.

BackendMessage QueueMock Server
0 likes · 8 min read
Design and Implementation of Footprint Platform, Mock Server Platform, and Pre‑release Gray Release Solution for Virtual Products
MaGe Linux Operations
MaGe Linux Operations
Feb 5, 2018 · Operations

6 Common Linux Ops Issues and How to Fix Them Quickly

This article presents a systematic troubleshooting workflow for Linux operations engineers, covering six typical problems—including filesystem corruption, disk‑space exhaustion, inode depletion, deleted files that still occupy space, too many open files, and read‑only filesystems—along with concrete commands and solutions to resolve each issue.

FilesystemLinuxOperations
0 likes · 13 min read
6 Common Linux Ops Issues and How to Fix Them Quickly
MaGe Linux Operations
MaGe Linux Operations
Feb 4, 2018 · Operations

Essential Operations Tools Every DevOps Engineer Should Master

This article outlines the key categories of operations tools—including process management, release automation, configuration handling, resource isolation, and comprehensive monitoring and alerting solutions—providing a practical guide for building reliable, automated infrastructure workflows.

InfrastructureOperationsautomation
0 likes · 8 min read
Essential Operations Tools Every DevOps Engineer Should Master
Efficient Ops
Efficient Ops
Jan 31, 2018 · Operations

85 Essential Ops Rules Every Engineer Should Follow

This article presents a comprehensive list of 85 practical operations rules covering capacity planning, monitoring, automation, security, documentation, budgeting, team management, and incident handling, offering actionable guidance for building reliable, scalable, and efficient IT infrastructure.

IT ManagementOperationsbest practices
0 likes · 20 min read
85 Essential Ops Rules Every Engineer Should Follow
MaGe Linux Operations
MaGe Linux Operations
Jan 31, 2018 · Operations

Essential Linux Ops Interview Q&A: TCP, HTTP, Proxy, and More

A comprehensive guide to common Linux operations interview questions, covering environment variables, TCP characteristics and handshake, proxy principles, TCP vs UDP trade‑offs, OOP vs procedural programming, HTTP request flow and status codes, deadlock concepts, TCP states, and inter‑process communication mechanisms.

HTTPLinuxOperations
0 likes · 14 min read
Essential Linux Ops Interview Q&A: TCP, HTTP, Proxy, and More
dbaplus Community
dbaplus Community
Jan 29, 2018 · Operations

How Data‑Driven Monitoring Unlocks Real Value for Ops Teams

This article explains why quantifiable data is essential for evaluating the impact of operational changes, outlines common data‑collection stacks, defines core business and user‑centric metrics, and demonstrates practical monitoring techniques such as PCU analysis, simulated user flows, and intelligent scaling to turn ops work into measurable business value.

DevOpsOperationsbusiness metrics
0 likes · 15 min read
How Data‑Driven Monitoring Unlocks Real Value for Ops Teams
MaGe Linux Operations
MaGe Linux Operations
Jan 26, 2018 · Operations

Master Linux Performance Diagnosis in 60 Seconds with 10 Essential Commands

When troubleshooting a Linux server, this guide shows the ten essential command‑line tools—uptime, dmesg, vmstat, mpstat, pidstat, iostat, free, sar, and top—to quickly assess CPU, memory, disk, and network health within the first sixty seconds, helping you identify saturation and bottlenecks.

LinuxOperationsPerformance Monitoring
0 likes · 23 min read
Master Linux Performance Diagnosis in 60 Seconds with 10 Essential Commands
Meitu Technology
Meitu Technology
Jan 24, 2018 · Operations

Meituan Monitoring Practice: Building a Holistic Monitoring System

Meituan’s Meipai service, serving over 150 million monthly users with a hybrid private‑public cloud architecture, spent three years building a comprehensive, three‑dimensional monitoring platform that unifies client‑to‑server metrics, alerts and reporting to ensure resilient, scalable operations and rapid business growth.

Cloud ServicesMeituanOperations
0 likes · 2 min read
Meituan Monitoring Practice: Building a Holistic Monitoring System
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 19, 2018 · Operations

How Alibaba’s AI‑Powered Supply Chain Handles Double‑11’s Massive Surge

This article explains how Alibaba’s supply‑chain algorithms and data‑driven operations enable rapid order processing, accurate demand forecasting, dynamic inventory allocation, and efficient warehouse fulfillment during the massive traffic of Double 11, highlighting the challenges faced and the solutions implemented.

AlibabaDemand ForecastingOperations
0 likes · 11 min read
How Alibaba’s AI‑Powered Supply Chain Handles Double‑11’s Massive Surge
Efficient Ops
Efficient Ops
Jan 18, 2018 · Operations

Understanding Linux Load Average: Reading, Interpreting, and Using It for Troubleshooting

This article explains what Linux load average measures, how to view the 1‑, 5‑, and 15‑minute values, interprets the numbers using traffic analogies, presents stress‑test scenarios across different CPU cores, and shows how load average guides effective troubleshooting of CPU and I/O bottlenecks.

Load AverageOperationsperformance troubleshooting
0 likes · 8 min read
Understanding Linux Load Average: Reading, Interpreting, and Using It for Troubleshooting
Efficient Ops
Efficient Ops
Jan 16, 2018 · Operations

How Tencent Secures Game Operations: Real Cases, Challenges, and Data‑Driven Solutions

This article shares a comprehensive overview of game operation security at Tencent, covering personal background, real‑world incident cases, the inherent challenges of large‑scale game services, past monitoring efforts, and a new data‑driven alerting framework that dramatically reduces false alarms while protecting game economies.

AlertingBig DataGame Security
0 likes · 25 min read
How Tencent Secures Game Operations: Real Cases, Challenges, and Data‑Driven Solutions
dbaplus Community
dbaplus Community
Jan 15, 2018 · Operations

How JD Finance Achieves Real-Time Capacity Assessment and Smart Alerting

This article explains JD Finance's operational challenges in a rapidly expanding micro‑service environment and presents a comprehensive approach that combines offline and online load testing, precise capacity calculations, and intelligent root‑cause alert analysis using both rule‑based and machine‑learning techniques.

Load TestingOperationsRoot Cause Analysis
0 likes · 15 min read
How JD Finance Achieves Real-Time Capacity Assessment and Smart Alerting
Efficient Ops
Efficient Ops
Jan 15, 2018 · Operations

How to Build a Full‑Chain Load‑Testing Platform for E‑Commerce in 2 Days

This article details how Xiaohongshu tackled rapid growth challenges by designing, implementing, and operating a full‑link performance testing platform in just two days, covering system architecture, testing models, collaborative deployment, capacity planning, and practical advice for teams seeking reliable e‑commerce load testing.

Load TestingOperationse‑commerce
0 likes · 9 min read
How to Build a Full‑Chain Load‑Testing Platform for E‑Commerce in 2 Days
Efficient Ops
Efficient Ops
Jan 14, 2018 · Operations

How We Built a Unified Network Automation Framework for Heterogeneous Devices

This article shares how a telecom operations team tackled the complexity of managing dozens of device vendors and hundreds of models by designing a Python‑based automation module called Forward, which standardizes low‑level actions, provides reusable libraries, and enables rapid script composition for diverse network scenarios.

Heterogeneous DevicesInfrastructure as CodeOperations
0 likes · 10 min read
How We Built a Unified Network Automation Framework for Heterogeneous Devices
Snowball Engineer Team
Snowball Engineer Team
Jan 12, 2018 · Operations

RDR: An Open-Source Tool for Visualizing and Analyzing Redis Memory Usage

This article introduces RDR, an open-source visualization platform developed by Xueqiu's SRE team to safely and efficiently analyze Redis memory consumption by parsing RDB files, estimating key-level memory usage based on internal data structures, and generating intuitive statistical reports for operational optimization.

Memory analysisOperationsRDB Parsing
0 likes · 9 min read
RDR: An Open-Source Tool for Visualizing and Analyzing Redis Memory Usage
Efficient Ops
Efficient Ops
Jan 11, 2018 · Operations

Mastering Incident Troubleshooting: Proven SRE Strategies for Operations

This article shares practical SRE‑based principles for diagnosing and resolving online incidents, emphasizing systematic investigation, gathering clues, and prioritizing service restoration over immediate root‑cause identification to make troubleshooting less mystical and more effective.

OperationsSREincident management
0 likes · 7 min read
Mastering Incident Troubleshooting: Proven SRE Strategies for Operations
Architects Research Society
Architects Research Society
Jan 11, 2018 · Operations

Envoy Outlier Detection and Ejection Mechanism Overview

The article explains Envoy's outlier detection and ejection process, detailing how unhealthy upstream hosts are identified and temporarily removed based on consecutive 5xx errors, gateway failures, or success‑rate thresholds, and describes the logging format and configuration options for these health‑check mechanisms.

Operationsejectionhealth check
0 likes · 6 min read
Envoy Outlier Detection and Ejection Mechanism Overview