Tagged articles
3281 articles
Page 28 of 33
Efficient Ops
Efficient Ops
Jan 7, 2018 · Operations

How Tencent Leverages AI to Simplify Massive-Scale Service Monitoring and Root‑Cause Analysis

Tencent's SNG social platform team tackles billion‑scale traffic by integrating AI‑driven anomaly detection, multi‑dimensional monitoring, and decision‑tree based root‑cause analysis, turning complex backend architectures and massive alert volumes into streamlined, actionable insights for faster issue resolution.

AIOperationsanomaly detection
0 likes · 16 min read
How Tencent Leverages AI to Simplify Massive-Scale Service Monitoring and Root‑Cause Analysis
Efficient Ops
Efficient Ops
Jan 3, 2018 · Operations

How QQ Space Photo Album Handled a 4‑Fold Traffic Surge on New Year’s Day

On December 30, 2017, a sudden wave of users uploading and downloading their 18‑year‑old photos caused QQ Space's album service to experience a four‑times spike in download traffic and a twelve‑times surge in post activity, prompting the operations and development teams to employ capacity monitoring, elastic scaling, flexible architecture, and targeted optimizations to maintain service stability and user experience.

OperationsQQ Spacecapacity planning
0 likes · 10 min read
How QQ Space Photo Album Handled a 4‑Fold Traffic Surge on New Year’s Day
Efficient Ops
Efficient Ops
Jan 2, 2018 · Operations

What Triggered These Real‑World System Crashes? 13 Post‑Mortem Lessons

The article compiles thirteen post‑mortem case studies of severe system outages—from AIX NTP misconfiguration and backup appliance driver issues to PowerHA node ID conflicts and hardware failures—detailing symptoms, root‑cause analysis, and practical remediation steps for each incident.

AIXOperationsPowerHA
0 likes · 20 min read
What Triggered These Real‑World System Crashes? 13 Post‑Mortem Lessons
MaGe Linux Operations
MaGe Linux Operations
Jan 2, 2018 · Operations

What Does Meituan Ask? 20 Must‑Know Linux Ops Interview Q&A

This article compiles Meituan's Linux operations engineer interview questions covering job requirements, core responsibilities, essential qualifications, and detailed answers on software installation, networking tools, IP configuration, scripting, iptables, MySQL security, replication, and common sysadmin commands, providing a comprehensive study guide for aspiring Linux ops candidates.

LinuxOperationsShell
0 likes · 13 min read
What Does Meituan Ask? 20 Must‑Know Linux Ops Interview Q&A
dbaplus Community
dbaplus Community
Jan 1, 2018 · Big Data

How Vipshop Leverages Data Processing, Analytics, and Mining for Smarter Ops

This article summarizes Wu Xiaoguang's talk at Gdevops 2017, detailing how Vipshop integrates data processing, analysis, and mining technologies—such as Flume, Kafka, Spark, and custom scheduling—to improve operational decision‑making, performance monitoring, root‑cause analysis, and predictive modeling across its e‑commerce platform.

Big DataData AnalyticsOperations
0 likes · 23 min read
How Vipshop Leverages Data Processing, Analytics, and Mining for Smarter Ops
MaGe Linux Operations
MaGe Linux Operations
Dec 30, 2017 · Operations

Essential Linux Operations Interview Questions & Answers from Meituan

This article compiles Meituan's Linux operations engineer interview requirements, common questions on system installation, networking, scripting, MySQL security, replication, iptables, and provides detailed command-line solutions and sample scripts to help candidates prepare effectively.

LinuxOperationsScripting
0 likes · 17 min read
Essential Linux Operations Interview Questions & Answers from Meituan
dbaplus Community
dbaplus Community
Dec 28, 2017 · Operations

Designing Scalable System Architecture: From Access Chains to Cloud‑Native Infrastructure

This comprehensive guide walks through the full lifecycle of enterprise system architecture, covering access‑chain analysis, network and hardware foundations, virtualization and container strategies, layered design, load‑balancing, database high‑availability, service segmentation, and operational safeguards such as CMDB, monitoring, and disaster‑recovery.

CMDBOperationsSystem Architecture
0 likes · 34 min read
Designing Scalable System Architecture: From Access Chains to Cloud‑Native Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 27, 2017 · Operations

Efficient Ticket System Operations During Double 11 Promotion

The article describes how a ticketing system with strict SLA enforcement, automated routing, and team‑based service management enabled rapid, orderly issue handling during the high‑volume Double 11 shopping event, achieving near‑90% resolution within 30 minutes and improving overall business stability.

Double 11OperationsSLA
0 likes · 7 min read
Efficient Ticket System Operations During Double 11 Promotion
DevOps Coach
DevOps Coach
Dec 27, 2017 · Operations

Essential DevOps Glossary: Key Terms Every Practitioner Should Know

This article presents a comprehensive bilingual DevOps glossary compiled from the DevOps Handbook, offering standardized English‑Chinese terminology, a change log, and open‑source contribution instructions via GitHub for continuous improvement.

CollaborationDevOpsGlossary
0 likes · 8 min read
Essential DevOps Glossary: Key Terms Every Practitioner Should Know
Efficient Ops
Efficient Ops
Dec 26, 2017 · Operations

From Oracle DBA to DevOps Leader: A 20‑Year Ops Journey and Lessons

This memoir chronicles a Chinese IT professional’s two‑decade evolution from a university student and Oracle DBA to a DevOps and cloud operations leader, sharing career milestones, technical choices, and practical insights for anyone pursuing a long‑term operations career.

Operationsdatabase
0 likes · 14 min read
From Oracle DBA to DevOps Leader: A 20‑Year Ops Journey and Lessons
MaGe Linux Operations
MaGe Linux Operations
Dec 23, 2017 · Operations

2017 Ops Tech Landscape: From Microservices to Intelligent Automation

This article surveys the evolution of operations technology, covering microservices, SRE, DevOps, containerization, orchestration, automation, intelligent monitoring, infrastructure, database and big‑data ops, as well as security, game and fintech operational challenges, highlighting current trends and future directions for 2017.

DevOpsMicroservicesOperations
0 likes · 14 min read
2017 Ops Tech Landscape: From Microservices to Intelligent Automation
Dada Group Technology
Dada Group Technology
Dec 22, 2017 · Operations

Performance Testing Process, Plans, and Best Practices for High‑Traffic Events

This article explains the purpose of performance (stress) testing, compares four testing approaches, details the chosen proportional‑deployment strategy, and provides comprehensive preparation steps, script guidelines, metric analysis, and practical tips for ensuring system stability during large‑scale traffic spikes.

Load TestingOperationscapacity planning
0 likes · 10 min read
Performance Testing Process, Plans, and Best Practices for High‑Traffic Events
ITPUB
ITPUB
Dec 21, 2017 · Operations

Master Linux Troubleshooting: 6 Common Issues and How to Fix Them

Learn a systematic approach for Linux system administrators to diagnose and resolve six typical problems—including filesystem errors, 'argument list too long', inode exhaustion, undeleted file space, too many open files, and read‑only filesystem—using command‑line tools, log analysis, and practical fixes.

FilesystemLinuxOperations
0 likes · 15 min read
Master Linux Troubleshooting: 6 Common Issues and How to Fix Them
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 21, 2017 · Operations

Stability Monitoring Practices for Double 11 2017

The 2017 Double 11 stability monitoring project introduced a four‑layer monitoring architecture—including customer & sentiment, business, system water‑level, and infrastructure monitoring—along with data archiving and system‑level reliability measures to detect, respond to, and mitigate issues far faster than traditional manual processes.

Operationsbig-dataincident response
0 likes · 14 min read
Stability Monitoring Practices for Double 11 2017
Architecture Digest
Architecture Digest
Dec 21, 2017 · Operations

Design and Implementation of an Open‑Source Load Balancing Solution Using Nginx and LVS

The article describes how a company replaced costly commercial load balancers with an open‑source architecture based on Nginx for layer‑4 traffic and a layer‑7 cluster, detailing project background, technology selection, redundant design, network and Nginx configurations, operational scripts, performance testing, and data analysis.

Operationsautomationhigh availability
0 likes · 11 min read
Design and Implementation of an Open‑Source Load Balancing Solution Using Nginx and LVS
MaGe Linux Operations
MaGe Linux Operations
Dec 21, 2017 · Operations

Mastering High Availability Clusters: Key Concepts, Resource Management, and Failure Handling

This article explains how high‑availability (HA) clusters provide redundancy for directors, RS‑servers, databases and storage, covering active‑passive node roles, resource stickiness, constraints, quorum voting, split‑brain avoidance, failure detection methods, and essential configuration tips.

ClusterOperationsResource Management
0 likes · 12 min read
Mastering High Availability Clusters: Key Concepts, Resource Management, and Failure Handling
Meitu Technology
Meitu Technology
Dec 19, 2017 · Industry Insights

Inside Meitu’s In‑House Log Collection System Arachnia: Design, Challenges, and Core Mechanisms

This article introduces Meitu’s self‑developed log collection system Arachnia, explaining why a custom solution was needed for massive server‑side user‑behavior logs, the key requirements such as reliability and real‑time throughput, and the core architectural mechanisms that address those challenges.

ArachniaBig DataMeitu
0 likes · 2 min read
Inside Meitu’s In‑House Log Collection System Arachnia: Design, Challenges, and Core Mechanisms
Efficient Ops
Efficient Ops
Dec 18, 2017 · Operations

How WiFi Key Built a Million‑User Monitoring Platform: Architecture and Best Practices

This article describes how WiFi 万能钥匙 designed and implemented the Roma monitoring platform to handle billions of daily requests, covering background challenges, architectural principles, component design, data collection, transmission, storage, alerting, and future directions for large‑scale observability.

MicroservicesOperationsarchitecture
0 likes · 16 min read
How WiFi Key Built a Million‑User Monitoring Platform: Architecture and Best Practices
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 15, 2017 · Operations

Automated Fault Recovery Architecture for Alibaba's Network during Double Eleven

The article describes Alibaba's end‑to‑end automated fault recovery system for its massive network, covering extensive data collection, Spark‑based event processing, flexible alerting with Siddhi, alert convergence using PageRank, and scripted recovery actions to achieve high availability during the Double Eleven traffic surge.

Big DataNetwork MonitoringOperations
0 likes · 9 min read
Automated Fault Recovery Architecture for Alibaba's Network during Double Eleven
DevOps
DevOps
Dec 7, 2017 · Operations

Insights on DevOps: Perspectives, Principles, and Business Value

Drawing on 40 years of IT experience, the speaker explores DevOps as a transformative practice, discusses its strategic business value, outlines four key discussion areas—including principles, practices, selling to executives, and identifying weak points—and offers practical guidance for cultural and organizational change.

Continuous DeliveryDevOpsIT Management
0 likes · 11 min read
Insights on DevOps: Perspectives, Principles, and Business Value
AI Cyberspace
AI Cyberspace
Dec 6, 2017 · Operations

Master RabbitMQ: Message Acknowledgment, Prefetch, RPC, vhosts & Plugins

This article explores RabbitMQ’s core features—including message acknowledgment, prefetch count, RPC support, virtual hosts, and its powerful plugin system—explaining how each works, when to enable or disable them, and providing step‑by‑step command‑line examples for configuring users, permissions, and management tools.

ConfigurationMessage QueueOperations
0 likes · 9 min read
Master RabbitMQ: Message Acknowledgment, Prefetch, RPC, vhosts & Plugins
Efficient Ops
Efficient Ops
Dec 3, 2017 · Operations

Why Operations Teams Get Overlooked and How to Build Real Collaboration

The article explores common conflicts between development, testing, and operations staff, explains why operations are often undervalued, and offers practical steps—such as clear documentation, defined processes, and proactive communication—to improve teamwork and reduce blame‑shifting in software projects.

Operationscommunicationprocess
0 likes · 8 min read
Why Operations Teams Get Overlooked and How to Build Real Collaboration
Tencent Cloud Developer
Tencent Cloud Developer
Nov 28, 2017 · Operations

Award-Winning DevOps Product “Developer Lab” and Tencent Cloud Distributed Database (DCDB) – Technical Overview

At the 2017 Global Operations Conference, Tencent Cloud’s award‑winning Developer Lab—an immersive, browser‑based IDE integrating SSH, RDP and tutorial‑driven workflows with automated resource scheduling—and its Distributed Cloud Database (DCDB), a sharded, cluster‑managed MySQL‑compatible system featuring advanced scheduling, routing and configuration services, were recognized for innovation and influence.

DevOpsInnovationOperations
0 likes · 8 min read
Award-Winning DevOps Product “Developer Lab” and Tencent Cloud Distributed Database (DCDB) – Technical Overview
Efficient Ops
Efficient Ops
Nov 27, 2017 · Operations

How Facebook Scales to Billions: Disaggregated Networks, Storage, and Warm Spark

Facebook’s journey from early startup ops to supporting over 2 billion monthly users reveals how disaggregated network, storage, and warm‑storage‑enabled Spark architectures overcome scalability bottlenecks, illustrating the operational strategies and design principles that power massive, reliable data‑center services.

Big DataDistributed SystemsOperations
0 likes · 12 min read
How Facebook Scales to Billions: Disaggregated Networks, Storage, and Warm Spark
Efficient Ops
Efficient Ops
Nov 23, 2017 · Artificial Intelligence

How to Turn AIOps from Hype into Reality: A Practical Roadmap

In this comprehensive talk, Pei Dan outlines the technical and strategic roadmap for bringing AIOps to production, explains the challenges of anomaly detection, fault localization, root‑cause analysis and prediction, and demonstrates how to decompose complex operations problems into AI‑solvable tasks.

AIOperationsaiops
0 likes · 21 min read
How to Turn AIOps from Hype into Reality: A Practical Roadmap
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 23, 2017 · Operations

How Alibaba Is Revolutionizing Operations with Intelligent Automation and DevOps

Alibaba's R&D efficiency team explains how intelligent operations—spanning resource planning, change management, monitoring, stability, and one‑click site building—are being transformed from manual tooling to automated, AI‑driven DevOps practices to boost efficiency, cut costs, and ensure high availability at massive scale.

DevOpsOperationsScalability
0 likes · 27 min read
How Alibaba Is Revolutionizing Operations with Intelligent Automation and DevOps
Efficient Ops
Efficient Ops
Nov 20, 2017 · Operations

How JD.com Scales Network Monitoring for Massive Traffic Peaks

This article explains how JD.com’s network team continuously optimizes its large‑scale infrastructure, designs effective monitoring strategies, implements practical monitoring solutions, and outlines future directions to improve network availability, fault detection, and operational efficiency across data centers and the internet backbone.

JD.comNetwork MonitoringOperations
0 likes · 16 min read
How JD.com Scales Network Monitoring for Massive Traffic Peaks
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Nov 14, 2017 · Operations

Unlocking Scalable Network Automation: Lessons from 360’s Ops Strategy

This article explores how rapid growth in network devices drives the need for comprehensive automation—covering script‑based tasks, zero‑touch provisioning, orchestration with OpenStack, device selection criteria, fault diagnosis, and monitoring—to keep operations ahead of business demands.

Fault DiagnosisNetwork MonitoringOpenStack integration
0 likes · 10 min read
Unlocking Scalable Network Automation: Lessons from 360’s Ops Strategy
JD Retail Technology
JD Retail Technology
Nov 14, 2017 · Operations

Design and Implementation of JD.com's Multi‑Active Distributed Architecture

This article details JD.com's multi-active distributed architecture, covering its evolution from single‑data‑center to multi‑region deployments, network design, leaf‑spine topology, data consistency mechanisms, application scheduling, monitoring, and disaster recovery strategies that enhance high availability and user experience.

Data ConsistencyDistributed SystemsOperations
0 likes · 11 min read
Design and Implementation of JD.com's Multi‑Active Distributed Architecture
ITPUB
ITPUB
Nov 14, 2017 · Operations

How Alibaba’s Dragonfly P2P System Powers 20B Transfers and Slashes Docker Image Traffic

Alibaba’s Dragonfly P2P file distribution platform, built to handle massive file and container image delivery during peak events like Double‑11, combines peer‑to‑peer networking, smart compression, flow‑control and security features to achieve billions of transfers, petabyte‑scale traffic, and up to 99.9% reduction in registry outbound bandwidth.

File DistributionOperationsP2P
0 likes · 20 min read
How Alibaba’s Dragonfly P2P System Powers 20B Transfers and Slashes Docker Image Traffic
Efficient Ops
Efficient Ops
Nov 12, 2017 · Operations

How 360’s LVS FULLNAT Transforms Load Balancing and Boosts Security

This article explains how 360’s Linux Virtual Server (LVS) platform evolved with the FULLNAT forwarding mode, enhancing cross‑VLAN deployment, simplifying real‑server configuration, adding SYN‑proxy protection, and improving UDP handling, while detailing the new deployment architecture and operational benefits.

DeploymentFullNATLVS
0 likes · 10 min read
How 360’s LVS FULLNAT Transforms Load Balancing and Boosts Security
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Nov 10, 2017 · Operations

Top 16 Common TDH Community Edition Installation Issues and How to Fix Them

This guide compiles the most frequent problems encountered when installing the TDH Community Edition—such as hostname configuration, logical volume creation errors, service startup failures, firewall settings, and license issues—and provides clear step‑by‑step solutions to help users avoid and resolve these obstacles.

InstallationLinuxOperations
0 likes · 10 min read
Top 16 Common TDH Community Edition Installation Issues and How to Fix Them
Qunar Tech Salon
Qunar Tech Salon
Nov 10, 2017 · Operations

Building a Private Cloud Elasticsearch Platform with Mesos and Docker

This article describes how the OPS team designed and implemented a private‑cloud Elasticsearch service using Mesos for resource management, Docker containers orchestrated by Marathon, and a suite of monitoring, self‑service configuration, and continuous deployment tools to improve resource utilization and operational efficiency.

DockerElasticsearchMarathon
0 likes · 9 min read
Building a Private Cloud Elasticsearch Platform with Mesos and Docker
dbaplus Community
dbaplus Community
Nov 9, 2017 · Operations

Mastering Log Levels: Practical Guidelines for Effective Logging

This article explains the purpose of each log level, when to write logs, performance impacts, and concrete best‑practice patterns for INFO, DEBUG, WARN and ERROR in Java applications, providing actionable templates and configuration tips to build a robust logging system.

Operationsbest practiceslog levels
0 likes · 19 min read
Mastering Log Levels: Practical Guidelines for Effective Logging
MaGe Linux Operations
MaGe Linux Operations
Nov 8, 2017 · Operations

How to Build an Ops Engineer Skill Map to Bridge the Hiring Gap

An operations director explains why hiring skilled ops engineers is hard, identifies the technology mismatch in typical stacks, and shares a practical skill‑map approach that lets teams cover most essential tools while giving engineers a clear learning roadmap.

InfrastructureOperationsOps Engineering
0 likes · 3 min read
How to Build an Ops Engineer Skill Map to Bridge the Hiring Gap
ITPUB
ITPUB
Nov 8, 2017 · Operations

10 Essential Linux Sysadmin Hacks to Boost Efficiency

This article presents ten practical Linux system‑administration tricks—from ejecting a stuck DVD drive and resetting a frozen console to sharing screen sessions, creating SSH tunnels for VNC, measuring network bandwidth, and gathering system diagnostics—each designed to save time and improve operational productivity.

LinuxOperationsShell
0 likes · 20 min read
10 Essential Linux Sysadmin Hacks to Boost Efficiency
Efficient Ops
Efficient Ops
Nov 5, 2017 · Operations

Scaling Ele.me’s Infrastructure: Operations, Automation, and Private Cloud Insights

This article recounts Ele.me's rapid growth from 2014 onward, detailing the challenges of network and server management, the evolution of their operations through standardization, process automation, and platform building, and how private cloud solutions like ZStack enabled fine‑grained, data‑driven infrastructure management.

InfrastructureOperationsautomation
0 likes · 23 min read
Scaling Ele.me’s Infrastructure: Operations, Automation, and Private Cloud Insights
Architecture Digest
Architecture Digest
Nov 1, 2017 · Operations

A Structured Approach to Online System Issue Diagnosis and Recovery

This article outlines a systematic methodology for understanding, evaluating, and quickly resolving production system incidents by categorizing system layers, assessing impact, employing Linux diagnostic tools, and designing fault‑tolerant mechanisms to minimize downtime and maintain core functionality.

BackendLinux toolsOperations
0 likes · 12 min read
A Structured Approach to Online System Issue Diagnosis and Recovery
JD Retail Technology
JD Retail Technology
Oct 30, 2017 · Operations

Ensuring High Availability and Scalability for Large‑Scale Promotions: Insights from a JD Senior Architect

The article explains how JD’s senior architect prepares for the 11.11 shopping festival by defining high‑availability goals, discussing scalability strategies, disaster‑recovery planning, performance optimization, and system resilience to ensure reliable service under massive traffic spikes.

OperationsScalabilitySystem Architecture
0 likes · 8 min read
Ensuring High Availability and Scalability for Large‑Scale Promotions: Insights from a JD Senior Architect
Architecture Digest
Architecture Digest
Oct 27, 2017 · Operations

Key Practices and Principles of DevOps from the “Cloud Development and Operations Best Practices” Talk

The article summarizes a DevOps talk, outlining eight guiding principles—configuration over hard‑coding, redundancy over single points, restartability, whole‑stack delivery, statelessness, standardization, automation, and unattended operation—while sharing concrete tools, architectures, and real‑world experiences from a cloud provider.

InfrastructureOperationsautomation
0 likes · 16 min read
Key Practices and Principles of DevOps from the “Cloud Development and Operations Best Practices” Talk
Meituan Technology Team
Meituan Technology Team
Oct 26, 2017 · Operations

Evolution of Payment Channel Automation Management at Meituan-Dianping

Meituan‑Dianping’s payment team progressed from manual fault alerts to a fully automated channel management system that detects failures, disables affected banks, conducts controlled ramp‑up tests, and restores service, dramatically cutting response times, manpower costs, and secondary‑failure risks while boosting overall availability.

OperationsSystem Designfault management
0 likes · 14 min read
Evolution of Payment Channel Automation Management at Meituan-Dianping
MaGe Linux Operations
MaGe Linux Operations
Oct 26, 2017 · Operations

Essential Linux Monitoring Tools Every Sysadmin Should Know

Discover a comprehensive collection of 80 essential Linux monitoring tools, ranging from system resource visualizers like nmon and Glances to log analyzers such as GoAccess, each described with features, usage tips, and links, helping sysadmins efficiently track performance, diagnose issues, and maintain robust infrastructure.

OperationsSysadminperformance tools
0 likes · 18 min read
Essential Linux Monitoring Tools Every Sysadmin Should Know
Efficient Ops
Efficient Ops
Oct 25, 2017 · Information Security

Securing Cloud‑Era Network Boundaries: Practices and Automated Operations

This article presents a comprehensive overview of cloud‑era network boundary management, detailing security challenges, unified access control concepts, endpoint protection, traffic analysis, and how automated operations and visualization platforms can reduce risk while maintaining efficient network operations.

Operationsaccess controlautomation
0 likes · 24 min read
Securing Cloud‑Era Network Boundaries: Practices and Automated Operations
Efficient Ops
Efficient Ops
Oct 24, 2017 · Operations

How Pinterest Scaled Its Monitoring, Logging, and Tracing Over Seven Years

This article chronicles Pinterest's seven‑year evolution from a single‑machine time‑series monitor to a multi‑component system that integrates metrics, log search, and distributed tracing, sharing architectural choices, scaling challenges, and lessons learned for building reliable, high‑performance operations platforms.

Distributed TracingOperationsSRE
0 likes · 24 min read
How Pinterest Scaled Its Monitoring, Logging, and Tracing Over Seven Years
Efficient Ops
Efficient Ops
Oct 18, 2017 · Operations

How Bilibili Scaled Its Log System to 10TB Daily with Elastic Stack

This article details Bilibili's Billions log platform—from its fragmented origins and design goals to the elastic‑stack‑based architecture, shard management, log sampling, custom Go splitters, and monitoring enhancements—highlighting the challenges faced and the roadmap for future improvements.

Big DataElastic StackLog Management
0 likes · 17 min read
How Bilibili Scaled Its Log System to 10TB Daily with Elastic Stack
dbaplus Community
dbaplus Community
Oct 16, 2017 · Operations

How Ele.me Scaled Operations: Key Lessons from Incident Management and Capacity Planning

This article details Ele.me's rapid expansion challenges and shares a three‑stage technical operations journey—fine‑grained division, stability maintenance, and efficiency gains—highlighting real incidents, monitoring upgrades, capacity testing, and practical insights for reliable large‑scale delivery platforms.

Operationscapacity planningincident management
0 likes · 14 min read
How Ele.me Scaled Operations: Key Lessons from Incident Management and Capacity Planning
Efficient Ops
Efficient Ops
Oct 14, 2017 · Operations

Why Small Internet Companies Still Need Operations: Beyond the Myth of No Ops

The article argues that even small internet firms cannot ignore operations as a capability, explaining how testing and ops improve functionality, stability, and business value, and outlining stages for integrating ops as a central control node within fast‑moving development cycles.

IT ManagementOperationsSmall business
0 likes · 8 min read
Why Small Internet Companies Still Need Operations: Beyond the Myth of No Ops
MaGe Linux Operations
MaGe Linux Operations
Oct 11, 2017 · Operations

When Celebrities Crash Weibo: Inside the Ops Battle and Hybrid Cloud Solution

A sudden surge of traffic triggered by a celebrity relationship announcement caused a Weibo outage, prompting frantic reactions from developers, operations, and management, and leading to an in‑depth analysis of high‑availability architecture, elastic scaling, hybrid‑cloud DCP platforms, and Docker‑based service deployment.

Operationshigh availabilityhybrid cloud
0 likes · 19 min read
When Celebrities Crash Weibo: Inside the Ops Battle and Hybrid Cloud Solution
dbaplus Community
dbaplus Community
Oct 10, 2017 · Operations

How to Build Effective Service Monitoring: Principles, Practices, and Technical Implementation

This article explains why service monitoring is essential for large‑scale microservice environments, outlines design principles, core monitoring components, dependency mapping, call‑chain analysis, capacity planning, root‑cause analysis, and presents a practical technical architecture for implementing robust monitoring solutions.

Distributed TracingOperationscapacity planning
0 likes · 12 min read
How to Build Effective Service Monitoring: Principles, Practices, and Technical Implementation
Efficient Ops
Efficient Ops
Oct 10, 2017 · Operations

WeChat’s 900M MAU Scaling: Secrets of Efficient Operations

The talk outlines WeChat’s approach to handling rapid user growth through disciplined operational standards, cloud‑native management, precise capacity planning, and automated scaling, detailing configuration file conventions, name‑service design, hardware metric evaluation, stress‑testing methods, and dynamic resource allocation to maintain high efficiency and low cost.

Operationsautomationcapacity management
0 likes · 25 min read
WeChat’s 900M MAU Scaling: Secrets of Efficient Operations
Efficient Ops
Efficient Ops
Oct 9, 2017 · Operations

How Tencent Scales Operations for Holiday Traffic Surges

This article explains how Tencent's social platform operations team prepares for massive holiday traffic spikes by following a four‑stage process—business preparation, capacity evaluation, resource provisioning, and scaling with stress testing—while detailing team structures, operational standards, and the supporting tool ecosystem that enable reliable, high‑availability services.

OperationsToolingcapacity planning
0 likes · 13 min read
How Tencent Scales Operations for Holiday Traffic Surges
ITPUB
ITPUB
Oct 5, 2017 · Operations

Essential Command-Line Tools Every DevOps Engineer Should Know

This article presents a curated collection of fast, interactive, and productivity‑boosting command‑line utilities—including ag, tig, mycli, jq, shellcheck, yapf, mosh, fzf, PathPicker, htop, axel, cloc, ccache, tmux, and many more—along with brief usage examples and screenshots to help engineers streamline development, debugging, and system monitoring tasks.

DevOpsLinuxOperations
0 likes · 10 min read
Essential Command-Line Tools Every DevOps Engineer Should Know
MaGe Linux Operations
MaGe Linux Operations
Sep 29, 2017 · Operations

Why Ops Teams Feel Stuck and How to Break the Cycle

The article explores common feelings of fatigue, lack of achievement, and low morale among operations professionals, identifies six root causes such as missing systematic frameworks, unclear positioning, closed mindset, insufficient authority, stagnant improvement, and absent cultural integration, and offers actionable suggestions to transform operations into a strategic, valued function.

IT ManagementOperationsTeam Culture
0 likes · 8 min read
Why Ops Teams Feel Stuck and How to Break the Cycle
21CTO
21CTO
Sep 28, 2017 · Operations

Master Real-Time Log Collection with LogHub: Strategies for E‑Commerce Platforms

This article explains how LogHub enables real-time log collection and unified management for an e‑commerce takeout platform, covering operational challenges, logstore configuration, user promotion tracking, server and client logging methods, and network access options.

Cloud ServicesLogHubOperations
0 likes · 9 min read
Master Real-Time Log Collection with LogHub: Strategies for E‑Commerce Platforms
Meitu Technology
Meitu Technology
Sep 28, 2017 · Operations

Inside Meipai’s 3‑D Monitoring System: Scaling 150M Users with Unified Observability

This article examines how Meipai, a popular live‑streaming and short‑video platform with over 150 million monthly active users, engineered a comprehensive, three‑dimensional monitoring architecture that spans client to server, integrates unified dashboards, and leverages both private and public cloud resources to ensure reliable, scalable operations.

DevOpsInfrastructureMeipai
0 likes · 3 min read
Inside Meipai’s 3‑D Monitoring System: Scaling 150M Users with Unified Observability
21CTO
21CTO
Sep 26, 2017 · Operations

Why You Should Never Trust Any Component in Your System—and How to Protect It

In programming and operations, every element—from services and dependencies to requests, machines, data centers, power, networks, and humans—can fail unexpectedly, so you must assume distrust and implement defensive measures such as monitoring, redundancy, rate limiting, fallback strategies, backups, and automated deployment.

OperationsReliabilityfault tolerance
0 likes · 9 min read
Why You Should Never Trust Any Component in Your System—and How to Protect It
MaGe Linux Operations
MaGe Linux Operations
Sep 25, 2017 · Operations

How to Become a Successful Operations Manager: Skills, Tools & Strategies

This article outlines the career paths, essential skills, comprehensive toolsets, infrastructure design, security measures, and management practices required to transition from a Linux engineer to an effective operations manager responsible for high‑availability, scalable, and secure IT services.

IT infrastructureOperationsops management
0 likes · 24 min read
How to Become a Successful Operations Manager: Skills, Tools & Strategies
DevOps
DevOps
Sep 24, 2017 · Operations

DevOps as a Lean, Artifact‑Centric Process: Principles, Value‑Stream Mapping, and Measurement

The article explains how DevOps applies lean principles to integrate IT operations and software development, describes artifact‑centric workflows, introduces value‑stream mapping and flow measurements, and shows how these practices enable continuous delivery, feedback loops, and systematic improvement across the enterprise.

Continuous DeliveryDevOpsOperations
0 likes · 19 min read
DevOps as a Lean, Artifact‑Centric Process: Principles, Value‑Stream Mapping, and Measurement
21CTO
21CTO
Sep 20, 2017 · R&D Management

Turning Errors into Innovation: Anti‑Fragile Systems in Digital Business

In today's fast‑moving digital landscape, waiting for perfect products is impossible, so companies like Amazon and HARTING adopt anti‑fragile, error‑embracing approaches—using systematic root‑cause analysis, agile micro‑services, and tools like Chaos Monkey—to transform failures into rapid innovation and competitive advantage.

Agile DevelopmentDigital TransformationError Handling
0 likes · 10 min read
Turning Errors into Innovation: Anti‑Fragile Systems in Digital Business
Efficient Ops
Efficient Ops
Sep 18, 2017 · Operations

How Meizu Built a Continuous Delivery Platform to Boost Ops Efficiency

This article details Meizu's journey from early internet eras to a mature continuous delivery platform, outlining the operational challenges, platform components, standardization, automation, and future intelligent operations to achieve high quality, efficiency, cost control, and security.

Continuous DeliveryOperationsautomation
0 likes · 19 min read
How Meizu Built a Continuous Delivery Platform to Boost Ops Efficiency
Efficient Ops
Efficient Ops
Sep 17, 2017 · Operations

How to Recover Accidentally Deleted Linux Files with extundelete

This guide walks you through preparing a Linux disk, installing extundelete, protecting the affected partition, and using extundelete commands to locate and restore both files and directories that were mistakenly removed, ensuring safe data recovery for system administrators.

File RecoveryOperationsextundelete
0 likes · 6 min read
How to Recover Accidentally Deleted Linux Files with extundelete
Efficient Ops
Efficient Ops
Sep 14, 2017 · Operations

How to Slash Data Center Energy Costs: 36 Green Ops Strategies and Real-World Cases

This article examines why electricity dominates data‑center operating costs, outlines practical green‑IT measures—including enclosure design, airflow management, lighting, and renewable power—and presents three detailed case studies that illustrate how modular design, cold‑aisle containment, and innovative cooling can dramatically reduce PUE and overall energy consumption.

DevOpsOperationscase study
0 likes · 14 min read
How to Slash Data Center Energy Costs: 36 Green Ops Strategies and Real-World Cases
UCloud Tech
UCloud Tech
Sep 11, 2017 · Operations

How Container and Serverless Architecture Transform Modern Operations

The CNUTCon Global Operations Conference highlighted UCloud's container and Serverless practices, explaining the evolution from virtualization waste to function‑as‑a‑service, showcasing UCloud's universal compute platform and real‑world use cases such as image processing, gene data analysis, and OCR recognition.

ContainersFunction as a ServiceOperations
0 likes · 5 min read
How Container and Serverless Architecture Transform Modern Operations
MaGe Linux Operations
MaGe Linux Operations
Sep 11, 2017 · Big Data

How Big Data Can Revolutionize Operations Monitoring

This article explores applying big‑data thinking and platforms—such as Flume, Spark Streaming, and HBase—to operations monitoring, detailing data sources, metric categories, architecture design, implementation steps, and the benefits of a scalable, low‑code monitoring platform.

Big DataOperationsSpark Streaming
0 likes · 10 min read
How Big Data Can Revolutionize Operations Monitoring
Efficient Ops
Efficient Ops
Sep 3, 2017 · Operations

How to Design an Enterprise‑Grade Monitoring & Alerting System from Scratch

This article introduces the fundamental concepts, methods, types, goals, and product attributes of enterprise monitoring and alerting, explains the perspective differences between users and builders, and outlines a comprehensive monitoring system architecture for large‑scale operations.

AlertingEnterpriseOperations
0 likes · 14 min read
How to Design an Enterprise‑Grade Monitoring & Alerting System from Scratch
DevOps
DevOps
Sep 3, 2017 · Operations

Challenges and Implementation Strategies for DevOps in Large Financial Enterprises

The article examines how large financial institutions face technical, procedural, and risk‑control challenges when adopting DevOps, and proposes a three‑stage implementation roadmap—including automation of delivery pipelines, optimization of development models, and continuous‑delivery process refinement—to achieve reliable, rapid software releases.

Continuous DeliveryDevOpsFinancial Services
0 likes · 19 min read
Challenges and Implementation Strategies for DevOps in Large Financial Enterprises
MaGe Linux Operations
MaGe Linux Operations
Sep 2, 2017 · Operations

From Traditional Ops to DevOps: The One Step You’re Missing

This talk walks through the transition from classic application operations to a DevOps culture, highlighting common pain points, the need for standardization and automation, and practical steps for engineers to evolve their skills and boost organizational efficiency.

DevOpsIT CultureLinux
0 likes · 14 min read
From Traditional Ops to DevOps: The One Step You’re Missing
Architecture Digest
Architecture Digest
Sep 1, 2017 · Operations

Comprehensive Guide to Scalable Website Architecture from an Operations Perspective

This article presents a step‑by‑step operations‑focused roadmap for evolving a website from a single‑server prototype to a highly available, horizontally scalable architecture using load balancing, caching, database replication, service‑oriented design, DNS round‑robin, CDN, and disaster‑recovery techniques.

Database ReplicationOperationsScalability
0 likes · 10 min read
Comprehensive Guide to Scalable Website Architecture from an Operations Perspective