Tagged articles

Operations

3329 articles · Page 28 of 34

Mar 20, 2018 · Operations

How Large-Scale Development Teams Implement DevOps Transformation: Engineering Systems, Automated Deployment, Telemetry, and Continuous Improvement

This article describes how Microsoft’s global development platform team built a highly available, automated DevOps pipeline on Azure, detailing the engineering system, deployment process, telemetry collection, alert handling, security practices, open‑source integration, and metrics‑driven continuous improvement.

AutomationCloudOperations

0 likes · 17 min read

How Large-Scale Development Teams Implement DevOps Transformation: Engineering Systems, Automated Deployment, Telemetry, and Continuous Improvement

DevOps Engineer

Mar 19, 2018 · Operations

DevOps Cultural Philosophy and Practical Practices

The article explains DevOps culture, the shift toward eliminating barriers between development and operations, and outlines key practices such as frequent small releases, microservices, continuous integration, continuous delivery, and infrastructure as code to accelerate innovation while maintaining reliability.

Continuous IntegrationOperationscontinuous delivery

0 likes · 7 min read

DevOps Cultural Philosophy and Practical Practices

DevOps Engineer

Mar 19, 2018 · Operations

Understanding DevOps: Culture, Practices, and Tools for Faster Application Delivery

The article explains how DevOps combines culture, practices, and tooling to enable organizations to deliver applications and services more rapidly by breaking down silos between development and operations, fostering cross‑functional collaboration, automation, and continuous improvement throughout the software lifecycle.

AutomationCollaborationOperations

0 likes · 2 min read

Understanding DevOps: Culture, Practices, and Tools for Faster Application Delivery

21CTO

Mar 19, 2018 · Operations

How Tencent Scaled Its Network from 2004‑2013: Key Lessons in Data‑Center Evolution

This article chronicles Tencent's network journey from its modest 2004 infrastructure through rapid expansion, critical incidents, and architectural breakthroughs like SET zones, SDN, and MPLS VPN, illustrating how the company transformed its data‑center operations to support massive user growth.

Data CenterNetwork ArchitectureOperations

0 likes · 11 min read

How Tencent Scaled Its Network from 2004‑2013: Key Lessons in Data‑Center Evolution

Efficient Ops

Mar 15, 2018 · Operations

How Baidu’s CCS System Scales Command Execution Across Millions of Servers

This article examines Baidu’s Cluster Control System (CCS), detailing its two‑level data model, four‑tier scheduling architecture, and three‑layer execution agents, and explains how control and execution information, redundancy, and fault‑tolerant designs enable reliable large‑scale command execution across thousands of servers.

Command ExecutionOperationsReliability

0 likes · 12 min read

How Baidu’s CCS System Scales Command Execution Across Millions of Servers

Efficient Ops

Mar 15, 2018 · Operations

Mastering Large-Scale Command Execution: From Basics to Baidu’s Cluster Control System

This article explores the fundamentals of command execution, examines the challenges of scaling command delivery across hundreds of thousands of servers, and details Baidu’s Cluster Control System architecture that enables efficient, flexible, and extensible distributed command management for operations teams.

Command ExecutionDeploymentOperations

0 likes · 10 min read

Mastering Large-Scale Command Execution: From Basics to Baidu’s Cluster Control System

Efficient Ops

Mar 12, 2018 · Operations

Why Did My Data Platform’s CPU Spike to 98%? A Step‑by‑Step Debugging Guide

This article walks through a real‑world incident where a data platform’s CPU usage surged to 98%, detailing how to pinpoint the high‑load process, trace the offending Java thread, uncover a time‑utility method bottleneck, and apply a concise fix that reduced load by thirtyfold.

CPUJavaOperations

0 likes · 7 min read

Why Did My Data Platform’s CPU Spike to 98%? A Step‑by‑Step Debugging Guide

dbaplus Community

Mar 11, 2018 · Cloud Computing

How a Chinese Telecom Payment Platform Mastered Cloud Migration in 8 Hours

This article details the end‑to‑end cloud migration of China Telecom's payment platform, covering pre‑migration challenges, architectural redesign, data‑sync strategies, the eight‑hour cut‑over process, post‑migration performance gains, and future DBaaS plans, all based on a 2017 DBAplus conference talk.

Cloud MigrationDBaaSOperations

0 likes · 19 min read

How a Chinese Telecom Payment Platform Mastered Cloud Migration in 8 Hours

Efficient Ops

Mar 7, 2018 · Operations

Mastering Log Collection: From Daily Ops to the ELK Stack

This article explores the everyday challenges of operations teams handling system, access, runtime, error, and business logs, outlines the pain points of log collection and standardization, and provides a comprehensive guide to implementing the ELK (Elastic) stack—including Elasticsearch, Logstash, and Kibana—for effective monitoring and analysis.

ELKKibanaLogstash

0 likes · 13 min read

Mastering Log Collection: From Daily Ops to the ELK Stack

DevOps

Mar 6, 2018 · Operations

Curated DevOps Book List Based on the DevOps Handbook

This article presents a curated list of 25 DevOps books, compiled from the DevOps Handbook and other sources, displayed with images, and invites readers to share, recommend, and comment as the list continues to be updated.

Book ListOperationsdevops

0 likes · 2 min read

Curated DevOps Book List Based on the DevOps Handbook

Efficient Ops

Mar 6, 2018 · Operations

How Tencent’s SNG Ops Team Overcame Five Massive Service Challenges

The SNG Operations team shares the five critical challenges of managing tens of thousands of domains, certificates, server failures, automation, and rapid scaling during peak events, and outlines the practical strategies they used to ensure reliable, near‑real‑time service delivery.

AutomationOperationscertificate-management

0 likes · 6 min read

How Tencent’s SNG Ops Team Overcame Five Massive Service Challenges

Alibaba Cloud Developer

Mar 5, 2018 · Operations

Boosting Test Environment Stability: Automated Container Replacement & Buffer Pools

This article analyzes the instability of Alibaba's test environment container provisioning, identifies root causes, and presents a comprehensive solution—including automatic container replacement, a buffer pool, and resource‑pool rationalization—that raised the container success rate to 99.9% and stabilized performance.

Buffer PoolContainer OrchestrationOperations

0 likes · 9 min read

Boosting Test Environment Stability: Automated Container Replacement & Buffer Pools

ITPUB

Mar 2, 2018 · Operations

Recover Deleted Webapp Uploads on Linux with extundelete: Step‑by‑Step Guide

When accidental deletion removes files from the /data/webapps/.../upload directory, this guide shows how to install extundelete, locate the relevant inodes, and safely recover the lost data using read‑only mounts and specific restore commands.

Data RecoveryFile SystemLinux

0 likes · 6 min read

Recover Deleted Webapp Uploads on Linux with extundelete: Step‑by‑Step Guide

MaGe Linux Operations

Mar 2, 2018 · Operations

Master Linux: A Step‑by‑Step Roadmap from Beginner to Senior Ops Engineer

This guide outlines a systematic Linux learning path—starting with basic commands and permissions, progressing through intermediate networking, services, and security, and culminating in advanced cloud, automation, and high‑availability skills—to help learners become competent operations engineers.

LinuxOperationslearning roadmap

0 likes · 5 min read

Master Linux: A Step‑by‑Step Roadmap from Beginner to Senior Ops Engineer

Efficient Ops

Mar 2, 2018 · Operations

Mastering System Performance Tuning: A Practical 5W+1H Guide

This article provides a comprehensive, easy‑to‑understand overview of performance tuning, covering what, why, when, where, who, and how to optimize hardware, operating systems, and applications, with practical examples, metrics, tools, and step‑by‑step procedures for both pre‑deployment and post‑deployment optimization.

HardwareOperationsSoftware

0 likes · 21 min read

Mastering System Performance Tuning: A Practical 5W+1H Guide

MaGe Linux Operations

Mar 1, 2018 · Operations

Top 10 Linux Ops Troubleshooting Tips Every Sysadmin Should Know

An experienced Linux sysadmin shares a curated list of common operational issues—from shell script execution failures and cron output overload to disk space leaks, MySQL storage pitfalls, and network latency—detailing root causes, step‑by‑step diagnostics, and practical solutions to keep servers running smoothly.

LinuxMySQLOperations

0 likes · 15 min read

Top 10 Linux Ops Troubleshooting Tips Every Sysadmin Should Know

AntTech

Mar 1, 2018 · Operations

Intelligent Scheduling in Customer Service: Architecture, Challenges, and Future Directions

The article examines how intelligent scheduling combines AI-driven bots and human agents to dynamically allocate customer service resources, addressing market slowdown, complex business structures, and operational pain points through perception, decision‑making, and execution capabilities, while outlining current implementations and future plans at Ant Financial.

AIIntelligent SchedulingOperations

0 likes · 14 min read

Intelligent Scheduling in Customer Service: Architecture, Challenges, and Future Directions

Efficient Ops

Feb 28, 2018 · Operations

How Meituan Scaled Delivery Ops with Automated Monitoring and Full‑Link Testing

This article explains how Meituan's food delivery platform built an automated operations system—covering complex workflows, traffic spikes, rapid growth, pain‑point analysis, core goals, system architecture, and automation techniques such as anomaly detection, service‑protection triggers, and full‑link testing—to improve reliability and reduce manual effort.

AutomationMeituanOperations

0 likes · 17 min read

How Meituan Scaled Delivery Ops with Automated Monitoring and Full‑Link Testing

ITFLY8 Architecture Home

Feb 24, 2018 · Operations

How Tencent’s Blue Whale Transforms Operations: From Automation to Data‑Driven Service

This article outlines the evolution of Tencent Game's Blue Whale platform, describing its background, design philosophy, six‑platform architecture, and phased approach to automating basic operations, empowering product teams, and leveraging real‑time big‑data analytics to create a data‑driven, service‑oriented operations ecosystem.

OperationsPlatform

0 likes · 23 min read

How Tencent’s Blue Whale Transforms Operations: From Automation to Data‑Driven Service

Efficient Ops

Feb 23, 2018 · Operations

What a Decade of Ops Taught Me: Key Strategies for Scalable Infrastructure

This article reflects on ten years of Tencent's operations experience, sharing the author's career journey, the evolution of large‑scale service management, the design of the L5 fault‑tolerant system, unified frameworks, resource packaging, CMDB virtual mirrors, and automated deployment practices that together enable reliable, efficient, and scalable infrastructure.

AutomationCMDBOperations

0 likes · 11 min read

What a Decade of Ops Taught Me: Key Strategies for Scalable Infrastructure

MaGe Linux Operations

Feb 23, 2018 · Operations

Essential Linux Ops Interview Guide: From RAID Basics to Load‑Balancing Strategies

This comprehensive guide covers Linux operations interview topics, including the definition of ops, game‑ops roles, server management techniques, RAID levels, load‑balancing tools (LVS, Nginx, HAProxy), middleware, MySQL troubleshooting, backup solutions, health‑check configuration, common networking commands, virus removal, TCP/IP model, Nginx modules, log retention, system optimization, and useful command‑line shortcuts, all presented with clear explanations and practical examples.

LinuxMySQLOperations

0 likes · 38 min read

Essential Linux Ops Interview Guide: From RAID Basics to Load‑Balancing Strategies

Alibaba Cloud Infrastructure

Feb 12, 2018 · Operations

Intelligent Network Practices for Alibaba's Double 11: Automation, Fault Detection, and Traffic Optimization

Alibaba senior technical expert Houyi explains how intelligent network automation, rapid fault detection, automatic isolation, and traffic‑optimizing technologies were applied during Double 11 to dramatically improve stability, reduce costs, and enhance overall network performance across millions of devices.

AlibabaOperationsfault detection

0 likes · 16 min read

Intelligent Network Practices for Alibaba's Double 11: Automation, Fault Detection, and Traffic Optimization

dbaplus Community

Feb 11, 2018 · Operations

Scaling Ops Automation on Alibaba Cloud: From Scripts to Ansible & API Gateways

This article recounts how a fintech platform migrated from manual upgrade scripts to a fully automated operations workflow using Rundeck, Ansible, DNS, API gateways, and a custom backup tool, dramatically improving deployment speed, reducing downtime, and sharing open‑source utilities for the community.

Operationsrundesk

0 likes · 10 min read

Scaling Ops Automation on Alibaba Cloud: From Scripts to Ansible & API Gateways

DevOps Engineer

Feb 7, 2018 · Operations

Understanding DevOps: Concepts, History, Benefits, and Adoption

This article explains the DevOps concept, its historical evolution, the advantages of faster and more reliable software delivery, the cultural and technical drivers behind its rise, and current adoption trends and tools used by enterprises worldwide.

AutomationContinuous IntegrationCulture

0 likes · 7 min read

DevOps

Feb 6, 2018 · Operations

From DevOps to Lean: A Two‑Year Reflection on Value‑Stream Delivery and Continuous Improvement

The article reflects on how DevOps, Docker, Kubernetes and lean/TOC thinking have transformed over the past two years, explains the three‑step workflow for building a value‑stream delivery pipeline, and offers practical guidance on culture, feedback loops, and handling unplanned work to achieve reliable, business‑focused IT operations.

IT ManagementOperationsTheory of Constraints

0 likes · 10 min read

From DevOps to Lean: A Two‑Year Reflection on Value‑Stream Delivery and Continuous Improvement

Efficient Ops

Feb 6, 2018 · Operations

Hybrid Learning Beats Thresholds: Anomaly Detection for Millions of KPI Curves

The article recounts the author’s 2017‑onward journey building an intelligent operations platform at Tencent, detailing challenges such as legacy thresholds, AIOps talent shortage, and lack of frameworks, and explains how a two‑stage hybrid unsupervised‑supervised model was devised to automatically detect anomalies across millions of KPI time‑series, enabling scalable root‑cause analysis and cost optimization.

AIOpsAnomaly DetectionMachine Learning

0 likes · 7 min read

Hybrid Learning Beats Thresholds: Anomaly Detection for Millions of KPI Curves

360 Zhihui Cloud Developer

Feb 6, 2018 · Operations

Deep Dive into Kubernetes Scheduler: Principles, Algorithms, and Code Walkthrough

This article provides a comprehensive overview of the Kubernetes kube‑scheduler, detailing its watch‑and‑bind mechanism, two‑stage scheduling algorithm, and an in‑depth analysis of the source code structure and key functions for readers interested in mastering Kubernetes scheduling internals.

OperationsSchedulercloud-native

0 likes · 8 min read

Deep Dive into Kubernetes Scheduler: Principles, Algorithms, and Code Walkthrough

Efficient Ops

Feb 5, 2018 · Operations

How WeChat Scales Massive Real-Time Monitoring: Design & Practices

This article details the architecture and practical techniques behind WeChat's large‑scale monitoring system, covering lightweight data collection, classification of real‑time, non‑real‑time and user‑specific metrics, anomaly detection algorithms, automated configuration, and high‑performance storage solutions for billions of events per minute.

Large ScaleOperationsReal-time

0 likes · 14 min read

How WeChat Scales Massive Real-Time Monitoring: Design & Practices

JD Retail Technology

Feb 5, 2018 · Backend Development

Design and Implementation of Footprint Platform, Mock Server Platform, and Pre‑release Gray Release Solution for Virtual Products

This article presents the challenges of virtual product development and describes three engineering solutions—a Footprint tracking system, a Mock Server platform, and a pre‑release gray‑release strategy—detailing their backgrounds, architectures, implementations, and operational benefits for improving debugging, testing, and deployment efficiency.

Message QueueOperationsSystem Design

0 likes · 8 min read

Design and Implementation of Footprint Platform, Mock Server Platform, and Pre‑release Gray Release Solution for Virtual Products

MaGe Linux Operations

Feb 5, 2018 · Operations

6 Common Linux Ops Issues and How to Fix Them Quickly

This article presents a systematic troubleshooting workflow for Linux operations engineers, covering six typical problems—including filesystem corruption, disk‑space exhaustion, inode depletion, deleted files that still occupy space, too many open files, and read‑only filesystems—along with concrete commands and solutions to resolve each issue.

FilesystemLinuxOperations

0 likes · 13 min read

6 Common Linux Ops Issues and How to Fix Them Quickly

MaGe Linux Operations

Feb 4, 2018 · Operations

Essential Operations Tools Every DevOps Engineer Should Master

This article outlines the key categories of operations tools—including process management, release automation, configuration handling, resource isolation, and comprehensive monitoring and alerting solutions—providing a practical guide for building reliable, automated infrastructure workflows.

AutomationOperationsinfrastructure

0 likes · 8 min read

Essential Operations Tools Every DevOps Engineer Should Master

MaGe Linux Operations

Feb 1, 2018 · Operations

Master Linux Storage: Mount, Unmount, Auto-Mount, Partition & Format Commands

This guide explains essential Linux storage commands—including fdisk, df, du, mount, and umount—covers automatic mounting via /etc/fstab, details disk partitioning with fdisk, and demonstrates how to format partitions using mkfs, providing practical examples for each step.

CommandsFilesystemLinux

0 likes · 11 min read

Master Linux Storage: Mount, Unmount, Auto-Mount, Partition & Format Commands

Efficient Ops

Jan 31, 2018 · Operations

85 Essential Ops Rules Every Engineer Should Follow

This article presents a comprehensive list of 85 practical operations rules covering capacity planning, monitoring, automation, security, documentation, budgeting, team management, and incident handling, offering actionable guidance for building reliable, scalable, and efficient IT infrastructure.

Best PracticesIT ManagementOperations

0 likes · 20 min read

85 Essential Ops Rules Every Engineer Should Follow

MaGe Linux Operations

Jan 31, 2018 · Operations

Essential Linux Ops Interview Q&A: TCP, HTTP, Proxy, and More

A comprehensive guide to common Linux operations interview questions, covering environment variables, TCP characteristics and handshake, proxy principles, TCP vs UDP trade‑offs, OOP vs procedural programming, HTTP request flow and status codes, deadlock concepts, TCP states, and inter‑process communication mechanisms.

HTTPLinuxOperations

0 likes · 14 min read

Essential Linux Ops Interview Q&A: TCP, HTTP, Proxy, and More

MaGe Linux Operations

Jan 30, 2018 · Operations

Master Nginx: Installation, Configuration, and Essential Commands

This guide walks you through installing Nginx on CentOS, compiling from source, configuring core settings, managing workers, setting up virtual hosts, and using common directives such as gzip, proxy, and access control to optimize performance and security.

LinuxNginxOperations

0 likes · 14 min read

Master Nginx: Installation, Configuration, and Essential Commands

dbaplus Community

Jan 29, 2018 · Operations

How Data‑Driven Monitoring Unlocks Real Value for Ops Teams

This article explains why quantifiable data is essential for evaluating the impact of operational changes, outlines common data‑collection stacks, defines core business and user‑centric metrics, and demonstrates practical monitoring techniques such as PCU analysis, simulated user flows, and intelligent scaling to turn ops work into measurable business value.

Operationsbusiness metricsdata analysis

0 likes · 15 min read

How Data‑Driven Monitoring Unlocks Real Value for Ops Teams

MaGe Linux Operations

Jan 26, 2018 · Operations

Master Linux Performance Diagnosis in 60 Seconds with 10 Essential Commands

When troubleshooting a Linux server, this guide shows the ten essential command‑line tools—uptime, dmesg, vmstat, mpstat, pidstat, iostat, free, sar, and top—to quickly assess CPU, memory, disk, and network health within the first sixty seconds, helping you identify saturation and bottlenecks.

Command-line ToolsLinuxOperations

0 likes · 23 min read

Master Linux Performance Diagnosis in 60 Seconds with 10 Essential Commands

Meitu Technology

Jan 24, 2018 · Operations

Meituan Monitoring Practice: Building a Holistic Monitoring System

Meituan’s Meipai service, serving over 150 million monthly users with a hybrid private‑public cloud architecture, spent three years building a comprehensive, three‑dimensional monitoring platform that unifies client‑to‑server metrics, alerts and reporting to ensure resilient, scalable operations and rapid business growth.

Cloud ServicesMeituanOperations

0 likes · 2 min read

Meituan Monitoring Practice: Building a Holistic Monitoring System

360 Zhihui Cloud Developer

Jan 23, 2018 · Operations

How to Proactively Monitor Elasticsearch Performance and Prevent Outages

This article explains how to anticipate and monitor Elasticsearch issues such as node unavailability, OOM errors, and long garbage‑collection pauses by tracking key performance metrics across query, indexing, memory, and system levels, helping prevent service disruptions.

ElasticsearchOperationsPerformance

0 likes · 12 min read

How to Proactively Monitor Elasticsearch Performance and Prevent Outages

Alibaba Cloud Developer

Jan 19, 2018 · Operations

How Alibaba’s AI‑Powered Supply Chain Handles Double‑11’s Massive Surge

This article explains how Alibaba’s supply‑chain algorithms and data‑driven operations enable rapid order processing, accurate demand forecasting, dynamic inventory allocation, and efficient warehouse fulfillment during the massive traffic of Double 11, highlighting the challenges faced and the solutions implemented.

AlibabaOperationsdemand forecasting

0 likes · 11 min read

How Alibaba’s AI‑Powered Supply Chain Handles Double‑11’s Massive Surge

Efficient Ops

Jan 18, 2018 · Operations

Understanding Linux Load Average: Reading, Interpreting, and Using It for Troubleshooting

This article explains what Linux load average measures, how to view the 1‑, 5‑, and 15‑minute values, interprets the numbers using traffic analogies, presents stress‑test scenarios across different CPU cores, and shows how load average guides effective troubleshooting of CPU and I/O bottlenecks.

OperationsSystem Monitoringload-average

0 likes · 8 min read

Understanding Linux Load Average: Reading, Interpreting, and Using It for Troubleshooting

Efficient Ops

Jan 16, 2018 · Operations

How Tencent Secures Game Operations: Real Cases, Challenges, and Data‑Driven Solutions

This article shares a comprehensive overview of game operation security at Tencent, covering personal background, real‑world incident cases, the inherent challenges of large‑scale game services, past monitoring efforts, and a new data‑driven alerting framework that dramatically reduces false alarms while protecting game economies.

AlertingBig DataCase Study

0 likes · 25 min read

How Tencent Secures Game Operations: Real Cases, Challenges, and Data‑Driven Solutions

dbaplus Community

Jan 15, 2018 · Operations

How JD Finance Achieves Real-Time Capacity Assessment and Smart Alerting

This article explains JD Finance's operational challenges in a rapidly expanding micro‑service environment and presents a comprehensive approach that combines offline and online load testing, precise capacity calculations, and intelligent root‑cause alert analysis using both rule‑based and machine‑learning techniques.

Machine LearningOperationsRoot Cause Analysis

0 likes · 15 min read

How JD Finance Achieves Real-Time Capacity Assessment and Smart Alerting

Efficient Ops

Jan 15, 2018 · Operations

How to Build a Full‑Chain Load‑Testing Platform for E‑Commerce in 2 Days

This article details how Xiaohongshu tackled rapid growth challenges by designing, implementing, and operating a full‑link performance testing platform in just two days, covering system architecture, testing models, collaborative deployment, capacity planning, and practical advice for teams seeking reliable e‑commerce load testing.

Operationse-commercefull-chain testing

0 likes · 9 min read

How to Build a Full‑Chain Load‑Testing Platform for E‑Commerce in 2 Days

Efficient Ops

Jan 14, 2018 · Operations

How We Built a Unified Network Automation Framework for Heterogeneous Devices

This article shares how a telecom operations team tackled the complexity of managing dozens of device vendors and hundreds of models by designing a Python‑based automation module called Forward, which standardizes low‑level actions, provides reusable libraries, and enables rapid script composition for diverse network scenarios.

Heterogeneous DevicesOperationsinfrastructure as code

0 likes · 10 min read

How We Built a Unified Network Automation Framework for Heterogeneous Devices

Snowball Engineer Team

Jan 12, 2018 · Operations

RDR: An Open-Source Tool for Visualizing and Analyzing Redis Memory Usage

This article introduces RDR, an open-source visualization platform developed by Xueqiu's SRE team to safely and efficiently analyze Redis memory consumption by parsing RDB files, estimating key-level memory usage based on internal data structures, and generating intuitive statistical reports for operational optimization.

OperationsRDB ParsingRedis

0 likes · 9 min read

RDR: An Open-Source Tool for Visualizing and Analyzing Redis Memory Usage

Efficient Ops

Jan 11, 2018 · Operations

Mastering Incident Troubleshooting: Proven SRE Strategies for Operations

This article shares practical SRE‑based principles for diagnosing and resolving online incidents, emphasizing systematic investigation, gathering clues, and prioritizing service restoration over immediate root‑cause identification to make troubleshooting less mystical and more effective.

Incident ManagementOperationsSRE

0 likes · 7 min read

Mastering Incident Troubleshooting: Proven SRE Strategies for Operations

Architects Research Society

Jan 11, 2018 · Operations

Envoy Outlier Detection and Ejection Mechanism Overview

The article explains Envoy's outlier detection and ejection process, detailing how unhealthy upstream hosts are identified and temporarily removed based on consecutive 5xx errors, gateway failures, or success‑rate thresholds, and describes the logging format and configuration options for these health‑check mechanisms.

Operationsejectionhealth-check

0 likes · 6 min read

Envoy Outlier Detection and Ejection Mechanism Overview

MaGe Linux Operations

Jan 10, 2018 · Operations

What I Learned from a 2018 Linux Ops Interview: Key Questions & Answers

In this detailed account of a 2018 Linux operations interview, the author shares the job description, required skills, practical preparation tips, and concise answers to seven common interview questions, offering valuable insights for aspiring sysadmin and DevOps professionals.

LinuxOperationsServer Administration

0 likes · 10 min read

What I Learned from a 2018 Linux Ops Interview: Key Questions & Answers

Efficient Ops

Jan 7, 2018 · Operations

How Tencent Leverages AI to Simplify Massive-Scale Service Monitoring and Root‑Cause Analysis

Tencent's SNG social platform team tackles billion‑scale traffic by integrating AI‑driven anomaly detection, multi‑dimensional monitoring, and decision‑tree based root‑cause analysis, turning complex backend architectures and massive alert volumes into streamlined, actionable insights for faster issue resolution.

AIAnomaly DetectionDecision Tree

0 likes · 16 min read

How Tencent Leverages AI to Simplify Massive-Scale Service Monitoring and Root‑Cause Analysis

MaGe Linux Operations

Jan 7, 2018 · Operations

How to Recover Accidentally Deleted Linux Upload Files with extundelete

After a sudden incident deleted files from /data/webapps/xxxx/upload without backup, this guide walks through installing extundelete, locating the deleted inode data, and using extundelete commands to recover as much of the lost data as possible, including tips on mounting read‑only.

Data RecoveryFilesystemLinux

0 likes · 6 min read

How to Recover Accidentally Deleted Linux Upload Files with extundelete

Alibaba Cloud Developer

Jan 5, 2018 · Operations

How Alibaba Scaled Double 11 with AI‑Driven Network Automation

Alibaba senior technologist Houyi explains how the company used AI‑powered network intelligence, automated fault detection, traffic scheduling, and smart routing to dramatically improve stability, reduce costs, and boost efficiency during the massive Double 11 shopping event.

AIAlibabaOperations

0 likes · 16 min read

How Alibaba Scaled Double 11 with AI‑Driven Network Automation

Efficient Ops

Jan 3, 2018 · Operations

How QQ Space Photo Album Handled a 4‑Fold Traffic Surge on New Year’s Day

On December 30, 2017, a sudden wave of users uploading and downloading their 18‑year‑old photos caused QQ Space's album service to experience a four‑times spike in download traffic and a twelve‑times surge in post activity, prompting the operations and development teams to employ capacity monitoring, elastic scaling, flexible architecture, and targeted optimizations to maintain service stability and user experience.

Elastic ScalingOperationsQQ Space

0 likes · 10 min read

How QQ Space Photo Album Handled a 4‑Fold Traffic Surge on New Year’s Day

Efficient Ops

Jan 2, 2018 · Operations

What Triggered These Real‑World System Crashes? 13 Post‑Mortem Lessons

The article compiles thirteen post‑mortem case studies of severe system outages—from AIX NTP misconfiguration and backup appliance driver issues to PowerHA node ID conflicts and hardware failures—detailing symptoms, root‑cause analysis, and practical remediation steps for each incident.

AIXCase StudyOperations

0 likes · 20 min read

What Triggered These Real‑World System Crashes? 13 Post‑Mortem Lessons

MaGe Linux Operations

Jan 2, 2018 · Operations

What Does Meituan Ask? 20 Must‑Know Linux Ops Interview Q&A

This article compiles Meituan's Linux operations engineer interview questions covering job requirements, core responsibilities, essential qualifications, and detailed answers on software installation, networking tools, IP configuration, scripting, iptables, MySQL security, replication, and common sysadmin commands, providing a comprehensive study guide for aspiring Linux ops candidates.

LinuxMySQLOperations

0 likes · 13 min read

What Does Meituan Ask? 20 Must‑Know Linux Ops Interview Q&A

Practical DevOps Architecture

Jan 2, 2018 · Operations

Configuring Zabbix Alert Notifications via WeChat Using a Shell Script

This guide explains how to create and deploy a Bash script that retrieves a WeChat corporate token and sends Zabbix alarm messages to all users through the WeChat API, enabling automated monitoring alerts via the corporate WeChat platform.

AlertingOperationsShell script

0 likes · 3 min read

Configuring Zabbix Alert Notifications via WeChat Using a Shell Script

dbaplus Community

Jan 1, 2018 · Big Data

How Vipshop Leverages Data Processing, Analytics, and Mining for Smarter Ops

This article summarizes Wu Xiaoguang's talk at Gdevops 2017, detailing how Vipshop integrates data processing, analysis, and mining technologies—such as Flume, Kafka, Spark, and custom scheduling—to improve operational decision‑making, performance monitoring, root‑cause analysis, and predictive modeling across its e‑commerce platform.

Big DataOperationsPredictive Modeling

0 likes · 23 min read

How Vipshop Leverages Data Processing, Analytics, and Mining for Smarter Ops

MaGe Linux Operations

Dec 30, 2017 · Operations

Essential Linux Operations Interview Questions & Answers from Meituan

This article compiles Meituan's Linux operations engineer interview requirements, common questions on system installation, networking, scripting, MySQL security, replication, iptables, and provides detailed command-line solutions and sample scripts to help candidates prepare effectively.

LinuxMySQLOperations

0 likes · 17 min read

Essential Linux Operations Interview Questions & Answers from Meituan

dbaplus Community

Dec 28, 2017 · Operations

Designing Scalable System Architecture: From Access Chains to Cloud‑Native Infrastructure

This comprehensive guide walks through the full lifecycle of enterprise system architecture, covering access‑chain analysis, network and hardware foundations, virtualization and container strategies, layered design, load‑balancing, database high‑availability, service segmentation, and operational safeguards such as CMDB, monitoring, and disaster‑recovery.

CMDBOperationsdatabase

0 likes · 34 min read

Designing Scalable System Architecture: From Access Chains to Cloud‑Native Infrastructure

Alibaba Cloud Infrastructure

Dec 27, 2017 · Operations

Efficient Ticket System Operations During Double 11 Promotion

The article describes how a ticketing system with strict SLA enforcement, automated routing, and team‑based service management enabled rapid, orderly issue handling during the high‑volume Double 11 shopping event, achieving near‑90% resolution within 30 minutes and improving overall business stability.

Double 11Incident ManagementOperations

0 likes · 7 min read

Efficient Ticket System Operations During Double 11 Promotion

DevOps Coach

Dec 27, 2017 · Operations

Essential DevOps Glossary: Key Terms Every Practitioner Should Know

This article presents a comprehensive bilingual DevOps glossary compiled from the DevOps Handbook, offering standardized English‑Chinese terminology, a change log, and open‑source contribution instructions via GitHub for continuous improvement.

CollaborationOperationsTerminology

0 likes · 8 min read

Essential DevOps Glossary: Key Terms Every Practitioner Should Know

Efficient Ops

Dec 26, 2017 · Operations

From Oracle DBA to DevOps Leader: A 20‑Year Ops Journey and Lessons

This memoir chronicles a Chinese IT professional’s two‑decade evolution from a university student and Oracle DBA to a DevOps and cloud operations leader, sharing career milestones, technical choices, and practical insights for anyone pursuing a long‑term operations career.

Operationsdatabase

0 likes · 14 min read

DevOps

Dec 26, 2017 · Operations

Implementing the Flow Principle: Continuous Delivery and Value‑Stream Optimization in DevOps

This article, based on the Chinese translation of the DevOps Handbook, explains the Flow Principle, continuous delivery practices, value‑stream mapping, waste elimination, and Goldratt’s five‑step method for improving DevOps pipelines and achieving low‑risk, fast releases.

Flow PrincipleOperationsValue Stream

0 likes · 9 min read

Implementing the Flow Principle: Continuous Delivery and Value‑Stream Optimization in DevOps

MaGe Linux Operations

Dec 23, 2017 · Operations

2017 Ops Tech Landscape: From Microservices to Intelligent Automation

This article surveys the evolution of operations technology, covering microservices, SRE, DevOps, containerization, orchestration, automation, intelligent monitoring, infrastructure, database and big‑data ops, as well as security, game and fintech operational challenges, highlighting current trends and future directions for 2017.

Operationscontainerizationdevops

0 likes · 14 min read

2017 Ops Tech Landscape: From Microservices to Intelligent Automation

Dada Group Technology

Dec 22, 2017 · Operations

Performance Testing Process, Plans, and Best Practices for High‑Traffic Events

This article explains the purpose of performance (stress) testing, compares four testing approaches, details the chosen proportional‑deployment strategy, and provides comprehensive preparation steps, script guidelines, metric analysis, and practical tips for ensuring system stability during large‑scale traffic spikes.

Operationscapacity planningload testing

0 likes · 10 min read

Performance Testing Process, Plans, and Best Practices for High‑Traffic Events

ITPUB

Dec 21, 2017 · Operations

Master Linux Troubleshooting: 6 Common Issues and How to Fix Them

Learn a systematic approach for Linux system administrators to diagnose and resolve six typical problems—including filesystem errors, 'argument list too long', inode exhaustion, undeleted file space, too many open files, and read‑only filesystem—using command‑line tools, log analysis, and practical fixes.

FilesystemLinuxOperations

0 likes · 15 min read

Master Linux Troubleshooting: 6 Common Issues and How to Fix Them

Alibaba Cloud Infrastructure

Dec 21, 2017 · Operations

Stability Monitoring Practices for Double 11 2017

The 2017 Double 11 stability monitoring project introduced a four‑layer monitoring architecture—including customer & sentiment, business, system water‑level, and infrastructure monitoring—along with data archiving and system‑level reliability measures to detect, respond to, and mitigate issues far faster than traditional manual processes.

OperationsStabilitybig-data

0 likes · 14 min read

Stability Monitoring Practices for Double 11 2017

Architecture Digest

Dec 21, 2017 · Operations

Design and Implementation of an Open‑Source Load Balancing Solution Using Nginx and LVS

The article describes how a company replaced costly commercial load balancers with an open‑source architecture based on Nginx for layer‑4 traffic and a layer‑7 cluster, detailing project background, technology selection, redundant design, network and Nginx configurations, operational scripts, performance testing, and data analysis.

AutomationHigh AvailabilityNetwork

0 likes · 11 min read

Design and Implementation of an Open‑Source Load Balancing Solution Using Nginx and LVS

MaGe Linux Operations

Dec 21, 2017 · Operations

Mastering High Availability Clusters: Key Concepts, Resource Management, and Failure Handling

This article explains how high‑availability (HA) clusters provide redundancy for directors, RS‑servers, databases and storage, covering active‑passive node roles, resource stickiness, constraints, quorum voting, split‑brain avoidance, failure detection methods, and essential configuration tips.

High AvailabilityOperationsResource Management

0 likes · 12 min read

Mastering High Availability Clusters: Key Concepts, Resource Management, and Failure Handling

Meitu Technology

Dec 19, 2017 · Industry Insights

Inside Meitu’s In‑House Log Collection System Arachnia: Design, Challenges, and Core Mechanisms

This article introduces Meitu’s self‑developed log collection system Arachnia, explaining why a custom solution was needed for massive server‑side user‑behavior logs, the key requirements such as reliability and real‑time throughput, and the core architectural mechanisms that address those challenges.

ArachniaBig DataMeitu

0 likes · 2 min read

Inside Meitu’s In‑House Log Collection System Arachnia: Design, Challenges, and Core Mechanisms

Efficient Ops

Dec 18, 2017 · Operations

How WiFi Key Built a Million‑User Monitoring Platform: Architecture and Best Practices

This article describes how WiFi 万能钥匙 designed and implemented the Roma monitoring platform to handle billions of daily requests, covering background challenges, architectural principles, component design, data collection, transmission, storage, alerting, and future directions for large‑scale observability.

ObservabilityOperationsarchitecture

0 likes · 16 min read

How WiFi Key Built a Million‑User Monitoring Platform: Architecture and Best Practices

Alibaba Cloud Infrastructure

Dec 18, 2017 · Operations

Deep Hardware‑Software Integration to Eliminate NVMe IO Jitter During Double 11

This case study explains how a combination of kernel‑level NVMe driver congestion control, LVM adjustments, and SSD over‑provisioning was used to suppress severe IO bandwidth drops and jitter, ensuring smooth transaction processing for a high‑traffic Double 11 event.

IO jitterKernel DriversLVM

0 likes · 7 min read

Deep Hardware‑Software Integration to Eliminate NVMe IO Jitter During Double 11

Alibaba Cloud Infrastructure

Dec 15, 2017 · Operations

Automated Fault Recovery Architecture for Alibaba's Network during Double Eleven

The article describes Alibaba's end‑to‑end automated fault recovery system for its massive network, covering extensive data collection, Spark‑based event processing, flexible alerting with Siddhi, alert convergence using PageRank, and scripted recovery actions to achieve high availability during the Double Eleven traffic surge.

AutomationBig DataNetwork Monitoring

0 likes · 9 min read

Automated Fault Recovery Architecture for Alibaba's Network during Double Eleven

Alibaba Cloud Developer

Dec 13, 2017 · Operations

How Alibaba’s StarOps Transforms Operations with Automated DevOps Tools

This article explains how Alibaba’s StarOps platform integrates DevOps automation, CMDB, release management, monitoring, host operations, bastion security and fault handling to enable large‑scale, unmanned, data‑driven operations across hybrid cloud environments.

CMDBOperationscloud-native

0 likes · 12 min read

DevOps

Dec 7, 2017 · Operations

Insights on DevOps: Perspectives, Principles, and Business Value

Drawing on 40 years of IT experience, the speaker explores DevOps as a transformative practice, discusses its strategic business value, outlines four key discussion areas—including principles, practices, selling to executives, and identifying weak points—and offers practical guidance for cultural and organizational change.

IT ManagementOperationsbusiness value

0 likes · 11 min read

Insights on DevOps: Perspectives, Principles, and Business Value

MaGe Linux Operations

Dec 6, 2017 · Operations

How to Trim Down Linux Startup: Disable Unnecessary Systemd Services

This guide explains why many Linux distributions start unused services at boot, shows how to list and inspect them with systemd tools, and provides step‑by‑step methods to safely mask or disable specific services to speed up system startup.

Operationsservice managementstartup

0 likes · 7 min read

How to Trim Down Linux Startup: Disable Unnecessary Systemd Services

Huawei Cloud Developer Alliance

Dec 6, 2017 · Operations

What Does an Integration Validation Engineer Do? Insights from a Huawei Veteran

The article explains the role, responsibilities, and career benefits of integration validation (solution testing) engineers at Huawei, offering practical advice and personal perspectives for newcomers transitioning from development to testing positions.

Career AdviceHuaweiIntegration Testing

0 likes · 7 min read

What Does an Integration Validation Engineer Do? Insights from a Huawei Veteran

AI Cyberspace

Dec 6, 2017 · Operations

Master RabbitMQ: Message Acknowledgment, Prefetch, RPC, vhosts & Plugins

This article explores RabbitMQ’s core features—including message acknowledgment, prefetch count, RPC support, virtual hosts, and its powerful plugin system—explaining how each works, when to enable or disable them, and providing step‑by‑step command‑line examples for configuring users, permissions, and management tools.

Message QueueOperationsPlugins

0 likes · 9 min read

Master RabbitMQ: Message Acknowledgment, Prefetch, RPC, vhosts & Plugins

Efficient Ops

Dec 5, 2017 · Operations

How Alibaba’s Sunfire Achieves Second‑Level Monitoring at Trillion‑Transaction Scale

This article explains how Alibaba’s Sunfire monitoring platform processes terabytes of logs per minute, uses a pull‑based architecture with Brain‑Reduce‑Map roles, tackles scalability and reliability challenges, and outlines future directions such as MQL standardization and intelligent baselines.

Large ScaleLog ProcessingOperations

0 likes · 17 min read

How Alibaba’s Sunfire Achieves Second‑Level Monitoring at Trillion‑Transaction Scale

Efficient Ops

Dec 3, 2017 · Operations

Why Operations Teams Get Overlooked and How to Build Real Collaboration

The article explores common conflicts between development, testing, and operations staff, explains why operations are often undervalued, and offers practical steps—such as clear documentation, defined processes, and proactive communication—to improve teamwork and reduce blame‑shifting in software projects.

Operationscommunicationprocess

0 likes · 8 min read

Why Operations Teams Get Overlooked and How to Build Real Collaboration

Tencent Cloud Developer

Nov 28, 2017 · Operations

Award-Winning DevOps Product “Developer Lab” and Tencent Cloud Distributed Database (DCDB) – Technical Overview

At the 2017 Global Operations Conference, Tencent Cloud’s award‑winning Developer Lab—an immersive, browser‑based IDE integrating SSH, RDP and tutorial‑driven workflows with automated resource scheduling—and its Distributed Cloud Database (DCDB), a sharded, cluster‑managed MySQL‑compatible system featuring advanced scheduling, routing and configuration services, were recognized for innovation and influence.

OperationsTencent Clouddevops

0 likes · 8 min read

Award-Winning DevOps Product “Developer Lab” and Tencent Cloud Distributed Database (DCDB) – Technical Overview

Efficient Ops

Nov 27, 2017 · Operations

How Facebook Scales to Billions: Disaggregated Networks, Storage, and Warm Spark

Facebook’s journey from early startup ops to supporting over 2 billion monthly users reveals how disaggregated network, storage, and warm‑storage‑enabled Spark architectures overcome scalability bottlenecks, illustrating the operational strategies and design principles that power massive, reliable data‑center services.

Big DataOperationscloud infrastructure

0 likes · 12 min read

How Facebook Scales to Billions: Disaggregated Networks, Storage, and Warm Spark

Efficient Ops

Nov 23, 2017 · Artificial Intelligence

How to Turn AIOps from Hype into Reality: A Practical Roadmap

In this comprehensive talk, Pei Dan outlines the technical and strategic roadmap for bringing AIOps to production, explains the challenges of anomaly detection, fault localization, root‑cause analysis and prediction, and demonstrates how to decompose complex operations problems into AI‑solvable tasks.

AIAIOpsAnomaly Detection

0 likes · 21 min read

How to Turn AIOps from Hype into Reality: A Practical Roadmap

Alibaba Cloud Developer

Nov 23, 2017 · Operations

How Alibaba Is Revolutionizing Operations with Intelligent Automation and DevOps

Alibaba's R&D efficiency team explains how intelligent operations—spanning resource planning, change management, monitoring, stability, and one‑click site building—are being transformed from manual tooling to automated, AI‑driven DevOps practices to boost efficiency, cut costs, and ensure high availability at massive scale.

Operationsdevopsintelligent-ops

0 likes · 27 min read

How Alibaba Is Revolutionizing Operations with Intelligent Automation and DevOps

360 Zhihui Cloud Developer

Nov 21, 2017 · Operations

Detecting LVS Traffic Anomalies with Short‑Term and Long‑Term Ratio Algorithms

This article introduces a practical LVS traffic anomaly detection method that combines short‑term and long‑term ratio analyses, dynamic thresholds, and periodicity‑aware techniques, providing code examples and a decision flow to help ops teams identify sudden traffic spikes or drops accurately.

LVSOperationsdynamic-threshold

0 likes · 10 min read

Detecting LVS Traffic Anomalies with Short‑Term and Long‑Term Ratio Algorithms

Efficient Ops

Nov 20, 2017 · Operations

How JD.com Scales Network Monitoring for Massive Traffic Peaks

This article explains how JD.com’s network team continuously optimizes its large‑scale infrastructure, designs effective monitoring strategies, implements practical monitoring solutions, and outlines future directions to improve network availability, fault detection, and operational efficiency across data centers and the internet backbone.

JD.comNetwork MonitoringOperations

0 likes · 16 min read

How JD.com Scales Network Monitoring for Massive Traffic Peaks

Huawei Cloud Developer Alliance

Nov 20, 2017 · Operations

How Unmanned Services Are Redefining Business Operations and Costs

The article analyzes the rise of unmanned solutions such as supermarkets, fitness pods, and toy cars, explaining how mobile payment, IoT, and AI enable them, while examining the shifting cost structure, operational benefits, and challenges of this emerging business model.

IoTOperationsbusiness model

0 likes · 5 min read

How Unmanned Services Are Redefining Business Operations and Costs

MaGe Linux Operations

Nov 20, 2017 · Operations

Is Operations a Promising Career? Insights from Small Firms to BAT

This article examines the future of operations (运维) in the internet industry, comparing small and large companies, highlighting why Ops is technically demanding, and offering practical advice for career growth and skill development.

CareerITOperations

0 likes · 10 min read

Is Operations a Promising Career? Insights from Small Firms to BAT

MaGe Linux Operations

Nov 18, 2017 · Operations

Automate Incident Response with BlueKing Fault Self‑Healing and Zabbix

This article shares a hands‑on guide to using BlueKing's Fault Self‑Healing (FTA) platform with Zabbix, detailing benefits, integration steps, package creation, convergence rules, job‑script linking, and real‑world case studies that dramatically reduce manual alert handling time.

BlueKingOperationsZabbix

0 likes · 8 min read

Automate Incident Response with BlueKing Fault Self‑Healing and Zabbix

360 Zhihui Cloud Developer

Nov 14, 2017 · Operations

Unlocking Scalable Network Automation: Lessons from 360’s Ops Strategy

This article explores how rapid growth in network devices drives the need for comprehensive automation—covering script‑based tasks, zero‑touch provisioning, orchestration with OpenStack, device selection criteria, fault diagnosis, and monitoring—to keep operations ahead of business demands.

Fault diagnosisNetwork MonitoringOpenStack integration

0 likes · 10 min read

Unlocking Scalable Network Automation: Lessons from 360’s Ops Strategy

JD Retail Technology

Nov 14, 2017 · Operations

Design and Implementation of JD.com's Multi‑Active Distributed Architecture

This article details JD.com's multi-active distributed architecture, covering its evolution from single‑data‑center to multi‑region deployments, network design, leaf‑spine topology, data consistency mechanisms, application scheduling, monitoring, and disaster recovery strategies that enhance high availability and user experience.

Data ConsistencyNetwork ArchitectureOperations

0 likes · 11 min read

Design and Implementation of JD.com's Multi‑Active Distributed Architecture

ITPUB

Nov 14, 2017 · Operations

How Alibaba’s Dragonfly P2P System Powers 20B Transfers and Slashes Docker Image Traffic

Alibaba’s Dragonfly P2P file distribution platform, built to handle massive file and container image delivery during peak events like Double‑11, combines peer‑to‑peer networking, smart compression, flow‑control and security features to achieve billions of transfers, petabyte‑scale traffic, and up to 99.9% reduction in registry outbound bandwidth.

File DistributionOperationsP2P

0 likes · 20 min read

How Alibaba’s Dragonfly P2P System Powers 20B Transfers and Slashes Docker Image Traffic

Efficient Ops

Nov 12, 2017 · Operations

How 360’s LVS FULLNAT Transforms Load Balancing and Boosts Security

This article explains how 360’s Linux Virtual Server (LVS) platform evolved with the FULLNAT forwarding mode, enhancing cross‑VLAN deployment, simplifying real‑server configuration, adding SYN‑proxy protection, and improving UDP handling, while detailing the new deployment architecture and operational benefits.

DeploymentFullNATLVS

0 likes · 10 min read

How 360’s LVS FULLNAT Transforms Load Balancing and Boosts Security

Continuous Delivery 2.0

Nov 12, 2017 · Operations

Practicing Continuous Deployment: Insights from the Cruise (now GoCD) Project

This article recounts how the Cruise team (later renamed GoCD) tackled the "last mile" problem by implementing continuous deployment through automated pipelines, extensive testing, risk mitigation, and frequent releases, ultimately reducing deployment time and improving software delivery reliability.

AutomationCI/CDContinuous Deployment

0 likes · 10 min read

Practicing Continuous Deployment: Insights from the Cruise (now GoCD) Project

StarRing Big Data Open Lab

Nov 10, 2017 · Operations

Top 16 Common TDH Community Edition Installation Issues and How to Fix Them

This guide compiles the most frequent problems encountered when installing the TDH Community Edition—such as hostname configuration, logical volume creation errors, service startup failures, firewall settings, and license issues—and provides clear step‑by‑step solutions to help users avoid and resolve these obstacles.

InstallationLinuxOperations

0 likes · 10 min read

Top 16 Common TDH Community Edition Installation Issues and How to Fix Them

Qunar Tech Salon

Nov 10, 2017 · Operations

Building a Private Cloud Elasticsearch Platform with Mesos and Docker

This article describes how the OPS team designed and implemented a private‑cloud Elasticsearch service using Mesos for resource management, Docker containers orchestrated by Marathon, and a suite of monitoring, self‑service configuration, and continuous deployment tools to improve resource utilization and operational efficiency.

DockerElasticsearchMarathon

0 likes · 9 min read

Building a Private Cloud Elasticsearch Platform with Mesos and Docker

dbaplus Community

Nov 9, 2017 · Operations

Mastering Log Levels: Practical Guidelines for Effective Logging

This article explains the purpose of each log level, when to write logs, performance impacts, and concrete best‑practice patterns for INFO, DEBUG, WARN and ERROR in Java applications, providing actionable templates and configuration tips to build a robust logging system.

Best PracticesLoggingOperations

0 likes · 19 min read

Mastering Log Levels: Practical Guidelines for Effective Logging

MaGe Linux Operations

Nov 8, 2017 · Operations

How to Build an Ops Engineer Skill Map to Bridge the Hiring Gap

An operations director explains why hiring skilled ops engineers is hard, identifies the technology mismatch in typical stacks, and shares a practical skill‑map approach that lets teams cover most essential tools while giving engineers a clear learning roadmap.

OperationsOps EngineeringSkill Map

0 likes · 3 min read

How to Build an Ops Engineer Skill Map to Bridge the Hiring Gap

ITPUB

Nov 8, 2017 · Operations

10 Essential Linux Sysadmin Hacks to Boost Efficiency

This article presents ten practical Linux system‑administration tricks—from ejecting a stuck DVD drive and resetting a frozen console to sharing screen sessions, creating SSH tunnels for VNC, measuring network bandwidth, and gathering system diagnostics—each designed to save time and improve operational productivity.

LinuxOperationsShell

0 likes · 20 min read

10 Essential Linux Sysadmin Hacks to Boost Efficiency

Efficient Ops

Nov 5, 2017 · Operations

Scaling Ele.me’s Infrastructure: Operations, Automation, and Private Cloud Insights

This article recounts Ele.me's rapid growth from 2014 onward, detailing the challenges of network and server management, the evolution of their operations through standardization, process automation, and platform building, and how private cloud solutions like ZStack enabled fine‑grained, data‑driven infrastructure management.

AutomationCloud ComputingOperations

0 likes · 23 min read

Scaling Ele.me’s Infrastructure: Operations, Automation, and Private Cloud Insights