Tagged articles
3281 articles
Page 26 of 33
HomeTech
HomeTech
Nov 16, 2018 · Operations

Open-Sourcing Windows Agent for Open-Falcon Monitoring

The article announces the open-source release of the Windows Agent component under the Apache license, its integration into the Open-Falcon community, future feature enhancements, and gratitude to contributors, while providing links to the source code and related documentation.

Apache LicenseOperationsWindows Agent
0 likes · 5 min read
Open-Sourcing Windows Agent for Open-Falcon Monitoring
Efficient Ops
Efficient Ops
Nov 14, 2018 · Operations

How Zabbix Tackles FinTech Monitoring Challenges in the VUCA Era

This article explores how the VUCA-driven volatility of modern FinTech demands robust, multi‑layered monitoring solutions and explains why Zabbix, with its open‑source flexibility, automated discovery, and deep integration capabilities, is a compelling choice for achieving resilient, automated operations.

FinTechOperationsVUCA
0 likes · 19 min read
How Zabbix Tackles FinTech Monitoring Challenges in the VUCA Era
DevOps
DevOps
Nov 13, 2018 · Operations

Reflections on DevOps Organizational Transformation: Lessons from Development‑Operations Integration, Product Teams, and IT Ops Decentralization

The article shares practical reflections on a two‑year DevOps transformation, examining the integration of development and operations, the shift to product‑oriented teams, and the decentralization of the IT operations department, while highlighting emerging challenges and key lessons for supporting global business.

DevOpsIT opsOperations
0 likes · 11 min read
Reflections on DevOps Organizational Transformation: Lessons from Development‑Operations Integration, Product Teams, and IT Ops Decentralization
58 Tech
58 Tech
Nov 12, 2018 · Operations

Key Takeaways from the 58 Group Technical Salon on Monitoring Platforms

The article summarizes the 58 Group technical salon where experts from Momo and 58 shared practical experiences on monitoring platform architectures, coverage, alarm configurations, convergence techniques, custom dimensions, multi‑view dashboards, and future directions for intelligent and automated monitoring across the company.

AlertingDevOpsOperations
0 likes · 9 min read
Key Takeaways from the 58 Group Technical Salon on Monitoring Platforms
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 11, 2018 · Operations

A Decade of Double 11: Technical Evolution and Operational Lessons from Alibaba

Over ten years of Alibaba's Double 11, the company transformed a modest marketing event into a global e‑commerce platform by continuously improving backend architecture, scaling strategies, full‑link stress testing, multi‑active data centers, cloud migration, and real‑time incident response, offering valuable operational insights.

AlibabaBackendOperations
0 likes · 15 min read
A Decade of Double 11: Technical Evolution and Operational Lessons from Alibaba
MaGe Linux Operations
MaGe Linux Operations
Nov 9, 2018 · Information Security

Essential Linux Security Practices Every Ops Engineer Should Know

This article outlines comprehensive Linux security measures—including account hardening, remote access protection, file system safeguards, rootkit detection tools, and step‑by‑step post‑attack response—to help system administrators strengthen server defenses and quickly recover from compromises.

HardeningLinuxOperations
0 likes · 23 min read
Essential Linux Security Practices Every Ops Engineer Should Know
Zhongtong Tech
Zhongtong Tech
Nov 9, 2018 · Operations

How ZTO Technology Scales Logistics Systems for Double 11: From Smart Sorting to Private Cloud

Marking the 10th anniversary of Double 11, ZTO Technology details how it tackles massive traffic spikes with an automatic sorting management platform, a high‑availability IDC and private cloud, smart voice and face‑recognition services, real‑time data dashboards, and extensive performance testing to ensure stable, fast, and accurate order fulfillment.

LogisticsOperationscloud computing
0 likes · 6 min read
How ZTO Technology Scales Logistics Systems for Double 11: From Smart Sorting to Private Cloud
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 6, 2018 · Operations

How Alibaba Scaled Double 11: Lessons from a Decade of E‑commerce Mega‑Events

From its humble 2009 launch to the 2018 tenth anniversary, Alibaba’s Double 11 shopping festival evolved through relentless technical challenges—system crashes, CDN bottlenecks, over‑selling bugs, and massive load‑testing innovations—offering a decade‑long case study in operations, scalability, and resilience for large‑scale e‑commerce platforms.

Load TestingOperationsScalability
0 likes · 16 min read
How Alibaba Scaled Double 11: Lessons from a Decade of E‑commerce Mega‑Events
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 5, 2018 · Operations

How Alibaba Conquered Double 11: A Decade of Scaling, Crises, and Lessons

From the humble 2009 launch of Double 11 to the massive, cloud-native, multi-region architecture of 2018, Alibaba’s engineers chronicle yearly technical hurdles—traffic spikes, system crashes, CDN limits, over-selling, and the evolution of stress-testing, capacity planning, and operational safeguards that turned the shopping festival into a global engineering showcase.

OperationsPerformance TestingScalability
0 likes · 17 min read
How Alibaba Conquered Double 11: A Decade of Scaling, Crises, and Lessons
Tencent Cloud Developer
Tencent Cloud Developer
Nov 1, 2018 · Databases

Experience and Optimization of MongoDB for Mini‑Game Operations and Cloud Integration

Li Xiaohui shares Tencent Cloud MongoDB’s real‑world mini‑game operations, detailing schema‑free design, sharding, thread‑per‑connection tuning, snapshot‑based read fixes, and table‑level rollback, then demonstrates a one‑click cloud stack that provisions MongoDB, serverless functions, storage, monitoring and security for mini‑program developers.

Cloud ServicesGame DevelopmentMongoDB
0 likes · 12 min read
Experience and Optimization of MongoDB for Mini‑Game Operations and Cloud Integration
Efficient Ops
Efficient Ops
Oct 31, 2018 · Operations

How to Build an Automated Operations System for Game Companies

This article examines why automated operations are essential for growing game businesses, outlines the goals of a complete, simple, efficient, and secure system, and details the architecture and individual subsystems—including installation, platform, security, client updates, backup, and monitoring—that together form a robust DevOps solution.

DevOpsGame IndustryOperations
0 likes · 19 min read
How to Build an Automated Operations System for Game Companies
Efficient Ops
Efficient Ops
Oct 29, 2018 · Operations

How Youzan Manages Online Incidents: A Step‑by‑Step Guide

This article outlines Youzan's end‑to‑end online incident management process—from fault detection and coordination through root‑cause analysis, recovery, review, and actionable JIRA tracking—highlighting practical workflows, data analysis, and continuous improvement practices for reliable service delivery.

JIRA workflowOperationsfault handling
0 likes · 10 min read
How Youzan Manages Online Incidents: A Step‑by‑Step Guide
JD Tech
JD Tech
Oct 29, 2018 · Operations

SGM Service Governance Monitoring Platform: Design, Features, and Use Cases

The article introduces SGM, a comprehensive service governance and monitoring solution that addresses scaling, dependency complexity, and operational challenges by providing automated topology, real‑time tracing, capacity planning, root‑cause analysis, and extensive monitoring features such as performance metrics, JVM stats, call‑chain visualization, business dashboards, and intelligent alerting.

AlertingOperationscall chain
0 likes · 13 min read
SGM Service Governance Monitoring Platform: Design, Features, and Use Cases
Architects' Tech Alliance
Architects' Tech Alliance
Oct 24, 2018 · Operations

Data Center Facility Construction Standards and Classification Guidelines

This article outlines the scope, terminology, classification levels, site selection principles, equipment layout, and subsystem requirements—including lighting, grounding, lightning protection, HVAC, monitoring, and cabling—for building and operating data center facilities in accordance with industry standards.

Operationsclassificationconstruction standards
0 likes · 9 min read
Data Center Facility Construction Standards and Classification Guidelines
UC Tech Team
UC Tech Team
Oct 23, 2018 · Operations

Understanding Faults and Fault Isolation Strategies in Distributed Systems

The article explains what constitutes a fault, introduces key metrics such as RPO and RTO, and describes various fault isolation principles, patterns, and practical examples—including dependency degradation, failover, dynamic adjustment, fast‑fail, caching, rate limiting, and resource isolation—to improve system reliability.

OperationsRPORTO
0 likes · 14 min read
Understanding Faults and Fault Isolation Strategies in Distributed Systems
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 23, 2018 · Operations

Unlocking Resource Efficiency: Alibaba’s Mixed‑Deployment (Co‑location) Strategy

This article explains how Alibaba’s mixed‑deployment (co‑location) technology combines online transaction services and offline compute workloads on shared physical servers, detailing its architecture, scheduling mechanisms, resource‑concession strategies, achieved performance gains, and future directions for large‑scale e‑commerce infrastructure.

AlibabaCo-locationOperations
0 likes · 23 min read
Unlocking Resource Efficiency: Alibaba’s Mixed‑Deployment (Co‑location) Strategy
Efficient Ops
Efficient Ops
Oct 22, 2018 · Operations

How Ops Teams Can Find Happiness and Deliver Real Business Value

The article explores why many operations engineers feel unhappy, identifies achievement and compensation as key to happiness, explains the internal and external value of ops work, and outlines how a dedicated ops team can improve product speed, stability, cost efficiency, and overall business outcomes.

DevOpsOperationsbusiness efficiency
0 likes · 6 min read
How Ops Teams Can Find Happiness and Deliver Real Business Value
vivo Internet Technology
vivo Internet Technology
Oct 22, 2018 · Operations

Jenkins Area Meetup 2018 Shenzhen: DevOps Practices and CI/CD Solutions

The Jenkins Area Meetup 2018 in Shenzhen, co‑hosted by DevOps时代社区 and vivo Mobile Internet, gathered experts who presented on hybrid‑cloud DevOps, large‑scale CI/CD with Jenkins at Tencent, DevOps‑based R&D and operations standards, and an automated CMDB‑driven operations platform, concluding with strong community engagement and available presentation materials.

DevOpsJenkinsOperations
0 likes · 3 min read
Jenkins Area Meetup 2018 Shenzhen: DevOps Practices and CI/CD Solutions
dbaplus Community
dbaplus Community
Oct 21, 2018 · Artificial Intelligence

How Weibo’s Hubble Platform Uses AI for Real‑Time Monitoring and Trend Forecasting

The article details Weibo Advertising's Hubble monitoring system, describing its three‑layer architecture, metric taxonomy, AI‑driven trend prediction with LSTM models, dynamic alert thresholds, and performance testing using GoReplay, illustrating how large‑scale data and machine learning enable proactive operations.

AILSTMOperations
0 likes · 22 min read
How Weibo’s Hubble Platform Uses AI for Real‑Time Monitoring and Trend Forecasting
Architecture Talk
Architecture Talk
Oct 15, 2018 · Operations

Master Nginx Rate Limiting: Request & Connection Control with Practical Configs

This article explains how to use Nginx’s built‑in limit_req and limit_conn modules to implement request‑rate and connection‑based throttling, covering configuration directives, execution flow, burst handling, delay modes, whitelist setup with geo and map modules, and practical examples for IP and domain limits.

NginxOperationsWeb server
0 likes · 9 min read
Master Nginx Rate Limiting: Request & Connection Control with Practical Configs
Efficient Ops
Efficient Ops
Oct 10, 2018 · Operations

How Alibaba’s Mixed‑Deployment Cuts Costs and Boosts Resource Utilization

This article explains Alibaba's mixed‑deployment (co‑location) technique, detailing its motivation, architecture, resource‑sharing mechanisms, scheduling strategies, performance results, and future directions for scaling and refining resource utilization across online and offline workloads.

AlibabaCo-locationOperations
0 likes · 22 min read
How Alibaba’s Mixed‑Deployment Cuts Costs and Boosts Resource Utilization
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Oct 10, 2018 · Operations

How to Build a Highly Available Redis Service with Sentinel and Virtual IP

This article explains why Redis is a popular in‑memory key‑value store, defines high availability, enumerates failure scenarios, and walks through four incremental architectures—single instance, master‑slave with one Sentinel, dual Sentinel, and three‑Sentinel with VIP—to achieve a robust, production‑grade Redis deployment.

Operationsredissentinel
0 likes · 12 min read
How to Build a Highly Available Redis Service with Sentinel and Virtual IP
Java Captain
Java Captain
Oct 10, 2018 · Operations

Linux Command Cheatsheet and Java Diagnostic Tools for System Operations

This article compiles essential Linux commands and a suite of Java diagnostic utilities—including tail, grep, awk, find, tsar, btrace, Greys, JProfiler, and others—providing concise examples and code snippets to help engineers troubleshoot and monitor production systems efficiently.

LinuxOperationsdebugging
0 likes · 13 min read
Linux Command Cheatsheet and Java Diagnostic Tools for System Operations
Efficient Ops
Efficient Ops
Oct 9, 2018 · Operations

How Tencent Scales Automated Operations for Massive Services

Tencent’s architecture platform team explains how they monitor, automate, and secure billions of daily operations across storage, CDN, and live services, using multi‑dimensional metrics, real‑time and instant computation, AI‑driven anomaly detection, and a custom control platform for safe changes.

Operationsaiopsautomation
0 likes · 23 min read
How Tencent Scales Automated Operations for Massive Services
Architects' Tech Alliance
Architects' Tech Alliance
Sep 30, 2018 · Industry Insights

What Every Data Center Engineer Must Know About Rack Cabinet Standards and Design

This article provides a comprehensive overview of data‑center rack cabinets, covering size specifications, power and cooling requirements, key industry standards such as IEC 60297‑1 and EIA‑310‑D, structural components, environmental considerations, load capacity, and practical design guidelines for safe and efficient deployment.

InfrastructureOperationsRack Cabinet
0 likes · 10 min read
What Every Data Center Engineer Must Know About Rack Cabinet Standards and Design
Youzan Coder
Youzan Coder
Sep 28, 2018 · Industry Insights

How Youzan Scaled Development with Containerization: Challenges and Solutions

This article examines Youzan's journey to containerize its development and testing environments using Kubernetes and Docker, detailing the motivations, architectural decisions, network and isolation challenges, image integration, logging, load balancing, debugging, and the ongoing rollout to standard production environments.

DevOpsDockerEnvironment provisioning
0 likes · 12 min read
How Youzan Scaled Development with Containerization: Challenges and Solutions
Java Backend Technology
Java Backend Technology
Sep 28, 2018 · Operations

Why Your Microservices Need a Distributed Configuration Center (and How to Build One)

This article explains the shortcomings of traditional configuration files, describes why distributed configuration centers are essential for dynamic, multi‑environment microservice deployments, outlines their evolution, presents a simple design with caching and consistency improvements, and reviews popular open‑source solutions.

Configuration ManagementMicroservicesOperations
0 likes · 11 min read
Why Your Microservices Need a Distributed Configuration Center (and How to Build One)
Efficient Ops
Efficient Ops
Sep 27, 2018 · Operations

Tencent Billing’s Secret to Managing Massive Promo Spikes

Tencent’s billing platform powers billions of daily transactions across 180+ countries, supporting both consumer and business payments, and employs sophisticated capacity testing, dynamic auto‑scaling, resource sharing, and change‑control mechanisms to ensure reliable large‑scale promotional events without service disruptions.

Auto ScalingOperationsTencent Billing
0 likes · 15 min read
Tencent Billing’s Secret to Managing Massive Promo Spikes
JD Tech
JD Tech
Sep 27, 2018 · Operations

Overview of JD Invoice System Architecture and Business Processes

The article provides a comprehensive overview of JD's invoice system, detailing its business lines, core modules, data sources, invoicing workflows—including forward and reverse invoicing—and the system's role in automating tax management and reducing operational risk.

JDOperationsSystem Architecture
0 likes · 9 min read
Overview of JD Invoice System Architecture and Business Processes
Architects' Tech Alliance
Architects' Tech Alliance
Sep 26, 2018 · Operations

How Goldeneye Enables Adaptive, Intelligent Business Monitoring at Scale

Goldeneye, Alibaba Mom's monitoring platform, uses big‑data pipelines, dynamic threshold prediction, mean‑shift change‑point detection, and automated metric discovery to replace manual alarm settings, reduce false alerts, and provide intelligent, scalable business monitoring across hundreds of services.

Big DataOperationsbusiness monitoring
0 likes · 19 min read
How Goldeneye Enables Adaptive, Intelligent Business Monitoring at Scale
Efficient Ops
Efficient Ops
Sep 24, 2018 · Operations

How Checklist Thinking Fuels Ops Professionals' Lifelong Growth

This talk explores how ops engineers can achieve continuous professional development by adopting checklist thinking, covering growth drivers, error classification, practical checklist applications, cognitive models, and design principles that turn complex incidents into systematic, repeatable processes.

DevOpsGrowthOperations
0 likes · 34 min read
How Checklist Thinking Fuels Ops Professionals' Lifelong Growth
UCloud Tech
UCloud Tech
Sep 20, 2018 · Operations

Why CPU Monitoring Shows 0% or 100% Spikes and How Hot Patches Fixed It

The article investigates intermittent CPU usage spikes on Linux servers caused by a kernel cputime bug, explains the root‑cause analysis, describes a cold patch applied to newer kernels, and details a hot‑patch solution that safely resolves the issue across thousands of production machines.

CPU MonitoringLinuxOperations
0 likes · 9 min read
Why CPU Monitoring Shows 0% or 100% Spikes and How Hot Patches Fixed It
Efficient Ops
Efficient Ops
Sep 18, 2018 · Operations

Mastering Internet Operations: Roles, Responsibilities, and Evolution

This article provides a comprehensive overview of internet operations, detailing how service‑centric stability, security, and efficiency are achieved through infrastructure management, monitoring, risk mitigation, and continuous optimization, while outlining the various operational roles, their duties, and the evolution of ops practices.

DevOpsInfrastructureOperations
0 likes · 21 min read
Mastering Internet Operations: Roles, Responsibilities, and Evolution
Efficient Ops
Efficient Ops
Sep 17, 2018 · Operations

How Alibaba Scales Monitoring: From CMDB to AI‑Driven Full‑Link Observability

Alibaba’s monitoring evolution—from fragmented early tools to the standardized Sunfire platform and now AI‑powered full‑link observability—addresses scaling challenges, introduces business‑centric metrics, automated traceability, and intelligent anomaly detection, illustrating how massive, multi‑tenant infrastructures achieve unified, proactive operations at scale.

AlibabaOperationsaiops
0 likes · 19 min read
How Alibaba Scales Monitoring: From CMDB to AI‑Driven Full‑Link Observability
DevOps
DevOps
Sep 17, 2018 · Operations

Key Insights from the 2018 Global DevOps State of the World Report

The 2018 Global DevOps State of the World Report, compiled by DORA with contributions from leading experts, presents extensive data from over 30,000 professionals, highlights new trends such as accelerated practices, cloud infrastructure, elite high‑performance organizations, and offers a live online session to help practitioners quickly grasp its valuable findings.

DevOpsOperationsReport
0 likes · 6 min read
Key Insights from the 2018 Global DevOps State of the World Report
Youzan Coder
Youzan Coder
Sep 15, 2018 · Big Data

How Data Empowers Operations: Insights from Youzan & NetEase’s Big Data Summit

On September 15, Youzan’s big-data team and NetEase YouShu hosted a technical sharing titled “The Road to Data-Driven Operations,” where speakers explored the evolution of Youzan’s data warehouse metadata system, the architecture of its big-data development platform, and the application of functional programming in visual data analysis, highlighting current trends and future directions.

Data visualizationOperationsdata-warehouse
0 likes · 4 min read
How Data Empowers Operations: Insights from Youzan & NetEase’s Big Data Summit
JD Tech
JD Tech
Sep 14, 2018 · Operations

Joint‑Venture Settlement Platform Overview and Billing Architecture

This document presents a comprehensive solution for merchant settlement in joint‑venture (co‑operated) offline stores, describing business models, settlement subject abstraction, billing engine components, settlement workflow, payment collection, and reconciliation architecture with detailed tables and diagrams.

FinancialMicroservicesOperations
0 likes · 18 min read
Joint‑Venture Settlement Platform Overview and Billing Architecture
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Sep 13, 2018 · Operations

Common Open‑Source Monitoring Systems and Zabbix Monitoring Process

The article introduces common open‑source monitoring tools such as Zabbix and Nagios, explains why distributed systems need proactive health checks, compares features, and provides a detailed Zabbix monitoring workflow including data collection, storage, visualization, alerting, and specific metrics for servers, networks, JVM and MySQL.

Distributed SystemsNagiosOperations
0 likes · 8 min read
Common Open‑Source Monitoring Systems and Zabbix Monitoring Process
Alibaba Cloud Developer
Alibaba Cloud Developer
Sep 12, 2018 · Artificial Intelligence

How Alibaba’s XSigma AI Engine Revolutionizes Customer Service Scheduling

The XSigma system combines AI‑driven demand forecasting, real‑time optimization, visual decision‑making and intelligent training to automatically schedule, scale, balance load and match customers with the best agents, dramatically improving resource utilization and user experience for Alibaba’s massive CCO operation.

Artificial IntelligenceOperationsScheduling
0 likes · 19 min read
How Alibaba’s XSigma AI Engine Revolutionizes Customer Service Scheduling
dbaplus Community
dbaplus Community
Sep 11, 2018 · Operations

How Qunar’s Fault Injection Platform Ensures High‑Availability in Complex Backend Systems

Qunar built a fault‑injection platform that dynamically injects runtime errors into its densely coupled backend services, enabling verification of degradation and circuit‑breaker strategies, with a four‑part architecture comprising a web UI, deployment system, command server, and Java agents using Instrumentation‑API for bytecode weaving.

BackendFault InjectionJava Instrumentation
0 likes · 13 min read
How Qunar’s Fault Injection Platform Ensures High‑Availability in Complex Backend Systems
DevOps
DevOps
Sep 10, 2018 · Operations

Challenges and DevOps Transformation in Traditional Financial Enterprises

This talk examines how traditional financial institutions, facing intense internet disruption, struggle with DevOps adoption, highlighting real-world case studies, the importance of granularity and decoupling, internal innovation mechanisms, and practical steps such as physical Kanban, CI/CD pipelines, and Git workflows to improve efficiency.

DevOpsFinancial ServicesOperations
0 likes · 14 min read
Challenges and DevOps Transformation in Traditional Financial Enterprises
Youzan Coder
Youzan Coder
Sep 7, 2018 · Operations

How We Built a Configurable Online Test Monitoring System for Real‑Time CI/CD Alerts

This article details the design, evolution, and implementation of an online test‑monitoring platform that transforms CI/CD pipelines into proactive alerting systems, covering the initial Spring‑based prototype, its shortcomings, the 2.0 configurable and visual redesign, plugin architecture, and future distributed deployment plans.

Operationsci/cdonline monitoring
0 likes · 15 min read
How We Built a Configurable Online Test Monitoring System for Real‑Time CI/CD Alerts
DevOps
DevOps
Sep 5, 2018 · Operations

Five Essential Flow Metrics for Effective DevOps Transformations

This article explains five essential flow metrics—Flow Time, Flow Efficiency, WIP Report, Aging Report, and Flow Distribution—showing how they help technology companies measure outcomes, improve predictability, and optimize DevOps transformations through data‑driven insights.

DevOpsOperationsflow metrics
0 likes · 11 min read
Five Essential Flow Metrics for Effective DevOps Transformations
Qunar Tech Salon
Qunar Tech Salon
Sep 5, 2018 · Operations

Tencent SNG Operations: Business Profiling for Capacity Planning, Activity Modeling, and Multi‑Region Deployment

The article explains how Tencent's SNG operations team uses business profiling—including capacity, activity, core‑link, and SET models—to address performance testing across device types, forecast activity‑driven resource needs, identify core versus peripheral services, and plan reliable multi‑region deployments.

Operationsbusiness profilingcapacity planning
0 likes · 9 min read
Tencent SNG Operations: Business Profiling for Capacity Planning, Activity Modeling, and Multi‑Region Deployment
MaGe Linux Operations
MaGe Linux Operations
Sep 2, 2018 · Operations

Essential Linux Ops Interview Guide: 30+ Questions & Solutions

A comprehensive collection of Linux operations interview questions and answers covering topics such as system maintenance, networking, load balancing, RAID, MySQL, scripting, security, and troubleshooting, providing practical guidance for candidates seeking high‑pay Linux sysadmin roles.

LinuxNetworkingOperations
0 likes · 42 min read
Essential Linux Ops Interview Guide: 30+ Questions & Solutions
360 Tech Engineering
360 Tech Engineering
Aug 29, 2018 · Operations

Monitoring Elasticsearch Performance: Host‑Level System and Network Metrics, Cluster Health, and Resource Saturation

This article continues the Elasticsearch performance monitoring series by detailing host‑level system and network metrics, cluster health and node availability, resource saturation, and related errors, providing practical guidance on disk space, I/O, CPU, network throughput, file descriptors, HTTP connections, thread pools, caches, pending tasks, and failed GET requests.

ElasticsearchOperationsPerformance Monitoring
0 likes · 14 min read
Monitoring Elasticsearch Performance: Host‑Level System and Network Metrics, Cluster Health, and Resource Saturation
Efficient Ops
Efficient Ops
Aug 28, 2018 · Operations

How to Detect and Resolve Time‑Series Anomalies in Modern AIOps

This article explains practical approaches for time‑series anomaly detection, multi‑dimensional drill‑down analysis, alarm‑convergence root‑cause analysis, and future AIOps planning, combining statistical methods, unsupervised learning, and supervised models to improve monitoring accuracy and operational efficiency.

OperationsRoot Cause AnalysisUnsupervised Learning
0 likes · 20 min read
How to Detect and Resolve Time‑Series Anomalies in Modern AIOps
Alibaba Cloud Developer
Alibaba Cloud Developer
Aug 28, 2018 · Operations

How Alibaba Achieves Full‑Link Business Monitoring: A Practical Guide

Alibaba’s infrastructure team introduces a full‑link business monitoring approach that visualizes end‑to‑end health from a business perspective, unifies metrics, automates data collection, and leverages intelligent baseline alerts, enabling rapid issue detection, precise root‑cause analysis, and fine‑grained dimension monitoring across services.

AlibabaOperationsbusiness metrics
0 likes · 11 min read
How Alibaba Achieves Full‑Link Business Monitoring: A Practical Guide
Qunar Tech Salon
Qunar Tech Salon
Aug 23, 2018 · Operations

Alibaba Search Middle Platform DevOps Practices: Sophon, Bahamut, and AIOps

This article details Alibaba's three‑year journey building a search middle platform, describing how DevOps, goal‑driven operations, and AI‑assisted automation (Sophon, Bahamut, and AIOps) were introduced to improve scalability, stability, and efficiency for large‑scale search services.

BahamutDevOpsOperations
0 likes · 16 min read
Alibaba Search Middle Platform DevOps Practices: Sophon, Bahamut, and AIOps
Efficient Ops
Efficient Ops
Aug 21, 2018 · Operations

How Tencent SNG Uses Business Profiling to Optimize Capacity, Activity, and Multi‑Region Deployment

This article explains how Tencent's SNG operations team builds and applies business profiling models—including capacity, activity, core‑link, and SET planning—to predict performance, automate scaling, identify critical services, and efficiently distribute workloads across multiple regions.

Operationsactivity modelingcapacity planning
0 likes · 11 min read
How Tencent SNG Uses Business Profiling to Optimize Capacity, Activity, and Multi‑Region Deployment
HomeTech
HomeTech
Aug 21, 2018 · Operations

Automated Asset Collection for CMDB Using Puppet Facter and Assets_Report

This article explains how to build an automated CMDB asset collection system by extending Puppet's Facter with custom plugins, using a custom Report Processor to post data to an AutoBank service, and deploying a Python/Django API server for storage and retrieval.

Asset CollectionCMDBDjango
0 likes · 7 min read
Automated Asset Collection for CMDB Using Puppet Facter and Assets_Report
MaGe Linux Operations
MaGe Linux Operations
Aug 20, 2018 · Operations

Essential Linux Performance Tools: Quick Guide to Diagnose System Bottlenecks

This article compiles and explains a set of Linux command‑line utilities—including uptime, dmesg, vmstat, mpstat, pidstat, iostat, free, sar and top—showing how to interpret their output to quickly identify CPU, memory, I/O, and network performance issues, with practical examples and key columns to monitor.

LinuxOperationsPerformance Monitoring
0 likes · 18 min read
Essential Linux Performance Tools: Quick Guide to Diagnose System Bottlenecks
Efficient Ops
Efficient Ops
Aug 15, 2018 · Operations

Why Multi‑Threaded Downloads Spike Bandwidth and How to Diagnose Them

This article examines a real‑world case where a client’s multi‑threaded download caused sudden internet‑outbound bandwidth congestion, details the packet‑level investigation that revealed partial HTTP requests, explains the underlying network traffic analysis architecture, and outlines how automated monitoring and alerts improve operations efficiency.

Operationsbandwidth monitoringmulti-threaded download
0 likes · 10 min read
Why Multi‑Threaded Downloads Spike Bandwidth and How to Diagnose Them
Big Data and Microservices
Big Data and Microservices
Aug 15, 2018 · Operations

What Is APM? A Deep Dive into Application Performance Management and Top Open‑Source Tools

This article explains Application Performance Management (APM), its role in monitoring distributed and micro‑service systems, outlines the five‑dimensional APM model, details core monitoring functions, and reviews leading open‑source APM solutions such as PinPoint, Zipkin, SkyWalking, Prometheus, CAT and Hawkular.

APMDistributed TracingOperations
0 likes · 8 min read
What Is APM? A Deep Dive into Application Performance Management and Top Open‑Source Tools
DevOps
DevOps
Aug 10, 2018 · Operations

Effective Strategies for Promoting DevOps with Minimal Risk and Cost

This article examines how enterprises can adopt DevOps with minimal risk and cost by leveraging agile management, continuous delivery frameworks like the 100‑to‑100 model, Conway’s Law, automation, scripting, and containerization with Docker, while also presenting a recruitment call for DevOps engineers.

Continuous DeliveryConway's lawDevOps
0 likes · 9 min read
Effective Strategies for Promoting DevOps with Minimal Risk and Cost
Efficient Ops
Efficient Ops
Aug 9, 2018 · Operations

How a Bank Built an Automated Operations Platform with Ansible and Open‑Source Tools

This article outlines the motivations, design principles, system architecture, and key tools—including Cobbler, Ignite‑UX, WSUS, and Ansible—behind a bank’s automated operations platform, and details Ansible’s features, capabilities across Linux, HPUX, Windows and OpenStack, and its practical application scenarios such as batch changes, software installation, and environment delivery.

AnsibleBankingIT infrastructure
0 likes · 22 min read
How a Bank Built an Automated Operations Platform with Ansible and Open‑Source Tools
58 Tech
58 Tech
Aug 8, 2018 · Databases

58 Cloud DB Platform: Architecture, Automation, and Intelligent Operations

The article presents a detailed case study of the 58 Cloud DB Platform, describing its architecture, automated workflow using Celery and Ansible, and intelligent features such as server selection and alarm merging powered by machine‑learning, highlighting how it streamlines MySQL, Redis, and MongoDB operations for developers and DBAs.

MongoDBOperationsautomation
0 likes · 10 min read
58 Cloud DB Platform: Architecture, Automation, and Intelligent Operations
Qunar Tech Salon
Qunar Tech Salon
Aug 7, 2018 · Operations

Comprehensive DevOps Glossary, Tool Periodic Table, and Skill Roadmap

This article presents an extensive DevOps glossary covering key terms and practices, a detailed periodic table of DevOps tools, and a skill roadmap outlining the essential knowledge and technologies needed to successfully implement DevOps in modern software delivery.

Continuous DeliveryDevOpsOperations
0 likes · 16 min read
Comprehensive DevOps Glossary, Tool Periodic Table, and Skill Roadmap
Efficient Ops
Efficient Ops
Aug 6, 2018 · Cloud Native

How We Built a Hybrid Container‑VM Private Cloud: Lessons from a Large‑Scale Deployment

This article details the challenges and solutions encountered while transitioning a rapidly growing financial services platform from a VM‑centric private cloud to a hybrid environment that combines containers and virtual machines, covering network integration, IP management, container image standards, resource isolation, scheduling compatibility, and future lightweight container strategies.

Cloud NativeMacvlanOperations
0 likes · 10 min read
How We Built a Hybrid Container‑VM Private Cloud: Lessons from a Large‑Scale Deployment
ITPUB
ITPUB
Aug 3, 2018 · Operations

How to Monitor Log Files in Real-Time with Python: 3 Simple Methods

When high service reliability demands immediate detection of slow requests, this guide shows three Python techniques—using tail via subprocess, file.tell/seek loops, and a generator with yield—to continuously watch log files and trigger alerts as soon as specified patterns appear.

Log MonitoringOperationsPython
0 likes · 4 min read
How to Monitor Log Files in Real-Time with Python: 3 Simple Methods
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Aug 2, 2018 · Operations

How to Build Systems That Run Stably for 10 Years

This article shares practical methodologies for building software systems that remain stable for a decade, covering goal setting, holistic design, operator and data‑center choices, cross‑region active‑active challenges, server and platform selection, comprehensive monitoring, and the importance of continuous personal improvement.

Continuous ImprovementOperationsSoftware Architecture
0 likes · 7 min read
How to Build Systems That Run Stably for 10 Years
Efficient Ops
Efficient Ops
Aug 1, 2018 · Operations

How Tencent Revolutionized Monitoring: From IDC Crises to AI‑Driven AIOps

This talk by Tencent’s monitoring R&D lead outlines a decade of evolution in large‑scale monitoring, covering real‑world incident cases, the three drivers behind architectural upgrades, the implementation of a three‑dimensional monitoring framework, and the application of AI‑powered AIOps for precise, rapid anomaly detection.

Big DataOperationsaiops
0 likes · 18 min read
How Tencent Revolutionized Monitoring: From IDC Crises to AI‑Driven AIOps
DevOps
DevOps
Aug 1, 2018 · Operations

A Simple DevOps Value System: Business, Architecture, Technology, People, Process, Tools, Principles, Methods, Practices

The article presents a straightforward DevOps value framework that links business, architecture, technology, people, process, tools, principles, methods, and practices, illustrating how each element supports the others and offering practical guidance for startups, micro‑service adoption, and economic decision‑making in software delivery.

DevOpsOperationsbusiness
0 likes · 12 min read
A Simple DevOps Value System: Business, Architecture, Technology, People, Process, Tools, Principles, Methods, Practices
MaGe Linux Operations
MaGe Linux Operations
Jul 28, 2018 · Operations

Master the Most Common Ansible Modules: From ping to get_url

This guide introduces the most frequently used Ansible modules—including ping, setup, file, copy, service, cron, yum, user, group, synchronize, mount, and get_url—explaining their purpose, key options, and providing concrete command‑line examples to help you automate system tasks efficiently.

AnsibleDevOpsModules
0 likes · 15 min read
Master the Most Common Ansible Modules: From ping to get_url
Open Source Tech Hub
Open Source Tech Hub
Jul 19, 2018 · Operations

How to Retrieve Jenkins Initial Admin Password on Windows

This guide explains what Jenkins is—a free, powerful CI/CD platform for any build or deployment—and shows the exact command to display the initial administrator password stored in the Jenkins home directory on a Windows host.

InitialAdminPasswordJenkinsLinux
0 likes · 2 min read
How to Retrieve Jenkins Initial Admin Password on Windows
UCloud Tech
UCloud Tech
Jul 18, 2018 · Operations

How to Build a Unified Monitoring System for Microservices: Key Dimensions & Scenarios

This article explains how microservice architectures require a comprehensive monitoring system, covering data, resource, and code dimensions, and describes eight atomic monitoring scenarios such as URL, host, product, component, custom, resource, APM, and event monitoring to help engineers design effective observability solutions.

APMOperationscloud-native
0 likes · 7 min read
How to Build a Unified Monitoring System for Microservices: Key Dimensions & Scenarios
转转QA
转转QA
Jul 18, 2018 · Operations

Improving Test Efficiency and Continuous Integration with the Beetle Platform: An Interface Testing Case Study

The article discusses how embracing speed and flexible configuration in QA, exemplified by the Beetle platform’s interface testing workflow, can improve project efficiency, enable unified automated testing, and integrate continuous integration, while emphasizing that tools alone cannot guarantee test quality.

OperationsSoftware qualityTestNG
0 likes · 9 min read
Improving Test Efficiency and Continuous Integration with the Beetle Platform: An Interface Testing Case Study
Efficient Ops
Efficient Ops
Jul 11, 2018 · Operations

How Tencent Scales Automated Operations with Package Management and CMDB

This article outlines Tencent's automated operations framework, covering the evolution of its package management system, multi‑center organizational structures, CMDB resource imaging, process automation, version control, and release management, while sharing practical lessons and pitfalls from real‑world deployments.

CMDBDevOpsOperations
0 likes · 21 min read
How Tencent Scales Automated Operations with Package Management and CMDB
ITPUB
ITPUB
Jul 11, 2018 · Operations

Parallelizing Bash Loops Without Extra Tools: Practical Shell Techniques

This article explains how Linux administrators can replace slow serial shell loops with concurrent executions using background processes, simulated queues, and FIFO pipes, providing step‑by‑step scripts, performance comparisons, and practical guidelines to control process counts safely.

BashOperationsParallel
0 likes · 10 min read
Parallelizing Bash Loops Without Extra Tools: Practical Shell Techniques
Efficient Ops
Efficient Ops
Jul 8, 2018 · Operations

How to Seamlessly Take Over New Ops Responsibilities: A Practical Checklist

This guide outlines a step‑by‑step approach for taking over new operational responsibilities, covering communication with development leaders, business overview, asset inventory, basic and business‑specific monitoring, standardization, SOP creation, failure drills, cost and capacity planning, and effective cross‑team communication.

Operationsasset managementhandovers
0 likes · 10 min read
How to Seamlessly Take Over New Ops Responsibilities: A Practical Checklist
Efficient Ops
Efficient Ops
Jul 3, 2018 · Operations

From Fire‑Fighting to Proactive Delivery: How Meizu Built a Cloud‑Native CI/CD Ops Platform

Meizu’s operations team transformed reactive firefighting into proactive delivery by building a cloud‑native continuous integration platform, detailing their automation journey, challenges, platform components, release evolution, and intelligent ops that together boost quality, efficiency, cost control, and security.

Operationsautomationcloud delivery
0 likes · 16 min read
From Fire‑Fighting to Proactive Delivery: How Meizu Built a Cloud‑Native CI/CD Ops Platform
MaGe Linux Operations
MaGe Linux Operations
Jul 1, 2018 · Operations

Essential Linux Commands and Options: A Comprehensive Guide

This article provides a detailed reference of common Linux commands—including ls, mv, cp, scp, rm, touch, cd, mkdir, find, grep, tar, chmod, and many others—explaining each option, flag, and typical usage examples to help system administrators and developers work efficiently in the shell.

LinuxOperationsShell
0 likes · 34 min read
Essential Linux Commands and Options: A Comprehensive Guide
Efficient Ops
Efficient Ops
Jun 27, 2018 · Operations

How ZhiYun Job Platform Revolutionizes Automated Operations

The article introduces the ZhiYun Job Platform, detailing its evolution from basic tool construction to advanced orchestration and API integration, highlighting how it standardizes, automates, and secures repetitive operational tasks for enterprises across cloud environments.

OperationsOrchestrationautomation
0 likes · 10 min read
How ZhiYun Job Platform Revolutionizes Automated Operations
DataFunTalk
DataFunTalk
Jun 24, 2018 · Big Data

OPPO Big Data Platform Operations and R&D Practices: Architecture, Scaling, and Monitoring

This article summarizes OPPO's rapid growth of its big‑data platform, detailing the three‑layer architecture, the evolution from Flume‑Kafka to NiFi for data ingestion, the upgrade of the OFlow task scheduler, comprehensive monitoring of data, resources and task SLA, and the development of a self‑service analytics tool called InnerEye to ensure stability, efficiency, and security.

AirflowBig DataNiFi
0 likes · 10 min read
OPPO Big Data Platform Operations and R&D Practices: Architecture, Scaling, and Monitoring
Architecture Digest
Architecture Digest
Jun 24, 2018 · Databases

Designing a High‑Availability Redis Service with Sentinel

This article explains how to build a highly available Redis service using Sentinel, discusses failure scenarios, compares single‑instance, master‑slave, and multi‑Sentinel architectures, and provides practical guidance on deployment, VIP handling, and operational considerations.

Operationsredissentinel
0 likes · 11 min read
Designing a High‑Availability Redis Service with Sentinel
ITPUB
ITPUB
Jun 23, 2018 · Operations

How to Diagnose Server Failures Within the First 5 Minutes

This guide walks you through a systematic, step‑by‑step process for quickly identifying the root cause of a server outage, covering everything from gathering context and checking who is logged in to inspecting processes, network services, hardware, I/O performance, filesystem mounts, and relevant logs.

Operationsmonitoringserver troubleshooting
0 likes · 8 min read
How to Diagnose Server Failures Within the First 5 Minutes