Tagged articles
3281 articles
Page 24 of 33
DevOps Cloud Academy
DevOps Cloud Academy
Oct 19, 2019 · Operations

Resolving Common SonarQube Platform Issues: Data Instability, Rule Configuration, and Project Authorization

This article explains how to address three common SonarQube challenges—data instability across branches, difficulty assigning quality profiles, and project permission management—by creating per‑branch projects, using Jenkins pipeline scripts with Sonar REST APIs, and applying permission templates to streamline large‑scale code‑quality scanning.

DevOpsJenkinsOperations
0 likes · 7 min read
Resolving Common SonarQube Platform Issues: Data Instability, Rule Configuration, and Project Authorization
Alibaba Cloud Native
Alibaba Cloud Native
Oct 16, 2019 · Cloud Native

Master the Distributed Systems Knowledge Map: From SOA to MSA and Beyond

This comprehensive guide walks you through the fundamentals, design patterns, consistency models, core components, and engineering practices of modern distributed systems, helping you understand micro‑service architecture, network protocols, data management, fault tolerance, and performance optimization in cloud‑native environments.

Cloud NativeConsistencyMicroservices
0 likes · 32 min read
Master the Distributed Systems Knowledge Map: From SOA to MSA and Beyond
Java Captain
Java Captain
Oct 12, 2019 · Operations

Curated List of Free Technical Books Covering Linux, System Administration, Networking, and More

This article presents a curated collection of over a hundred free technical books—including Linux command‑line guides, system‑administration manuals, computer‑networking textbooks, and Docker tutorials—complete with brief descriptions, download links, and the impressive GitHub star and fork statistics of the source project.

LinuxOperationsSystem Administration
0 likes · 8 min read
Curated List of Free Technical Books Covering Linux, System Administration, Networking, and More
Efficient Ops
Efficient Ops
Oct 9, 2019 · Operations

From IT Maintenance to IT Operations: Why the Shift Matters

This article explores the nuanced differences between IT maintenance (IT运维) and IT operations (IT运营), explaining how organizations transition from merely keeping systems alive to delivering high‑quality, business‑centric services that satisfy users, executives, and IT staff alike.

IT OperationsOperationsautomation
0 likes · 19 min read
From IT Maintenance to IT Operations: Why the Shift Matters
Architects' Tech Alliance
Architects' Tech Alliance
Oct 9, 2019 · Operations

Understanding Linux Virtual Server (LVS) Load Balancing: Principles, Implementation Methods, and Scheduling Algorithms

This article explains the role of load balancers in large-scale internet applications, introduces Linux Virtual Server (LVS) as a four‑layer software load‑balancing solution, describes its architecture, NAT/TUN/DR forwarding methods, and details various static and dynamic scheduling algorithms such as Round Robin, Weighted Least‑Connection, and locality‑based strategies.

LVSLinuxOperations
0 likes · 11 min read
Understanding Linux Virtual Server (LVS) Load Balancing: Principles, Implementation Methods, and Scheduling Algorithms
Efficient Ops
Efficient Ops
Oct 8, 2019 · Operations

Build a Docker Container Monitoring Stack with CAdvisor, InfluxDB, Grafana

To effectively monitor Dockerized services, this guide walks through selecting a monitoring solution, deploying CAdvisor, integrating it with InfluxDB for persistent storage, visualizing metrics via Grafana, and addressing common issues such as missing utilities, memory stats, and network traffic inaccuracies.

GrafanaInfluxDBOperations
0 likes · 15 min read
Build a Docker Container Monitoring Stack with CAdvisor, InfluxDB, Grafana
DevOps Cloud Academy
DevOps Cloud Academy
Oct 7, 2019 · Operations

GitLab High Availability Solution with DRBD

This guide details a step‑by‑step setup of a highly available GitLab service using two virtual machines, DRBD for block‑level replication, configuration of GitLab and PostgreSQL directories, DRBD resource creation, service start‑up, and manual primary‑secondary failover procedures.

DRBDGitLabLinux
0 likes · 8 min read
GitLab High Availability Solution with DRBD
37 Interactive Technology Team
37 Interactive Technology Team
Sep 27, 2019 · Operations

Centralized Management of Cron Jobs: Challenges and Solutions

The article outlines how a company built a centralized cron‑job platform—using Python’s crontab library, SaltStack deployment, ELK log aggregation, and automated email alerts—to integrate existing tasks, provide reliable CRUD operations, enable fast log querying, and detect failures, cutting operational overhead while managing thousands of scheduled jobs across multiple servers.

Log ManagementOperationsPython
0 likes · 8 min read
Centralized Management of Cron Jobs: Challenges and Solutions
Liangxu Linux
Liangxu Linux
Sep 25, 2019 · Operations

Understanding Linux Load Average: What the Numbers Really Mean

This article explains what Linux load average measures, how to read the 1‑, 5‑, and 15‑minute values, what they indicate on single‑core and multi‑core systems, and which thresholds should raise alerts for system performance monitoring.

LinuxLoad AverageOperations
0 likes · 6 min read
Understanding Linux Load Average: What the Numbers Really Mean
转转QA
转转QA
Sep 25, 2019 · Operations

Comprehensive Testing Strategies for Advertising Recall Systems

The article outlines a complete testing framework for advertising recall services, analyzing three demand types, defining testing focus for each, and presenting tools for log comparison, recall result verification, result comparison, and batch regression to ensure high‑quality ad delivery and revenue stability.

AdvertisingBackendOperations
0 likes · 8 min read
Comprehensive Testing Strategies for Advertising Recall Systems
Efficient Ops
Efficient Ops
Sep 23, 2019 · Operations

How to Build an Effective CMDB for Scalable Operations Management

This article explains the step‑by‑step process of constructing a configuration management database (CMDB) for operations, covering resource modeling, data integration, organizational structures, maintenance methods, and how a well‑designed CMDB supports higher‑level business operations such as automation, visualization, and capacity planning.

CMDBITILOperations
0 likes · 14 min read
How to Build an Effective CMDB for Scalable Operations Management
Architects Research Society
Architects Research Society
Sep 23, 2019 · Operations

Curated List of Open‑Source Workflow Engines and BPM Tools

This article presents a comprehensive, categorized list of open‑source workflow engines and BPM tools—including Airflow, Argo, Cadence, Camunda, and many others—detailing their primary features and typical use cases for orchestration, data pipelines, and micro‑service coordination.

EngineOperationsOrchestration
0 likes · 4 min read
Curated List of Open‑Source Workflow Engines and BPM Tools
Efficient Ops
Efficient Ops
Sep 22, 2019 · Operations

How Experts Refined the DevOps Technical Operations Assessment Method

The September 2019 workshop convened over 30 DevOps specialists from leading Chinese enterprises to review and improve the evaluation method for the DevOps Capability Maturity Model Part 4: Technical Operations, resulting in a more complete and standardized assessment framework.

Assessment MethodOperationsstandardization
0 likes · 3 min read
How Experts Refined the DevOps Technical Operations Assessment Method
Programmer DD
Programmer DD
Sep 20, 2019 · Operations

Master Prometheus: Key Features, Architecture, and Query Essentials

This article introduces Prometheus, an open‑source cloud‑native monitoring and alerting system, covering its main characteristics, core components, architecture diagram, typical use cases, query language syntax, built‑in functions, time‑series types, and practical tips for reliable operation.

AlertingOperationsPromQL
0 likes · 9 min read
Master Prometheus: Key Features, Architecture, and Query Essentials
Efficient Ops
Efficient Ops
Sep 18, 2019 · Databases

Why the DBA Role Is Becoming a Narrowed, High‑Risk Career Path

The article analyzes how the DBA job market is shrinking as traditional enterprises shift away from legacy systems, cloud adoption reshapes responsibilities, and DBAs face limited advancement unless they transition to architecture or data‑analytics roles, highlighting the growing risk and low reward of staying in pure DBA work.

Big DataDBADatabase Administration
0 likes · 7 min read
Why the DBA Role Is Becoming a Narrowed, High‑Risk Career Path
DevOps Cloud Academy
DevOps Cloud Academy
Sep 8, 2019 · Operations

SSO and WebHook Integration Guide for GitLab and Jenkins

This guide details step‑by‑step configurations for integrating Single Sign‑On (SSO) and WebHook between GitLab and Jenkins, covering GitLab application setup, Jenkins backup and proxy adjustments, plugin installation, token generation, and testing the connection to ensure successful builds.

GitLabIntegrationJenkins
0 likes · 2 min read
SSO and WebHook Integration Guide for GitLab and Jenkins
DevOps Cloud Academy
DevOps Cloud Academy
Sep 8, 2019 · Operations

Jenkins User, Credential, and Permission Management Guide

This guide explains how to configure Jenkins user management, credential storage, and permission settings, covering entry points, LDAP/GitLab integration, credential types, and role-based access control with detailed steps and visual illustrations for administrators.

JenkinsOperationsPermissions
0 likes · 4 min read
Jenkins User, Credential, and Permission Management Guide
DevOps Cloud Academy
DevOps Cloud Academy
Sep 8, 2019 · Operations

Project Management Guidelines and Jenkins Pipeline Setup

This guide outlines project naming conventions and step‑by‑step instructions for creating a new Jenkins project, configuring build history, parameterized builds, triggers, Jenkinsfile, and how to build, view logs, and debug the pipeline, illustrated with screenshots.

JenkinsNaming ConventionOperations
0 likes · 2 min read
Project Management Guidelines and Jenkins Pipeline Setup
Efficient Ops
Efficient Ops
Sep 8, 2019 · Operations

Inside GNSEC 2023: How DevOps Leaders Accelerate Cloud and Digital Transformation

The one‑day GNSEC Global New‑Generation Software Engineering Summit gathered senior experts from major banks, tech giants and research institutes to showcase DevOps, cloud‑native, and digital‑transformation practices through a series of insightful talks, live demos, and award ceremonies, highlighting concrete case studies and emerging standards.

Digital TransformationOperationsconference
0 likes · 9 min read
Inside GNSEC 2023: How DevOps Leaders Accelerate Cloud and Digital Transformation
DevOps Cloud Academy
DevOps Cloud Academy
Sep 5, 2019 · Operations

An Overview of the Prometheus Monitoring System

Prometheus, an open‑source monitoring and alerting toolkit originally developed by SoundCloud and now a CNCF project, offers multidimensional data models, flexible queries, pull‑based data collection, various metric types (counter, gauge, summary, histogram), local and remote storage, service discovery, and integrates with Grafana for visualization.

Cloud NativeMetricsOperations
0 likes · 8 min read
An Overview of the Prometheus Monitoring System
JD Retail Technology
JD Retail Technology
Sep 2, 2019 · Operations

How a Real‑Time H5 Monitoring Platform Solves E‑Commerce Activity Issues

Facing frequent user complaints about broken, slow, or misleading H5 activity pages, JD’s massive e‑commerce operations categorize issues into four types and deploy the Woodpecker platform—a scalable, real‑time monitoring and analysis system that pre‑detects configuration errors, server faults, development bugs, and minor UX flaws, while offering extensible, configurable alerts and historical scans.

AIH5 monitoringOperations
0 likes · 15 min read
How a Real‑Time H5 Monitoring Platform Solves E‑Commerce Activity Issues
DevOps Coach
DevOps Coach
Aug 29, 2019 · Operations

Benchmark Your DevOps Performance with the 2019 Accelerate Report

This article walks you through the key findings of the 2019 Accelerate DevOps State of the Industry report, explains the four golden metrics, shows how to use Google’s minimal‑ist benchmark tool to compare your organization against industry baselines, and discusses the emerging service‑operations efficiency metric.

Accelerate ReportBenchmarkingDevOps
0 likes · 11 min read
Benchmark Your DevOps Performance with the 2019 Accelerate Report
Efficient Ops
Efficient Ops
Aug 28, 2019 · Operations

How to Harden Linux Server Security: Account, Login, and Boot Controls

This guide details practical Linux server hardening techniques—including account cleanup, password policies, su/sudo restrictions, login controls, and BIOS/GRUB protection—while providing exact command examples for operations teams to quickly improve system security.

Account ManagementLinuxOperations
0 likes · 12 min read
How to Harden Linux Server Security: Account, Login, and Boot Controls
Cloud Native Technology Community
Cloud Native Technology Community
Aug 21, 2019 · Industry Insights

What Does a DevOps Consultant Actually Do? A Real‑World Walkthrough

This article shares a DevOps consultant’s personal journey, detailing the diverse responsibilities, tools, and mindset required—from early full‑stack experience and virtualization research to CI/CD pipelines, infrastructure‑as‑code, security, load balancing, and fostering a DevOps culture across teams.

ConsultingDevOpsInfrastructure as Code
0 likes · 9 min read
What Does a DevOps Consultant Actually Do? A Real‑World Walkthrough
Youzan Coder
Youzan Coder
Aug 21, 2019 · Operations

How Opsflow Revolutionized Youzan's DevOps Workflow Management

This article examines the evolution of Youzan's Opsflow workflow engine, detailing its architecture, components, and how it solved numerous operational challenges such as low customizability, lack of progress visibility, and fragmented approval processes, while outlining its current status and future roadmap.

DevOpsFinite State MachineOperations
0 likes · 13 min read
How Opsflow Revolutionized Youzan's DevOps Workflow Management
dbaplus Community
dbaplus Community
Aug 12, 2019 · Operations

Why DevOps Matters and How to Implement It: Practical Lessons from Vipshop

This article explains the need for DevOps, contrasts it with ITIL, outlines practical steps for implementation, and shares Vipshop’s component‑centric DevOps practice, including configuration platforms, risk‑matrix control, and continuous improvement metrics, offering engineers actionable insights for real‑world deployment.

Configuration ManagementDevOpsITIL
0 likes · 12 min read
Why DevOps Matters and How to Implement It: Practical Lessons from Vipshop
DevOps Cloud Academy
DevOps Cloud Academy
Aug 12, 2019 · Operations

Ansible Installation and Basic Usage Guide

This guide walks through setting up a two‑node Linux environment, installing Ansible, configuring its inventory and SSH keys, and demonstrates common Ansible commands for managing hosts, checking connectivity, and executing remote tasks.

AnsibleConfiguration ManagementLinux
0 likes · 5 min read
Ansible Installation and Basic Usage Guide
Efficient Ops
Efficient Ops
Aug 8, 2019 · Operations

10 Ops Murphy’s Laws Every Engineer Should Read Daily

This article shares a set of operational Murphy’s laws, practical process‑management tips, and automation strategies to help ops engineers reduce human error, improve safety, stability, efficiency, and cost‑saving in daily work.

Operationsautomationincident response
0 likes · 9 min read
10 Ops Murphy’s Laws Every Engineer Should Read Daily
58 Tech
58 Tech
Aug 7, 2019 · Operations

An Overview of the USP Deployment System: Architecture, Models, and Key Features

This article presents a detailed overview of the 58 Deployment System (USP), covering its evolution, Java‑based architecture, communication and deployment models, traffic management, one‑stop and parallel deployments, gray‑scale rollout, fast rollback, task‑driven workflow, and future direction within private‑cloud environments.

DeploymentOperationsautomation
0 likes · 8 min read
An Overview of the USP Deployment System: Architecture, Models, and Key Features
ITPUB
ITPUB
Aug 5, 2019 · Operations

Mastering SSH Public‑Key Login for Batch Server Operations

This guide explains how SSH public‑key authentication works, walks through generating key pairs, shows the connection handshake, and demonstrates practical batch command execution and file collection across multiple Linux servers using ssh, scp, and nc.

LinuxOperationsPublic Key Authentication
0 likes · 9 min read
Mastering SSH Public‑Key Login for Batch Server Operations
ITPUB
ITPUB
Aug 5, 2019 · Operations

How a Midnight Migration Saved Millions: Lessons in Problem‑Solving for Developers

A senior engineer recounts a high‑pressure, overnight data‑migration from an overloaded legacy platform to a new micro‑service system, detailing the technical challenges, rapid troubleshooting, multithreaded workarounds, and the broader lessons on what truly makes a programmer great.

BackendOperationsmultithreading
0 likes · 16 min read
How a Midnight Migration Saved Millions: Lessons in Problem‑Solving for Developers
Java Captain
Java Captain
Aug 3, 2019 · Operations

Practical Guide to Viewing Logs, Processes, Ports, and System Status on Linux

This article provides a comprehensive, step‑by‑step tutorial on using Linux command‑line tools such as cat, tail, vim, grep, sed, ps, netstat, lsof, and free to efficiently view large log files, locate specific entries, monitor processes and ports, and assess overall system health.

LinuxLog ManagementOperations
0 likes · 8 min read
Practical Guide to Viewing Logs, Processes, Ports, and System Status on Linux
iQIYI Technical Product Team
iQIYI Technical Product Team
Aug 2, 2019 · Operations

iQIYI CDN IPv6 Deployment Architecture and Implementation

iQIYI’s CDN scheduling system was redesigned for dual‑stack IPv4/IPv6, adding Anycast DNS, IPv6‑aware probes, and hybrid CDN integration, while upgrading data‑center, backbone, and server configurations through automated SDN and management platforms, enabling over 100 million IPv6 users and gigabit‑scale traffic.

CDNIPv6Operations
0 likes · 18 min read
iQIYI CDN IPv6 Deployment Architecture and Implementation
Ops Development Stories
Ops Development Stories
Jul 29, 2019 · Operations

Mastering Nginx Reverse Proxy, Load Balancing, and Caching

This article explains how to configure Nginx as a reverse proxy, implement load‑balancing strategies, separate static and dynamic content, set up proxy caching with various directives, purge caches, and enable gzip compression, providing complete code examples and practical testing results.

GzipNginxOperations
0 likes · 17 min read
Mastering Nginx Reverse Proxy, Load Balancing, and Caching
DevOps
DevOps
Jul 29, 2019 · Operations

Google’s Continuous Delivery Practices and SRE Culture: A DevOps Case Study

This article examines Google’s corporate values, development history, culture, and detailed DevOps and Site Reliability Engineering practices—including continuous delivery, SRE responsibilities, and Google Cloud Platform CI/CD tools—to illustrate how the company achieves 24/7 reliable service deployment at massive scale.

Continuous DeliveryDevOpsGoogle
0 likes · 15 min read
Google’s Continuous Delivery Practices and SRE Culture: A DevOps Case Study
DevOps
DevOps
Jul 26, 2019 · Operations

Amazon’s DevOps Journey: From Customer Obsession to Continuous Delivery

This article examines Amazon’s evolution—from its early focus on books and relentless customer obsession to the adoption of micro‑service architecture, two‑pizza teams, and a high‑velocity continuous delivery pipeline—illustrating how strategic cultural and technical choices drive massive operational efficiency.

AmazonContinuous DeliveryCustomer Obsession
0 likes · 9 min read
Amazon’s DevOps Journey: From Customer Obsession to Continuous Delivery
Efficient Ops
Efficient Ops
Jul 25, 2019 · Operations

How Tencent’s Ops Teams Move Massive Workloads to the Cloud and Boost Efficiency

Tencent’s recent Operations Open Day showcased how its engineers migrated billions of users to public cloud, leveraged cloud‑native DevOps, serverless functions, and intelligent data‑center management to dramatically improve efficiency, scalability, and reliability across its massive infrastructure.

Cloud NativeOperationsServerless
0 likes · 9 min read
How Tencent’s Ops Teams Move Massive Workloads to the Cloud and Boost Efficiency
DevOps
DevOps
Jul 25, 2019 · Operations

Why DevOps Teams Often Turn Into Tool Chains and What an Ideal DevOps Team Structure Looks Like

The article analyzes why many DevOps teams devolve into tool‑chain or pipeline roles, examines executor and organizational factors, presents a six‑role DevOps team model linked to the Six Thinking Hats, shares community viewpoints on role prioritization, and concludes that DevOps structures must be tailored to solve concrete business problems rather than follow a fixed standard.

DevOpsOperationsTeam Structure
0 likes · 12 min read
Why DevOps Teams Often Turn Into Tool Chains and What an Ideal DevOps Team Structure Looks Like
58 Tech
58 Tech
Jul 23, 2019 · Operations

Design and Implementation of an Open Alarm Platform for Monitoring Systems

The Open Alarm Platform provides a flexible data model, modular architecture, and robust stability features to enable various business lines to integrate their custom monitoring systems via APIs, offering alert convergence, merging, multi‑channel delivery, and comprehensive management while reducing development and maintenance costs.

AlertingOperationsScalability
0 likes · 9 min read
Design and Implementation of an Open Alarm Platform for Monitoring Systems
Xianyu Technology
Xianyu Technology
Jul 23, 2019 · Operations

Automated Service Fault Localization System Architecture

The automated service fault localization system ingests massive real‑time instrumentation data, builds call‑chain graphs, and instantly pinpoints the exact component causing timeouts or other errors, achieving developer‑level accuracy within seconds instead of minutes while remaining simple, fast, and fully automated.

Big DataFault LocalizationOperations
0 likes · 8 min read
Automated Service Fault Localization System Architecture
DevOps
DevOps
Jul 22, 2019 · Operations

DevOps Team Topologies: Anti‑Types, Types, and Choosing the Right Structure

This article explains the various DevOps team topologies—including anti‑patterns A‑G and nine positive types—detailing their characteristics, applicability, and potential effectiveness so organizations can select the most suitable structure for their value‑stream delivery goals.

Anti‑PatternDevOpsOperations
0 likes · 14 min read
DevOps Team Topologies: Anti‑Types, Types, and Choosing the Right Structure
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Jul 18, 2019 · Operations

Why Bosun Beats Alertmanager and Kapacitor for Container Alerting

This article compares three container alerting frameworks—Alertmanager, Kapacitor, and Bosun—explains why Bosun was chosen for its flexible HTTP API rule deployment and low learning curve, and provides step‑by‑step configuration, rule definition, notification, and templating examples for integrating Bosun with Prometheus.

AlertingBosunConfiguration
0 likes · 9 min read
Why Bosun Beats Alertmanager and Kapacitor for Container Alerting
MaGe Linux Operations
MaGe Linux Operations
Jul 12, 2019 · Operations

Essential Linux Commands Every Engineer Should Master

This guide compiles the most indispensable Linux commands—from directory and file manipulation, navigation, and text processing to compression, daily system administration, status monitoring, networking, and database access—providing concise examples and practical tips for both beginners and seasoned users.

LinuxOperationscommand-line
0 likes · 14 min read
Essential Linux Commands Every Engineer Should Master
Ctrip Technology
Ctrip Technology
Jul 11, 2019 · Cloud Native

Ctrip’s Continuous Delivery Practices and Unified Build Platform with Jenkins on Kubernetes

This article describes Ctrip’s large‑scale continuous delivery system, its benefits for efficiency, quality, reliability and team collaboration, the evolution of its deployment models, the design of a unified Jenkins‑based build platform, and practical experiences running Jenkins on Kubernetes with elastic scheduling and workspace management.

Cloud NativeContinuous DeliveryDevOps
0 likes · 19 min read
Ctrip’s Continuous Delivery Practices and Unified Build Platform with Jenkins on Kubernetes
360 Tech Engineering
360 Tech Engineering
Jul 8, 2019 · Operations

Common ETCD Issues and Recovery Procedures

This guide explains ETCD’s high‑availability architecture and provides detailed step‑by‑step recovery procedures for single‑node failures, majority‑node outages, and database‑space‑exceeded errors, including status checks, member removal and addition, snapshot restoration, compaction, defragmentation, and alarm clearing.

BackupOperationsRecovery
0 likes · 7 min read
Common ETCD Issues and Recovery Procedures
Architects' Tech Alliance
Architects' Tech Alliance
Jun 30, 2019 · Operations

How DNS and GSLB Enable Multi-Active Data Center Load Balancing

This article explains DNS fundamentals, the step‑by‑step resolution process, TTL caching, and how DNS‑based Global Server Load Balancing (GSLB) can direct traffic to the nearest active data‑center, providing a practical guide for building multi‑active, high‑availability infrastructures.

DNSGSLBOperations
0 likes · 10 min read
How DNS and GSLB Enable Multi-Active Data Center Load Balancing
MaGe Linux Operations
MaGe Linux Operations
Jun 30, 2019 · Operations

Mastering Load Balancing: LVS, Nginx, and HAProxy Explained

This article introduces server clustering and load‑balancing concepts, compares popular software such as LVS, Nginx, and HAProxy, explains their architectures, NAT and DR modes, and outlines each solution's strengths and weaknesses for building high‑performance web services.

HAProxyLVSOperations
0 likes · 14 min read
Mastering Load Balancing: LVS, Nginx, and HAProxy Explained
21CTO
21CTO
Jun 27, 2019 · Operations

From Hundreds to Thousands: Scaling Operations and Building a Custom Monitoring System

This article recounts AdMaster's five‑year journey from a few dozen servers to thousands, detailing the evolution of their monitoring infrastructure, the challenges faced at each scale stage, and the design of a self‑built, distributed monitoring platform that delivers real‑time alerts, visualized data, and business‑level insights.

InfrastructureOperationsscaling
0 likes · 14 min read
From Hundreds to Thousands: Scaling Operations and Building a Custom Monitoring System
ITPUB
ITPUB
Jun 26, 2019 · Operations

How to Prevent Catastrophic rm -rf Mistakes in Linux Shell Scripts

This article explains common scenarios where empty variables, spaces, special characters, or failed directory changes cause accidental deletions in Linux, and provides practical shell techniques—such as quoting, parameter expansion, set -u, and logical checks—to safeguard against disastrous rm -rf commands.

LinuxOperationsSafety
0 likes · 8 min read
How to Prevent Catastrophic rm -rf Mistakes in Linux Shell Scripts
Efficient Ops
Efficient Ops
Jun 23, 2019 · Operations

How to Diagnose and Fix Java Application Slowdowns: CPU, GC, and Thread Issues

This guide explains how to identify and resolve common Java production problems such as sudden CPU spikes, excessive Full GC, thread blocking, waiting states, and deadlocks by using tools like top, jstack, jstat, and memory‑dump analysis to pinpoint the root cause and apply appropriate fixes.

OperationsThread Dumpgc
0 likes · 18 min read
How to Diagnose and Fix Java Application Slowdowns: CPU, GC, and Thread Issues
DevOps Cloud Academy
DevOps Cloud Academy
Jun 20, 2019 · Operations

Step-by-Step Installation and Configuration of Node Exporter, Alertmanager, Prometheus, and Grafana for Monitoring and Alerting

This guide walks through downloading, extracting, and setting up Node Exporter, Alertmanager, Prometheus, and Grafana on a Linux server, configuring their systemd services, customizing alert rules, and verifying the monitoring and alerting pipeline with screenshots of each verification step.

AlertmanagerGrafanaOperations
0 likes · 7 min read
Step-by-Step Installation and Configuration of Node Exporter, Alertmanager, Prometheus, and Grafana for Monitoring and Alerting
ITPUB
ITPUB
Jun 20, 2019 · Operations

Essential Ops Lessons: Avoid Disasters with Backups, Permissions, and Monitoring

This article shares hard‑earned operational guidelines for Linux servers, covering safe testing, cautious use of rm ‑rf, the importance of backups, strict access control, SSH hardening, firewall rules, intrusion detection, systematic monitoring, performance tuning, and maintaining a calm mindset to prevent costly incidents.

OperationsServer Administrationmonitoring
0 likes · 12 min read
Essential Ops Lessons: Avoid Disasters with Backups, Permissions, and Monitoring
Efficient Ops
Efficient Ops
Jun 11, 2019 · Operations

What Powers WeChat’s Billion‑User Scale? Inside Its DevOps Journey

WeChat, China’s top social app with over a billion users, has applied DevOps practices to dramatically improve development efficiency, code quality, and accelerate the feedback cycle from requirements to delivery, while confronting real‑world challenges in tooling, processes, reliability, and automation costs.

Continuous DeliveryOperationsWeChat
0 likes · 3 min read
What Powers WeChat’s Billion‑User Scale? Inside Its DevOps Journey
DevOps Cloud Academy
DevOps Cloud Academy
Jun 9, 2019 · Operations

Prometheus Metric Definitions, Types, and Data Samples

This article explains Prometheus metric naming conventions, label usage, metric types such as Counter, Gauge, Summary, and Histogram, and describes the structure of data samples, providing examples and best‑practice guidelines for defining and classifying metrics in monitoring systems.

MetricsOperationsPrometheus
0 likes · 5 min read
Prometheus Metric Definitions, Types, and Data Samples
Programmer DD
Programmer DD
Jun 7, 2019 · Operations

Why Most Alerts Fail and How to Build Actionable Monitoring

This article explains the fundamental flaws of typical alert systems, distinguishes between business rule and reliability monitoring, outlines essential metrics and strategies for effective alerts, and presents simple yet powerful anomaly‑detection algorithms to ensure alerts are actionable and reduce noise.

AlertingOperationsReliability
0 likes · 21 min read
Why Most Alerts Fail and How to Build Actionable Monitoring
MaGe Linux Operations
MaGe Linux Operations
Jun 3, 2019 · Operations

How to Safely Prevent Accidental rm -rf Deletions in Linux Shell

This article explains common scenarios that lead to accidental directory or file deletions in Linux shell scripts—such as empty variables, spaces in paths, special characters, and failed cd commands—and provides practical Bash techniques like variable expansion checks, quoting, set -u, logical short‑circuiting, and safer prompts to avoid catastrophic rm -rf mistakes.

BashLinuxOperations
0 likes · 8 min read
How to Safely Prevent Accidental rm -rf Deletions in Linux Shell
Efficient Ops
Efficient Ops
May 30, 2019 · Operations

Enterprise‑Scale DevOps Secrets from China’s Top Banks Revealed

The 2019 Enterprise‑Level DevOps Empowerment Forum in Chengdu gathered experts from major Chinese banks and telecoms to share practical experiences, including China Merchants Bank’s K8s‑based pipeline, measurement challenges, and collaborative Q&A, illustrating how organizations can accelerate DevOps adoption and improve delivery efficiency.

EnterpriseKubernetesOperations
0 likes · 9 min read
Enterprise‑Scale DevOps Secrets from China’s Top Banks Revealed
MaGe Linux Operations
MaGe Linux Operations
May 29, 2019 · Operations

Essential Linux Ops Tools: Install & Use Nethogs, IOZone, IOTop, and More

This guide introduces a collection of practical Linux operations tools—including Nethogs, IOZone, IOTop, IPtraf, IFTop, HTop, NMON, MultiTail, Fail2ban, Tmux, Agedu, NMap and Httperf—providing concise installation commands, basic usage examples, and key options to help system administrators monitor performance, security and resources efficiently.

LinuxOperationsperformance
0 likes · 11 min read
Essential Linux Ops Tools: Install & Use Nethogs, IOZone, IOTop, and More
21CTO
21CTO
May 24, 2019 · Operations

How Meituan’s R&D Team Cut Tens of Millions in Resource Costs: A Practical Guide

This article details Meituan's R&D team's systematic PDCA‑based approach to resource cost optimization, covering methodology definition, planning, execution, checking, and iterative improvement across infrastructure, big‑data, and shared services, ultimately saving tens of millions of yuan.

Big DataCost OptimizationOperations
0 likes · 22 min read
How Meituan’s R&D Team Cut Tens of Millions in Resource Costs: A Practical Guide
Beike Product & Technology
Beike Product & Technology
May 23, 2019 · Backend Development

Investigation of Nginx 502 Errors Caused by PHP‑FPM Warning Triggering a FastCGI Buffer Defect

This article analyses why seemingly normal PHP‑FPM requests can cause Nginx to return 502 errors, revealing a FastCGI fastcgi_buffer_size bug triggered by warning output, describing the reproduction steps, detailed packet analysis, the underlying protocol mechanics, and practical recommendations for developers and operators.

502 errorNginxOperations
0 likes · 17 min read
Investigation of Nginx 502 Errors Caused by PHP‑FPM Warning Triggering a FastCGI Buffer Defect
Efficient Ops
Efficient Ops
May 21, 2019 · Operations

Essential Linux Ops Tools: Nethogs, IOZone, IOTop, and More

This guide introduces a dozen practical Linux operation tools—including Nethogs, IOZone, IOTop, IPtraf, IFTop, Fail2ban, Tmux, and others—providing concise descriptions, download links, and ready‑to‑run installation commands to help system administrators boost monitoring, performance testing, and security on their servers.

LinuxOperationsmonitoring
0 likes · 12 min read
Essential Linux Ops Tools: Nethogs, IOZone, IOTop, and More
Efficient Ops
Efficient Ops
May 16, 2019 · Operations

How Alibaba’s AI‑Powered Data Centers Achieve Scalable, Reliable Operations

This article examines Alibaba Cloud’s intelligent data center ecosystem, covering market share, global distribution, operational challenges, AIOps evolution, multi‑layered infrastructure platforms, demand forecasting, fault prediction, and future smart‑automation prospects for large‑scale cloud operations.

Alibaba CloudInfrastructureOperations
0 likes · 13 min read
How Alibaba’s AI‑Powered Data Centers Achieve Scalable, Reliable Operations
Liangxu Linux
Liangxu Linux
May 15, 2019 · Operations

Essential Backup Tools for Developers: Git, Rsync, Dropbox, and Time Machine

This guide reviews four practical backup solutions—Git for versioned file control, Rsync for command‑line incremental syncing, Dropbox for cloud‑based GUI storage, and macOS Time Machine for full system snapshots—explaining their key features, typical use cases, and basic setup steps.

BackupGitOperations
0 likes · 6 min read
Essential Backup Tools for Developers: Git, Rsync, Dropbox, and Time Machine
Efficient Ops
Efficient Ops
May 14, 2019 · Operations

How to Master Multi‑Cloud Operations: Lessons from a Gaming Company’s Hybrid Architecture

This talk shares a senior director’s experience building a hybrid multi‑cloud infrastructure for a game company, covering stability, efficiency, cost challenges, design‑for‑failure principles, standardization, resource automation, and the cultural and organizational factors that affect successful cloud operations.

Cost OptimizationDevOpsOperations
0 likes · 20 min read
How to Master Multi‑Cloud Operations: Lessons from a Gaming Company’s Hybrid Architecture
Qunar Tech Salon
Qunar Tech Salon
May 14, 2019 · Operations

Understanding Linux Cgroups for Container Resource Management

This article explains the fundamentals of Linux control groups (cgroups), their components and relationships, and provides step‑by‑step guidance on creating hierarchies, mounting, configuring subsystems, and applying cgroup limits to Docker and Kubernetes containers.

ContainerOperationscgroups
0 likes · 9 min read
Understanding Linux Cgroups for Container Resource Management
Architects' Tech Alliance
Architects' Tech Alliance
May 8, 2019 · Operations

How to Choose the Right Server Rack: Key Factors and Best Practices

This guide explains how to select and grade server racks, outlines essential criteria such as load capacity, ventilation, power distribution, and cable management, and compares three cable‑routing techniques to help data‑center operators make reliable, future‑proof decisions.

Operationscable managementhardware selection
0 likes · 12 min read
How to Choose the Right Server Rack: Key Factors and Best Practices
Efficient Ops
Efficient Ops
May 6, 2019 · Operations

How Live Streaming Ops Ensure Real-Time Reliability at Scale

Zhang Guanshi, the operations director at Huya Live, shares how his team designs a hybrid‑cloud architecture, implements a six‑pillar reliability framework, and leverages real‑time monitoring, AIOps, and rapid‑recovery tools to maintain stable, low‑latency live video streams for millions of viewers.

Operationscloud architecturelive streaming
0 likes · 22 min read
How Live Streaming Ops Ensure Real-Time Reliability at Scale
Efficient Ops
Efficient Ops
May 5, 2019 · Operations

How Qunar Uses AI-Driven Fault Prediction to Boost System Reliability

This article outlines Qunar's operational strategy for reducing failures and extending uptime through precise fault detection, rapid recovery, and AI-powered predictive health management, detailing the evolution of their OPS processes, practical implementations, and future challenges in applying PHM to internet services.

OperationsPHMaiops
0 likes · 18 min read
How Qunar Uses AI-Driven Fault Prediction to Boost System Reliability
iQIYI Technical Product Team
iQIYI Technical Product Team
Apr 26, 2019 · Operations

Design and Implementation of iQIYI CDN Inspection System

iQIYI built a three‑component CDN Inspection System that automatically generates tasks, centrally processes and analyzes results, and runs edge measurements to monitor millions of hybrid CDN servers in real time, detecting configuration errors, file mismatches and traffic anomalies, enabling proactive remediation and 100 % local coverage.

CDNDistributed SystemsOperations
0 likes · 11 min read
Design and Implementation of iQIYI CDN Inspection System
DevOps
DevOps
Apr 24, 2019 · Operations

2019 Accelerate State of DevOps Survey: Participation Guide, Insights, and Interview with Nicole Forsgren

This article introduces the 2019 Accelerate State of DevOps survey, explains how to join the questionnaire, provides background on previous reports, shares a detailed interview with Nicole Forsgren about research design and key findings such as architecture, cloud adoption, and outsourcing, and encourages community participation.

AccelerateDevOpsNicole Forsgren
0 likes · 35 min read
2019 Accelerate State of DevOps Survey: Participation Guide, Insights, and Interview with Nicole Forsgren
dbaplus Community
dbaplus Community
Apr 24, 2019 · Operations

Choosing and Tuning Open‑Source Monitoring Stacks for Large‑Scale Operations

This article reviews common open‑source monitoring tools, shares the evolution of China Unicom's big‑data platform monitoring, and provides practical guidance on selecting collectors, databases, and visualization components, with detailed configurations for Prometheus, Alertmanager, Grafana, and automation recovery techniques.

AlertmanagerGrafanaInfluxDB
0 likes · 19 min read
Choosing and Tuning Open‑Source Monitoring Stacks for Large‑Scale Operations
Efficient Ops
Efficient Ops
Apr 24, 2019 · Operations

Why Every Ops Change Should Be Treated Like a Project

This article shares practical lessons from a real‑world ops incident, emphasizing the need for clear change background, optimal timing, project‑style management, and strict process adherence to reduce risk and improve production reliability.

DevOpsOperationsbest practices
0 likes · 9 min read
Why Every Ops Change Should Be Treated Like a Project
Didi Tech
Didi Tech
Apr 23, 2019 · Industry Insights

What the First Global DevOps Standard Means for Didi and the Industry

The article explains the launch of the world’s first DevOps capability maturity model, the collaborative effort behind it, Didi’s role as a standards workgroup member, and how its OE (OneExperience) platform embodies the new guidelines to streamline the entire software delivery lifecycle.

Capability Maturity ModelDevOpsDidi
0 likes · 5 min read
What the First Global DevOps Standard Means for Didi and the Industry
21CTO
21CTO
Apr 19, 2019 · Operations

From Junior to Senior Ops Engineer: Master the Skills to Level Up

This guide walks you through the entire career ladder of a senior operations engineer, covering essential Linux, networking, monitoring, container, automation, and security skills, while offering practical advice on job roles, learning paths, and professional growth.

DevOpsOperationscontainerization
0 likes · 13 min read
From Junior to Senior Ops Engineer: Master the Skills to Level Up
ITPUB
ITPUB
Apr 19, 2019 · Operations

How to Level Up from Junior to Senior DevOps Engineer: A Complete Roadmap

This guide outlines the career stages, skill sets, and practical tasks for DevOps engineers—from entry‑level troubleshooting to senior‑level architecture, automation, and performance optimization—providing concrete learning paths, tools, and personal development advice to help engineers advance their operations careers.

DevOpsLinuxOperations
0 likes · 12 min read
How to Level Up from Junior to Senior DevOps Engineer: A Complete Roadmap