Tagged articles
3281 articles
Page 15 of 33
Efficient Ops
Efficient Ops
Jun 21, 2022 · Operations

How ICBC Revamped Its Dev/Test Environments for Agile, Scalable Operations

This article outlines how the Industrial and Commercial Bank of China's software development center redesigned its development‑testing environment operations—highlighting key characteristics, practical governance measures, current challenges, and strategic improvements to boost efficiency, automation, and resource utilization.

DevOpsOperationsTesting Environment
0 likes · 10 min read
How ICBC Revamped Its Dev/Test Environments for Agile, Scalable Operations
政采云技术
政采云技术
Jun 21, 2022 · Big Data

Overview of the Traffic Domain and Its Data Governance Architecture

This document presents a comprehensive overview of the traffic domain in a data warehouse, covering its concepts, objectives, guiding principles, core and extension models, data quality, monitoring, scheduling, and operational practices to achieve a complete, accurate, efficient, low‑cost, and high‑value traffic data system while addressing massive data volume, consistency, and SLA challenges.

Big DataData GovernanceOperations
0 likes · 15 min read
Overview of the Traffic Domain and Its Data Governance Architecture
Efficient Ops
Efficient Ops
Jun 20, 2022 · Operations

How ICBC Built an Enterprise‑Scale DevOps Toolchain to Boost R&D Efficiency

This article details how Industrial and Commercial Bank of China (ICBC) tackled rapid product demand and limited R&D resources by designing a unified, enterprise‑level DevOps platform that streamlines continuous integration, delivery, and deployment, improves collaboration, and supports future digital transformation initiatives.

Continuous DeliveryDevOpsEnterprise
0 likes · 9 min read
How ICBC Built an Enterprise‑Scale DevOps Toolchain to Boost R&D Efficiency
Top Architect
Top Architect
Jun 20, 2022 · Operations

Comprehensive Nginx Installation, Configuration, and Optimization Guide

This article provides a step‑by‑step tutorial on installing Nginx, explains core directives such as listen, server_name, and location, and covers advanced topics like rate limiting, various load‑balancing algorithms, reverse proxy setup, keepalive tuning, gzip compression, CORS, anti‑leech, and integration with LVS and keepalived for high‑availability deployments.

ConfigurationOperationsload balancing
0 likes · 15 min read
Comprehensive Nginx Installation, Configuration, and Optimization Guide
ITPUB
ITPUB
Jun 18, 2022 · Operations

How MDD and SRE Cut Mini‑Program Image‑Upload Failures from Days to Minutes

This article recounts a three‑day image‑upload outage in a mini‑program, analyzes the multi‑layer causes, and shows how combining Metrics‑Driven Development with SRE and a custom observability platform dramatically reduces diagnosis time and improves reliability.

Metrics-Driven DevelopmentMini ProgramOperations
0 likes · 20 min read
How MDD and SRE Cut Mini‑Program Image‑Upload Failures from Days to Minutes
Qunar Tech Salon
Qunar Tech Salon
Jun 16, 2022 · Operations

Practical Chaos Engineering Practices at Qunar Travel: Architecture, Scenarios, and Automation

This article details Qunar Travel's mature chaos engineering platform built on chaosblade, covering value analysis, system architecture, shutdown and dependency drills, automated closed‑loop testing, attack‑defense exercises, and the measurable reliability improvements achieved across thousands of services.

Distributed SystemsFault InjectionOperations
0 likes · 18 min read
Practical Chaos Engineering Practices at Qunar Travel: Architecture, Scenarios, and Automation
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jun 16, 2022 · Operations

Renewable Energy‑Driven Data Center “Computing‑Power–Electricity” Optimized Scheduling Pilot in China

Alibaba and North China Electric Power University conducted a pioneering cross‑regional “computing‑power–electricity” optimization pilot, shifting workloads from a Jiangsu data center to a Hebei renewable‑powered site, demonstrating millisecond‑level coordinated scheduling that reduces power demand, cuts CO₂ emissions, and aligns with national green‑energy policies.

Operationscloud computinggrid integration
0 likes · 5 min read
Renewable Energy‑Driven Data Center “Computing‑Power–Electricity” Optimized Scheduling Pilot in China
Ops Development Stories
Ops Development Stories
Jun 16, 2022 · Operations

How to Streamline Call Center Incident Management: From Rapid Diagnosis to Automated Recovery

This article outlines a comprehensive approach to handling call‑center incidents, covering fault boundary definition, emergency recovery actions, rapid root‑cause localization, enhanced monitoring strategies, clear alerting, proactive automation, and the creation of concise, regularly exercised emergency response plans.

Operationscall centerfault-recovery
0 likes · 14 min read
How to Streamline Call Center Incident Management: From Rapid Diagnosis to Automated Recovery
IT Architects Alliance
IT Architects Alliance
Jun 14, 2022 · Cloud Native

Design and Challenges of Multi‑Active Architecture in Hybrid Cloud Environments

This article examines the design principles, challenges, and implementation details of a multi‑active architecture for hybrid cloud environments, covering stability, cost, efficiency, network topology, container orchestration, service discovery, traffic scheduling, and data storage, and outlines practical solutions used by the Zuoyebang platform.

Cloud NativeCost OptimizationKubernetes
0 likes · 13 min read
Design and Challenges of Multi‑Active Architecture in Hybrid Cloud Environments
Laravel Tech Community
Laravel Tech Community
Jun 14, 2022 · Operations

LNMP One-Click Installation Package V1.9 Release Notes and Usage Guide

The LNMP one‑click installation package V1.9 adds support for Rocky Linux, AlmaLinux, CentOS Stream 9 and several Chinese Linux distributions, introduces PHP 8.1, new PHP extensions, MySQL 5.7/8.0 binaries, IPv6 and ZeroSSL options, and provides detailed management commands for deploying and maintaining Nginx, MySQL, PHP and related services.

LNMPLinuxOperations
0 likes · 7 min read
LNMP One-Click Installation Package V1.9 Release Notes and Usage Guide
Efficient Ops
Efficient Ops
Jun 14, 2022 · Operations

Unlocking XOps: From DevOps Metrics to AIOps, BizDevOps, and FinOps

This article summarizes Professor Niu Xiaoling’s GNSEC 2022 keynote, outlining the XOps framework and its five pillars—XOps, DevOps research and operational efficiency metrics, AIOps, BizDevOps, and FinOps—while detailing their drivers, maturity models, implementation examples, and the role of standards in guiding enterprises toward intelligent, cost‑effective, and business‑value‑focused software delivery.

BizDevOpsDevOpsFinOps
0 likes · 18 min read
Unlocking XOps: From DevOps Metrics to AIOps, BizDevOps, and FinOps
Bilibili Tech
Bilibili Tech
Jun 14, 2022 · Operations

SRE Practices for Large‑Scale Event Assurance at Bilibili

Bilibili’s SRE team ensures flawless large‑scale online events by meticulously gathering activity details, provisioning DNS, CDN, networking and compute resources, conducting multi‑stage performance tests and chaos‑engineering drills, applying layered traffic controls, maintaining historical checklists, executing predefined contingency responses, and iterating post‑mortems to drive continuous automation and reliability.

Cloud NativeEvent ReliabilityOperations
0 likes · 20 min read
SRE Practices for Large‑Scale Event Assurance at Bilibili
DevOps Cloud Academy
DevOps Cloud Academy
Jun 12, 2022 · Operations

How to Retrieve Project Branches in Jenkins Pipelines

This tutorial explains how to use Jenkins, the Git Parameter plugin, and parameterized builds to fetch and display Git branch information within a pipeline, including installation steps, configuration details, and common troubleshooting tips.

GitJenkinsOperations
0 likes · 4 min read
How to Retrieve Project Branches in Jenkins Pipelines
Java Baker
Java Baker
Jun 12, 2022 · Operations

System Capacity Checklist: Key Metrics Every Architect Should Track

Architects should treat system capacity like a pre‑flight checklist, using this comprehensive guide to monitor resource usage across services, databases, and queues, and to define business metrics and state‑machine indicators that reveal bottlenecks and guide scaling decisions.

MetricsOperationsarchitecture
0 likes · 5 min read
System Capacity Checklist: Key Metrics Every Architect Should Track
Top Architect
Top Architect
Jun 11, 2022 · Operations

Comprehensive Fault Handling and Emergency Response Guide for Call Center Systems

This guide details a call‑center system fault scenario and provides a step‑by‑step approach for operations teams to identify symptoms, assess impact, implement rapid recovery actions, improve monitoring, and maintain an effective emergency response plan, ensuring faster resolution and long‑term fault self‑healing.

Operationscall centeremergency plan
0 likes · 12 min read
Comprehensive Fault Handling and Emergency Response Guide for Call Center Systems
Efficient Ops
Efficient Ops
Jun 9, 2022 · Operations

What Are the Best Chinese Alternatives to CentOS in 2022?

With CentOS 8 discontinued and CentOS 7 reaching end‑of‑life in 2024, this guide reviews the leading domestic Linux replacements—including OpenEuler, Anolis OS, Alibaba Cloud Linux, TencentOS, KylinOS, deepin, and Red Flag Linux—detailing their features, compatibility, and suitability for cloud and on‑premise deployments.

CentOS alternativesOperationsServer OS
0 likes · 7 min read
What Are the Best Chinese Alternatives to CentOS in 2022?
Efficient Ops
Efficient Ops
Jun 8, 2022 · Operations

What Are the Key Tool Requirements in the DevOps Capability Maturity Model?

This article explains the importance of tool platforms for DevOps, outlines the DevOps Capability Maturity Model's system and tool technical requirements—covering project management, work item management, planning, documentation, knowledge, team collaboration, metrics, and portfolio management—and provides guidance on selecting suitable tools.

DevOpsOperationsProject Management
0 likes · 9 min read
What Are the Key Tool Requirements in the DevOps Capability Maturity Model?
Architects Research Society
Architects Research Society
Jun 4, 2022 · Operations

Improving Solr Search Stability and Performance in a High‑Traffic Personalization Service

This article describes how a team tackled stability and performance problems in a SolrCloud‑based search and recommendation stack serving 150,000 requests per minute, detailing root‑cause analysis, memory and GC tuning, replica configuration changes, and the resulting reductions in latency, resource usage, and operational complexity.

OperationsScalabilitycloud
0 likes · 14 min read
Improving Solr Search Stability and Performance in a High‑Traffic Personalization Service
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 2, 2022 · Operations

Common Operational, Data, and SQL Issues in Apache Doris – FAQs and Solutions

This article compiles frequently asked questions and detailed solutions covering Apache Doris operational problems, data handling errors, and SQL query issues, providing step‑by‑step guidance, configuration tips, and command examples to help administrators troubleshoot and maintain a stable Doris cluster.

Apache DorisConfigurationOperations
0 likes · 28 min read
Common Operational, Data, and SQL Issues in Apache Doris – FAQs and Solutions
Architecture Digest
Architecture Digest
Jun 2, 2022 · Operations

Incident Handling and Fault Recovery Practices for Call Center Systems

The article outlines a comprehensive approach to diagnosing, responding to, and preventing call‑center system failures by describing typical fault scenarios, step‑by‑step recovery actions, monitoring enhancements, emergency plan components, and continuous improvement strategies for operations teams.

Operationscall centeremergency procedures
0 likes · 13 min read
Incident Handling and Fault Recovery Practices for Call Center Systems
Programmer DD
Programmer DD
Jun 1, 2022 · Operations

How Modern Payment Systems Evolve: From Metal Coins to Microservice Architectures

From ancient barter and metal coins to paper money and today’s electronic payments, this article traces the four stages of currency evolution, explains the significance of payment licenses, and details the architectural progression of payment systems—from monolithic designs to microservice‑based, high‑availability infrastructures.

MicroservicesOperationsSystem Architecture
0 likes · 11 min read
How Modern Payment Systems Evolve: From Metal Coins to Microservice Architectures
Efficient Ops
Efficient Ops
May 31, 2022 · Operations

Essential Linux Command Cheatsheet for Sysadmins: 14 Handy Scripts

A concise collection of 14 practical Linux shell commands and scripts—ranging from file searching and batch extraction to log cleanup, directory checks, sed replacements, network capture, and firewall rules—helps operations engineers work faster and solve common problems without constantly searching online.

OperationsShellSysadmin
0 likes · 6 min read
Essential Linux Command Cheatsheet for Sysadmins: 14 Handy Scripts
Efficient Ops
Efficient Ops
May 29, 2022 · Operations

How to Build a Semi‑Automated Prometheus Monitoring Stack for Small Teams

This article details a practical, semi‑automated monitoring solution for environments with fewer than 500 nodes, covering active monitoring concepts, Prometheus data modeling, service‑framework instrumentation, data scraping and visualization with Grafana, and alert handling via AlertManager.

GrafanaOperationsPrometheus
0 likes · 13 min read
How to Build a Semi‑Automated Prometheus Monitoring Stack for Small Teams
Liangxu Linux
Liangxu Linux
May 29, 2022 · Operations

Essential Linux Commands Every Sysadmin Should Know

A comprehensive reference of essential Linux commands covering file management, system monitoring, user administration, networking, compression, and process control, providing concise descriptions to help users navigate and operate Linux systems efficiently.

Operations
0 likes · 16 min read
Essential Linux Commands Every Sysadmin Should Know
Ctrip Technology
Ctrip Technology
May 26, 2022 · Operations

TS Operations System and Practices at Ctrip's Public Technical Service Center

This article details how Ctrip transformed its technical support team into a public TS organization, describing the evolution of its support models, the architecture of its operation system, AI‑driven service accounts, wiki automation, crawler tools, tagging strategies, monitoring dashboards, and future plans to enhance efficiency and user satisfaction.

AI chatbotCtripOperations
0 likes · 13 min read
TS Operations System and Practices at Ctrip's Public Technical Service Center
Open Source Linux
Open Source Linux
May 26, 2022 · Operations

Optimizing Zabbix Agent Monitoring for Linux and Windows: Best Practices

This guide explains how Zabbix agent monitors Linux and Windows systems, compares active and passive modes, and provides detailed optimization tips for OS metrics, CPU, memory, filesystem, Windows services, performance counters, and event logs, including alarm suppression and macro usage.

AgentLinuxOperations
0 likes · 11 min read
Optimizing Zabbix Agent Monitoring for Linux and Windows: Best Practices
Xianyu Technology
Xianyu Technology
May 25, 2022 · Operations

How Xianyu Built a Scalable Test Data Generation Platform for Faster Testing

Facing high manual costs, steep data‑creation barriers, and a lack of test‑data support, Xianyu designed a configurable, multi‑endpoint platform that automates product, order, and discount data generation, dramatically speeding up testing and enabling left‑shift testing across PC, app, and DingTalk.

OperationsXianyuautomation
0 likes · 9 min read
How Xianyu Built a Scalable Test Data Generation Platform for Faster Testing
Laravel Tech Community
Laravel Tech Community
May 24, 2022 · Operations

Key New Features and Improvements in Ubuntu 22.04 LTS (Jammy Jellyfish)

Ubuntu 22.04 LTS introduces a default Wayland display server, a lighter Yaru theme, a more compact GNOME desktop, enhanced desktop icons, horizontal workspaces, revamped app launcher, dock refinements, new accent colours, touch‑pad gestures, password‑protected zip handling, microphone mute alerts, calendar events in the notification area, expanded power‑management modes, visible restart option, improved keyboard shortcuts, extensive multitasking settings, a new interactive screenshot tool, proper dark mode, and Firefox distributed as a Snap package.

DesktopGNOMELinux
0 likes · 10 min read
Key New Features and Improvements in Ubuntu 22.04 LTS (Jammy Jellyfish)
Architecture and Beyond
Architecture and Beyond
May 21, 2022 · Product Management

Mastering Product Prioritization: From Requirement Levels to Incident Management

This article explains how limited resources shape product requirement prioritization, test‑bug grading, product‑module classification, online bug severity, and incident response levels, offering practical frameworks and concrete grading tables to help teams make objective, value‑driven decisions throughout a product’s lifecycle.

OperationsSoftware Developmentbug triage
0 likes · 13 min read
Mastering Product Prioritization: From Requirement Levels to Incident Management
Open Source Linux
Open Source Linux
May 19, 2022 · Operations

Step-by-Step Guide to Installing phpIPAM on Linux: From Apache to MariaDB

This article provides a comprehensive, step‑by‑step tutorial for installing and configuring the phpIPAM IP address management web application on a Linux server, covering environment preparation, disabling SELinux, installing dependencies, setting up Apache and MariaDB, cloning the source, adjusting permissions, and completing the web‑based setup.

ApacheIP address managementInstallation
0 likes · 5 min read
Step-by-Step Guide to Installing phpIPAM on Linux: From Apache to MariaDB
Efficient Ops
Efficient Ops
May 17, 2022 · Operations

How Top Chinese Companies Use DevOps Maturity to Boost IT Efficiency

This article reviews how leading Chinese enterprises such as Tencent, Quark, and Baoxin Software applied the CAICT DevOps Capability Maturity Model, detailing their assessment results, continuous delivery improvements, and security‑risk achievements to enhance IT performance and support digital transformation.

DevOpsDigital TransformationIT efficiency
0 likes · 9 min read
How Top Chinese Companies Use DevOps Maturity to Boost IT Efficiency
Efficient Ops
Efficient Ops
May 16, 2022 · Operations

How Chinese Securities Firms Accelerate IT with DevOps: Real-World Maturity Model Successes

This article reviews how leading Chinese securities companies adopted the CAICT‑led DevOps Capability Maturity Model, detailing assessment results across dozens of projects, the improvements in delivery speed, test coverage and security, and the broader impact on digital transformation in the financial sector.

DevOpsFinancial ServicesIT transformation
0 likes · 19 min read
How Chinese Securities Firms Accelerate IT with DevOps: Real-World Maturity Model Successes
DeWu Technology
DeWu Technology
May 16, 2022 · Operations

NOC SLA Implementation for Consumer Trading Platform

To tackle growing production complexity and past incident delays, the consumer trading platform introduced a three‑tier NOC‑SLA with intelligent baselines powered by Facebook Prophet, streamlined alert rules, and an SOS‑linked workflow, boosting detection frequency, cutting critical response times to under five minutes, and improving overall system reliability while emphasizing ongoing baseline and rule maintenance.

Alert ManagementNOCOperations
0 likes · 13 min read
NOC SLA Implementation for Consumer Trading Platform
ByteDance Data Platform
ByteDance Data Platform
May 16, 2022 · Operations

How ByteDance’s SLA Assurance Platform Guarantees Data Reliability at Scale

This article explains how ByteDance’s self‑built SLA assurance platform addresses data pipeline communication costs, unclear responsibilities, and operational pressure by introducing roles, a streamlined signing workflow, checkpoint and recommendation calculations, and real‑time monitoring to achieve a 99.1% SLA compliance rate.

OperationsSLAmonitoring
0 likes · 9 min read
How ByteDance’s SLA Assurance Platform Guarantees Data Reliability at Scale
Efficient Ops
Efficient Ops
May 15, 2022 · Operations

How Chinese Banks Accelerate IT Efficiency with DevOps Maturity Assessments

This article reviews how leading Chinese banks have adopted the CAICT DevOps Capability Maturity Model to improve IT efficiency, detailing assessment statistics, specific project implementations across continuous delivery, technical operation, security, and tooling, and highlighting the overall impact on digital transformation.

BankingDevOpsIT efficiency
0 likes · 16 min read
How Chinese Banks Accelerate IT Efficiency with DevOps Maturity Assessments
Efficient Ops
Efficient Ops
May 15, 2022 · Operations

How Chinese Banks Accelerate Digital Transformation with DevOps Maturity Assessments

This article reviews how major Chinese joint‑stock banks have adopted the CAICT DevOps Capability Maturity Model, detailing the number of evaluated projects, each bank’s implementation experiences across continuous delivery, security, and toolchain standards, and the overall impact on IT efficiency and business agility.

BankingCapability Maturity ModelContinuous Delivery
0 likes · 15 min read
How Chinese Banks Accelerate Digital Transformation with DevOps Maturity Assessments
dbaplus Community
dbaplus Community
May 11, 2022 · Backend Development

Mastering Failure‑Oriented Design: Mindset, Process, and Distributed Locks

This article explores the philosophy and practical techniques of failure‑oriented design, covering why anticipating failures is crucial for developers, the organizational and process changes needed, core design principles, and concrete implementations such as multi‑level Redis distributed locks with code examples.

Backend EngineeringOperationsdistributed-lock
0 likes · 23 min read
Mastering Failure‑Oriented Design: Mindset, Process, and Distributed Locks
NetEase Game Operations Platform
NetEase Game Operations Platform
May 9, 2022 · Operations

Intelligent Log Classification and Anomaly Detection: Design and Implementation

This article presents a two‑stage streaming log classification system using an improved prefix‑tree and longest‑common‑subsequence algorithms, along with a statistical unsupervised anomaly detection method that leverages chi‑square aggregation and box‑plot scoring to reduce false alarms and accelerate template convergence.

LCS algorithmOperationsUnsupervised Learning
0 likes · 11 min read
Intelligent Log Classification and Anomaly Detection: Design and Implementation
ITPUB
ITPUB
May 9, 2022 · Databases

How Meituan’s Database Autonomy Service Tackles Scale and Reliability Challenges

This article outlines the evolution of Meituan’s Database Autonomy Service (DAS), describing the growing scale‑vs‑operations imbalance, the strategic roadmap for self‑service and AI‑driven diagnostics, detailed architectural designs across data collection, compute/storage, and analysis layers, and the measurable outcomes and future plans for full database autonomy.

AI DiagnosisDatabase AutonomyOperations
0 likes · 19 min read
How Meituan’s Database Autonomy Service Tackles Scale and Reliability Challenges
DevOps Cloud Academy
DevOps Cloud Academy
May 9, 2022 · Operations

Resetting the GitLab Root Password via Console and Password Recovery

This guide explains how to retrieve and change the temporary GitLab root password generated during a Terraform deployment, outlines two recovery methods—including using the password‑reset feature and directly updating the password through the GitLab Rails console with example code.

DevOpsGitLabOperations
0 likes · 3 min read
Resetting the GitLab Root Password via Console and Password Recovery
Architect
Architect
May 8, 2022 · Operations

ELK Stack Common Deployment Architectures and Practical Solutions

This article introduces the ELK stack components, compares three typical deployment architectures—Logstash as collector, Filebeat as collector, and a cache‑queue‑enhanced design—then discusses common logging issues such as multiline merging, timestamp handling, and module filtering, providing concrete configuration examples and solutions.

ELKFilebeatKibana
0 likes · 10 min read
ELK Stack Common Deployment Architectures and Practical Solutions
DevOps Cloud Academy
DevOps Cloud Academy
May 7, 2022 · Operations

Optimizing Zabbix Monitoring for Linux and Windows Systems

This article provides a comprehensive guide on configuring and optimizing Zabbix agent monitoring for Linux and Windows, covering agent types, passive and active modes, macro variables, LLD macros, CPU/memory/file‑system metrics, and Windows service, performance counter, and event‑log monitoring.

LinuxOperationsWindows
0 likes · 9 min read
Optimizing Zabbix Monitoring for Linux and Windows Systems
dbaplus Community
dbaplus Community
May 4, 2022 · Operations

How Tencent Game Teams Use Chaos Engineering to Boost Reliability and Reduce Outages

This article explains the concept of chaos engineering, its six key benefits, the design of a full‑lifecycle chaos platform, fault‑atom categories, experiment orchestration, risk control, automation, red‑blue war games, and practical experiments that helped Tencent Games improve system reliability while cutting operational costs.

DevOpsGamingOperations
0 likes · 21 min read
How Tencent Game Teams Use Chaos Engineering to Boost Reliability and Reduce Outages
58UXD
58UXD
Apr 29, 2022 · Operations

How 58 Home Service Standardized Cleaning: From User Research to SOP Success

This article examines how 58 Home Service identified service gaps through user research, built a detailed user‑experience map, created a comprehensive SOP handbook covering image, etiquette, and behavior, and implemented training, assessment, and incentives to dramatically improve customer satisfaction and reduce complaints.

OperationsTrainingUser experience
0 likes · 9 min read
How 58 Home Service Standardized Cleaning: From User Research to SOP Success
DaTaobao Tech
DaTaobao Tech
Apr 29, 2022 · Industry Insights

How Taobao Mini Programs Cut Load Times by 30%: A Data‑Driven Performance Playbook

This article analyzes the performance challenges of Taobao Mini Programs, defines a multi‑dimensional experience metric, builds a standardized ops data pipeline, introduces the T2 first‑screen algorithm and a three‑stage performance model, and shares concrete optimization practices that reduced T2 from 2.7 s to 1.9 s while improving business metrics.

Mini ProgramOperationsdata analysis
0 likes · 10 min read
How Taobao Mini Programs Cut Load Times by 30%: A Data‑Driven Performance Playbook
Bilibili Tech
Bilibili Tech
Apr 26, 2022 · Operations

Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation

Bilibili’s SRE team combines stability theory, detailed fault‑stage and operational metrics, and a unified emergency‑response platform—including on‑call scheduling, fault‑command incident commanders, automated fault portraits, and rapid post‑mortems—to transform frequent incidents into data‑driven, collaborative recoveries and lay groundwork for AI‑assisted self‑healing.

Business StabilityMetricsOncall
0 likes · 23 min read
Bilibili's SRE Practice for Business Stability: Theory, Metrics, and Operational Implementation
Efficient Ops
Efficient Ops
Apr 26, 2022 · Operations

How Beijing Gas Achieved Advanced DevOps Maturity: A Detailed Case Study

Beijing Gas’s Tongzhou Call Center project passed the Level 2 DevOps continuous‑delivery assessment, showcasing how standardized processes, a cloud‑native tool platform, and agile practices dramatically improved delivery speed, quality, and digital transformation across the organization.

Continuous DeliveryDevOpsDigital Transformation
0 likes · 11 min read
How Beijing Gas Achieved Advanced DevOps Maturity: A Detailed Case Study
Efficient Ops
Efficient Ops
Apr 26, 2022 · Operations

How China’s Top Firms Achieved Leading DevOps Maturity – Assessment Insights

The CAICT’s fourth‑batch DevOps assessment reveals that China Zhongjin Wealth’s platform passed the Excellent level, showcasing how standardized pipelines, tool empowerment, and the DevOps Capability Maturity Model dramatically boost delivery speed, quality, and competitiveness across major enterprises.

Capability Maturity ModelContinuous DeliveryDevOps
0 likes · 6 min read
How China’s Top Firms Achieved Leading DevOps Maturity – Assessment Insights
dbaplus Community
dbaplus Community
Apr 25, 2022 · Operations

From Monitoring to Observability: Expert Insights on Evolving Cloud‑Native Operations

In this interview series, three industry experts explain how monitoring differs from observability, the shifts required for ops, developers, and architects, the core methodologies and technologies behind metrics, traces, and logs, and practical guidance for selecting and integrating observability tools in cloud‑native environments.

MetricsOperationscloud-native
0 likes · 16 min read
From Monitoring to Observability: Expert Insights on Evolving Cloud‑Native Operations
Cognitive Technology Team
Cognitive Technology Team
Apr 24, 2022 · Backend Development

Thread Pool Misconfiguration Cases and Best Practices for Resilience

The article presents two 2018 incidents where improper Java thread‑pool settings caused service degradation and unavailability, analyzes the root causes such as insufficient core size, unbounded queues, and missing rejection handlers, and offers practical recommendations for dynamic sizing, alerting, degradation strategies, isolation, and auto‑scaling to prevent similar faults.

FaultToleranceJavaConcurrencyOperations
0 likes · 3 min read
Thread Pool Misconfiguration Cases and Best Practices for Resilience
DevOps Cloud Academy
DevOps Cloud Academy
Apr 23, 2022 · Operations

A Comprehensive Overview of DevOps Tools and Their Roles

This article introduces the DevOps culture and systematically categorizes a wide range of DevOps tools—including source‑code management, CI/CD, containers, cloud providers, automation, monitoring, project management, and secret management—to help teams improve productivity and collaboration.

ContainersDevOpsOperations
0 likes · 9 min read
A Comprehensive Overview of DevOps Tools and Their Roles
Liangxu Linux
Liangxu Linux
Apr 20, 2022 · Operations

How to Quickly Identify Disk Space Hogs on Linux Servers

When a Linux server raises a disk‑space alarm, this guide shows step‑by‑step how to locate the offending directories or files using df, du, find, lsof and tune2fs, and explains why reported usage may differ from summed directory sizes.

Operationsdufind
0 likes · 4 min read
How to Quickly Identify Disk Space Hogs on Linux Servers
IT Architects Alliance
IT Architects Alliance
Apr 17, 2022 · Operations

Understanding the SRE Role: Responsibilities, Types, and Practices

This article explains what Site Reliability Engineering (SRE) is, why it was created, the challenges in hiring SREs, and breaks the role into three layers—Infrastructure, Platform, and Business—detailing their duties, deployment processes, on‑call practices, SLI/SLO management, incident post‑mortems, capacity planning, user support, and career advice.

InfrastructureOncallOperations
0 likes · 21 min read
Understanding the SRE Role: Responsibilities, Types, and Practices
Architect
Architect
Apr 16, 2022 · Operations

A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices

This article explains what SRE is, why it was created, how its responsibilities differ across companies, and breaks the work into Infrastructure, Platform, and Business SRE while covering deployment, on‑call, SLI/SLO, incident post‑mortems, capacity planning, user support, and career advice.

OncallOperationsSLI/SLO
0 likes · 22 min read
A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices
Alibaba Cloud Native
Alibaba Cloud Native
Apr 16, 2022 · Cloud Native

How AHAS Feature Switches Simplify Dynamic Configuration in Cloud‑Native Microservices

This article explains common configuration challenges in microservice applications and introduces Alibaba Cloud's AHAS feature switch as a lightweight, dynamic configuration framework that offers zero‑code integration, strong type validation, persistent storage, and non‑intrusive deployment for real‑time business control.

AHASCloud NativeDynamic Configuration
0 likes · 8 min read
How AHAS Feature Switches Simplify Dynamic Configuration in Cloud‑Native Microservices
YunZhu Net Technology Team
YunZhu Net Technology Team
Apr 15, 2022 · Operations

Design and Architecture of a Cloud‑Native Monitoring Platform for Business Systems

The document outlines the background, vision, current status, technical research, value, product and technical architecture, and functional design of a cloud‑native monitoring platform that integrates SkyWalking and Prometheus to provide comprehensive APM, resource utilization, alerting, and rapid fault localization for business and technical middle‑platform services.

APMMetricsOperations
0 likes · 10 min read
Design and Architecture of a Cloud‑Native Monitoring Platform for Business Systems
Dada Group Technology
Dada Group Technology
Apr 8, 2022 · Operations

Marketing Guard: A Risk Pre‑Warning System for E‑Commerce Marketing Operations

The article presents a comprehensive analysis of marketing‑related financial loss cases, outlines the design and implementation of a non‑intrusive, event‑driven Marketing Guard system with dual‑layer ES‑HBase storage, and discusses its operational safeguards, achievements, shortcomings, and future development plans.

OperationsSystem Architecturemarketing risk
0 likes · 12 min read
Marketing Guard: A Risk Pre‑Warning System for E‑Commerce Marketing Operations
Architecture Digest
Architecture Digest
Apr 6, 2022 · Operations

Why Organizations Struggle with DevOps: Leadership, Structure, Value‑Stream Mapping and Key Practices

The article explains that many organizations fail to achieve the promised business value of DevOps because they overlook four critical factors—leadership, organizational structure, value‑stream mapping, and regular pulse checks—and provides concrete recommendations to address each area.

OperationsValue Stream Mappingorganizational structure
0 likes · 9 min read
Why Organizations Struggle with DevOps: Leadership, Structure, Value‑Stream Mapping and Key Practices
Open Source Linux
Open Source Linux
Apr 2, 2022 · Operations

How to Speed Up Call Center Incident Recovery with Proven Ops Strategies

This article walks through a real call‑center outage scenario, outlines systematic fault‑identification steps, practical emergency recovery actions, monitoring enhancements, concise emergency‑plan design, and introduces intelligent event‑handling to help operations teams resolve incidents faster and more reliably.

Operationsautomationcall center
0 likes · 13 min read
How to Speed Up Call Center Incident Recovery with Proven Ops Strategies
dbaplus Community
dbaplus Community
Mar 31, 2022 · Databases

Why Build Your Own Database Middleware in a Multi‑Cloud Era?

The article explains why, despite cloud services, enterprises still need to develop their own database middleware to ensure multi‑cloud compatibility, vendor neutrality, high availability, and scalable performance, detailing the challenges, design principles, core features, technical metrics, and operational benefits of such a solution.

Database MiddlewareOperationscloud infrastructure
0 likes · 20 min read
Why Build Your Own Database Middleware in a Multi‑Cloud Era?
TAL Education Technology
TAL Education Technology
Mar 31, 2022 · Cloud Computing

Hybrid Cloud Governance at TAL Education: Challenges, Methods, and Future Plans

This article examines TAL Education's hybrid‑cloud journey, explaining what hybrid cloud is, presenting industry adoption statistics, detailing the company's initial network chaos, outlining governance difficulties, describing the first‑phase remediation measures, and outlining the objectives and methods for the second‑phase transformation.

Network GovernanceOperationscloud architecture
0 likes · 20 min read
Hybrid Cloud Governance at TAL Education: Challenges, Methods, and Future Plans
IT Architects Alliance
IT Architects Alliance
Mar 30, 2022 · Operations

30 Essential Architecture Patterns for Scalable and Resilient Systems

This article systematically presents thirty architectural patterns—covering management, monitoring, performance, scalability, data handling, design, messaging, resilience, and security—to help engineers design, implement, and operate robust, high‑performance distributed systems.

Design PatternsOperationsScalability
0 likes · 33 min read
30 Essential Architecture Patterns for Scalable and Resilient Systems
MaGe Linux Operations
MaGe Linux Operations
Mar 28, 2022 · Databases

Why GitHub’s MySQL Cluster Crashed: Lessons from Recent Outages

GitHub experienced multiple service outages over recent weeks due to resource contention in its MySQL1 cluster, leading to prolonged downtimes, and the company disclosed detailed timelines, root causes, and ongoing mitigation measures such as load audits, traffic shifting, and infrastructure scaling to prevent future incidents.

GitHubOperationsdatabase outage
0 likes · 3 min read
Why GitHub’s MySQL Cluster Crashed: Lessons from Recent Outages
DevOps Cloud Academy
DevOps Cloud Academy
Mar 28, 2022 · Operations

Understanding DevOps: Definition, Benefits, Practices, and Drawbacks

This article explains DevOps as a cultural, organizational, and technical shift that unifies development, operations, and quality assurance, outlines its benefits such as faster delivery and improved reliability, describes key practices like CI/CD, multi‑environment deployments, early failure detection, rollback, policy enforcement and observability, and discusses its potential drawbacks and considerations.

DevOpsOperationsautomation
0 likes · 12 min read
Understanding DevOps: Definition, Benefits, Practices, and Drawbacks
Architecture Digest
Architecture Digest
Mar 26, 2022 · Operations

Top Free Docker GUI Tools for Efficient Container Management

This article reviews several free Docker graphical user interface (GUI) tools—including Portainer, DockStation, Docker Desktop, Lazydocker, and Docui—detailing their platform support, feature sets, Docker version compatibility, and practical usage scenarios for streamlined container administration.

Container ManagementDockerGUI
0 likes · 7 min read
Top Free Docker GUI Tools for Efficient Container Management
Efficient Ops
Efficient Ops
Mar 20, 2022 · Operations

How to Keep Your Kubernetes Nodes and Pods Stable: Essential Ops Practices

This guide walks through essential Kubernetes operations—from node kernel upgrades and Docker daemon tuning to pod resource limits, scheduling policies, health probes, logging standards, and comprehensive monitoring—providing practical commands and configurations to keep clusters stable and observable.

KubernetesNode ManagementOperations
0 likes · 18 min read
How to Keep Your Kubernetes Nodes and Pods Stable: Essential Ops Practices
Open Source Linux
Open Source Linux
Mar 18, 2022 · Operations

Evolution of Open‑Source Monitoring Tools: From Nagios to Prometheus

This article traces the development of open‑source monitoring solutions from early tools like Nagios and Cacti through modern platforms such as Prometheus and Nightingale, comparing their strengths, weaknesses, and typical use cases while also looking ahead to emerging observability trends in cloud‑native environments.

NagiosOperationsPrometheus
0 likes · 14 min read
Evolution of Open‑Source Monitoring Tools: From Nagios to Prometheus
Efficient Ops
Efficient Ops
Mar 17, 2022 · Operations

Inside China’s AIOps Standard: Key Insights from the 4th Draft Meeting

The article reports on the fourth draft discussion of China’s Cloud Computing Intelligent Operations (AIOps) Capability Maturity Model – Part 2, detailing the meeting’s participants, the finalized system and tool technical requirements, and the progress toward a comprehensive AIOps standard that addresses quality, cost, efficiency, and security across multiple functional modules.

Artificial IntelligenceOperationsaiops
0 likes · 5 min read
Inside China’s AIOps Standard: Key Insights from the 4th Draft Meeting
FunTester
FunTester
Mar 17, 2022 · Operations

Turning Manual Performance Monitoring into Automated Multi‑Level Alerts

The author explains how they distinguished test automation from automated testing, identified monitoring pain points, built a custom scraper‑driven alert system with three escalation levels, tackled common pitfalls, and achieved faster, more reliable performance testing alerts.

OperationsPerformance Monitoringalert system
0 likes · 6 min read
Turning Manual Performance Monitoring into Automated Multi‑Level Alerts
Efficient Ops
Efficient Ops
Mar 15, 2022 · Cloud Native

How eBPF Powers Seamless Observability in Cloud‑Native Kubernetes Environments

This article explains why the rise of Kubernetes as a cloud‑native standard brings new observability challenges, outlines how eBPF enables non‑intrusive, multi‑language, multi‑protocol data collection, and describes a comprehensive monitoring stack—including golden metrics, service topology, tracing, alerts, and network diagnostics—to achieve end‑to‑end visibility in complex Kubernetes deployments.

Cloud NativeKubernetesOperations
0 likes · 22 min read
How eBPF Powers Seamless Observability in Cloud‑Native Kubernetes Environments
IT Architects Alliance
IT Architects Alliance
Mar 13, 2022 · Operations

30 Essential Architecture Patterns for Scalable, Resilient Systems

This article presents a comprehensive catalog of thirty architectural patterns—including management, monitoring, performance, data management, design, messaging, resilience, and security modes—explaining their purpose, typical use cases, benefits, and implementation considerations to help engineers build robust, high‑performance distributed applications.

Architecture PatternsOperationsResilience
0 likes · 32 min read
30 Essential Architecture Patterns for Scalable, Resilient Systems
Architects' Tech Alliance
Architects' Tech Alliance
Mar 12, 2022 · Cloud Computing

Understanding and Managing Complexity in Multi‑Cloud Infrastructure

The article examines the growing complexity of multi‑cloud and hybrid cloud environments, identifies security, API, and logging challenges, and proposes a flexible, cloud‑neutral automation platform with clear communication, audit, planning, and incremental implementation steps to reduce operational overhead and cost.

Cloud NativeOperationsautomation
0 likes · 12 min read
Understanding and Managing Complexity in Multi‑Cloud Infrastructure
AntTech
AntTech
Mar 12, 2022 · Operations

Evolution of Large‑Scale Distributed System Stability at Ant Group

The article outlines Ant Group's multi‑stage journey of building large‑scale distributed system stability, describing architectural evolutions, risk‑inspection mechanisms, high‑availability solutions such as LDC and fine‑grained traffic scheduling, and intelligent risk‑defense products that together enable resilient, cost‑effective operations.

Cloud NativeDistributed SystemsOperations
0 likes · 15 min read
Evolution of Large‑Scale Distributed System Stability at Ant Group
Dada Group Technology
Dada Group Technology
Mar 11, 2022 · Operations

Design and Iteration of JD Daojia Order Timeliness System

This article details the background, overall architecture, iterative improvements, and future directions of JD Daojia's order timeliness system, covering early limitations, business‑driven challenges, solution iterations, order‑control mechanisms, product‑dimension handling, and the final business architecture to enhance fulfillment rates and user experience.

BackendJD DaojiaOperations
0 likes · 11 min read
Design and Iteration of JD Daojia Order Timeliness System
Open Source Linux
Open Source Linux
Mar 11, 2022 · Operations

Essential Linux Ops Tools: Monitoring, Performance, and Security Utilities

This article presents a curated list of practical Linux operation tools—including Nethogs, IOzone, IOTop, IPtraf, IFTop, HTop, NMON, MultiTail, Fail2ban, Tmux, Agedu, NMap, and Httperf—detailing their purpose, download links, installation commands, and basic usage to help system administrators improve monitoring, performance testing, and security on Linux servers.

LinuxOperationsSysadmin
0 likes · 12 min read
Essential Linux Ops Tools: Monitoring, Performance, and Security Utilities
MaGe Linux Operations
MaGe Linux Operations
Mar 9, 2022 · Operations

Why Do 502 Errors Appear Only on POST Requests After Migrating to PaaS?

After moving an application to a PaaS platform, intermittent 502 errors occur, seemingly only for POST requests, but the root cause lies in Nginx‑Ingress and uwsgi HTTP version mismatches, connection reuse, and retry behavior, which can be diagnosed through traffic analysis and configuration changes.

502 errorHTTP version mismatchIngress
0 likes · 6 min read
Why Do 502 Errors Appear Only on POST Requests After Migrating to PaaS?
Open Source Linux
Open Source Linux
Mar 8, 2022 · Operations

Master Kubernetes Troubleshooting: The Three Pillars Every Engineer Needs

This article breaks down Kubernetes troubleshooting into three essential steps—understanding the failure, managing the response, and preventing recurrence—while mapping key monitoring, observability, and incident‑response tools to each phase for reliable cloud‑native operations.

KubernetesOperationschaos engineering
0 likes · 8 min read
Master Kubernetes Troubleshooting: The Three Pillars Every Engineer Needs
High Availability Architecture
High Availability Architecture
Mar 7, 2022 · Operations

Understanding High Concurrency, High Availability, Performance, and Scalability: Concepts and Metrics

This article systematically explains the relationships among high concurrency, high availability, performance, and scalability, defines their quantitative metrics, categorizes sources of change that affect system reliability, and outlines strategies for fault prediction, impact reduction, and rapid recovery in large‑scale services.

OperationsReliabilityScalability
0 likes · 11 min read
Understanding High Concurrency, High Availability, Performance, and Scalability: Concepts and Metrics
Efficient Ops
Efficient Ops
Mar 6, 2022 · Operations

How Top Chinese Insurers Achieved DevOps Maturity: Real‑World Case Studies

This article examines how three leading Chinese insurance companies used the nationally‑backed DevOps Capability Maturity Model to evaluate and improve their IT operations, detailing project architectures, cloud‑native implementations, continuous‑delivery results, and the broader significance of the DevOps standard.

Continuous DeliveryDevOpsInsurance
0 likes · 8 min read
How Top Chinese Insurers Achieved DevOps Maturity: Real‑World Case Studies
IT Architects Alliance
IT Architects Alliance
Mar 6, 2022 · Operations

Mastering Nginx: From Static Servers to Advanced Load Balancing and Reverse Proxy

This guide walks through deploying static files with Nginx, configuring location blocks and regex patterns, setting up reverse proxy to Java services, implementing various load‑balancing strategies (round‑robin, weight, ip_hash, fair, url_hash), separating static and dynamic content, and using essential directives such as return, rewrite, error_page, logging, deny, and built‑in variables.

NginxOperationsServer Configuration
0 likes · 16 min read
Mastering Nginx: From Static Servers to Advanced Load Balancing and Reverse Proxy