Tagged articles
3281 articles
Page 5 of 33
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
May 19, 2025 · Operations

Why 90% of ERP Projects Fail in China—and How to Make Yours Succeed

This article examines why most ERP projects in Chinese companies fail, explores the core benefits of ERP, identifies common pitfalls such as over‑complexity, resistance, and high costs, and offers practical steps—including need assessment, right‑sized selection, process clarification, phased rollout, and strong leadership—to ensure successful implementation.

ERPOperationsSMEs
0 likes · 11 min read
Why 90% of ERP Projects Fail in China—and How to Make Yours Succeed
Lin is Dream
Lin is Dream
May 18, 2025 · Operations

Master Server Disk & Network Monitoring with Command‑Line Tools

This guide explains why every server must monitor CPU, memory, disk and network usage, shows how to clean disks and analyze traffic using command‑line utilities such as df, du, iotop, iostat, iftop, lsof and tcpdump, and provides real‑world case studies for troubleshooting disk space exhaustion, port conflicts and abnormal outbound traffic.

OperationsServer Monitoringdisk-management
0 likes · 9 min read
Master Server Disk & Network Monitoring with Command‑Line Tools
php Courses
php Courses
May 16, 2025 · Operations

Using Python for Automation in Operations (DevOps)

This article explains why Python is a leading language for DevOps automation, detailing its core advantages, typical use cases such as bulk server management, configuration management, log analysis, and scheduled tasks, and introduces common Python libraries and learning pathways for building robust operational workflows.

Configuration ManagementDevOpsOperations
0 likes · 6 min read
Using Python for Automation in Operations (DevOps)
FunTester
FunTester
May 16, 2025 · Operations

Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles

Chaos engineering is a discipline that deliberately injects faults into distributed systems to test and improve resilience, tracing its evolution from Netflix's Chaos Monkey to modern platforms, outlining its operational workflow, benefits, and core principles for reliable system design.

Distributed SystemsFault InjectionOperations
0 likes · 9 min read
Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles
FunTester
FunTester
May 15, 2025 · Operations

Uncovering the Eight Hidden Pitfalls That Can Crash Your Distributed System

This article dissects the classic Eight Fallacies of Distributed Computing, explaining each mistaken assumption about network reliability, latency, bandwidth, security, topology, administration, cost, and homogeneity, and provides real‑world case studies and practical recommendations to help engineers design more resilient distributed systems.

Distributed SystemsFallaciesLatency
0 likes · 16 min read
Uncovering the Eight Hidden Pitfalls That Can Crash Your Distributed System
Lin is Dream
Lin is Dream
May 14, 2025 · Operations

Master Nginx Rate Limiting: Prevent Abuse with limit_req & limit_conn

Learn how to protect your services from abusive traffic and brute‑force attacks by using Nginx's rate‑limiting features—limit_req to control request rates and limit_conn to restrict concurrent connections—complete with configuration examples, explanations of zones, burst handling, custom error pages, and log monitoring.

OperationsServer Configurationlimit_conn
0 likes · 6 min read
Master Nginx Rate Limiting: Prevent Abuse with limit_req & limit_conn
JD Tech Talk
JD Tech Talk
May 13, 2025 · Operations

Intelligent Supply Chain Planning Algorithms and Their Applications

The article introduces intelligent supply chain planning algorithms—including network design, inventory layout, and simulation—detailing their optimization models, high‑performance solving techniques, and real‑world impact on cost reduction, efficiency, and service experience across large‑scale logistics operations.

LogisticsOperationsSupply Chain
0 likes · 12 min read
Intelligent Supply Chain Planning Algorithms and Their Applications
Efficient Ops
Efficient Ops
May 11, 2025 · Operations

China’s Leading Banks Achieve Top DevOps Standard Certifications – What It Means

The 25th GOPS Global Operations Conference in Shenzhen announced the dual ITU DevOps international and domestic standard assessment results, highlighting Agricultural Bank as the first state bank to earn a five‑star internal coach talent rating and showcasing multiple financial institutions that have successfully passed BizDevOps and continuous delivery evaluations, underscoring the growing importance of standardized DevOps practices in China’s finance sector.

BizDevOpsDevOpsFinancial Industry
0 likes · 9 min read
China’s Leading Banks Achieve Top DevOps Standard Certifications – What It Means
Architecture and Beyond
Architecture and Beyond
May 10, 2025 · Operations

What Heinrich’s 1:29:300 Rule Reveals About Preventing Online Outages

The article explains Heinrich's Law, its 1:29:300 accident pyramid, and how applying its principles—tracking minor incidents, hidden hazards, and systemic risks—can help software teams anticipate, diagnose, and prevent major online failures through systematic safety management and data‑driven practices.

Heinrich's LawOperationsincident management
0 likes · 15 min read
What Heinrich’s 1:29:300 Rule Reveals About Preventing Online Outages
Efficient Ops
Efficient Ops
May 7, 2025 · Operations

Why Choose SigNoz for Open‑Source Observability? A Deep Dive

This article introduces SigNoz, a self‑hosted open‑source observability platform that unifies metrics, logs, and traces, outlines its core capabilities, shows how to install it with Docker, and compares its resource efficiency to commercial solutions like DataDog and Elastic.

MetricsOpenTelemetryOperations
0 likes · 4 min read
Why Choose SigNoz for Open‑Source Observability? A Deep Dive
ITPUB
ITPUB
May 5, 2025 · Operations

Turn Zabbix Alerts into an AI‑Powered Personal Assistant

This guide shows how to integrate Zabbix with a locally deployed DeepSeek large language model via Webhook, enabling automatic analysis of alert causes and solutions, feeding results back to operators through dashboards or enterprise WeChat, and dramatically reducing MTTR and manual effort.

Alert AutomationDeepSeekOperations
0 likes · 5 min read
Turn Zabbix Alerts into an AI‑Powered Personal Assistant
Dual-Track Product Journal
Dual-Track Product Journal
May 2, 2025 · Operations

How to Stop Warehouse Chaos: 100 Ways Wave Picking Can Fail—and How to Fix It

A disastrous beauty‑ecommerce promotion exposed how naïve wave‑picking designs can turn warehouses into mazes, cause urgent orders to disappear, and mix products, but by applying intelligent grouping, dynamic capacity, heat‑map path optimization, and a three‑level priority system, fulfillment efficiency can be dramatically restored.

LogisticsOperationsorder fulfillment
0 likes · 5 min read
How to Stop Warehouse Chaos: 100 Ways Wave Picking Can Fail—and How to Fix It
ITPUB
ITPUB
Apr 30, 2025 · Operations

Why a Credential‑Rotation Mistake Took Down Cloudflare R2 and Its Dependent Services

On March 21 2025, a mis‑deployed credential during R2 Gateway's key rotation caused a 1‑hour‑7‑minute outage that blocked all write operations and about 35% of reads across R2 and several downstream Cloudflare services, prompting a detailed post‑mortem and a set of corrective actions to improve visibility and safety of credential changes.

Operationscloud computingcredential management
0 likes · 15 min read
Why a Credential‑Rotation Mistake Took Down Cloudflare R2 and Its Dependent Services
Efficient Ops
Efficient Ops
Apr 29, 2025 · Operations

How BizDevOps Standards Are Shaping China’s Cloud and AI Operations Landscape

This article outlines the evolution of BizDevOps standards in China, detailing recent policy mandates, the expansion of the DevOps maturity model to organization‑level practice, the AI‑driven SOMM operation assurance framework, and the integration of large‑model AI into R&D and operational workflows, highlighting their impact on enterprise efficiency and governance.

AI integrationBizDevOpsDevOps Standards
0 likes · 15 min read
How BizDevOps Standards Are Shaping China’s Cloud and AI Operations Landscape
BirdNest Tech Talk
BirdNest Tech Talk
Apr 29, 2025 · Cloud Native

How Docker Simplifies MCP Server Deployment for AI Agents

The article analyzes the challenges of manually deploying Model Context Protocol (MCP) servers for AI agents, compares them with Docker‑based deployment, and demonstrates step‑by‑step configurations, code snippets, and concrete benefits such as environment consistency, resource efficiency, and security.

AI agentsCloud NativeDeployment
0 likes · 7 min read
How Docker Simplifies MCP Server Deployment for AI Agents
dbaplus Community
dbaplus Community
Apr 28, 2025 · Operations

20 Common Ops Failures and How to Diagnose & Fix Them

This article compiles twenty frequent operational incidents—from server inaccessibility and database connection errors to disk‑space exhaustion, high CPU usage, memory leaks, network latency, DNS failures, service crashes, file‑system corruption, update problems, permission misconfigurations, web‑server and email issues, backup failures, load‑balancing anomalies, firewall rule mistakes, SSH connection problems, database performance degradation, dependency gaps, and virtual‑machine faults—detailing their symptoms, step‑by‑step troubleshooting procedures, and concrete remediation actions.

FixesOperationsServer
0 likes · 15 min read
20 Common Ops Failures and How to Diagnose & Fix Them
Baidu Geek Talk
Baidu Geek Talk
Apr 28, 2025 · Operations

How Baidu’s Log Platform Cuts Billions in Cost with Full‑Lifecycle Event Governance

This article details Baidu's log platform point‑governance practice, explaining why uncontrolled event logging inflates storage and compute costs, and describing a three‑stage solution—manual, semi‑automatic platform, and full‑lifecycle standardization—that uses anomaly detection, automated workflows, and IM bots to achieve massive PV reduction and annual cost savings.

Cost OptimizationLog ManagementOperations
0 likes · 20 min read
How Baidu’s Log Platform Cuts Billions in Cost with Full‑Lifecycle Event Governance
Efficient Ops
Efficient Ops
Apr 27, 2025 · Operations

How ICBC’s BizDevOps Evaluation Drives Digital Transformation and Sets New Standards

The article details China’s push for international ITU DevOps standards, showcases Industrial and Commercial Bank of China’s successful BizDevOps dual‑certificate assessment, and presents an in‑depth Q&A with senior fintech managers discussing the value, challenges, and future roadmap of integrating business, development, and operations to accelerate digital transformation.

BizDevOpsDevOpsDigital Transformation
0 likes · 16 min read
How ICBC’s BizDevOps Evaluation Drives Digital Transformation and Sets New Standards
Liangxu Linux
Liangxu Linux
Apr 26, 2025 · Operations

Essential Linux Log Files Every Sysadmin Should Monitor

The article outlines the most important Linux log files located under /var/log, explains what each records—from system messages and authentication attempts to web server activity—and provides practical commands for viewing and alerting on critical entries to improve troubleshooting and security monitoring.

OperationsSysadminsystem logs
0 likes · 10 min read
Essential Linux Log Files Every Sysadmin Should Monitor
FunTester
FunTester
Apr 26, 2025 · Operations

Curated List of Technical Articles on Fault Testing, Byteman, and Chrome Extension Development

This collection gathers recent technical articles covering fault testing fundamentals, Byteman usage guides, and Chrome extension development tutorials, providing developers with practical insights, best practices, and hands‑on examples to improve system reliability, testing strategies, and front‑end extension capabilities.

BytemanChrome ExtensionOperations
0 likes · 4 min read
Curated List of Technical Articles on Fault Testing, Byteman, and Chrome Extension Development
Efficient Ops
Efficient Ops
Apr 26, 2025 · Operations

How China Galaxy Securities Earned Dual International & Domestic DevOps Certification for Its Bond Platform

At the 25th GOPS Global Operations Conference in Shenzhen, China Galaxy Securities showcased its successful dual certification under ITU DevOps standards, detailing the Bond Investment Business Support System's continuous testing achievements, pipeline integration, and the broader impact on financial DevOps practices.

Bond Investment PlatformChina Galaxy SecuritiesDevOps
0 likes · 9 min read
How China Galaxy Securities Earned Dual International & Domestic DevOps Certification for Its Bond Platform
Efficient Ops
Efficient Ops
Apr 26, 2025 · Operations

Why China’s Top State Banks Are Leading the DevOps Certification Wave

The 25th GOPS Global Operations Conference in Shenzhen announced the dual ITU DevOps international and domestic standard assessment results, highlighting the Agricultural Bank of China's AAAAA‑level DevOps coach certification and showcasing multiple state‑owned banks, exchanges, and securities firms achieving leading domestic DevOps maturity across BizDevOps, continuous delivery, and continuous testing standards.

BizDevOpsDevOpsFinancial Industry
0 likes · 10 min read
Why China’s Top State Banks Are Leading the DevOps Certification Wave
Efficient Ops
Efficient Ops
Apr 25, 2025 · Operations

How Changan Auto Earned Top‑Tier DevOps Certification with a Full‑Link Observability Platform

Changan Automobile’s full‑link observability platform passed both ITU DevOps international and domestic standards assessments, showcasing its advanced monitoring capabilities, improved system stability, and strategic role in the company’s digital transformation, while the interview reveals implementation challenges, benefits, and future AI‑driven enhancements.

DevOpsDigital TransformationFull‑Link Monitoring
0 likes · 21 min read
How Changan Auto Earned Top‑Tier DevOps Certification with a Full‑Link Observability Platform
Efficient Ops
Efficient Ops
Apr 25, 2025 · Operations

How Shenzhen Stock Exchange Achieved Leading DevOps Standards Through Dual International and Domestic Certification

The article details China's push for internationalized information standards, the CAICT's dual ITU DevOps and domestic assessments, and how the Shenzhen Stock Exchange successfully passed Level 3 continuous delivery evaluation, showcasing advanced DevOps capabilities and their alignment with national digital transformation policies.

DevOpsDigital TransformationFinTech
0 likes · 11 min read
How Shenzhen Stock Exchange Achieved Leading DevOps Standards Through Dual International and Domestic Certification
Efficient Ops
Efficient Ops
Apr 25, 2025 · Operations

How BizDevOps Standards Accelerate ICBC’s Digital Transformation

The article details China Information Communication Institute’s dual ITU DevOps international and domestic standard assessment, ICBC’s successful BizDevOps certification for its Fast‑Loan project, and insights from senior fintech managers on the value, challenges, and future roadmap of integrating business, development, and operations to drive digital transformation.

BankingBizDevOpsDevOps
0 likes · 16 min read
How BizDevOps Standards Accelerate ICBC’s Digital Transformation
Dual-Track Product Journal
Dual-Track Product Journal
Apr 25, 2025 · Operations

How to Stop Inventory Discrepancies and End the Blame Game

This article analyzes common inventory discrepancy scenarios, exposes typical blame‑shifting tactics across departments, and presents a comprehensive, operation‑focused solution stack—including traceability, dynamic calibration, and fool‑proof design—to eliminate errors and improve accountability.

Operationsdynamic calibrationerror prevention
0 likes · 6 min read
How to Stop Inventory Discrepancies and End the Blame Game
Top Architecture Tech Stack
Top Architecture Tech Stack
Apr 22, 2025 · Operations

Step-by-Step Guide to Deploy a Spring Boot Application with Docker and Jenkins CI/CD

This tutorial walks through installing Docker and Jenkins on CentOS, configuring system settings, creating a Jenkins job to pull, build, and package a Spring Boot project, testing the pipeline, and finally running the application via Docker, providing complete commands and configuration details for a reliable CI/CD workflow.

BackendJenkinsOperations
0 likes · 8 min read
Step-by-Step Guide to Deploy a Spring Boot Application with Docker and Jenkins CI/CD
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
Apr 22, 2025 · Operations

How a Real SCM System Automates Procurement, Warehouse, and Logistics to Eliminate Manual Work

This article explains why many companies struggle with fragmented supply‑chain processes, outlines the three essential capabilities of a truly effective SCM system—smooth workflow, node monitoring, and data persistence—and details how such a system can transform procurement, warehousing, logistics, cross‑department collaboration, and data analysis into an automated, data‑driven operation.

Data AnalyticsOperationsSCM
0 likes · 9 min read
How a Real SCM System Automates Procurement, Warehouse, and Logistics to Eliminate Manual Work
Java Captain
Java Captain
Apr 22, 2025 · Operations

Improving Cron Job Stability and Monitoring with Best Practices and Healthchecks

The article analyzes common cron job failures such as accidental deletions, OOM crashes, and lack of monitoring, then proposes standardized Jenkins deployment, automatic server selection, lock mechanisms, queue-based processing, status awareness, and the use of the open‑source Healthchecks system to achieve proactive detection and alerting.

Operationsautomationcron
0 likes · 8 min read
Improving Cron Job Stability and Monitoring with Best Practices and Healthchecks
dbaplus Community
dbaplus Community
Apr 21, 2025 · Operations

Turn Zabbix Alerts into AI‑Powered Insights with DeepSeek

This guide shows how to integrate Zabbix with a locally deployed DeepSeek large language model via Webhook, enabling automatic analysis of alerts, generation of root‑cause explanations and remediation suggestions, and delivering results through WeChat bots, dashboards, or email to reduce MTTR and manual effort.

AI OpsAlert AutomationDeepSeek
0 likes · 4 min read
Turn Zabbix Alerts into AI‑Powered Insights with DeepSeek
Raymond Ops
Raymond Ops
Apr 19, 2025 · Operations

Essential Apache & Nginx Log Analysis Commands for Linux Ops

This guide compiles practical Linux shell commands for analyzing Apache and Nginx access logs, covering IP frequency, page request counts, status code distribution, traffic volume, crawler detection, subnet aggregation, and time‑based request rates to help administrators monitor web service health efficiently.

NginxOperationslog analysis
0 likes · 15 min read
Essential Apache & Nginx Log Analysis Commands for Linux Ops
MaGe Linux Operations
MaGe Linux Operations
Apr 18, 2025 · Operations

Master Docker: Essential Commands for Developers and Ops

This guide compiles the most commonly used Docker commands, organized by functionality—including installation, image management, container handling, network and volume operations, logging, debugging, and system cleanup—to help developers and operations engineers efficiently manage Docker environments.

ContainerOperationscommand-line
0 likes · 11 min read
Master Docker: Essential Commands for Developers and Ops
JD Tech
JD Tech
Apr 17, 2025 · Operations

Chaos Engineering: Principles, Core Steps, Tool Selection, and AI Integration

This article explains chaos engineering—its definition, core principles, experimental workflow, tool selection, AI‑driven enhancements, and practical case studies—providing a comprehensive guide for building resilient distributed systems across backend, cloud‑native, mobile, and AI‑enabled environments.

AI integrationDistributed SystemsFault Injection
0 likes · 26 min read
Chaos Engineering: Principles, Core Steps, Tool Selection, and AI Integration
MaGe Linux Operations
MaGe Linux Operations
Apr 17, 2025 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces the ten most frequently used operations engineering tools, detailing each tool's functions, suitable scenarios, advantages, and real‑world examples, and includes practical code snippets to help engineers automate and streamline their daily workflows.

InfrastructureLinux toolsOperations
0 likes · 8 min read
Top 10 Essential Ops Tools Every Engineer Should Master
Efficient Ops
Efficient Ops
Apr 16, 2025 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable operations engineering tools—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, suitable scenarios, advantages, and real‑world examples, plus sample code snippets to help engineers automate and monitor infrastructure efficiently.

DevOpsInfrastructureOperations
0 likes · 9 min read
Top 10 Essential Ops Tools Every Engineer Should Master
Dual-Track Product Journal
Dual-Track Product Journal
Apr 11, 2025 · Operations

Why Your Replenishment System Traps You in a ‘More Restock, More Shortage’ Loop—and How to Fix It

This article dissects common failures in e‑commerce replenishment—such as hot‑product black holes, slow‑moving stock graves, and supply‑chain avalanches—and presents a seven‑step framework of dynamic forecasting, tiered strategies, distributed inventory, and automated safeguards to stabilize inventory levels.

OperationsSupply Chainautomation
0 likes · 9 min read
Why Your Replenishment System Traps You in a ‘More Restock, More Shortage’ Loop—and How to Fix It
Tencent Cloud Middleware
Tencent Cloud Middleware
Apr 9, 2025 · Operations

How TDMQ Pulsar’s Cluster‑Level and Topic‑Partition Throttling Keeps Your Messaging System Stable

This article explains why high‑throughput producers and consumers can saturate CPU, memory, network and disk I/O in TDMQ Pulsar clusters, describes the built‑in cluster‑level distributed and topic‑partition rate‑limiting mechanisms, and provides practical guidance for configuration, monitoring, and troubleshooting.

Cluster ManagementMessage QueueOperations
0 likes · 12 min read
How TDMQ Pulsar’s Cluster‑Level and Topic‑Partition Throttling Keeps Your Messaging System Stable
Liangxu Linux
Liangxu Linux
Apr 6, 2025 · Operations

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

This guide explains how SRE teams should collaborate early in the software development lifecycle to define Service Level Indicators (SLIs), set realistic Service Level Objectives (SLOs) and Service Level Agreements (SLAs), and integrate observability signals, error budgeting, risk management, and incident handling into reliable operations.

Error BudgetOperationsSLA
0 likes · 13 min read
How to Define SLIs, SLOs, and SLAs for Effective SRE Practices
Raymond Ops
Raymond Ops
Apr 5, 2025 · Operations

Master Nginx Load Balancing: Step‑by‑Step Configuration Guide

This article explains how to configure Nginx as a load balancer for web applications, covering upstream and proxy_pass definitions, the three built‑in balancing methods, weight and connection settings, fail‑over options, and practical code examples for both HTTP and HTTPS deployments.

ConfigurationNginxOperations
0 likes · 11 min read
Master Nginx Load Balancing: Step‑by‑Step Configuration Guide
Open Source Linux
Open Source Linux
Apr 3, 2025 · Operations

Understanding Linux Boot Process: From BIOS to Systemd

This article explains the Linux boot sequence, covering the BIOS/UEFI hardware check, GRUB2 bootloader configuration, kernel loading with initramfs, root filesystem mounting, systemd target units, essential services, and a CentOS 8 example with GRUB settings and module inspection.

Boot ProcessCentOSGRUB
0 likes · 8 min read
Understanding Linux Boot Process: From BIOS to Systemd
Baobao Algorithm Notes
Baobao Algorithm Notes
Apr 2, 2025 · Industry Insights

Building AI‑Native Teams: Turning AI Agents into Reliable Digital Employees

This article analyses why current AI agents fall short of being true digital employees, identifies four major obstacles—undocumented knowledge, GUI‑only tools, lack of isolated test environments, and limited memory and initiative—and proposes a comprehensive, six‑step technical and cultural roadmap for creating AI‑native teams that treat AI as a collaborative team member.

AI integrationOperationsdigital employee
0 likes · 61 min read
Building AI‑Native Teams: Turning AI Agents into Reliable Digital Employees
FunTester
FunTester
Mar 31, 2025 · Operations

Performance Testing and Fault Testing: Complementary Pillars for System Stability

The article explains how performance testing measures system efficiency under load while fault testing validates resilience under abnormal conditions, highlighting their shared goals, differences, overlapping toolchains, and how their combined use drives architecture optimization and improves service level agreements in modern complex software systems.

Fault InjectionLoad TestingOperations
0 likes · 14 min read
Performance Testing and Fault Testing: Complementary Pillars for System Stability
Raymond Ops
Raymond Ops
Mar 30, 2025 · Operations

How to Permanently Disable Swap on Modern Linux Systems

This guide explains why the traditional /etc/fstab comment method fails on newer Linux distributions and provides two reliable techniques—masking the swap.target unit with systemd and adding the noauto option to the fstab swap entry—to permanently disable swap across Ubuntu, CentOS, openEuler, and other major systems.

Operationsfstabnoauto
0 likes · 7 min read
How to Permanently Disable Swap on Modern Linux Systems
Raymond Ops
Raymond Ops
Mar 27, 2025 · Operations

All-in-One Linux Init Scripts for Rocky, AlmaLinux, CentOS, Ubuntu & More

This article introduces a comprehensive set of shell scripts that automate system initialization across dozens of Linux distributions, detailing supported features, version‑specific updates, usage instructions, and code examples for network, security, package management, and more.

DevOpsLinuxOperations
0 likes · 16 min read
All-in-One Linux Init Scripts for Rocky, AlmaLinux, CentOS, Ubuntu & More
Qunar Tech Salon
Qunar Tech Salon
Mar 27, 2025 · Operations

Automated Capacity Planning and Auto‑Scaling for Hotel Services During Traffic Peaks

This document describes a comprehensive capacity‑planning solution that predicts traffic‑peak impacts for hotel services, automatically estimates required CPU resources, creates timed scaling tasks, and evaluates performance using detailed metrics, thereby improving operational efficiency and reducing manual effort during events such as exam‑ticket printing and holiday travel surges.

Auto ScalingOperationsResource Management
0 likes · 12 min read
Automated Capacity Planning and Auto‑Scaling for Hotel Services During Traffic Peaks
FunTester
FunTester
Mar 25, 2025 · Operations

Integrating Chaos Engineering into Service Dependency Governance for Resilient Cloud‑Native Systems

This article explores how to embed chaos engineering practices into service dependency governance, detailing dynamic validation versus static analysis, fault injection techniques, multi‑point failure simulations, and data‑driven optimizations to build robust, self‑healing microservice architectures in cloud‑native environments.

Cloud NativeMicroservicesOperations
0 likes · 18 min read
Integrating Chaos Engineering into Service Dependency Governance for Resilient Cloud‑Native Systems
Efficient Ops
Efficient Ops
Mar 24, 2025 · Operations

15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them

This article outlines fifteen common operational and management mistakes—such as frequent incidents, excessive new hires, lack of automation, and missing rollback plans—that can trigger system outages, and offers guidance on how teams can strengthen testing, processes, and team capabilities to prevent downtime.

DevOpsOperationsSRE
0 likes · 6 min read
15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them
Ops Development Stories
Ops Development Stories
Mar 24, 2025 · Operations

Why Do Some Ops Teams Face Value Challenges? Insights for CTOs

Operations leaders and CTOs often confront the question of the true value of their teams, and this article explores who asks it, why it matters, typical challenges, and practical ways to define and protect the operational role through unified platforms, processes, and strategic collaboration with development.

CTOOperationsSRE
0 likes · 13 min read
Why Do Some Ops Teams Face Value Challenges? Insights for CTOs
Raymond Ops
Raymond Ops
Mar 20, 2025 · Operations

Master Linux Traceroute: Install, Use, and Advanced Options Explained

Learn how to install the traceroute utility on Debian/Ubuntu and CentOS/RHEL systems, understand its basic command syntax, explore common and advanced options, and see practical examples for network path tracing, while noting important considerations and usage tips.

LinuxOperationscommand-line
0 likes · 6 min read
Master Linux Traceroute: Install, Use, and Advanced Options Explained
Bilibili Tech
Bilibili Tech
Mar 18, 2025 · Operations

Technical Practices for Ensuring Stability of Bilibili’s 2025 Spring Festival Gala Live Stream

Bilibili’s engineering team built a scenario‑metadata and one‑click fault‑drill platform, implemented multi‑tier degradation, dynamic capacity planning, and extensive automated fault‑injection testing to guarantee zero‑severity incidents during the high‑traffic 2025 Spring Festival Gala live stream.

Fault InjectionOperationshigh concurrency
0 likes · 16 min read
Technical Practices for Ensuring Stability of Bilibili’s 2025 Spring Festival Gala Live Stream
dbaplus Community
dbaplus Community
Mar 17, 2025 · Operations

Designing an AI‑Powered Ops Platform with DeepSeek: Architecture, Modules, and Implementation

This article outlines a comprehensive AI‑Ops solution built on DeepSeek, covering its technical architecture, data collection stack, AI engine deployment, key functional modules, implementation roadmap, model training, security design, cost estimates, and risk mitigation strategies for modern operations teams.

AI OpsDeepSeekInfrastructure Automation
0 likes · 7 min read
Designing an AI‑Powered Ops Platform with DeepSeek: Architecture, Modules, and Implementation
Dual-Track Product Journal
Dual-Track Product Journal
Mar 14, 2025 · Operations

How Bad Inventory Sync Can Kill Your E‑commerce Business—and 3 Fixes to Save It

This article examines how delayed or inconsistent inventory synchronization leads to costly overselling and deadstock in e‑commerce, presents three destructive synchronization patterns, and offers a step‑by‑step guide—including real‑time messaging, distributed locks, rule‑engine integration, and intelligent alerts—to transform inventory management from a liability into a self‑healing system.

BackendDistributed SystemsOperations
0 likes · 8 min read
How Bad Inventory Sync Can Kill Your E‑commerce Business—and 3 Fixes to Save It
FunTester
FunTester
Mar 14, 2025 · Operations

Fault Testing: Enhancing System Resilience through Controlled Failure Simulations

The article explains how fault testing—by deliberately injecting failures in a controlled environment—helps identify system weaknesses, validates post‑mortem improvements, and drives architectural optimization, thereby increasing high‑availability and resilience of modern internet services.

Operationschaos engineeringfault testing
0 likes · 8 min read
Fault Testing: Enhancing System Resilience through Controlled Failure Simulations
Raymond Ops
Raymond Ops
Mar 13, 2025 · Operations

Boost Nginx Performance: Essential Linux Kernel Tweaks for High Concurrency

This guide explains why default Linux kernel settings are insufficient for high‑traffic Nginx servers and provides a curated list of sysctl parameters—such as file‑max, tcp_tw_reuse, and net.core buffers—along with explanations and tuning tips to maximize concurrent connections and overall performance.

Kernel TuningNginxOperations
0 likes · 8 min read
Boost Nginx Performance: Essential Linux Kernel Tweaks for High Concurrency
DevOps Cloud Academy
DevOps Cloud Academy
Mar 13, 2025 · Operations

Release Engineering Best Practices: Branching Models, CI/CD Guidelines, and Deployment Strategies

This article provides a comprehensive overview of release engineering, covering branch models, naming conventions, merge processes, Git commit standards, CI/CD stage design, environment isolation, artifact management, product delivery steps, deployment strategies, and rollback procedures to ensure reliable software releases.

DeploymentKubernetesOperations
0 likes · 26 min read
Release Engineering Best Practices: Branching Models, CI/CD Guidelines, and Deployment Strategies
Efficient Ops
Efficient Ops
Mar 12, 2025 · Operations

How BizDevOps Is Accelerating Digital Transformation in Finance

This article explains the governmental push for digital transformation in financial institutions, introduces the BizDevOps integration model and its domestic and international standards, outlines the evaluation framework and process, showcases case studies, and announces the open registration for the 2025 BizDevOps assessment.

BizDevOpsDigital TransformationFinancial Industry
0 likes · 9 min read
How BizDevOps Is Accelerating Digital Transformation in Finance
FunTester
FunTester
Mar 12, 2025 · Operations

Fault Injection Testing: Concepts, Scenarios, Process, and Best Practices

Fault injection testing deliberately introduces failures into a system to assess its resilience, helping identify weak points, improve retry and timeout mechanisms, and ensure robust operation across software, protocol, and infrastructure layers, with practical guidance on processes, tools, and Kubernetes-specific practices.

Fault InjectionKubernetesOperations
0 likes · 8 min read
Fault Injection Testing: Concepts, Scenarios, Process, and Best Practices
Liangxu Linux
Liangxu Linux
Mar 11, 2025 · Operations

Master Linux ‘ip’ Command: Essential Network Management Operations

This guide explains the Linux ip command—its syntax, how to view and control network interfaces, configure IP addresses, manage routes, set up VLANs, and handle ARP entries—providing practical examples that enable efficient network administration and troubleshooting on Linux systems.

ARPLinuxOperations
0 likes · 6 min read
Master Linux ‘ip’ Command: Essential Network Management Operations
dbaplus Community
dbaplus Community
Mar 11, 2025 · Operations

How a Unified White‑Screen Ops Platform Transformed Multi‑Cloud Middleware Management

This article details the challenges of traditional middleware operations, explains how Kubernetes and Operators were leveraged to build a unified, visual, and automated platform that standardizes, automates, and visualizes multi‑cloud resource management, and reports the significant efficiency, cost, and safety gains achieved across dozens of clusters.

KubernetesOperationsOperator
0 likes · 23 min read
How a Unified White‑Screen Ops Platform Transformed Multi‑Cloud Middleware Management
Python Programming Learning Circle
Python Programming Learning Circle
Mar 7, 2025 · Operations

Using Python Scripts for Operations Automation: Remote Execution, Log Parsing, Monitoring, Deployment, and Backup

This article explains how operations engineers can leverage Python scripts and popular libraries such as paramiko, regex, psutil, fabric, and shutil to automate tasks like remote command execution, log analysis, system monitoring, batch deployment, and backup, thereby improving efficiency and reducing manual errors.

OperationsPythonScripting
0 likes · 10 min read
Using Python Scripts for Operations Automation: Remote Execution, Log Parsing, Monitoring, Deployment, and Backup
FunTester
FunTester
Mar 7, 2025 · Operations

Fault Testing: Proactive Resilience Engineering for Distributed Systems

Fault testing, akin to a shield, deliberately injects failures into distributed and cloud‑native systems to expose weak points, verify recovery mechanisms, and improve overall reliability, ensuring business continuity even under unexpected disruptions.

OperationsResiliencechaos engineering
0 likes · 11 min read
Fault Testing: Proactive Resilience Engineering for Distributed Systems
Raymond Ops
Raymond Ops
Feb 27, 2025 · Operations

Unlock Linux Secrets: Exploring /proc and /proc/self for Process Insight

This article explains the Linux /proc virtual file system and its /proc/self shortcut, detailing how to read process information such as command line, working directory, executable path, environment variables, memory maps, and memory image using simple shell commands.

Operationsprocproc-self
0 likes · 5 min read
Unlock Linux Secrets: Exploring /proc and /proc/self for Process Insight
JD Tech Talk
JD Tech Talk
Feb 26, 2025 · Operations

Business Monitoring: Importance, Metric System Design, and Practical Implementation

This article explains the significance of business monitoring, distinguishes technical and business metrics, outlines a step‑by‑step process for building a business metric system, and shares practical experiences, tools, and common pitfalls to help teams improve operational reliability and decision‑making.

MetricsOperationsbusiness monitoring
0 likes · 13 min read
Business Monitoring: Importance, Metric System Design, and Practical Implementation
JD Cloud Developers
JD Cloud Developers
Feb 26, 2025 · Operations

How to Build Effective Business Monitoring Metrics for Reliable Operations

This guide explains the significance of business monitoring, differentiates technical and business metrics, outlines a step‑by‑step process for building a robust business indicator system, and shares practical methods, tools, and common pitfalls to ensure reliable, actionable monitoring in operations.

Operationsbusiness monitoringincident response
0 likes · 12 min read
How to Build Effective Business Monitoring Metrics for Reliable Operations
Open Source Linux
Open Source Linux
Feb 26, 2025 · Operations

Master Apache Log Analysis with 30 Essential Shell Commands

This guide presents a comprehensive collection of shell and awk commands for analyzing Apache access logs, covering IP counting, page request statistics, traffic filtering, performance metrics, connection states, and bandwidth usage, enabling administrators to efficiently monitor and troubleshoot web server activity.

ApacheOperationsShell scripting
0 likes · 14 min read
Master Apache Log Analysis with 30 Essential Shell Commands
DeWu Technology
DeWu Technology
Feb 24, 2025 · Mobile Development

Design and Implementation of a Mini‑Program Management Platform

The DeWu mini‑program platform unifies WeChat, Alipay and other channels into a single workflow by providing role‑based management, cross‑platform API abstraction, real‑time data synchronization, and Feishu‑linked approval, reducing manual tasks, speeding complaint handling, and boosting operational efficiency while addressing integration and security challenges.

AI toolsMiniProgramOperations
0 likes · 9 min read
Design and Implementation of a Mini‑Program Management Platform
Chen Tian Universe
Chen Tian Universe
Feb 24, 2025 · Operations

How Enterprise Budget Management Systems Streamline Resource Allocation

This article explains how a comprehensive budget management system helps enterprises allocate limited financial, material, and human resources efficiently by defining processes, integrating with travel, expense, HR, and procurement systems, and providing real‑time control, data synchronization, and balance reporting to support strategic objectives.

Budget ControlOperationsbudget management
0 likes · 10 min read
How Enterprise Budget Management Systems Streamline Resource Allocation
Ops Development & AI Practice
Ops Development & AI Practice
Feb 22, 2025 · Operations

Why Terraform Is the Go-To Tool for Modern Infrastructure Automation

This article explains how Terraform, as an Infrastructure as Code solution, streamlines infrastructure management by offering declarative configuration, automation, version control, repeatability, and testing, while outlining its advantages, typical use cases, a comparison with other IaC tools, and step‑by‑step installation and workflow guidance.

Infrastructure as CodeOperationsTerraform
0 likes · 9 min read
Why Terraform Is the Go-To Tool for Modern Infrastructure Automation
ByteDance SYS Tech
ByteDance SYS Tech
Feb 18, 2025 · Operations

How Can Data Center Planning Cut Costs and Boost Efficiency?

This article explains how a mixed‑integer programming tool developed by ByteDance's SYS‑DCD team integrates cost, reliability, delivery speed, and environmental metrics to optimize data‑center planning, reduce power waste, and accelerate deployment across multiple regional scenarios.

Linear ProgrammingOperationsPlanning
0 likes · 15 min read
How Can Data Center Planning Cut Costs and Boost Efficiency?
Efficient Ops
Efficient Ops
Feb 17, 2025 · Operations

From Bronze to AI‑Powered Ops: Mastering the Operations Career Ladder

This article explores the hierarchy of operations roles, outlines five career stages from entry‑level to AI‑driven expert, and offers practical advice on building foundations, automation, high‑availability design, and embracing emerging technologies.

Career DevelopmentDevOpsOperations
0 likes · 6 min read
From Bronze to AI‑Powered Ops: Mastering the Operations Career Ladder
Baidu Geek Talk
Baidu Geek Talk
Feb 17, 2025 · Operations

How Baidu Netdisk Prevents Service Avalanches: Dynamic Circuit Breaking & Queue Control

This article analyzes Baidu Netdisk's anti‑avalanche architecture, explaining how avalanche cascades occur in high‑concurrency services and detailing practical prevention, blocking, and mitigation techniques such as dynamic circuit breaking, traffic isolation, request‑validity checks, and socket‑level detection to maintain system reliability.

Backend ArchitectureCircuit BreakingDynamic Throttling
0 likes · 18 min read
How Baidu Netdisk Prevents Service Avalanches: Dynamic Circuit Breaking & Queue Control
Alibaba Cloud Observability
Alibaba Cloud Observability
Feb 17, 2025 · Cloud Native

Mastering Enterprise Alerting: Build a Robust Cloud‑Native Monitoring System

This article explores how to design and implement a comprehensive, enterprise‑grade alerting system—covering monitoring fundamentals, MTTF/MTTR concepts, multi‑layer metric collection, alert rule best practices, severity levels, notification channels, false‑positive reduction, and real‑world case studies—to ensure reliable cloud‑native operations.

AlertingMTTROperations
0 likes · 35 min read
Mastering Enterprise Alerting: Build a Robust Cloud‑Native Monitoring System
FunTester
FunTester
Feb 16, 2025 · Operations

Master Byteman: Install, Build, and Configure Java Fault Injection

This guide walks you through downloading Byteman, setting up BYTEMAN_HOME, using Ant or Maven for integration, building from source, configuring the Java agent with detailed options, and leveraging tutorials for effective fault‑injection testing in Java applications.

AntBytemanFault Injection
0 likes · 8 min read
Master Byteman: Install, Build, and Configure Java Fault Injection
Software Development Quality
Software Development Quality
Feb 14, 2025 · R&D Management

Essential R&D Metrics: How to Measure Business Value, Delivery Speed, Quality, and Ops

This guide presents a comprehensive set of R&D performance indicators—including business value, delivery speed, engineering quality, and operational reliability—detailing each metric’s purpose, key calculation rules, and interpretation to help teams monitor and improve software development efficiency.

Delivery SpeedOperationsR&D metrics
0 likes · 8 min read
Essential R&D Metrics: How to Measure Business Value, Delivery Speed, Quality, and Ops
FunTester
FunTester
Feb 14, 2025 · Operations

Debugging, Tracing, and Stack Management Operations in the Rule Engine

This article explains the built‑in debugging and tracing methods of the rule engine, including the debug API, trace operations, stack‑management functions such as caller checks, stack formatting, and thread‑stack tracing, along with usage examples and special cases for controlling output.

Operationstracing
0 likes · 9 min read
Debugging, Tracing, and Stack Management Operations in the Rule Engine
FunTester
FunTester
Feb 13, 2025 · Operations

Why Fault Testing Is Critical for Modern Online Systems

In today's digital era, online services face increasing fault risks, and systematic fault testing—through chaos engineering, fault injection, stress testing, and disaster recovery drills—helps teams anticipate, evaluate, and improve system resilience, ultimately reducing downtime and protecting business continuity.

Cloud NativeOperationsautomation
0 likes · 9 min read
Why Fault Testing Is Critical for Modern Online Systems