Tagged articles
3281 articles
Page 4 of 33
Raymond Ops
Raymond Ops
Jul 24, 2025 · Operations

Master Essential Linux Commands: From Navigation to File Management

This guide provides a comprehensive overview of common Linux commands for directory navigation, file manipulation, permission management, searching, compression, and system shutdown, complete with usage syntax, options, and practical examples to help users efficiently manage their Unix-like environments.

BashOperationsShell
0 likes · 22 min read
Master Essential Linux Commands: From Navigation to File Management
Ops Community
Ops Community
Jul 24, 2025 · Operations

How a Small E‑commerce Site Scaled to 10 Million Daily Visits: Real‑World Architecture Lessons

This article details a small‑to‑mid‑size e‑commerce platform’s journey from a few thousand daily page views to ten million, covering business challenges, three architecture evolution stages, key technical solutions, performance optimizations, cost‑control strategies, and practical automation tips.

OperationsPerformance Optimizationmonitoring
0 likes · 14 min read
How a Small E‑commerce Site Scaled to 10 Million Daily Visits: Real‑World Architecture Lessons
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
Jul 24, 2025 · Operations

How to Tier and Manage Suppliers Effectively Across Their Full Lifecycle

This article explains why supplier tiering is essential, outlines three common classification models, describes a DMAIC‑based six‑stage supplier lifecycle, and shows how a digital procurement system can automate information management, quoting, execution, and settlement to improve operational efficiency.

Operationsprocurementsupplier management
0 likes · 8 min read
How to Tier and Manage Suppliers Effectively Across Their Full Lifecycle
Linux Ops Smart Journey
Linux Ops Smart Journey
Jul 23, 2025 · Operations

Master Real-Time Kubernetes Logs with the kubectl tail Plugin

This guide explains how to install and use the kubectl tail plugin—a krew‑based tool that streams logs from multiple Kubernetes Pods and containers in real time, covering prerequisites, offline manifest download, installation steps, and practical command examples for various selectors.

KubernetesLog MonitoringOperations
0 likes · 6 min read
Master Real-Time Kubernetes Logs with the kubectl tail Plugin
MaGe Linux Operations
MaGe Linux Operations
Jul 23, 2025 · Operations

How We Rescued a Crashed K8s Cluster: etcd 100% Fragmentation Recovery

This article details a P0 production incident where a Kubernetes cluster became completely unresponsive due to 100% etcd database fragmentation, describing the step‑by‑step diagnosis, emergency recovery actions, root‑cause analysis, and long‑term preventive measures for reliable cluster operation.

Cluster RecoveryKubernetesOperations
0 likes · 12 min read
How We Rescued a Crashed K8s Cluster: etcd 100% Fragmentation Recovery
Chen Tian Universe
Chen Tian Universe
Jul 23, 2025 · Operations

Designing a Robust Settlement System: 10 Essential Steps and Real‑World Scenarios

This comprehensive guide explains what a settlement system is, outlines ten critical design steps—from data source identification to bill provision—and illustrates their application across diverse scenarios such as corporate welfare platforms, government services, promotion platforms, bank acquiring, points e‑commerce, ETC, and consumer finance, while highlighting key rules, workflows, and interface considerations.

Financial ServicesOperationsSystem Design
0 likes · 34 min read
Designing a Robust Settlement System: 10 Essential Steps and Real‑World Scenarios
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
Jul 22, 2025 · Operations

How to Build an Automated Procurement Inquiry & Quotation System

This guide explains the common pain points of manual procurement inquiries and quotations, then walks through a five‑step approach to create an online, automated system that streamlines demand collection, supplier notification, price entry, real‑time comparison, and final price confirmation, improving accuracy and efficiency for both buyers and suppliers.

OperationsSupply Chainautomation
0 likes · 9 min read
How to Build an Automated Procurement Inquiry & Quotation System
High Availability Architecture
High Availability Architecture
Jul 22, 2025 · Operations

How We Automated Server Fault Detection and Repair at Scale

This article explains the challenges of managing rapidly growing server fleets, outlines a systematic classification of hardware and software faults, and details an end‑to‑end automated solution that combines in‑band and out‑of‑band data collection, rule‑based detection, and fully automated repair workflows to improve fault coverage, accuracy, and recovery speed.

Operationshardware detectionmonitoring
0 likes · 16 min read
How We Automated Server Fault Detection and Repair at Scale
Dual-Track Product Journal
Dual-Track Product Journal
Jul 22, 2025 · Operations

How Intelligent Shelving Strategies Boost Warehouse Efficiency – A Product Manager’s Guide

This article explains how dynamic, data‑driven shelving strategies in a Warehouse Management System can dramatically improve space utilization, reduce picker travel, and streamline inventory handling by outlining key objectives, common tactics, detailed workflow steps, constraint checks, exception handling, and result processing.

LogisticsOperationsWMS
0 likes · 10 min read
How Intelligent Shelving Strategies Boost Warehouse Efficiency – A Product Manager’s Guide
MaGe Linux Operations
MaGe Linux Operations
Jul 21, 2025 · Cloud Native

Master Kubernetes with Essential Commands: Efficient Container Cluster Management

This comprehensive guide walks operations engineers through essential Kubernetes commands, covering cluster inspection, pod lifecycle, service and network handling, storage configuration, troubleshooting, performance monitoring, scaling, security, and automation, enabling efficient and expert management of containerized clusters.

Cluster ManagementKubernetesOperations
0 likes · 17 min read
Master Kubernetes with Essential Commands: Efficient Container Cluster Management
Liangxu Linux
Liangxu Linux
Jul 20, 2025 · Operations

Top 10 Essential Tools Every Operations Engineer Should Master

Discover the ten most widely used operations engineering tools—including Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing each tool's functions, ideal scenarios, advantages, and real‑world examples, plus sample code and configuration snippets.

ConfigurationDevOpsOperations
0 likes · 8 min read
Top 10 Essential Tools Every Operations Engineer Should Master
Su San Talks Tech
Su San Talks Tech
Jul 19, 2025 · Operations

Mastering Load Balancing: Architecture, Algorithms, and Real-World Pitfalls

This article explores the four‑layer load‑balancing architecture, five common algorithms (including Round Robin, Weighted RR, Least Connections, Consistent Hashing, and AI‑driven adaptive load), high‑availability design, deep pitfalls, and a self‑built load balancer implementation, providing practical code examples and best‑practice guidelines.

Backend ArchitectureOperationsdistributed algorithms
0 likes · 10 min read
Mastering Load Balancing: Architecture, Algorithms, and Real-World Pitfalls
Ops Development & AI Practice
Ops Development & AI Practice
Jul 18, 2025 · Operations

Mastering Modern Software Operations: The Six Essential Steps for Success

Modern software operations have shifted from a post‑launch checklist to an ongoing, automated discipline, and this article outlines the six core phases—requirement planning, CI/CD automation, comprehensive monitoring, incident response, performance tuning, and security compliance—providing concrete examples and practical advice for building a resilient DevOps culture.

DevOpsOperationsPerformance Optimization
0 likes · 9 min read
Mastering Modern Software Operations: The Six Essential Steps for Success
Lin is Dream
Lin is Dream
Jul 18, 2025 · Backend Development

Master Nginx: From Installation to Advanced Configuration on Linux, Docker, and macOS

This guide walks you through installing and configuring Nginx on CentOS, Docker, and macOS, explains core concepts like event‑driven architecture, compares Nginx with Tomcat, and clarifies the roles of traffic gateways, API gateways, and service registries, providing practical commands and Q&A for developers.

DockerInstallationOperations
0 likes · 10 min read
Master Nginx: From Installation to Advanced Configuration on Linux, Docker, and macOS
Raymond Ops
Raymond Ops
Jul 17, 2025 · Operations

Essential Ops Toolkit: Unified Account Management, Automation, DNS, and More

This guide reviews a comprehensive set of operations tools—including LDAP, JumpServer, Fabric, Ansible, dnsmasq, pdnsd, ApacheBench, TCPcopy, PortSentry, fail2ban, knockd, Vagrant, Docker, ELK, and Smokeping—detailing their features, advantages, and typical use cases for modern infrastructure management.

DNSOperationslogging
0 likes · 8 min read
Essential Ops Toolkit: Unified Account Management, Automation, DNS, and More
Raymond Ops
Raymond Ops
Jul 16, 2025 · Operations

Master Linux File Search: locate vs find – Fast Tips & Advanced Options

This guide explains how the Linux locate and find commands work, covering their performance characteristics, key features, common options, search criteria, combining conditions, actions, and how to use xargs for parameter passing, helping users choose the right tool for efficient file searching.

File SearchOperationscommand-line
0 likes · 8 min read
Master Linux File Search: locate vs find – Fast Tips & Advanced Options
Ops Community
Ops Community
Jul 15, 2025 · Operations

Why 90% of Ops Teams Choose the Wrong LVS Mode – A Deep Dive into Performance

This article examines the four Linux Virtual Server (LVS) clustering modes—NAT, Direct Routing, Tunneling, and FULLNAT—detailing their architectures, data flows, configuration steps, advantages, disadvantages, and ideal use cases, helping operations engineers select the most suitable load‑balancing solution for high‑performance, scalable web services.

LVSOperationsload balancing
0 likes · 18 min read
Why 90% of Ops Teams Choose the Wrong LVS Mode – A Deep Dive into Performance
Ops Community
Ops Community
Jul 14, 2025 · Operations

How Three Open‑Source Ops Tools Can Eliminate Manual Maintenance Nightmares

After eight years of ops experience, this article shares three open‑source tools—SSL certificate management, SMS verification, and website monitoring—that automate routine tasks, cut costs, and prevent midnight outages, offering practical solutions for individual developers, small teams, and SMEs.

Cost reductionOperationsSMS Verification
0 likes · 6 min read
How Three Open‑Source Ops Tools Can Eliminate Manual Maintenance Nightmares
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
Jul 14, 2025 · Operations

How to Turn Uncooperative Suppliers into Collaborative Partners with Smart SRM Design

This article explains why suppliers often seem uncooperative, identifies the three core concerns that drive their behavior, and outlines five practical SRM system features—real‑time reconciliation, payment visibility, production planning, inventory alerts, and transparent performance—that transform supplier relationships into mutually beneficial collaborations.

CollaborationOperationsSRM
0 likes · 8 min read
How to Turn Uncooperative Suppliers into Collaborative Partners with Smart SRM Design
Efficient Ops
Efficient Ops
Jul 8, 2025 · Operations

How China International Aviation Achieved Advanced DevOps Maturity Through Dual International and Domestic Standards

This article details China International Aviation's successful DevOps assessment—earning a Level‑2 Continuous Delivery rating—by aligning its refund system with both ITU international standards and domestic Chinese standards, and includes an in‑depth interview on implementation, benefits, challenges, and future plans.

Case StudyContinuous DeliveryDevOps
0 likes · 16 min read
How China International Aviation Achieved Advanced DevOps Maturity Through Dual International and Domestic Standards
Raymond Ops
Raymond Ops
Jul 8, 2025 · Operations

Comprehensive Multi‑Distro Linux Initialization Script Guide

This article presents a complete collection of shell scripts for initializing a wide range of Linux distributions—including Rocky, AlmaLinux, CentOS, Ubuntu, Debian, openEuler, AnolisOS, OpenCloudOS, openSUSE, Kylin Server, and Uos Server—detailing supported features, version‑specific updates, and step‑by‑step usage instructions.

OperationsSystem Initializationautomation
0 likes · 19 min read
Comprehensive Multi‑Distro Linux Initialization Script Guide
Practical DevOps Architecture
Practical DevOps Architecture
Jul 8, 2025 · Operations

Master Filebeat: Complete Configuration Guide for Log Shipping

This article provides a complete Filebeat configuration example, covering input settings for log files, field definitions, multiline handling, module loading, Elasticsearch output parameters, index naming, authentication, and processors for field cleanup, enabling efficient log collection and indexing in Elastic Stack environments.

FilebeatLog ShippingOperations
0 likes · 2 min read
Master Filebeat: Complete Configuration Guide for Log Shipping
Liangxu Linux
Liangxu Linux
Jul 7, 2025 · Operations

Why Dish Beats Netstat: A Lightweight Socket Monitoring CLI

This article introduces Dish, a Go‑based command‑line utility for real‑time socket connection monitoring, explains its core features, technical principles, installation and usage examples, compares it with tools like netcat, nmap and telnet, and discusses its advantages and limitations for operations teams.

CLI toolGoOperations
0 likes · 10 min read
Why Dish Beats Netstat: A Lightweight Socket Monitoring CLI
Ops Development & AI Practice
Ops Development & AI Practice
Jul 7, 2025 · Cloud Computing

Why Infrastructure Architecture Is the Hidden Backbone of Modern Cloud Systems

Infrastructure architecture, the often‑overlooked foundation of IT, defines how compute, storage, networking, and security are designed, integrated, and automated—linking software, ops, and cloud strategies—through processes like requirement analysis, technology selection, IaC implementation, and continuous optimization for reliability, performance, cost, and operational excellence.

DevOpsInfrastructureOperations
0 likes · 8 min read
Why Infrastructure Architecture Is the Hidden Backbone of Modern Cloud Systems
Ops Development & AI Practice
Ops Development & AI Practice
Jul 7, 2025 · Operations

Why Infrastructure as Code Is a Game‑Changer for Modern Ops

From manual server provisioning nightmares to automated, version‑controlled infrastructure, this article explains what IaC is, why it matters, and how to adopt it using Terraform and Ansible, offering practical steps, best‑practice tips, and real‑world benefits for operations teams.

AnsibleInfrastructure as CodeOperations
0 likes · 10 min read
Why Infrastructure as Code Is a Game‑Changer for Modern Ops
Su San Talks Tech
Su San Talks Tech
Jul 7, 2025 · Operations

Mastering High Availability: Redundancy & Automatic Failover in Modern Internet Architecture

This article explains how to achieve high availability in internet systems by designing redundant components and automatic failover mechanisms across layers such as load balancers, reverse proxies, microservices, middleware, databases, and messaging, illustrating concepts with diagrams of architectures, clustering, leader election, and practical tools like keepalived, Zookeeper, Redis Sentinel, and Kafka.

MicroservicesOperationsfailover
0 likes · 19 min read
Mastering High Availability: Redundancy & Automatic Failover in Modern Internet Architecture
FunTester
FunTester
Jul 5, 2025 · Operations

Essential Fault‑Testing Resources: Guides, Tools, and Articles for 2025

This curated collection gathers the most valuable fault‑testing, Byteman, Chrome Extension, and frontend development articles from 2024‑2025, providing concise titles, direct links, and publication dates to help engineers quickly locate essential technical knowledge.

BytemanChrome ExtensionOperations
0 likes · 6 min read
Essential Fault‑Testing Resources: Guides, Tools, and Articles for 2025
dbaplus Community
dbaplus Community
Jul 3, 2025 · Cloud Native

Rescue Expired Kubernetes Certificates Offline: A 4‑Step Emergency Guide

Facing certificate expiration in isolated, regulated Kubernetes clusters? This guide explains the hidden risks, outlines a four‑step offline rescue toolkit, details automated rotation with Cert‑Manager and Vault, and provides compliance audit and disaster‑recovery strategies, illustrated with real‑world banking case studies.

Cloud NativeKubernetesOperations
0 likes · 11 min read
Rescue Expired Kubernetes Certificates Offline: A 4‑Step Emergency Guide
Ops Development & AI Practice
Ops Development & AI Practice
Jul 3, 2025 · Operations

Why Event-Driven Architecture Is the Secret Sauce for Resilient Ops

The article explains how Event‑Driven Architecture (EDA) transforms traditional request‑response systems into decoupled, asynchronous pipelines that boost system resilience, scalability, observability, and agility, and it demonstrates a practical AWS EventBridge image‑processing workflow.

AWS EventBridgeEDAEvent-Driven Architecture
0 likes · 10 min read
Why Event-Driven Architecture Is the Secret Sauce for Resilient Ops
Efficient Ops
Efficient Ops
Jul 2, 2025 · Operations

How AI and DevOps Are Shaping the Future of Enterprise Operations

This article outlines the evolution of DevOps standards, organization‑level capability building, AI‑enabled R&D and operations, and the foundational AI‑driven assurance framework, highlighting recent assessments, standards, and industry adoption across finance, telecom, manufacturing, and other sectors.

AIDevOpsOperations
0 likes · 14 min read
How AI and DevOps Are Shaping the Future of Enterprise Operations
Efficient Ops
Efficient Ops
Jul 2, 2025 · Operations

Master Grafana: Key Features, Installation on Linux & Docker

This guide introduces Grafana, outlines its multi‑source monitoring features, and provides step‑by‑step installation instructions for Linux using systemd and for Docker Compose, including required commands, configuration files, and how to create and save a basic dashboard.

DockerGrafanaInstallation
0 likes · 4 min read
Master Grafana: Key Features, Installation on Linux & Docker
Ops Development & AI Practice
Ops Development & AI Practice
Jul 2, 2025 · Cloud Native

Demystifying Cloud Native: A Hands‑On Guide for Ops Engineers

This article breaks down the cloud‑native concept for operations teams, explaining its meaning, the three core pillars—containerization, microservices, and container orchestration—and how adopting them can accelerate delivery, improve resilience, cut costs, and free engineers from repetitive manual tasks.

Cloud NativeContainersDevOps
0 likes · 8 min read
Demystifying Cloud Native: A Hands‑On Guide for Ops Engineers
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
Jul 1, 2025 · Operations

Why Inventory Turnover Rate Is the True KPI for Supply Chain Efficiency

Inventory turnover rate, not delivery time or fulfillment rate, is the key metric for evaluating supply chain efficiency; this article explains its definitions, business-specific calculations, common pitfalls, data requirements, and practical steps to set, monitor, and leverage the metric for better operational performance.

KPIsManufacturingOperations
0 likes · 9 min read
Why Inventory Turnover Rate Is the True KPI for Supply Chain Efficiency
macrozheng
macrozheng
Jul 1, 2025 · Operations

Best Log Management Tools Compared: Filebeat, Graylog, ELK, Loki, Datadog & More

This article provides a comprehensive comparison of popular log management solutions—including Filebeat, Graylog, the Elastic (ELK) stack, Grafana Loki, LogDNA, Datadog, Logstash, Fluentd, and Splunk—detailing their main features, pricing models, advantages, and drawbacks to help you choose the right tool for your needs.

ELK StackLog ManagementOperations
0 likes · 16 min read
Best Log Management Tools Compared: Filebeat, Graylog, ELK, Loki, Datadog & More
Chen Tian Universe
Chen Tian Universe
Jul 1, 2025 · Operations

How Do WeChat & Alipay Process Billions Daily? Inside Their Settlement System Design

This article dissects the core principles and architectural patterns behind Chinese super‑payment platforms' settlement systems, covering transaction‑clear‑settlement flow, settlement modes, account models, system components, processing steps, and key performance indicators for handling trillions of yuan each day.

Operationsfinancial systemspayment settlement
0 likes · 12 min read
How Do WeChat & Alipay Process Billions Daily? Inside Their Settlement System Design
ITPUB
ITPUB
Jun 30, 2025 · Operations

100 Essential Windows Command-Line Tools for System Administration

This comprehensive guide lists 100 practical Windows command‑line utilities, organized into categories such as system management, network diagnostics, file handling, process control, and advanced operational tasks, while highlighting high‑risk commands and best‑practice rules for safe administration.

Batch ScriptsOperationsWindows
0 likes · 13 min read
100 Essential Windows Command-Line Tools for System Administration
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
Jun 30, 2025 · Operations

How to Slash Procurement Costs: 5 Proven Process Controls

This guide reveals how companies can dramatically cut procurement expenses by first mapping their spending flow, then tightening demand control, approval boundaries, pricing comparisons, payment accuracy, and leveraging a digital system to create a closed‑loop, data‑driven purchasing process.

Cost reductionDigital TransformationOperations
0 likes · 9 min read
How to Slash Procurement Costs: 5 Proven Process Controls
Ops Development & AI Practice
Ops Development & AI Practice
Jun 28, 2025 · Information Security

Mastering AWS Temporary Credentials: Securely Assume IAM Roles

This guide explains why long‑lived IAM user keys are risky, introduces IAM roles and temporary security credentials, details trust and permissions policies, and provides step‑by‑step commands and profile configurations for safely using AWS STS assume‑role in production environments.

AWSAssumeRoleIAM
0 likes · 8 min read
Mastering AWS Temporary Credentials: Securely Assume IAM Roles
Raymond Ops
Raymond Ops
Jun 27, 2025 · Operations

How to Set Up Real‑Time NFS Backup with inotify and rsync

This guide walks through configuring rsync and inotify on multiple Linux hosts to achieve real‑time backup of NFS static resources, covering host preparation, rsync daemon setup, password handling, daemon activation, inotify‑driven monitoring scripts, and verification of successful synchronization.

NFSOperationsSysadmin
0 likes · 12 min read
How to Set Up Real‑Time NFS Backup with inotify and rsync
Java Tech Enthusiast
Java Tech Enthusiast
Jun 26, 2025 · Information Security

Why Microsoft Office Users Saw TLS Certificate Errors and What It Means

A missed renewal of a TLS certificate for the domain https://support.content.office.net caused widespread certificate‑expired warnings for Office users on June 24, 2024, affecting all services that rely on several related domains and will likely be resolved once Microsoft updates the certificate during regular working hours.

Microsoft OfficeOperationsTLS
0 likes · 3 min read
Why Microsoft Office Users Saw TLS Certificate Errors and What It Means
dbaplus Community
dbaplus Community
Jun 25, 2025 · Operations

How We Boosted Kafka Production Capacity by 35% with Simple Compression Tweaks

Facing petabyte‑scale log traffic, the Qunar team identified low compression rates in their Kafka‑Filebeat pipeline as the main bottleneck and, through systematic tuning of batch size, memory queues, and round‑robin settings, achieved a 35% reduction in traffic and a 30‑42% drop in request volume while raising per‑minute throughput by 35%.

BackendFilebeatKafka
0 likes · 10 min read
How We Boosted Kafka Production Capacity by 35% with Simple Compression Tweaks
Alibaba Cloud Observability
Alibaba Cloud Observability
Jun 24, 2025 · Operations

Avoid These 6 Log Management Anti‑Patterns to Keep Your Observability Reliable

This article examines common log‑management anti‑patterns—such as copy‑truncate rotation, NAS storage, multi‑process writes, file‑hole creation, frequent overwrites, and Vim edits—explains why they cause data loss or duplicate collection, and offers practical best‑practice recommendations for reliable log handling in cloud‑native environments.

Anti-PatternsOperationsbest practices
0 likes · 8 min read
Avoid These 6 Log Management Anti‑Patterns to Keep Your Observability Reliable
Efficient Ops
Efficient Ops
Jun 24, 2025 · Operations

Essential Ops Toolkit: 58 Core Tools for Modern Infrastructure Management

This article compiles a comprehensive matrix of 58 mainstream operations tools—covering operating systems, open‑source mirrors, containers, AI‑assisted ops, basic services, databases, monitoring, automation, CI/CD and service mesh—to help engineers quickly locate the right technology stack for efficient infrastructure management.

DevOpsInfrastructureOperations
0 likes · 6 min read
Essential Ops Toolkit: 58 Core Tools for Modern Infrastructure Management
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
Jun 19, 2025 · Operations

Boost Your Business: Master Inventory Turnover for Faster Cash Flow

This article explains what inventory turnover is, why it matters for cash flow and operational efficiency, and provides a three‑step framework—data dashboards, product rhythm management, and supply‑chain coordination—plus three key monitoring practices (trend, product, people) to continuously improve warehouse performance.

KPIsOperationsSupply Chain
0 likes · 8 min read
Boost Your Business: Master Inventory Turnover for Faster Cash Flow
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Jun 18, 2025 · Operations

How to Reset a Forgotten Elasticsearch 8.x/9.x Password Safely

When the built‑in elastic user password is lost in Elasticsearch 8.x or 9.x, you can use the official elasticsearch‑reset‑password command‑line tool to generate or set a new password without restarting the service, following a few simple steps and troubleshooting tips.

ElasticsearchOperationselasticsearch-reset-password
0 likes · 4 min read
How to Reset a Forgotten Elasticsearch 8.x/9.x Password Safely
Efficient Ops
Efficient Ops
Jun 18, 2025 · Operations

Bizarre Ops Nightmares: Real-World Failures That Shocked Engineers

A collection of startling operational mishaps—from a disastrous database expansion during a sales event to a Kubernetes storage blunder, a misconfigured ESXi host, a company‑wide Excel crash, and a power‑maintenance disaster that fried servers—illustrates the critical importance of proper procedures, backups, and infrastructure monitoring.

IncidentOperationsUPS
0 likes · 7 min read
Bizarre Ops Nightmares: Real-World Failures That Shocked Engineers
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
Jun 18, 2025 · Operations

How to Build a Bulletproof Procurement System and Avoid Being Blamed

This article explains why procurement teams often get blamed when supply chain issues arise and outlines a practical, step‑by‑step framework—including standardized demand entry, automated approvals, consistent pricing comparisons, clear contract delivery nodes, and closed‑loop payment—to create a transparent, efficient procurement system.

OperationsSupply Chainprocess automation
0 likes · 7 min read
How to Build a Bulletproof Procurement System and Avoid Being Blamed
Alibaba Cloud Native
Alibaba Cloud Native
Jun 18, 2025 · Operations

Avoid These 6 Log Management Anti‑Patterns to Keep Your Cloud‑Native Systems Reliable

Effective log management is crucial for cloud‑native observability, yet common practices like copy‑truncate rotation, NAS storage, multi‑process writes, file‑hole creation, frequent overwrites, and vim edits can cause data loss or duplicate collection; adopting create‑mode rotation, local disks, append‑only writes, and proper tools mitigates these risks.

Cloud NativeLog ManagementOperations
0 likes · 10 min read
Avoid These 6 Log Management Anti‑Patterns to Keep Your Cloud‑Native Systems Reliable
Raymond Ops
Raymond Ops
Jun 17, 2025 · Operations

Diagnosing Disk Space Issues on Linux with df and du Commands

This article walks through troubleshooting a failed deployment caused by a full disk, showing how to use df -h to check overall disk usage and various du options (including --max-depth and -sh) to pinpoint large directories and resolve the issue.

Operationsdfdisk space
0 likes · 4 min read
Diagnosing Disk Space Issues on Linux with df and du Commands
Open Source Linux
Open Source Linux
Jun 17, 2025 · Operations

Master HAProxy: Step‑by‑Step Installation, Configuration, and Advanced Load Balancing

This comprehensive guide walks you through installing HAProxy via yum, RPM packages, or source compilation, then details every core configuration block—including global, defaults, frontend, backend, and listen sections—while covering load‑balancing algorithms, ACL routing, health checks, SSL termination, statistics, and practical code examples for building a robust, high‑performance load‑balancer.

ConfigurationHAProxyInstallation
0 likes · 53 min read
Master HAProxy: Step‑by‑Step Installation, Configuration, and Advanced Load Balancing
Open Source Linux
Open Source Linux
Jun 17, 2025 · Operations

Future‑Proof Your Ops Career: A Practical Skill & Personal Growth Blueprint

This article offers ops professionals a comprehensive roadmap to boost technical expertise, embrace AI and big‑data trends, and cultivate personal habits such as health, finance, communication, and hobbies, turning weekend time into a powerful engine for career resilience and lifelong fulfillment.

AICareer DevelopmentOperations
0 likes · 7 min read
Future‑Proof Your Ops Career: A Practical Skill & Personal Growth Blueprint
Ops Development & AI Practice
Ops Development & AI Practice
Jun 14, 2025 · Information Security

Designing a Resilient Zero‑Trust Security Architecture on AWS for Small Ops Teams

This article outlines a comprehensive, financial‑grade security blueprint for a three‑person operations team using AWS services such as IAM, Secrets Manager, Session Manager, GuardDuty, and WAF, emphasizing Zero Trust, Least Privilege, and Defense‑in‑Depth to protect against external attacks, internal risks, and to enable clear audit trails for incident investigation.

AWSIAMOperations
0 likes · 13 min read
Designing a Resilient Zero‑Trust Security Architecture on AWS for Small Ops Teams
Raymond Ops
Raymond Ops
Jun 13, 2025 · Operations

Master HAProxy: Step-by-Step Deployment and Configuration Guide

This article provides a comprehensive, hands‑on guide to installing HAProxy, configuring global, defaults, listen, frontend, and backend sections, setting up ACL‑based load balancing, preparing backend web servers, testing the setup, and accessing the HAProxy statistics page.

ACLBackendConfiguration
0 likes · 16 min read
Master HAProxy: Step-by-Step Deployment and Configuration Guide
TAL Education Technology
TAL Education Technology
Jun 13, 2025 · Operations

How Large Language Models Are Revolutionizing Fault Localization

This article explores how the rapid rise of large language models and techniques like Retrieval‑Augmented Generation, Chain‑of‑Thought prompting, and multi‑agent architectures can dramatically improve the speed, accuracy, and automation of fault localization in modern operations environments.

Agent ArchitectureCoTFault Localization
0 likes · 14 min read
How Large Language Models Are Revolutionizing Fault Localization
Efficient Ops
Efficient Ops
Jun 10, 2025 · Operations

What Caused the June 6, 2025 Alibaba Cloud DNS Outage and How to Mitigate It?

On June 6, 2025 Alibaba Cloud experienced a widespread DNS resolution failure affecting OSS, CDN, container image services and more, which was later linked to a Shadowserver sinkhole, and the article outlines the incident timeline, root‑cause analysis, and practical mitigation steps for operators.

Alibaba CloudDNS outageOperations
0 likes · 4 min read
What Caused the June 6, 2025 Alibaba Cloud DNS Outage and How to Mitigate It?
MaGe Linux Operations
MaGe Linux Operations
Jun 9, 2025 · Operations

Essential Kubernetes Troubleshooting Checklist for Ops Engineers

This guide provides Kubernetes operators with a comprehensive, step‑by‑step troubleshooting manual covering pod, node, and cluster‑level issues, common pod states, exit‑code analysis, and practical commands such as kubectl describe, logs, top, and drain, enabling rapid diagnosis and resolution of K8s problems.

KubernetesNodeOperations
0 likes · 10 min read
Essential Kubernetes Troubleshooting Checklist for Ops Engineers
Efficient Ops
Efficient Ops
Jun 4, 2025 · Operations

Streamline Nginx Management with Nginx UI: Features, Installation & AI Agent Integration

This article introduces Nginx UI, a graphical tool that simplifies Nginx configuration and monitoring, outlines its core features—including AI Agent support—provides pre‑installation notes, and offers step‑by‑step installation guides for Systemd, Docker, and quick‑install scripts, concluding with its operational benefits.

DockerNginxOperations
0 likes · 5 min read
Streamline Nginx Management with Nginx UI: Features, Installation & AI Agent Integration
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
Jun 4, 2025 · Operations

Why Warehouses Overflow Yet Stockouts Occur? Root Causes & Solutions

The article explains why warehouses can be overfilled while customers still face stockouts, analyzing false and structural overstock, flawed demand planning, weak supply chain execution, and offers practical steps such as data‑driven forecasting, ABC inventory classification, transparent collaboration, fast‑response mechanisms, and accountability to resolve the paradox.

Operationsdemand planninginventory management
0 likes · 11 min read
Why Warehouses Overflow Yet Stockouts Occur? Root Causes & Solutions
dbaplus Community
dbaplus Community
Jun 3, 2025 · Operations

Mastering Kubernetes High Availability: Control Plane, Nodes, Networking, Storage, and More

This comprehensive guide walks you through designing a highly available Kubernetes cluster, covering multi‑master control‑plane deployment, worker‑node resilience, advanced networking with Cilium, durable storage with Rook/Ceph, monitoring with Thanos, security policies, disaster‑recovery strategies, cost control, and automated rollouts, all illustrated with concrete configuration snippets and real‑world performance results.

Cluster DesignDevOpsKubernetes
0 likes · 13 min read
Mastering Kubernetes High Availability: Control Plane, Nodes, Networking, Storage, and More
Old Zhao – Management Systems Only
Old Zhao – Management Systems Only
May 29, 2025 · Operations

Master Supplier Performance Evaluation: A Complete SRM Guide

This comprehensive guide explains what supplier performance evaluation is, why it matters, and provides a step‑by‑step "3+1" framework—including metric definition, scoring methods, result grading, and system integration—to help organizations build a data‑driven, actionable SRM process that improves supply chain reliability and reduces costs.

OperationsPerformance EvaluationSRM
0 likes · 8 min read
Master Supplier Performance Evaluation: A Complete SRM Guide
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
May 27, 2025 · Operations

Understanding L4 and L7 Load Balancing Architectures

This article explains the fundamentals of Layer‑4 and Layer‑7 load balancing, compares their advantages and disadvantages, and describes how a hybrid approach can combine high‑performance traffic handling with flexible application‑level routing for large‑scale systems.

L4L7Operations
0 likes · 4 min read
Understanding L4 and L7 Load Balancing Architectures
Bilibili Tech
Bilibili Tech
May 27, 2025 · Operations

Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook

This article presents a comprehensive overview of server fault management at scale, detailing the classification of failures, shortcomings of traditional manual processes, and the design of an automated detection and repair system that combines in‑band and out‑of‑band data collection, rule‑based alerting, and end‑to‑end repair workflows, while also outlining future directions for intelligent monitoring and reliability.

InfrastructureOperationsautomation
0 likes · 17 min read
Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook
Mingyi World Elasticsearch
Mingyi World Elasticsearch
May 27, 2025 · Operations

The Deep‑Dive Elasticsearch Settings List You Must Know

This article presents a comprehensive, source‑code‑derived list of every Elasticsearch configuration option—including hidden and undocumented settings—explains their scopes, default values, and types, and shows how the list can be used for quick lookups, performance tuning, debugging, and automation.

Cluster ConfigurationElasticsearchOperations
0 likes · 10 min read
The Deep‑Dive Elasticsearch Settings List You Must Know
Efficient Ops
Efficient Ops
May 26, 2025 · Artificial Intelligence

How AI Agents Are Revolutionizing AIOps: Boosting Automation and Efficiency

This article explains how AI agents enhance large‑model capabilities for AIOps, detailing single‑agent use cases like knowledge retrieval, tool guidance, and fault diagnosis, as well as multi‑agent collaborations, required skills, and future prospects for autonomous operations.

AIOperationsagent
0 likes · 7 min read
How AI Agents Are Revolutionizing AIOps: Boosting Automation and Efficiency
Raymond Ops
Raymond Ops
May 26, 2025 · Operations

Master Nginx Log Formatting: Customize, Test, and Optimize Your Access Logs

This guide explains how to use Nginx's HttpLogModule to control log output, defines key directives such as access_log, log_format, and open_log_file_cache, provides example configurations, demonstrates testing with curl, and offers practical tips for per‑location log management to improve troubleshooting and performance.

Access LogOperationslog format
0 likes · 6 min read
Master Nginx Log Formatting: Customize, Test, and Optimize Your Access Logs
Raymond Ops
Raymond Ops
May 24, 2025 · Operations

How to Install and Configure rsync on Windows Server for Automated Backups

This guide walks through the required environment, Windows Server rsync installation, configuration of rsyncd.conf and password files, service startup, port verification, and client-side commands to achieve reliable, scheduled file synchronization between Windows machines.

BackupOperationsWindows server
0 likes · 4 min read
How to Install and Configure rsync on Windows Server for Automated Backups
Alibaba Cloud Developer
Alibaba Cloud Developer
May 23, 2025 · Operations

How to Schedule Dify Workflows with GitHub Actions and XXL‑JOB

This article explains how to overcome Dify's lack of built‑in scheduling and monitoring by integrating it with external task‑scheduling systems such as GitHub Actions and XXL‑JOB, detailing setup steps, limitations, and the advantages of using XXL‑JOB for precise, enterprise‑grade workflow automation.

AI workflowDifyGitHub Actions
0 likes · 11 min read
How to Schedule Dify Workflows with GitHub Actions and XXL‑JOB
Qiming AI - Digital Management Talk
Qiming AI - Digital Management Talk
May 23, 2025 · Operations

Boost Warehouse Efficiency: 6 Proven Strategies from a 7‑Year Expert

This article explains what warehouse management really means, outlines its key goals such as safety, efficiency, supply‑chain coordination and data visualization, and presents six practical methods—including clear objectives, ABC classification, barcode usage, FIFO, space optimization, and digital automation—to dramatically improve warehouse performance.

Digital TransformationLogisticsOperations
0 likes · 9 min read
Boost Warehouse Efficiency: 6 Proven Strategies from a 7‑Year Expert
Youzan Coder
Youzan Coder
May 23, 2025 · Artificial Intelligence

How LLMs Supercharge SaaS Alert Monitoring: An AI‑Powered Workflow

This article explains how a SaaS company leveraged large language models to automatically ingest, enrich, and analyze stability alerts, turning noisy notifications into actionable insights through configurable pipelines, Feishu integration, and a streamlined AI workflow that boosts incident response speed and reduces manual effort.

AIAlert MonitoringLLM
0 likes · 6 min read
How LLMs Supercharge SaaS Alert Monitoring: An AI‑Powered Workflow
Liangxu Linux
Liangxu Linux
May 21, 2025 · Operations

Master Apache Log Analysis with 20 Essential Linux Commands

This guide presents a curated collection of 20 practical Linux one‑liners—using awk, grep, netstat, and other shell tools—to extract IP counts, page views, bandwidth, error rates, concurrency, and other key metrics from Apache access logs, enabling quick and thorough server traffic analysis.

ApacheOperationsShell
0 likes · 10 min read
Master Apache Log Analysis with 20 Essential Linux Commands
Efficient Ops
Efficient Ops
May 21, 2025 · Operations

Why We Dropped Kubernetes: Cutting Costs by 62% and Boosting DevOps Happiness

Six months after abandoning Kubernetes, our DevOps team reduced infrastructure spend by 62%, cut deployment time by 89%, eliminated weekend on‑call duties, and improved overall happiness, demonstrating that simplifying the tech stack can deliver substantial operational and business benefits.

Cost reductionDevOpsInfrastructure
0 likes · 9 min read
Why We Dropped Kubernetes: Cutting Costs by 62% and Boosting DevOps Happiness
Dual-Track Product Journal
Dual-Track Product Journal
May 21, 2025 · Operations

Master Warehouse Management: Essential Terms & Strategies Every PM Should Know

This comprehensive guide covers core WMS terminology—from basic concepts like locations, storage slots, and SKUs to inbound/outbound processes, inventory management techniques such as FIFO and safety stock, strategic approaches including wave and picking methods, essential equipment like PDAs and RFID, and advanced industry jargon, providing product managers with the knowledge to navigate technical discussions, impress stakeholders, and optimize warehouse operations.

LogisticsOperationsinventory
0 likes · 11 min read
Master Warehouse Management: Essential Terms & Strategies Every PM Should Know
MaGe Linux Operations
MaGe Linux Operations
May 19, 2025 · Operations

Simplify Domain and SSL Certificate Management with a Unified Platform

This article outlines common challenges in multi‑platform domain and HTTPS certificate management, introduces a unified management platform with features like automated syncing, Let’s Encrypt integration, and multi‑channel alerts, provides a step‑by‑step Docker deployment guide, and shares a curated collection of popular open‑source monitoring tools.

Docker deploymentOperationsSSL
0 likes · 7 min read
Simplify Domain and SSL Certificate Management with a Unified Platform