Tagged articles

Operations

3329 articles · Page 5 of 34

Jun 24, 2025 · Operations

How to Efficiently Back Up GitLab 17.8: Pure Repos and Full Instance Strategies

This guide explains why backing up GitLab as a pure Git repository is useful, compares the git clone --mirror method with GitLab's built‑in Rake backup, details how to back up critical configuration files, and offers practical automation tips for a robust, multi‑layered backup strategy.

GitLabOperationsRake

0 likes · 13 min read

How to Efficiently Back Up GitLab 17.8: Pure Repos and Full Instance Strategies

Alibaba Cloud Observability

Jun 24, 2025 · Operations

Avoid These 6 Log Management Anti‑Patterns to Keep Your Observability Reliable

This article examines common log‑management anti‑patterns—such as copy‑truncate rotation, NAS storage, multi‑process writes, file‑hole creation, frequent overwrites, and Vim edits—explains why they cause data loss or duplicate collection, and offers practical best‑practice recommendations for reliable log handling in cloud‑native environments.

Anti-patternsBest PracticesObservability

0 likes · 8 min read

Avoid These 6 Log Management Anti‑Patterns to Keep Your Observability Reliable

Efficient Ops

Jun 24, 2025 · Operations

Essential Ops Toolkit: 58 Core Tools for Modern Infrastructure Management

This article compiles a comprehensive matrix of 58 mainstream operations tools—covering operating systems, open‑source mirrors, containers, AI‑assisted ops, basic services, databases, monitoring, automation, CI/CD and service mesh—to help engineers quickly locate the right technology stack for efficient infrastructure management.

CloudOperationsdevops

0 likes · 6 min read

Essential Ops Toolkit: 58 Core Tools for Modern Infrastructure Management

Old Zhao – Management Systems Only

Jun 19, 2025 · Operations

Boost Your Business: Master Inventory Turnover for Faster Cash Flow

This article explains what inventory turnover is, why it matters for cash flow and operational efficiency, and provides a three‑step framework—data dashboards, product rhythm management, and supply‑chain coordination—plus three key monitoring practices (trend, product, people) to continuously improve warehouse performance.

KPIsOperationsinventory turnover

0 likes · 8 min read

Boost Your Business: Master Inventory Turnover for Faster Cash Flow

Mingyi World Elasticsearch

Jun 18, 2025 · Operations

How to Reset a Forgotten Elasticsearch 8.x/9.x Password Safely

When the built‑in elastic user password is lost in Elasticsearch 8.x or 9.x, you can use the official elasticsearch‑reset‑password command‑line tool to generate or set a new password without restarting the service, following a few simple steps and troubleshooting tips.

ElasticsearchOperationselasticsearch-reset-password

0 likes · 4 min read

How to Reset a Forgotten Elasticsearch 8.x/9.x Password Safely

Efficient Ops

Jun 18, 2025 · Operations

Bizarre Ops Nightmares: Real-World Failures That Shocked Engineers

A collection of startling operational mishaps—from a disastrous database expansion during a sales event to a Kubernetes storage blunder, a misconfigured ESXi host, a company‑wide Excel crash, and a power‑maintenance disaster that fried servers—illustrates the critical importance of proper procedures, backups, and infrastructure monitoring.

OperationsUPSfailure

0 likes · 7 min read

Bizarre Ops Nightmares: Real-World Failures That Shocked Engineers

Old Zhao – Management Systems Only

Jun 18, 2025 · Operations

How to Build a Bulletproof Procurement System and Avoid Being Blamed

This article explains why procurement teams often get blamed when supply chain issues arise and outlines a practical, step‑by‑step framework—including standardized demand entry, automated approvals, consistent pricing comparisons, clear contract delivery nodes, and closed‑loop payment—to create a transparent, efficient procurement system.

Operationsprocess automationprocurement

0 likes · 7 min read

How to Build a Bulletproof Procurement System and Avoid Being Blamed

Alibaba Cloud Native

Jun 18, 2025 · Operations

Avoid These 6 Log Management Anti‑Patterns to Keep Your Cloud‑Native Systems Reliable

Effective log management is crucial for cloud‑native observability, yet common practices like copy‑truncate rotation, NAS storage, multi‑process writes, file‑hole creation, frequent overwrites, and vim edits can cause data loss or duplicate collection; adopting create‑mode rotation, local disks, append‑only writes, and proper tools mitigates these risks.

Operationscloud-nativelog management

0 likes · 10 min read

Avoid These 6 Log Management Anti‑Patterns to Keep Your Cloud‑Native Systems Reliable

Raymond Ops

Jun 17, 2025 · Operations

Diagnosing Disk Space Issues on Linux with df and du Commands

This article walks through troubleshooting a failed deployment caused by a full disk, showing how to use df -h to check overall disk usage and various du options (including --max-depth and -sh) to pinpoint large directories and resolve the issue.

LinuxOperationsTroubleshooting

0 likes · 4 min read

Diagnosing Disk Space Issues on Linux with df and du Commands

Open Source Linux

Jun 17, 2025 · Operations

Master HAProxy: Step‑by‑Step Installation, Configuration, and Advanced Load Balancing

This comprehensive guide walks you through installing HAProxy via yum, RPM packages, or source compilation, then details every core configuration block—including global, defaults, frontend, backend, and listen sections—while covering load‑balancing algorithms, ACL routing, health checks, SSL termination, statistics, and practical code examples for building a robust, high‑performance load‑balancer.

FrontendHAProxyInstallation

0 likes · 53 min read

Master HAProxy: Step‑by‑Step Installation, Configuration, and Advanced Load Balancing

Open Source Linux

Jun 17, 2025 · Operations

Future‑Proof Your Ops Career: A Practical Skill & Personal Growth Blueprint

This article offers ops professionals a comprehensive roadmap to boost technical expertise, embrace AI and big‑data trends, and cultivate personal habits such as health, finance, communication, and hobbies, turning weekend time into a powerful engine for career resilience and lifelong fulfillment.

AIOperationscareer development

0 likes · 7 min read

Future‑Proof Your Ops Career: A Practical Skill & Personal Growth Blueprint

Efficient Ops

Jun 16, 2025 · Operations

How Guoxin Securities Transformed Ops with CMDB Data Governance: A Four‑Stage Blueprint

The 25th GOPS Global Operations Conference in Shenzhen showcased Guoxin Securities' senior manager Lin Ying presenting a four‑stage CMDB development journey, a comprehensive data‑governance solution, real‑world case study, and future outlook, highlighting practical insights for digital transformation in operations.

CMDBCase StudyOperations

0 likes · 3 min read

How Guoxin Securities Transformed Ops with CMDB Data Governance: A Four‑Stage Blueprint

Ops Development & AI Practice

Jun 14, 2025 · Information Security

Designing a Resilient Zero‑Trust Security Architecture on AWS for Small Ops Teams

This article outlines a comprehensive, financial‑grade security blueprint for a three‑person operations team using AWS services such as IAM, Secrets Manager, Session Manager, GuardDuty, and WAF, emphasizing Zero Trust, Least Privilege, and Defense‑in‑Depth to protect against external attacks, internal risks, and to enable clear audit trails for incident investigation.

AWSIAMOperations

0 likes · 13 min read

Designing a Resilient Zero‑Trust Security Architecture on AWS for Small Ops Teams

Raymond Ops

Jun 13, 2025 · Operations

Master HAProxy: Step-by-Step Deployment and Configuration Guide

This article provides a comprehensive, hands‑on guide to installing HAProxy, configuring global, defaults, listen, frontend, and backend sections, setting up ACL‑based load balancing, preparing backend web servers, testing the setup, and accessing the HAProxy statistics page.

ACLFrontendHAProxy

0 likes · 16 min read

Master HAProxy: Step-by-Step Deployment and Configuration Guide

Linux Ops Smart Journey

Jun 13, 2025 · Operations

Master ServiceMonitor: Build Reliable Prometheus Monitoring for Kubernetes

This article dives deep into ServiceMonitor, comparing it with traditional Prometheus configurations, detailing its core fields, and providing hands‑on examples for Harbor and GitLab metrics, enabling you to create stable, flexible, and maintainable monitoring setups for Kubernetes services.

KubernetesOperationsPrometheus

0 likes · 5 min read

Master ServiceMonitor: Build Reliable Prometheus Monitoring for Kubernetes

TAL Education Technology

Jun 13, 2025 · Operations

How Large Language Models Are Revolutionizing Fault Localization

This article explores how the rapid rise of large language models and techniques like Retrieval‑Augmented Generation, Chain‑of‑Thought prompting, and multi‑agent architectures can dramatically improve the speed, accuracy, and automation of fault localization in modern operations environments.

CoTFault LocalizationLarge Language Model

0 likes · 14 min read

How Large Language Models Are Revolutionizing Fault Localization

Full-Stack DevOps & Kubernetes

Jun 12, 2025 · Cloud Native

Trace a containerd ID Back to Its Kubernetes Pod for Fast Debugging

This guide shows how to map a low‑level containerd task ID to the corresponding Kubernetes Pod using ctr, kubectl, and jq commands, then provides step‑by‑step debugging actions such as pod inspection, log retrieval, and resource checks to quickly resolve container issues.

CTRKubernetesOperations

0 likes · 8 min read

Trace a containerd ID Back to Its Kubernetes Pod for Fast Debugging

Efficient Ops

Jun 10, 2025 · Operations

What Caused the June 6, 2025 Alibaba Cloud DNS Outage and How to Mitigate It?

On June 6, 2025 Alibaba Cloud experienced a widespread DNS resolution failure affecting OSS, CDN, container image services and more, which was later linked to a Shadowserver sinkhole, and the article outlines the incident timeline, root‑cause analysis, and practical mitigation steps for operators.

Alibaba CloudCloud ComputingDNS outage

0 likes · 4 min read

What Caused the June 6, 2025 Alibaba Cloud DNS Outage and How to Mitigate It?

Architecture Digest

Jun 10, 2025 · Operations

How Much Bandwidth Does Douyin (TikTok) Really Have? Inside Its Massive Data Centers

This article explains how Douyin, TikTok, Baidu, Alibaba Cloud and Tencent operate self‑built data centers with terabit‑level outbound bandwidth, details ByteDance's server count growth from tens of thousands to hundreds of thousands, and describes the CDN technologies that enable billions of users to stream smoothly.

CDNData CenterOperations

0 likes · 8 min read

How Much Bandwidth Does Douyin (TikTok) Really Have? Inside Its Massive Data Centers

macrozheng

Jun 10, 2025 · Operations

Why HertzBeat Is the Next‑Gen Open‑Source Monitoring Solution for Cloud‑Native Environments

HertzBeat is a powerful, agent‑less, open‑source real‑time monitoring and alerting platform that supports custom templates, high‑performance clustering, cloud‑edge collaboration, and a wide range of notification channels, making it ideal for modern cloud‑native operations.

AlertingOperationsReal-time

0 likes · 14 min read

Why HertzBeat Is the Next‑Gen Open‑Source Monitoring Solution for Cloud‑Native Environments

MaGe Linux Operations

Jun 9, 2025 · Operations

Essential Kubernetes Troubleshooting Checklist for Ops Engineers

This guide provides Kubernetes operators with a comprehensive, step‑by‑step troubleshooting manual covering pod, node, and cluster‑level issues, common pod states, exit‑code analysis, and practical commands such as kubectl describe, logs, top, and drain, enabling rapid diagnosis and resolution of K8s problems.

KubernetesNodeOperations

0 likes · 10 min read

Essential Kubernetes Troubleshooting Checklist for Ops Engineers

DevOps Operations Practice

Jun 7, 2025 · Operations

How Ops Professionals Can Reach a 300k Annual Salary: Real‑World Tips

This article compiles practical advice from experienced operations engineers on the challenges and strategies for achieving a 300,000 CNY yearly salary, covering skill development, career moves, company size, automation, and the evolving role of SRE/DevOps.

CareerOperationsSRE

0 likes · 6 min read

How Ops Professionals Can Reach a 300k Annual Salary: Real‑World Tips

Linux Cloud Computing Practice

Jun 6, 2025 · Operations

Essential Kubernetes Ops Checklist: Diagnose and Fix Common Cluster Issues

This guide provides a comprehensive, step‑by‑step troubleshooting checklist for Kubernetes operations, covering Pod, Node, and cluster‑level problems, common pod status anomalies, and container exit‑code analysis to help operators quickly locate and resolve issues.

KubernetesNodeOperations

0 likes · 10 min read

Essential Kubernetes Ops Checklist: Diagnose and Fix Common Cluster Issues

Linux Cloud Computing Practice

Jun 6, 2025 · Operations

Master Ceph Distributed Storage: Essential Ops Guide & Troubleshooting

This article introduces Ceph as a leading open‑source distributed storage solution and presents a comprehensive operations manual covering common tasks, fault handling, and advanced topics to help engineers efficiently manage and troubleshoot Ceph clusters.

CephDistributed storageOperations

0 likes · 3 min read

Master Ceph Distributed Storage: Essential Ops Guide & Troubleshooting

Efficient Ops

Jun 4, 2025 · Operations

Streamline Nginx Management with Nginx UI: Features, Installation & AI Agent Integration

This article introduces Nginx UI, a graphical tool that simplifies Nginx configuration and monitoring, outlines its core features—including AI Agent support—provides pre‑installation notes, and offers step‑by‑step installation guides for Systemd, Docker, and quick‑install scripts, concluding with its operational benefits.

AutomationDockerNginx

0 likes · 5 min read

Streamline Nginx Management with Nginx UI: Features, Installation & AI Agent Integration

Old Zhao – Management Systems Only

Jun 4, 2025 · Operations

Why Warehouses Overflow Yet Stockouts Occur? Root Causes & Solutions

The article explains why warehouses can be overfilled while customers still face stockouts, analyzing false and structural overstock, flawed demand planning, weak supply chain execution, and offers practical steps such as data‑driven forecasting, ABC inventory classification, transparent collaboration, fast‑response mechanisms, and accountability to resolve the paradox.

Operationsdemand planninginventory management

0 likes · 11 min read

Why Warehouses Overflow Yet Stockouts Occur? Root Causes & Solutions

dbaplus Community

Jun 3, 2025 · Operations

Mastering Kubernetes High Availability: Control Plane, Nodes, Networking, Storage, and More

This comprehensive guide walks you through designing a highly available Kubernetes cluster, covering multi‑master control‑plane deployment, worker‑node resilience, advanced networking with Cilium, durable storage with Rook/Ceph, monitoring with Thanos, security policies, disaster‑recovery strategies, cost control, and automated rollouts, all illustrated with concrete configuration snippets and real‑world performance results.

Cluster DesignKubernetesOperations

0 likes · 13 min read

Mastering Kubernetes High Availability: Control Plane, Nodes, Networking, Storage, and More

MaGe Linux Operations

Jun 2, 2025 · Operations

Master journalctl: Persistent Systemd Logging, Viewing, and Cleanup Tips

This guide explains how to manage systemd journal logs on Linux, covering persistent storage configuration, size limits, various journalctl commands for viewing service, user, command, and boot logs, as well as log rotation, vacuuming, and integrity verification.

LinuxOperationsjournalctl

0 likes · 11 min read

Master journalctl: Persistent Systemd Logging, Viewing, and Cleanup Tips

ITFLY8 Architecture Home

Jun 1, 2025 · Operations

How JD’s Digital Transformation Blueprint Empowers Large State Enterprises

This article presents JD's comprehensive methodology for digitally transforming large and medium-sized state-owned enterprises, outlining strategic phases, technology adoption, organizational changes, and implementation steps through a series of illustrative slides.

JD methodologyOperationsState‑owned enterprises

0 likes · 4 min read

How JD’s Digital Transformation Blueprint Empowers Large State Enterprises

Xiaokun's Architecture Exploration Notes

Jun 1, 2025 · Operations

Understanding SLA, SLO, and SLI: Key Metrics for High‑Availability Systems

This article explains the differences between SLA, SLO, and SLI, shows how to express user expectations as concrete service level agreements, and introduces essential high‑availability metrics such as availability percentages, MTBF, MTTR, RPO, RTO, WRT, and MTD for reliable system design.

High AvailabilityOperationsSLA

0 likes · 9 min read

Understanding SLA, SLO, and SLI: Key Metrics for High‑Availability Systems

Dual-Track Product Journal

May 30, 2025 · Operations

How to Design Offline Inventory Counting: Avoid Data Loss and Conflict

This article explains how to build a robust offline inventory counting system that prevents data loss, resolves synchronization conflicts, and ensures seamless operation even when network connectivity is interrupted, offering practical design patterns and pitfall‑avoidance tips for warehouse teams.

InventoryOfflineOperations

0 likes · 6 min read

How to Design Offline Inventory Counting: Avoid Data Loss and Conflict

Qiming AI - Digital Management Talk

May 29, 2025 · Operations

Unlock Warehouse Efficiency: Proven SOPs to Align Inventory and Finance

This article outlines a complete warehouse management SOP—from accurate receiving and scientific storage to efficient shipping, regular inventory checks, safety standards, and 5S principles—showing how to turn the warehouse into a reliable “bank” for the factory while boosting operational efficiency.

5SOperationsSOP

0 likes · 9 min read

Unlock Warehouse Efficiency: Proven SOPs to Align Inventory and Finance

Old Zhao – Management Systems Only

May 29, 2025 · Operations

Master Supplier Performance Evaluation: A Complete SRM Guide

This comprehensive guide explains what supplier performance evaluation is, why it matters, and provides a step‑by‑step "3+1" framework—including metric definition, scoring methods, result grading, and system integration—to help organizations build a data‑driven, actionable SRM process that improves supply chain reliability and reduces costs.

OperationsSRMperformance evaluation

0 likes · 8 min read

Master Supplier Performance Evaluation: A Complete SRM Guide

Efficient Ops

May 28, 2025 · Operations

Unlocking Intelligent Operations: Inside China’s SOMM Maturity Model for AIOps, SRE, and FinOps

The article introduces China’s System Operation Maturity Model (SOMM), detailing its three pillars—AIOps, SRE, and FinOps—along with the underlying standards, assessment results, and how enterprises leverage these frameworks to achieve smarter, more reliable, and cost‑effective IT operations.

AIOpsFinOpsIT Operations

0 likes · 7 min read

Unlocking Intelligent Operations: Inside China’s SOMM Maturity Model for AIOps, SRE, and FinOps

Mike Chen's Internet Architecture

May 27, 2025 · Operations

Understanding L4 and L7 Load Balancing Architectures

This article explains the fundamentals of Layer‑4 and Layer‑7 load balancing, compares their advantages and disadvantages, and describes how a hybrid approach can combine high‑performance traffic handling with flexible application‑level routing for large‑scale systems.

L4L7Network Architecture

0 likes · 4 min read

Understanding L4 and L7 Load Balancing Architectures

Bilibili Tech

May 27, 2025 · Operations

Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook

This article presents a comprehensive overview of server fault management at scale, detailing the classification of failures, shortcomings of traditional manual processes, and the design of an automated detection and repair system that combines in‑band and out‑of‑band data collection, rule‑based alerting, and end‑to‑end repair workflows, while also outlining future directions for intelligent monitoring and reliability.

AutomationOperationsinfrastructure

0 likes · 17 min read

Automated Server Fault Detection and Repair: Architecture, Methods, and Future Outlook

Mingyi World Elasticsearch

May 27, 2025 · Operations

The Deep‑Dive Elasticsearch Settings List You Must Know

This article presents a comprehensive, source‑code‑derived list of every Elasticsearch configuration option—including hidden and undocumented settings—explains their scopes, default values, and types, and shows how the list can be used for quick lookups, performance tuning, debugging, and automation.

Cluster ConfigurationElasticsearchOperations

0 likes · 10 min read

The Deep‑Dive Elasticsearch Settings List You Must Know

Efficient Ops

May 26, 2025 · Artificial Intelligence

How AI Agents Are Revolutionizing AIOps: Boosting Automation and Efficiency

This article explains how AI agents enhance large‑model capabilities for AIOps, detailing single‑agent use cases like knowledge retrieval, tool guidance, and fault diagnosis, as well as multi‑agent collaborations, required skills, and future prospects for autonomous operations.

AIAIOpsAgent

0 likes · 7 min read

How AI Agents Are Revolutionizing AIOps: Boosting Automation and Efficiency

Raymond Ops

May 26, 2025 · Operations

Master Nginx Log Formatting: Customize, Test, and Optimize Your Access Logs

This guide explains how to use Nginx's HttpLogModule to control log output, defines key directives such as access_log, log_format, and open_log_file_cache, provides example configurations, demonstrates testing with curl, and offers practical tips for per‑location log management to improve troubleshooting and performance.

Access LogOperationslog format

0 likes · 6 min read

Master Nginx Log Formatting: Customize, Test, and Optimize Your Access Logs

Raymond Ops

May 24, 2025 · Operations

How to Install and Configure rsync on Windows Server for Automated Backups

This guide walks through the required environment, Windows Server rsync installation, configuration of rsyncd.conf and password files, service startup, port verification, and client-side commands to achieve reliable, scheduled file synchronization between Windows machines.

OperationsWindows Serverbackup

0 likes · 4 min read

How to Install and Configure rsync on Windows Server for Automated Backups

Alibaba Cloud Developer

May 23, 2025 · Operations

How to Schedule Dify Workflows with GitHub Actions and XXL‑JOB

This article explains how to overcome Dify's lack of built‑in scheduling and monitoring by integrating it with external task‑scheduling systems such as GitHub Actions and XXL‑JOB, detailing setup steps, limitations, and the advantages of using XXL‑JOB for precise, enterprise‑grade workflow automation.

AI workflowDifyGitHub Actions

0 likes · 11 min read

How to Schedule Dify Workflows with GitHub Actions and XXL‑JOB

Qiming AI - Digital Management Talk

May 23, 2025 · Operations

Boost Warehouse Efficiency: 6 Proven Strategies from a 7‑Year Expert

This article explains what warehouse management really means, outlines its key goals such as safety, efficiency, supply‑chain coordination and data visualization, and presents six practical methods—including clear objectives, ABC classification, barcode usage, FIFO, space optimization, and digital automation—to dramatically improve warehouse performance.

Operationsdigital transformationinventory optimization

0 likes · 9 min read

Boost Warehouse Efficiency: 6 Proven Strategies from a 7‑Year Expert

Youzan Coder

May 23, 2025 · Artificial Intelligence

How LLMs Supercharge SaaS Alert Monitoring: An AI‑Powered Workflow

This article explains how a SaaS company leveraged large language models to automatically ingest, enrich, and analyze stability alerts, turning noisy notifications into actionable insights through configurable pipelines, Feishu integration, and a streamlined AI workflow that boosts incident response speed and reduces manual effort.

AIAlert MonitoringAutomation

0 likes · 6 min read

How LLMs Supercharge SaaS Alert Monitoring: An AI‑Powered Workflow

Liangxu Linux

May 21, 2025 · Operations

Master Apache Log Analysis with 20 Essential Linux Commands

This guide presents a curated collection of 20 practical Linux one‑liners—using awk, grep, netstat, and other shell tools—to extract IP counts, page views, bandwidth, error rates, concurrency, and other key metrics from Apache access logs, enabling quick and thorough server traffic analysis.

OperationsShellapache

0 likes · 10 min read

Master Apache Log Analysis with 20 Essential Linux Commands

Efficient Ops

May 21, 2025 · Operations

Why We Dropped Kubernetes: Cutting Costs by 62% and Boosting DevOps Happiness

Six months after abandoning Kubernetes, our DevOps team reduced infrastructure spend by 62%, cut deployment time by 89%, eliminated weekend on‑call duties, and improved overall happiness, demonstrating that simplifying the tech stack can deliver substantial operational and business benefits.

KubernetesOperationscost reduction

0 likes · 9 min read

Why We Dropped Kubernetes: Cutting Costs by 62% and Boosting DevOps Happiness

Dual-Track Product Journal

May 21, 2025 · Operations

Master Warehouse Management: Essential Terms & Strategies Every PM Should Know

This comprehensive guide covers core WMS terminology—from basic concepts like locations, storage slots, and SKUs to inbound/outbound processes, inventory management techniques such as FIFO and safety stock, strategic approaches including wave and picking methods, essential equipment like PDAs and RFID, and advanced industry jargon, providing product managers with the knowledge to navigate technical discussions, impress stakeholders, and optimize warehouse operations.

InventoryOperationsProduct Management

0 likes · 11 min read

Master Warehouse Management: Essential Terms & Strategies Every PM Should Know

MaGe Linux Operations

May 19, 2025 · Operations

Simplify Domain and SSL Certificate Management with a Unified Platform

This article outlines common challenges in multi‑platform domain and HTTPS certificate management, introduces a unified management platform with features like automated syncing, Let’s Encrypt integration, and multi‑channel alerts, provides a step‑by‑step Docker deployment guide, and shares a curated collection of popular open‑source monitoring tools.

Docker deploymentOperationsSSL

0 likes · 7 min read

Simplify Domain and SSL Certificate Management with a Unified Platform

Old Zhao – Management Systems Only

May 19, 2025 · Operations

Why 90% of ERP Projects Fail in China—and How to Make Yours Succeed

This article examines why most ERP projects in Chinese companies fail, explores the core benefits of ERP, identifies common pitfalls such as over‑complexity, resistance, and high costs, and offers practical steps—including need assessment, right‑sized selection, process clarification, phased rollout, and strong leadership—to ensure successful implementation.

ERPOperationsSMEs

0 likes · 11 min read

Why 90% of ERP Projects Fail in China—and How to Make Yours Succeed

Lin is Dream

May 18, 2025 · Operations

Master Server Disk & Network Monitoring with Command‑Line Tools

This guide explains why every server must monitor CPU, memory, disk and network usage, shows how to clean disks and analyze traffic using command‑line utilities such as df, du, iotop, iostat, iftop, lsof and tcpdump, and provides real‑world case studies for troubleshooting disk space exhaustion, port conflicts and abnormal outbound traffic.

Disk ManagementLinux commandsOperations

0 likes · 9 min read

Master Server Disk & Network Monitoring with Command‑Line Tools

Linux Ops Smart Journey

May 16, 2025 · Operations

Turn Jenkins into a Real‑Time Monitoring Hub with Prometheus & Grafana

This guide shows how to integrate Jenkins with Prometheus and Grafana, covering plugin installation, metric endpoint exposure, Prometheus scraping configuration, verification via curl, and importing a ready‑made Grafana dashboard to achieve proactive, visualized CI/CD monitoring.

GrafanaJenkinsOperations

0 likes · 4 min read

Turn Jenkins into a Real‑Time Monitoring Hub with Prometheus & Grafana

php Courses

May 16, 2025 · Operations

Using Python for Automation in Operations (DevOps)

This article explains why Python is a leading language for DevOps automation, detailing its core advantages, typical use cases such as bulk server management, configuration management, log analysis, and scheduled tasks, and introduces common Python libraries and learning pathways for building robust operational workflows.

AutomationOperationsPython

0 likes · 6 min read

Using Python for Automation in Operations (DevOps)

FunTester

May 16, 2025 · Operations

Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles

Chaos engineering is a discipline that deliberately injects faults into distributed systems to test and improve resilience, tracing its evolution from Netflix's Chaos Monkey to modern platforms, outlining its operational workflow, benefits, and core principles for reliable system design.

Fault InjectionOperationsReliability

0 likes · 9 min read

Chaos Engineering: Evolution, Workflow, Advantages, and Practice Principles

FunTester

May 15, 2025 · Operations

Uncovering the Eight Hidden Pitfalls That Can Crash Your Distributed System

This article dissects the classic Eight Fallacies of Distributed Computing, explaining each mistaken assumption about network reliability, latency, bandwidth, security, topology, administration, cost, and homogeneity, and provides real‑world case studies and practical recommendations to help engineers design more resilient distributed systems.

FallaciesLatencyNetwork Reliability

0 likes · 16 min read

Uncovering the Eight Hidden Pitfalls That Can Crash Your Distributed System

Lin is Dream

May 14, 2025 · Operations

Master Nginx Rate Limiting: Prevent Abuse with limit_req & limit_conn

Learn how to protect your services from abusive traffic and brute‑force attacks by using Nginx's rate‑limiting features—limit_req to control request rates and limit_conn to restrict concurrent connections—complete with configuration examples, explanations of zones, burst handling, custom error pages, and log monitoring.

OperationsServer Configurationlimit_conn

0 likes · 6 min read

Master Nginx Rate Limiting: Prevent Abuse with limit_req & limit_conn

JD Tech Talk

May 13, 2025 · Operations

Intelligent Supply Chain Planning Algorithms and Their Applications

The article introduces intelligent supply chain planning algorithms—including network design, inventory layout, and simulation—detailing their optimization models, high‑performance solving techniques, and real‑world impact on cost reduction, efficiency, and service experience across large‑scale logistics operations.

OperationsOptimizationSimulation

0 likes · 12 min read

Intelligent Supply Chain Planning Algorithms and Their Applications

Efficient Ops

May 11, 2025 · Operations

China’s Leading Banks Achieve Top DevOps Standard Certifications – What It Means

The 25th GOPS Global Operations Conference in Shenzhen announced the dual ITU DevOps international and domestic standard assessment results, highlighting Agricultural Bank as the first state bank to earn a five‑star internal coach talent rating and showcasing multiple financial institutions that have successfully passed BizDevOps and continuous delivery evaluations, underscoring the growing importance of standardized DevOps practices in China’s finance sector.

BizDevOpsFinancial IndustryOperations

0 likes · 9 min read

China’s Leading Banks Achieve Top DevOps Standard Certifications – What It Means

macrozheng

May 11, 2025 · Operations

What’s the Longest‑Running Computer? Real‑World Server Uptime Stories

A collection of real‑world anecdotes from Zhihu users reveals servers and computers that have stayed online for decades, highlighting hardware longevity, power‑redundancy strategies, and the cultural mindset of “if it works, don’t touch it.”

Hardware ReliabilityLinuxOperations

0 likes · 8 min read

What’s the Longest‑Running Computer? Real‑World Server Uptime Stories

Architecture and Beyond

May 10, 2025 · Operations

What Heinrich’s 1:29:300 Rule Reveals About Preventing Online Outages

The article explains Heinrich's Law, its 1:29:300 accident pyramid, and how applying its principles—tracking minor incidents, hidden hazards, and systemic risks—can help software teams anticipate, diagnose, and prevent major online failures through systematic safety management and data‑driven practices.

Heinrich's LawIncident ManagementOperations

0 likes · 15 min read

What Heinrich’s 1:29:300 Rule Reveals About Preventing Online Outages

Qunar Tech Salon

May 9, 2025 · Operations

Kafka Production Optimization: Reducing Load and Improving Compression via Filebeat Tuning

This technical case study details how a high‑traffic Kafka logging cluster was optimized by adjusting Filebeat and Kafka parameters, increasing compression batch size, and tuning Kubernetes settings, resulting in significant reductions in request volume, network traffic, CPU usage, and overall resource consumption.

KafkaOperationsPerformance

0 likes · 10 min read

Kafka Production Optimization: Reducing Load and Improving Compression via Filebeat Tuning

Efficient Ops

May 7, 2025 · Operations

Why Choose SigNoz for Open‑Source Observability? A Deep Dive

This article introduces SigNoz, a self‑hosted open‑source observability platform that unifies metrics, logs, and traces, outlines its core capabilities, shows how to install it with Docker, and compares its resource efficiency to commercial solutions like DataDog and Elastic.

ObservabilityOpenTelemetryOperations

0 likes · 4 min read

Why Choose SigNoz for Open‑Source Observability? A Deep Dive

Dual-Track Product Journal

May 7, 2025 · Operations

Why Aggressive Replenishment Can Destroy Your Supply Chain—and How to Fix It

The article reveals how over‑eager replenishment creates supply‑chain avalanches, zombie inventory, and bullwhip effects, then offers a step‑by‑step resurrection guide with dynamic safety stock, data‑validation loops, and self‑healing mechanisms to restore healthy inventory flow.

Business AnalyticsOperationsinventory management

0 likes · 5 min read

Why Aggressive Replenishment Can Destroy Your Supply Chain—and How to Fix It

DevOps Operations Practice

May 6, 2025 · Operations

Kubernetes Certificate Management: Common Pitfalls, Detection Methods, and Renewal Procedures

This article explains why Kubernetes certificates often become hidden "time bombs," describes the typical failures caused by expired certificates, and provides practical methods to detect upcoming expirations and safely renew or replace them to keep clusters running smoothly.

KubernetesOperationscertificate-management

0 likes · 6 min read

Kubernetes Certificate Management: Common Pitfalls, Detection Methods, and Renewal Procedures

ITPUB

May 5, 2025 · Operations

Turn Zabbix Alerts into an AI‑Powered Personal Assistant

This guide shows how to integrate Zabbix with a locally deployed DeepSeek large language model via Webhook, enabling automatic analysis of alert causes and solutions, feeding results back to operators through dashboards or enterprise WeChat, and dramatically reducing MTTR and manual effort.

AIAlert AutomationDeepSeek

0 likes · 5 min read

Turn Zabbix Alerts into an AI‑Powered Personal Assistant

Dual-Track Product Journal

May 2, 2025 · Operations

How to Stop Warehouse Chaos: 100 Ways Wave Picking Can Fail—and How to Fix It

A disastrous beauty‑ecommerce promotion exposed how naïve wave‑picking designs can turn warehouses into mazes, cause urgent orders to disappear, and mix products, but by applying intelligent grouping, dynamic capacity, heat‑map path optimization, and a three‑level priority system, fulfillment efficiency can be dramatically restored.

Operationslogisticsorder fulfillment

0 likes · 5 min read

How to Stop Warehouse Chaos: 100 Ways Wave Picking Can Fail—and How to Fix It

Raymond Ops

May 1, 2025 · Fundamentals

Mastering SNAT and DNAT: How to Translate Network Addresses with iptables

This guide explains the concepts, mechanisms, and primary uses of SNAT and DNAT in network address translation, and provides step‑by‑step iptables commands for implementing source and destination address translation in typical networking scenarios.

DNATNATOperations

0 likes · 8 min read

Mastering SNAT and DNAT: How to Translate Network Addresses with iptables

ITPUB

Apr 30, 2025 · Operations

Why a Credential‑Rotation Mistake Took Down Cloudflare R2 and Its Dependent Services

On March 21 2025, a mis‑deployed credential during R2 Gateway's key rotation caused a 1‑hour‑7‑minute outage that blocked all write operations and about 35% of reads across R2 and several downstream Cloudflare services, prompting a detailed post‑mortem and a set of corrective actions to improve visibility and safety of credential changes.

Cloud ComputingOperationscredential management

0 likes · 15 min read

Why a Credential‑Rotation Mistake Took Down Cloudflare R2 and Its Dependent Services

Efficient Ops

Apr 29, 2025 · Operations

How BizDevOps Standards Are Shaping China’s Cloud and AI Operations Landscape

This article outlines the evolution of BizDevOps standards in China, detailing recent policy mandates, the expansion of the DevOps maturity model to organization‑level practice, the AI‑driven SOMM operation assurance framework, and the integration of large‑model AI into R&D and operational workflows, highlighting their impact on enterprise efficiency and governance.

AI integrationBizDevOpsDevOps Standards

0 likes · 15 min read

How BizDevOps Standards Are Shaping China’s Cloud and AI Operations Landscape

BirdNest Tech Talk

Apr 29, 2025 · Cloud Native

How Docker Simplifies MCP Server Deployment for AI Agents

The article analyzes the challenges of manually deploying Model Context Protocol (MCP) servers for AI agents, compares them with Docker‑based deployment, and demonstrates step‑by‑step configurations, code snippets, and concrete benefits such as environment consistency, resource efficiency, and security.

AI agentsDeploymentDocker

0 likes · 7 min read

How Docker Simplifies MCP Server Deployment for AI Agents

dbaplus Community

Apr 28, 2025 · Operations

20 Common Ops Failures and How to Diagnose & Fix Them

This article compiles twenty frequent operational incidents—from server inaccessibility and database connection errors to disk‑space exhaustion, high CPU usage, memory leaks, network latency, DNS failures, service crashes, file‑system corruption, update problems, permission misconfigurations, web‑server and email issues, backup failures, load‑balancing anomalies, firewall rule mistakes, SSH connection problems, database performance degradation, dependency gaps, and virtual‑machine faults—detailing their symptoms, step‑by‑step troubleshooting procedures, and concrete remediation actions.

This article analyzes common inventory discrepancy scenarios, exposes typical blame‑shifting tactics across departments, and presents a comprehensive, operation‑focused solution stack—including traceability, dynamic calibration, and fool‑proof design—to eliminate errors and improve accountability.

OperationsTraceabilitydynamic calibration

0 likes · 6 min read

How to Stop Inventory Discrepancies and End the Blame Game

Top Architecture Tech Stack

Apr 22, 2025 · Operations

Step-by-Step Guide to Deploy a Spring Boot Application with Docker and Jenkins CI/CD

This tutorial walks through installing Docker and Jenkins on CentOS, configuring system settings, creating a Jenkins job to pull, build, and package a Spring Boot project, testing the pipeline, and finally running the application via Docker, providing complete commands and configuration details for a reliable CI/CD workflow.

CI/CDJenkinsOperations

0 likes · 8 min read

Step-by-Step Guide to Deploy a Spring Boot Application with Docker and Jenkins CI/CD

Old Zhao – Management Systems Only

Apr 22, 2025 · Operations

How a Real SCM System Automates Procurement, Warehouse, and Logistics to Eliminate Manual Work

This article explains why many companies struggle with fragmented supply‑chain processes, outlines the three essential capabilities of a truly effective SCM system—smooth workflow, node monitoring, and data persistence—and details how such a system can transform procurement, warehousing, logistics, cross‑department collaboration, and data analysis into an automated, data‑driven operation.

OperationsSCMSupply Chain Automation

0 likes · 9 min read

How a Real SCM System Automates Procurement, Warehouse, and Logistics to Eliminate Manual Work

Java Captain

Apr 22, 2025 · Operations

Improving Cron Job Stability and Monitoring with Best Practices and Healthchecks

The article analyzes common cron job failures such as accidental deletions, OOM crashes, and lack of monitoring, then proposes standardized Jenkins deployment, automatic server selection, lock mechanisms, queue-based processing, status awareness, and the use of the open‑source Healthchecks system to achieve proactive detection and alerting.

AutomationOperationsTask scheduling

0 likes · 8 min read

Improving Cron Job Stability and Monitoring with Best Practices and Healthchecks

dbaplus Community

Apr 21, 2025 · Operations

Turn Zabbix Alerts into AI‑Powered Insights with DeepSeek

This guide shows how to integrate Zabbix with a locally deployed DeepSeek large language model via Webhook, enabling automatic analysis of alerts, generation of root‑cause explanations and remediation suggestions, and delivering results through WeChat bots, dashboards, or email to reduce MTTR and manual effort.

AI OpsAlert AutomationDeepSeek

0 likes · 4 min read

Turn Zabbix Alerts into AI‑Powered Insights with DeepSeek

Raymond Ops

Apr 19, 2025 · Operations

Essential Apache & Nginx Log Analysis Commands for Linux Ops

This guide compiles practical Linux shell commands for analyzing Apache and Nginx access logs, covering IP frequency, page request counts, status code distribution, traffic volume, crawler detection, subnet aggregation, and time‑based request rates to help administrators monitor web service health efficiently.

NginxOperationslog analysis

0 likes · 15 min read

Essential Apache & Nginx Log Analysis Commands for Linux Ops

MaGe Linux Operations

Apr 18, 2025 · Operations

Master Docker: Essential Commands for Developers and Ops

This guide compiles the most commonly used Docker commands, organized by functionality—including installation, image management, container handling, network and volume operations, logging, debugging, and system cleanup—to help developers and operations engineers efficiently manage Docker environments.

Operationscommand-linecontainer

0 likes · 11 min read

Master Docker: Essential Commands for Developers and Ops

Linux Cloud Computing Practice

Apr 18, 2025 · Operations

Unlock the Full Zabbix 7.0 Manual: Features, Architecture & Installation Guide

This article introduces Zabbix as a powerful open‑source monitoring solution, outlines its 7.0 features, architecture, installation and configuration steps, highlights recent enhancements, and explains how to obtain a free 2000‑page Chinese manual via QR code.

InstallationOperationsZabbix

0 likes · 4 min read

Unlock the Full Zabbix 7.0 Manual: Features, Architecture & Installation Guide

JD Tech

Apr 17, 2025 · Operations

Chaos Engineering: Principles, Core Steps, Tool Selection, and AI Integration

This article explains chaos engineering—its definition, core principles, experimental workflow, tool selection, AI‑driven enhancements, and practical case studies—providing a comprehensive guide for building resilient distributed systems across backend, cloud‑native, mobile, and AI‑enabled environments.

AI integrationFault InjectionOperations

0 likes · 26 min read

Chaos Engineering: Principles, Core Steps, Tool Selection, and AI Integration

MaGe Linux Operations

Apr 17, 2025 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces the ten most frequently used operations engineering tools, detailing each tool's functions, suitable scenarios, advantages, and real‑world examples, and includes practical code snippets to help engineers automate and streamline their daily workflows.

AutomationLinux toolsOperations

0 likes · 8 min read

Top 10 Essential Ops Tools Every Engineer Should Master

Tencent Cloud Middleware

Apr 17, 2025 · Operations

Boost RocketMQ Ops with LLM‑Powered Natural‑Language Queries via GraphQL

By integrating large language models, Chatbox, MCP, and GraphQL, the TDMQ RocketMQ team enables operators to retrieve cluster, topic, and message data across heterogeneous sources using a single natural‑language query, dramatically simplifying diagnostics and reducing manual query effort.

ChatboxGraphQLLLM

0 likes · 9 min read

Boost RocketMQ Ops with LLM‑Powered Natural‑Language Queries via GraphQL

Efficient Ops

Apr 16, 2025 · Operations

Top 10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable operations engineering tools—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, suitable scenarios, advantages, and real‑world examples, plus sample code snippets to help engineers automate and monitor infrastructure efficiently.

AutomationOperationsdevops

0 likes · 9 min read

Cognitive Technology Team

Apr 14, 2025 · Operations

TCP Out‑of‑Memory on AWS EC2: Diagnosis and Kernel Parameter Tuning

An AWS EC2 instance behind an Elastic Load Balancer became unresponsive due to repeated TCP out‑of‑memory errors, which were resolved by examining kernel messages, adjusting tcp_mem‑related kernel parameters, and rebooting the server after a long uptime.

AWSEC2Linux

0 likes · 6 min read

TCP Out‑of‑Memory on AWS EC2: Diagnosis and Kernel Parameter Tuning

Su San Talks Tech

Apr 13, 2025 · Operations

Unlock Zabbix: Complete Guide to Features, Architecture, and Hands‑On Deployment on CentOS

This article introduces Zabbix’s core features, flexible data collection, custom alerting, visualization, high‑availability architecture, security auditing, and compares it with Prometheus, then walks through a step‑by‑step installation, configuration, and deployment on a CentOS server.

InstallationLinuxOperations

0 likes · 20 min read

Unlock Zabbix: Complete Guide to Features, Architecture, and Hands‑On Deployment on CentOS

Dual-Track Product Journal

Apr 11, 2025 · Operations

Why Your Replenishment System Traps You in a ‘More Restock, More Shortage’ Loop—and How to Fix It

This article dissects common failures in e‑commerce replenishment—such as hot‑product black holes, slow‑moving stock graves, and supply‑chain avalanches—and presents a seven‑step framework of dynamic forecasting, tiered strategies, distributed inventory, and automated safeguards to stabilize inventory levels.

AutomationOperationsforecasting

0 likes · 9 min read

Why Your Replenishment System Traps You in a ‘More Restock, More Shortage’ Loop—and How to Fix It

Tencent Cloud Middleware

Apr 9, 2025 · Operations

How TDMQ Pulsar’s Cluster‑Level and Topic‑Partition Throttling Keeps Your Messaging System Stable

This article explains why high‑throughput producers and consumers can saturate CPU, memory, network and disk I/O in TDMQ Pulsar clusters, describes the built‑in cluster‑level distributed and topic‑partition rate‑limiting mechanisms, and provides practical guidance for configuration, monitoring, and troubleshooting.

Message QueueOperationsPulsar

0 likes · 12 min read

How TDMQ Pulsar’s Cluster‑Level and Topic‑Partition Throttling Keeps Your Messaging System Stable

Linux Ops Smart Journey

Apr 8, 2025 · Operations

How to Efficiently Monitor HAProxy with Prometheus and Grafana

This guide explains how to set up HAProxy monitoring by configuring a Prometheus exporter, adding HAProxy targets to Prometheus, verifying metric collection, and visualizing the data in Grafana with a ready-made dashboard, ensuring reliable and performant services.

GrafanaHAProxyKubernetes

0 likes · 4 min read

How to Efficiently Monitor HAProxy with Prometheus and Grafana

Liangxu Linux

Apr 6, 2025 · Operations

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

This guide explains how SRE teams should collaborate early in the software development lifecycle to define Service Level Indicators (SLIs), set realistic Service Level Objectives (SLOs) and Service Level Agreements (SLAs), and integrate observability signals, error budgeting, risk management, and incident handling into reliable operations.

Error BudgetIncident ManagementObservability

0 likes · 13 min read

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

Xiaokun's Architecture Exploration Notes

Apr 6, 2025 · Operations

Mastering Performance Testing: Why It Matters and How to Use wrk Effectively

This article explains what performance testing is, why it is essential for reliable systems, outlines practical steps for conducting effective tests, and introduces the wrk benchmarking tool as a lightweight solution for generating realistic load and measuring key performance metrics.

BenchmarkingOperationsload testing

0 likes · 2 min read

Mastering Performance Testing: Why It Matters and How to Use wrk Effectively

Raymond Ops

Apr 5, 2025 · Operations

Master Nginx Load Balancing: Step‑by‑Step Configuration Guide

This article explains how to configure Nginx as a load balancer for web applications, covering upstream and proxy_pass definitions, the three built‑in balancing methods, weight and connection settings, fail‑over options, and practical code examples for both HTTP and HTTPS deployments.

NginxOperationsconfiguration

0 likes · 11 min read

Master Nginx Load Balancing: Step‑by‑Step Configuration Guide

Practical DevOps Architecture

Apr 4, 2025 · Operations

Using Ansible to Copy Single and Multiple Files to Target Servers

This guide demonstrates how to use Ansible playbooks to copy a single file or multiple files from a source host to a target server, detailing the required YAML configuration, copy module parameters, and execution steps for reliable file deployment in automated operations.

AnsibleAutomationOperations

0 likes · 3 min read

Using Ansible to Copy Single and Multiple Files to Target Servers

Open Source Linux

Apr 3, 2025 · Operations

Understanding Linux Boot Process: From BIOS to Systemd

This article explains the Linux boot sequence, covering the BIOS/UEFI hardware check, GRUB2 bootloader configuration, kernel loading with initramfs, root filesystem mounting, systemd target units, essential services, and a CentOS 8 example with GRUB settings and module inspection.

Boot ProcessCentOSGRUB

0 likes · 8 min read

Understanding Linux Boot Process: From BIOS to Systemd