Tagged articles

Ops

131 articles · Page 1 of 2

Jun 24, 2026 · Operations

How to Diagnose Linux Server CPU Spikes: A Practical Step‑by‑Step Guide

This article presents a systematic, evidence‑driven process for locating and resolving high CPU usage on Linux servers, covering environment preparation, layered troubleshooting from whole‑machine to thread level, concrete command examples, real‑world case studies, best‑practice recommendations, and monitoring configurations.

CPULinuxOps

0 likes · 33 min read

How to Diagnose Linux Server CPU Spikes: A Practical Step‑by‑Step Guide

ITPUB

Jun 10, 2026 · Operations

Avoidable P1 Outage: How Nginx Changes Caused All Gateway Requests to Return 400

A production change replaced two Nginx reverse‑proxy servers, introduced an upstream name containing an underscore, broke the Host header required by HTTP/1.1, and caused Spring Cloud Gateway to return 400 Bad Request for every request until the configuration was corrected.

400-bad-requestHTTPNGINX

0 likes · 16 min read

Avoidable P1 Outage: How Nginx Changes Caused All Gateway Requests to Return 400

MaGe Linux Operations

May 30, 2026 · Databases

How Ops Engineers Fix MySQL Slow Queries: A Step‑by‑Step Guide

This article walks through the entire MySQL performance troubleshooting workflow from an operations perspective, covering architecture basics, slow‑query‑log configuration, analysis with mysqldumpslow and pt‑query‑digest, EXPLAIN interpretation, index design and optimization, configuration tuning, replication monitoring, real‑time diagnostic commands, risk mitigation, rollback procedures, and backup strategies.

ConfigurationIndexingOps

0 likes · 40 min read

How Ops Engineers Fix MySQL Slow Queries: A Step‑by‑Step Guide

AI Agent Super App

May 30, 2026 · Operations

Production-Ready MongoDB 7.0: Single-Node, Replica Set, and Security Hardening Guide

This step‑by‑step guide shows how to install MongoDB 7.0 on Linux, configure a production‑grade replica set, enable keyfile‑based internal authentication, create RBAC users, restrict network access, set system limits, schedule backups, and apply performance‑tuning and monitoring practices to keep the database secure and reliable.

MongoDBOpsbackup

0 likes · 15 min read

Production-Ready MongoDB 7.0: Single-Node, Replica Set, and Security Hardening Guide

AI Agent Super App

May 18, 2026 · Operations

20 Programming Language Environment Setups: From Installation to PATH Configuration for Ops Professionals

This practical guide walks ops engineers through installing, configuring PATH, setting language‑specific environment variables, and managing versions for 20 major programming languages, illustrating common pitfalls with real‑world examples and offering concrete best‑practice rules to keep production systems stable.

Opsenvironment variablespath configuration

0 likes · 23 min read

20 Programming Language Environment Setups: From Installation to PATH Configuration for Ops Professionals

Architect's Ambition

May 5, 2026 · Operations

OpenClaw vs Hermes: Static Control vs Dynamic Evolution—Which Should You Choose?

The article compares OpenClaw, a manually configured, fully controllable automation tool, with Hermes Agent, an automatically self‑evolving agent, detailing their design philosophies, learning mechanisms, pros and cons, and provides a decision matrix and best‑practice recommendation to use them together for optimal efficiency.

AutomationHermes AgentOpenClaw

0 likes · 8 min read

OpenClaw vs Hermes: Static Control vs Dynamic Evolution—Which Should You Choose?

AI Agent Super App

May 2, 2026 · Operations

Step‑by‑Step Guide to Inspect All Linux Server System and Hardware Details

This tutorial walks you through the essential Linux commands for retrieving system version, kernel, CPU, memory, disk, network interface details, and reliably distinguishing between physical and virtual servers, providing clear examples and output explanations for each step.

LinuxOpshardware-commands

0 likes · 10 min read

Step‑by‑Step Guide to Inspect All Linux Server System and Hardware Details

MaGe Linux Operations

Apr 19, 2026 · Cloud Native

Unlock the Full Deployment‑to‑Service Workflow in Kubernetes

This comprehensive guide walks operators through the entire Kubernetes workflow from creating a Deployment to exposing a Service, explaining core resources, control loops, scheduling, networking, rolling updates, troubleshooting steps, best‑practice configurations, performance tuning, and security hardening.

Cloud NativeOpsService

0 likes · 29 min read

Unlock the Full Deployment‑to‑Service Workflow in Kubernetes

MaGe Linux Operations

Apr 19, 2026 · Operations

How to Diagnose and Fix Slow Static Asset Delivery: A Complete Ops Guide

This guide walks operations engineers through a systematic, multi‑layered approach to identifying why static resources load slowly, covering data collection, network diagnostics, server configuration, application settings, client‑side checks, common failure scenarios, and automated monitoring scripts.

CDNMonitoringNetwork

0 likes · 26 min read

How to Diagnose and Fix Slow Static Asset Delivery: A Complete Ops Guide

Ray's Galactic Tech

Apr 11, 2026 · Operations

Mastering Production‑Grade Kubernetes: From kubectl Basics to Scalable Cluster Management

This comprehensive guide walks you through turning simple kubectl commands into a robust, production‑ready Kubernetes platform by covering core architecture, scheduling, resource governance, high‑availability design, observability, security, GitOps workflows, and real‑world case studies for large‑scale deployments.

ObservabilityOpsProduction

0 likes · 52 min read

Mastering Production‑Grade Kubernetes: From kubectl Basics to Scalable Cluster Management

Ops Community

Apr 11, 2026 · Operations

Master Linux CPU & Memory Bottleneck Diagnosis: Commands, Scripts, and Best Practices

This comprehensive guide walks Linux operators through systematic CPU and memory troubleshooting, detailing command sequences, deep metric interpretations, diagnostic scripts, and preventive tuning for modern multi‑core, cgroup‑v2 environments.

CPULinuxMonitoring

0 likes · 30 min read

Master Linux CPU & Memory Bottleneck Diagnosis: Commands, Scripts, and Best Practices

MaGe Linux Operations

Apr 6, 2026 · Operations

Master Redis Monitoring: Essential Metrics, Scripts, and Alerting Strategies

This guide walks operations engineers through building a complete Redis monitoring system—covering why monitoring matters, which metrics to collect, how to gather them with Prometheus and Grafana, and practical Bash scripts for health checks, memory, persistence, replication, client connections, and alert thresholds.

MetricsMonitoringOps

0 likes · 31 min read

Master Redis Monitoring: Essential Metrics, Scripts, and Alerting Strategies

dbaplus Community

Mar 2, 2026 · Operations

When Kubernetes Becomes a Burden: Why Top Engineers Walk Away

The article reflects on how Kubernetes, originally a lightweight orchestration tool, can evolve into a hidden source of technical and emotional debt that drains engineers, inflates operational costs, and ultimately drives talented staff to quit, highlighting the need for disciplined platform ownership.

OpsPlatform Engineeringkubernetes

0 likes · 6 min read

When Kubernetes Becomes a Burden: Why Top Engineers Walk Away

Raymond Ops

Feb 14, 2026 · Operations

How I Cut 80% of Ops Time with an Automated Service Management System

This article details a complete automated operations framework that replaces manual service restarts, log cleaning, and deployment tasks with health‑checks, systemd units, Kubernetes probes, monitoring scripts, fault‑diagnosis tools, auto‑scaling policies, and Ansible playbooks, saving roughly 80% of repetitive work and dramatically improving reliability.

AutomationMonitoringOps

0 likes · 38 min read

How I Cut 80% of Ops Time with an Automated Service Management System

Ops Community

Feb 8, 2026 · Operations

Master Linux Network Troubleshooting with tcpdump, ss, and iptables

A comprehensive guide for ops engineers that explains how to use tcpdump, ss, and iptables to diagnose and resolve common Linux networking issues, covering tool basics, practical scenarios, detailed command examples, scripts, best practices, and monitoring strategies.

NetworkOpsiptables

0 likes · 58 min read

Master Linux Network Troubleshooting with tcpdump, ss, and iptables

Ops Community

Feb 4, 2026 · Operations

Boost Your Ops Efficiency: 20 Must-Have Tools for Faster Server Management

Discover a curated collection of 20 open-source operations tools, covering terminal enhancements, file handling, system monitoring, network diagnostics, text processing, and container management, each with installation steps, configuration examples, and real-world use cases to dramatically improve productivity and streamline daily sysadmin tasks.

Opsproductivitytools

0 likes · 44 min read

Boost Your Ops Efficiency: 20 Must-Have Tools for Faster Server Management

Code Wrench

Feb 2, 2026 · Operations

Isolate Goroutine Panics in 3 Lines: Build Self‑Healing Go Probes

Go's unhandled panics can crash an entire monitoring agent, but by isolating each goroutine with a defer‑recover wrapper and optionally adding a circuit‑breaker, you can achieve self‑healing probes that continue operating despite transient failures, improving tool resilience and overall system availability.

Opscircuit-breakerpanic

0 likes · 9 min read

Isolate Goroutine Panics in 3 Lines: Build Self‑Healing Go Probes

Code Wrench

Feb 1, 2026 · Operations

Detect and Fix Goroutine Leaks in Go with Context & pprof

This guide explains how Goroutine leaks cause hidden memory and CPU issues in long‑running Go health‑check tools, demonstrates how to reproduce the problem, and shows step‑by‑step detection using pprof and context, plus a production‑ready zero‑leak probe template with best‑practice code.

Opsmemory-leakpprof

0 likes · 12 min read

Detect and Fix Goroutine Leaks in Go with Context & pprof

MaGe Linux Operations

Jan 28, 2026 · Operations

8 Crontab Pitfalls Every SRE Should Avoid – Proven Fixes & Best Practices

Learn from a seasoned SRE’s hard‑won experience as we dissect eight common crontab pitfalls—environment variables, permissions, time zones, email spam, path issues, concurrency, logging, and special character quirks—and provide concrete solutions, best‑practice configurations, monitoring tips, and migration guidance to systemd timers.

AutomationMonitoringOps

0 likes · 43 min read

8 Crontab Pitfalls Every SRE Should Avoid – Proven Fixes & Best Practices

Raymond Ops

Jan 12, 2026 · Operations

Build a Real-Time Linux Performance Alert System with Prometheus & Grafana

This guide walks you through designing a layered Linux monitoring architecture, selecting a Prometheus‑Grafana stack, defining key CPU, memory and disk metrics, crafting smart alert rules, visualizing dashboards, and adding automation and AI‑driven predictive techniques for reliable, business‑focused operations.

LinuxOpsgrafana

0 likes · 13 min read

Build a Real-Time Linux Performance Alert System with Prometheus & Grafana

Raymond Ops

Jan 9, 2026 · Databases

Master MongoDB Sharding: From Single Server to Enterprise-Scale Cluster

When a single‑node MongoDB instance can no longer handle tens of millions of records, this guide walks you through the theory, architecture, deployment steps, shard key strategies, performance tuning, monitoring, backup, and troubleshooting needed to build a robust, production‑grade sharded cluster.

MongoDBOpsPerformance Tuning

0 likes · 14 min read

Master MongoDB Sharding: From Single Server to Enterprise-Scale Cluster

Raymond Ops

Jan 2, 2026 · Operations

Avoid 3 Fatal Nginx+Keepalived HA Pitfalls That 90% of Ops Engineers Miss

This article reveals three hidden traps in Nginx‑Keepalived high‑availability setups—network‑partition split‑brain, inadequate health‑check scripts, and unsafe configuration‑sync timing—explains real incidents caused by each, and provides concrete configuration changes, Bash scripts, and automation tips to prevent service outages.

AutomationKeepalivedNGINX

0 likes · 16 min read

Avoid 3 Fatal Nginx+Keepalived HA Pitfalls That 90% of Ops Engineers Miss

Ray's Galactic Tech

Dec 23, 2025 · Operations

20 Essential Kubernetes Ops Tips to Keep Production Clusters Stable

This guide compiles twenty practical Kubernetes operations tips drawn from real‑world production experience, covering high availability, performance tuning, monitoring, automation, security, and advanced learning to help teams build and maintain reliable, resilient clusters.

High AvailabilityMonitoringOps

0 likes · 8 min read

20 Essential Kubernetes Ops Tips to Keep Production Clusters Stable

Raymond Ops

Dec 18, 2025 · Cloud Computing

How to Build Reusable Multi‑Cloud Infrastructure with Terraform

Learn how to replace manual, error‑prone cloud console clicks with Terraform‑driven, reusable multi‑cloud infrastructure, covering why multi‑cloud matters, Terraform fundamentals, project layout, example networking and compute modules for AWS and Alibaba Cloud, CI/CD integration, security scanning, cost optimization, and best‑practice guidelines.

CI/CDCloudMulti-Cloud

0 likes · 18 min read

How to Build Reusable Multi‑Cloud Infrastructure with Terraform

Raymond Ops

Dec 14, 2025 · Operations

5 Game-Changing One-Liner Shell Commands Every Ops Engineer Should Know

This article shares five practical one‑line Shell commands—covering bulk health checks, rapid log analysis, process ranking, network diagnostics, and precise disk cleanup—each explained with its scenario, inner workings, and real‑world performance impact for production environments.

AutomationLinuxOne-liner

0 likes · 10 min read

5 Game-Changing One-Liner Shell Commands Every Ops Engineer Should Know

Raymond Ops

Dec 13, 2025 · Operations

Boost Linux Server Management: Essential Automation Tools & Scripts

This article explains how Linux system administrators can dramatically improve efficiency and reliability by adopting automation tools like Ansible, Puppet, and SaltStack, along with practical shell and Python scripts for batch operations, scheduled tasks, log analysis, and automated backups.

AnsibleAutomationLinux

0 likes · 9 min read

Boost Linux Server Management: Essential Automation Tools & Scripts

Full-Stack DevOps & Kubernetes

Dec 9, 2025 · Information Security

How to Tame Kubernetes Security: From Roles to Token Risks

This article explains why Kubernetes security feels like navigating in the dark, breaks down the platform’s core resources, outlines common attack vectors such as container escape and token abuse, compares managed versus self‑hosted clusters, and presents a real‑world EKS attack case with practical mitigation insights.

Cloud NativeOpsServiceAccount

0 likes · 11 min read

How to Tame Kubernetes Security: From Roles to Token Risks

Efficient Ops

Dec 2, 2025 · Operations

How to Detect and Renew Expired Kubernetes API Server Certificates

This guide explains how to view Kubernetes certificates, check their expiration dates with kubeadm, renew them when needed, restart kubelet services, verify the renewal, and automate the whole process with a Bash script.

AutomationOpsShell script

0 likes · 4 min read

How to Detect and Renew Expired Kubernetes API Server Certificates

Ray's Galactic Tech

Dec 2, 2025 · Operations

Build an End‑to‑End AIOps Solution: Log Alerts and Automated Self‑Healing Ops

This guide walks through designing and implementing an intelligent operations workflow that transforms passive log monitoring into proactive alerting and automated remediation, covering core concepts, tech‑stack selection, step‑by‑step configuration of log collection, alert rules, webhook integration, Ansible automation, and best‑practice considerations for scaling and security.

AIOpsAlertingAnsible

0 likes · 7 min read

Build an End‑to‑End AIOps Solution: Log Alerts and Automated Self‑Healing Ops

Raymond Ops

Dec 2, 2025 · Operations

All‑in‑One Linux Init Script: Automate Setup for Rocky, AlmaLinux, Ubuntu, and More

This article introduces a comprehensive shell script that automates initial system configuration—root login, network, hostname, repository, firewall, SELinux, swap, SSH, and more—across dozens of Linux distributions, provides source links, detailed feature tables, version‑specific changelogs, and step‑by‑step usage instructions.

AutomationLinuxOps

0 likes · 20 min read

All‑in‑One Linux Init Script: Automate Setup for Rocky, AlmaLinux, Ubuntu, and More

Open Source Linux

Nov 30, 2025 · Operations

How to Diagnose Linux Server Performance Issues in Minutes

A step‑by‑step guide shows how to use Linux commands like top, vmstat, free, iostat, and ss to quickly identify CPU overload, memory pressure, disk I/O bottlenecks, and network port problems, providing a practical cheat sheet for effective server troubleshooting.

LinuxMonitoringOps

0 likes · 9 min read

How to Diagnose Linux Server Performance Issues in Minutes

MaGe Linux Operations

Nov 17, 2025 · Operations

Production-Ready Prometheus Alerting: 50+ Core Metrics & Best Practices

This guide details production‑grade Prometheus alerting configurations, covering applicable scenarios, prerequisites, anti‑patterns, environment matrices, step‑by‑step deployment of Node Exporter, Prometheus and Alertmanager, comprehensive rule files, performance testing, troubleshooting, best practices, and ready‑to‑use scripts for backup and health checks.

AlertingOpsinfrastructure

0 likes · 51 min read

Production-Ready Prometheus Alerting: 50+ Core Metrics & Best Practices

Liangxu Linux

Nov 16, 2025 · Information Security

Essential Linux Security Vulnerabilities & Practical Hardening Guide for Ops Engineers

This comprehensive guide walks ops engineers through the most common Linux security flaws—from sudo misconfigurations and SUID/SGID risks to SSH, web server, kernel, container, file system, logging, firewall, and compliance issues—offering concrete code snippets, step‑by‑step hardening measures, and actionable best‑practice recommendations.

LinuxOpsVulnerability

0 likes · 16 min read

Essential Linux Security Vulnerabilities & Practical Hardening Guide for Ops Engineers

dbaplus Community

Nov 11, 2025 · Operations

How to Build a Fast Golang Offline RDB Parser for Redis Big‑Key Detection

This article walks through the motivation, design, implementation, performance tuning, deployment, and lessons learned when creating a Golang‑based offline Redis RDB parser that efficiently identifies large keys without impacting a live cluster.

OpsRDBRedis

0 likes · 12 min read

How to Build a Fast Golang Offline RDB Parser for Redis Big‑Key Detection

Xiao Liu Lab

Nov 8, 2025 · Operations

Generate a Complete Linux Server Health Report with a Single Command

This article introduces a lightweight Bash script that, with one curl command, automatically gathers CPU, memory, disk, and network information from a Linux server and outputs a formatted, color‑coded report in seconds, dramatically simplifying routine ops tasks.

AutomationLinuxOps

0 likes · 6 min read

Generate a Complete Linux Server Health Report with a Single Command

MaGe Linux Operations

Nov 1, 2025 · Operations

Zero‑Downtime HAProxy Load Balancing: Full 4‑Layer & 7‑Layer Deployment Guide

This guide walks through installing HAProxy, configuring both layer‑4 TCP and layer‑7 HTTP/HTTPS load balancing with health checks, session persistence, advanced algorithms, high‑availability via Keepalived, monitoring with HAProxy stats and Prometheus, performance tuning, security hardening, and step‑by‑step rollback procedures for zero‑downtime deployments.

HAProxyHigh AvailabilityOps

0 likes · 36 min read

Zero‑Downtime HAProxy Load Balancing: Full 4‑Layer & 7‑Layer Deployment Guide

Ops Community

Oct 28, 2025 · Operations

Master Linux Performance: Top, iotop, pidstat, sar – Real‑World Diagnostic Guide

This guide covers Linux performance analysis tools—including top, htop, iotop, pidstat, iostat, sar, and vmstat—detailing installation, usage, key metrics, troubleshooting scenarios, monitoring integration with Prometheus, and best‑practice recommendations for effective system diagnostics and capacity planning.

Opsiotopperformance monitoring

0 likes · 29 min read

Master Linux Performance: Top, iotop, pidstat, sar – Real‑World Diagnostic Guide

Xiao Liu Lab

Oct 23, 2025 · Operations

Automate Nginx Config Audits: Python Script to Export Structured Excel Reports

Learn how a lightweight Python script can automatically parse complex Nginx configuration files, extract upstream, server, and location details, and generate a structured Excel report for easy auditing, analysis, and collaboration, streamlining operations and configuration management.

AutomationConfigurationExcel

0 likes · 9 min read

Automate Nginx Config Audits: Python Script to Export Structured Excel Reports

Ops Community

Oct 15, 2025 · Operations

Master Ansible: Complete Playbook Guide for Managing Hundreds of Servers

This comprehensive guide explores Ansible’s architecture, core principles, inventory management, playbook creation, advanced techniques, role usage, variable handling, error handling, idempotency, and real‑world case studies to help engineers efficiently automate and maintain large server fleets.

AnsibleOpsconfiguration management

0 likes · 37 min read

Master Ansible: Complete Playbook Guide for Managing Hundreds of Servers

MaGe Linux Operations

Oct 4, 2025 · Operations

How to Stop 3 AM Alert Calls: 5 Smart Monitoring Techniques

This article reveals why engineers are woken up at 3 am by noisy alerts, analyzes the evolution and pain points of monitoring systems, and presents five practical techniques—including severity grading, aggregation, dynamic thresholds, intelligent routing, and data‑driven effectiveness analysis—to transform alert noise into actionable, reliable notifications.

AlertingAutomationMonitoring

0 likes · 44 min read

How to Stop 3 AM Alert Calls: 5 Smart Monitoring Techniques

MaGe Linux Operations

Oct 3, 2025 · Operations

Why Your Crontab Jobs Fail: 5 Common Mistakes and How to Fix Them

This article explains why scheduled tasks often break in crontab, outlines the five most frequent errors such as missing environment variables, wrong paths, silent output, incorrect time expressions, and permission issues, and provides concrete debugging steps and best‑practice solutions for reliable Linux scheduling.

LinuxOpsScheduling

0 likes · 30 min read

Why Your Crontab Jobs Fail: 5 Common Mistakes and How to Fix Them

Ray's Galactic Tech

Sep 23, 2025 · Operations

How to Install and Deploy Nacos 2.3.2 on Windows and Linux – Full Production Guide

This step‑by‑step guide explains how to install Nacos 2.3.2 on Windows and Linux, covering version requirements, hardware recommendations, Docker‑Compose quick start, MySQL integration, production cluster topology, security hardening, monitoring with Prometheus, common troubleshooting, and clean uninstallation.

DockerInstallationOps

0 likes · 7 min read

How to Install and Deploy Nacos 2.3.2 on Windows and Linux – Full Production Guide

Linux Cloud Computing Practice

Sep 12, 2025 · Operations

45 Must‑Know Linux Command Combos for Everyday Ops – Boost Efficiency

This guide compiles 45 essential Linux command combinations, organized into seven high‑frequency operational scenarios—file handling, find‑based searches, system monitoring, log analysis, text processing, network capture, and disk cleanup—providing a near‑complete toolbox that addresses roughly 99% of everyday sysadmin tasks.

LinuxOpsShell Scripting

0 likes · 5 min read

45 Must‑Know Linux Command Combos for Everyday Ops – Boost Efficiency

Raymond Ops

Aug 25, 2025 · Operations

How to Resolve Kubernetes Certificate Expiration Errors with kubeadm

When a Kubernetes cluster suddenly fails to respond with an x509 certificate expiration error, this guide walks you through using kubeadm commands to renew all certificates, update kubeconfig files, restart kubelet, and verify the new expiration dates, ensuring the cluster returns to normal operation.

CertificateOpskubeadm

0 likes · 8 min read

How to Resolve Kubernetes Certificate Expiration Errors with kubeadm

Mike Chen's Internet Architecture

Aug 22, 2025 · Operations

10 Essential Nginx Settings to Boost Performance and Security

This guide walks you through ten crucial Nginx configuration tweaks—including optimal worker processes, connection limits, gzip compression, caching, request size limits, SSL/TLS setup, HTTP/2 enablement, timeout settings, version hiding, and Lua extensions—to improve server performance, security, and reliability.

ConfigurationOpsWeb Server

0 likes · 4 min read

10 Essential Nginx Settings to Boost Performance and Security

MaGe Linux Operations

Aug 21, 2025 · Operations

Master Docker Volume Management: From Basics to Enterprise‑Grade Persistence & Backup

This comprehensive guide walks you through Docker storage challenges, explains temporary, bind‑mount and named volumes, presents tiered storage architectures and dynamic scripts, and provides production‑grade backup, monitoring, and performance‑tuning strategies to ensure reliable data persistence in containerized environments.

MonitoringOpsbackup

0 likes · 13 min read

Master Docker Volume Management: From Basics to Enterprise‑Grade Persistence & Backup

Ops Development & AI Practice

Jul 11, 2025 · Industry Insights

Turning Full‑Stack Ops Skills into Interview Superpowers

The article explains why full‑stack operations engineers, despite their broad but shallow expertise, are invaluable system integrators and offers concrete interview strategies—reframing breadth as strength, storytelling with end‑to‑end impact, and showcasing a versatile toolset—to help them stand out against specialist interviewers.

CareerFull-StackOps

0 likes · 8 min read

Turning Full‑Stack Ops Skills into Interview Superpowers

Full-Stack DevOps & Kubernetes

Jul 7, 2025 · Operations

Unlock Linux Performance: How eBPF Reveals Hidden Bottlenecks

This article explains why traditional Linux monitoring tools often miss deep kernel issues and shows how to use eBPF‑based utilities such as biolatency, runqlat, and offcputime to pinpoint CPU, I/O, and lock‑contention problems with concrete command examples and a practical troubleshooting workflow.

LinuxOpseBPF

0 likes · 8 min read

Unlock Linux Performance: How eBPF Reveals Hidden Bottlenecks

Efficient Ops

Jun 15, 2025 · Operations

Master Ansible: Automate 300+ Servers with Simple Playbooks

This guide introduces Ansible’s core concepts, installation steps, common commands, and a complete Nginx deployment playbook, showing how to efficiently automate configuration, scaling, and updates across hundreds of servers.

AnsibleOpsconfiguration management

0 likes · 7 min read

Master Ansible: Automate 300+ Servers with Simple Playbooks

Efficient Ops

May 6, 2025 · Databases

5 Must‑Have GUI Tools to Master Redis Management

Operations engineers struggling with countless Redis commands and opaque data structures can simplify their workflow with five recommended visual tools that turn complex Redis operations into intuitive interfaces, complete with monitoring, cluster support, and cross‑platform clients.

GUIOpsRedis

0 likes · 4 min read

5 Must‑Have GUI Tools to Master Redis Management

Liangxu Linux

Mar 9, 2025 · Backend Development

Mastering Nginx Gzip: Configuration, Tips, and Common Pitfalls

Compressing HTTP responses with Nginx gzip improves user experience by reducing load times and cuts bandwidth costs, while proper directives, static gzip handling, and awareness of common misconfigurations ensure optimal performance in production environments.

Opsbackendcompression

0 likes · 6 min read

Mastering Nginx Gzip: Configuration, Tips, and Common Pitfalls

Efficient Ops

Mar 2, 2025 · Operations

How to Diagnose Linux Server Performance Issues in the First 60 Seconds

This article walks you through the ten essential Linux command‑line tools—such as uptime, vmstat, iostat, and top—that Netflix’s performance engineers use to quickly assess system load, resource saturation, and errors within the critical first minute of troubleshooting.

LinuxOpssystem-administration

0 likes · 18 min read

How to Diagnose Linux Server Performance Issues in the First 60 Seconds

Ops Development & AI Practice

Feb 22, 2025 · Operations

Master Terraform: From Basics to Advanced Cloud Automation

Discover why Terraform is the go‑to IaC tool for ops engineers, explore its declarative syntax, cross‑cloud support, state management, and community ecosystem, and get an overview of a comprehensive three‑part tutorial series covering fundamentals, intermediate concepts, and advanced best‑practice projects.

OpsTerraforminfrastructure as code

0 likes · 8 min read

Master Terraform: From Basics to Advanced Cloud Automation

Java Tech Enthusiast

Dec 2, 2024 · Operations

Sampler: A Visual Server Monitoring Tool for Linux

Sampler is a Linux visual monitoring tool that runs from a single binary, uses simple YAML files to define widgets such as sparklines and bar charts, and displays real‑time CPU, memory, network, Docker container statistics and other metrics, while being easily extensible to services like MySQL, MongoDB and Kafka.

LinuxOpsServer monitoring

0 likes · 7 min read

Sampler: A Visual Server Monitoring Tool for Linux

MaGe Linux Operations

Nov 30, 2024 · Operations

Essential Linux System Monitoring and Troubleshooting Commands

This guide compiles crucial Linux commands for viewing logs, inspecting CPU, memory, disk I/O, network, system load, and performing common administrative tasks such as IP configuration, file system cleanup, and service health checks, helping sysadmins quickly diagnose and resolve issues.

Opsjournalctlperformance

0 likes · 10 min read

Essential Linux System Monitoring and Troubleshooting Commands

Linux Ops Smart Journey

Nov 10, 2024 · Operations

Master Ansible: Using yum_repository, yum, and systemd Modules for Efficient Automation

This article explores three frequently used Ansible modules—yum_repository, yum, and systemd—detailing their parameters, usage examples, and practical commands to streamline package management and service control, helping DevOps engineers boost automation efficiency in cloud and container environments.

AnsibleOpsYum Repository

0 likes · 10 min read

Master Ansible: Using yum_repository, yum, and systemd Modules for Efficient Automation

Linux Ops Smart Journey

Nov 5, 2024 · Operations

Master 8 Essential Ansible Modules for Efficient Automation

This article introduces eight essential Ansible modules—file, copy, template, fetch, and get_url—explaining their parameters, usage examples, and how they simplify automation tasks in operations, with code snippets and reference links for deeper learning.

AnsibleOpsconfiguration management

0 likes · 11 min read

Master 8 Essential Ansible Modules for Efficient Automation

Baidu Tech Salon

Oct 16, 2024 · Big Data

Design and Implementation of an Online/Offline Integrated Task Scheduling System for Baidu's Mobile Operations Promotion Platform (OPS)

The paper presents Baidu’s Mobile Operations Promotion Platform redesign, introducing an online‑offline integrated task‑scheduling architecture that partitions settlement fields to the data‑warehouse, records all jobs in a unified MySQL operation table, orchestrates them via Turing Data Studio, and manages dependencies to achieve consistent, auditable, billion‑scale settlement processing.

BaiduData WarehouseOps

0 likes · 14 min read

Design and Implementation of an Online/Offline Integrated Task Scheduling System for Baidu's Mobile Operations Promotion Platform (OPS)

MaGe Linux Operations

Oct 5, 2024 · Operations

Mastering Docker Container Logs: Drivers, Commands, and Best Practices

This article provides a comprehensive guide to Docker container log management, covering engine and container logs, log driver options, configuration commands, storage locations across various OSes, and practical techniques for rotating, filtering, and collecting logs in production environments.

LoggingOpscontainer-logs

0 likes · 23 min read

Mastering Docker Container Logs: Drivers, Commands, and Best Practices

dbaplus Community

Aug 6, 2024 · Operations

How to Slash MTTR: Proven Strategies for Faster Incident Recovery

This article explains what MTTR is, why it matters for system stability, and provides a step‑by‑step framework—including monitoring, alert tuning, rapid mitigation, clear role assignments, and post‑mortem practices—to dramatically shorten repair times and improve overall reliability.

AlertingMTTROps

0 likes · 24 min read

How to Slash MTTR: Proven Strategies for Faster Incident Recovery

Beijing SF i-TECH City Technology Team

May 30, 2024 · Operations

Root Cause Analysis of CPU Sys Spikes and Memory Pressure in Linux Services

This article investigates two real‑world performance incidents—one caused by excessive disk I/O from a misconfigured Filebeat and another by kernel memory‑fragmentation bugs triggered by a trace feature—detailing observations, Linux diagnostic commands, analysis, and practical remediation steps.

CPULinuxOps

0 likes · 15 min read

Root Cause Analysis of CPU Sys Spikes and Memory Pressure in Linux Services

Python Programming Learning Circle

May 23, 2024 · Operations

Supervisor Process Monitoring and Management Guide

This article introduces Supervisor, a client/server process monitoring tool for Unix-like systems, explains its installation, configuration, and usage—including custom service and application files, command-line control with supervisorctl, advanced features like process groups, automatic restart policies, and web UI—providing practical examples and code snippets for reliable daemon management.

Opsprocess management

0 likes · 17 min read

Supervisor Process Monitoring and Management Guide

Liangxu Linux

Mar 19, 2024 · Operations

How to Diagnose High CPU Usage on Linux: A Step‑by‑Step Guide

This guide walks you through checking CPU utilization, system load, process resource consumption, tracing problematic processes, reviewing logs, and using performance tools to pinpoint and resolve high CPU usage on a Linux server.

CPULinuxOps

0 likes · 3 min read

How to Diagnose High CPU Usage on Linux: A Step‑by‑Step Guide

Practical DevOps Architecture

Jan 8, 2024 · Operations

Deploy MySQL and mysqld_exporter with Docker Compose and Configure Prometheus Monitoring

This guide shows how to set up a MySQL server and a mysqld_exporter using Docker Compose, configure Prometheus to scrape the exporter, and create alert rules for MySQL downtime and slow queries, providing a complete monitoring solution.

DockerExporterMonitoring

0 likes · 5 min read

Deploy MySQL and mysqld_exporter with Docker Compose and Configure Prometheus Monitoring

Java Tech Enthusiast

Jan 7, 2024 · Operations

Using the Linux top Command for Real-Time System Monitoring

The Linux top command offers a dynamic, real‑time view of system processes and resource usage—showing overall statistics, CPU and memory breakdowns, and detailed process columns—while supporting customizable refresh intervals, batch mode, and interactive shortcuts for sorting, column selection, and monitoring crucial metrics like %idle, %wa, and %steal.

CPULinuxOps

0 likes · 7 min read

Using the Linux top Command for Real-Time System Monitoring

Efficient Ops

Sep 26, 2023 · Operations

Mastering Zabbix: From Installation to Advanced Monitoring and Automation

This comprehensive guide walks you through Zabbix monitoring concepts, reliability calculations, installation methods, web UI configuration, host and template management, custom monitoring, alert integration with OneAlert, Grafana visualization, distributed monitoring, SNMP support, and practical scripts for large‑scale server environments.

AlertingAutomationMonitoring

0 likes · 28 min read

Mastering Zabbix: From Installation to Advanced Monitoring and Automation

Liangxu Linux

Sep 7, 2023 · Operations

Essential Shell Scripts Every Ops Engineer Should Use

This article presents a collection of practical Bash scripts for system administrators, covering load monitoring, file backup, log cleanup, service health checks, automated deployment, disk usage alerts, temporary file removal, network connectivity testing, bulk renaming, and batch service control, each with ready-to-use code examples.

AutomationLinuxOps

0 likes · 6 min read

Essential Shell Scripts Every Ops Engineer Should Use

MaGe Linux Operations

Sep 2, 2023 · Operations

Top 5 Linux Monitoring Tools Every Ops Engineer Should Use

This article introduces five essential Linux monitoring tools—iotop, htop, IPTraf, Monit, and related resources—explaining how each helps operations engineers diagnose I/O, CPU, memory, and network issues in real time without a GUI, and offers guidance on installation and practical use cases.

HtopIPTrafLinux

0 likes · 6 min read

Top 5 Linux Monitoring Tools Every Ops Engineer Should Use

DeWu Technology

Aug 28, 2023 · Operations

Real-time Data Warehouse Business-Side Chaos Engineering Practice

The article describes how a real‑time data warehouse supporting ad‑delivery metrics adopts both technical and business‑side chaos‑engineering, using red‑blue team drills to inject faults, monitor indicator anomalies, and refine response procedures, thereby enhancing early risk detection, system resilience, and overall data stability for the advertising platform.

Backend DevelopmentData QualityData Warehousing

0 likes · 16 min read

Real-time Data Warehouse Business-Side Chaos Engineering Practice

Open Source Linux

Jul 27, 2023 · Operations

17 Essential Linux Ops Tricks to Boost Your Productivity

This article compiles seventeen practical Linux administration techniques—from batch file handling and directory checks to log analysis, disk monitoring, firewall rules, and network capture—each illustrated with ready‑to‑run shell commands and concise explanations for sysadmins.

AutomationMonitoringNetwork

0 likes · 8 min read

17 Essential Linux Ops Tricks to Boost Your Productivity

DevOps

Jul 20, 2023 · Operations

Why Continuous Testing Is Essential for Infrastructure and How to Implement It

The article explains why continuous testing of infrastructure is critical for stability and reliability, outlines a comprehensive testing scope ranging from unit to reliability tests, discusses tool selection and practical Terraform‑based examples, and shows how test‑driven development can improve IaC workflows.

Continuous IntegrationIaCInfrastructure Testing

0 likes · 9 min read

Why Continuous Testing Is Essential for Infrastructure and How to Implement It

Open Source Linux

Mar 31, 2023 · Operations

Boost Your Ops Efficiency: 5 Python Scripts Every Engineer Should Know

This article explains how Python can automate common operations tasks—remote command execution, log parsing, system monitoring with alerts, batch software deployment, and backup/recovery—providing code examples and practical tips to improve efficiency and reduce manual errors.

AutomationMonitoringOps

0 likes · 9 min read

Boost Your Ops Efficiency: 5 Python Scripts Every Engineer Should Know

Efficient Ops

Jan 30, 2023 · Operations

Master Redis Monitoring: Key Metrics, Commands, and Performance Testing

This guide explains essential Redis monitoring metrics, the tools and commands for collecting performance, memory, activity, persistence, and error data, and shows how to use INFO, slowlog, and redis-benchmark to assess and improve database operations.

MetricsMonitoringOps

0 likes · 6 min read

Master Redis Monitoring: Key Metrics, Commands, and Performance Testing

Ops Development Stories

Dec 28, 2022 · Operations

When a Massive File Transfer Crashed My K8s Master: A Real‑World Docker Recovery Tale

The author recounts a sudden overload caused by copying hundreds of gigabytes of small files to an Alibaba Cloud NAS, which crashed the master node of a Kubernetes cluster, leading to Docker failures, and describes step‑by‑step troubleshooting, configuration changes, and lessons learned about backups, cautious operations, and calm analysis.

Cloud NativeDockerOps

0 likes · 5 min read

When a Massive File Transfer Crashed My K8s Master: A Real‑World Docker Recovery Tale

Open Source Linux

Oct 31, 2022 · Operations

Master Linux Performance: 10 Essential Commands to Diagnose Issues in 60 Seconds

This article from Netflix's performance engineering team outlines ten standard Linux command‑line tools and the USE method to quickly assess system health, focusing on error and saturation metrics before utilization, enabling rapid diagnosis of CPU, memory, disk, or network bottlenecks within the first minute.

LinuxOpsServer

0 likes · 18 min read

Master Linux Performance: 10 Essential Commands to Diagnose Issues in 60 Seconds

Practical DevOps Architecture

Sep 30, 2022 · Operations

Resolving Filebeat Startup Failure: EOF Error in Registrar State

This guide explains how to troubleshoot Filebeat failing to start due to an EOF error while loading registrar state, by inspecting logs, resetting the registry directory, and restarting the service on a Linux host.

MonitoringOpsfilebeat

0 likes · 4 min read

Resolving Filebeat Startup Failure: EOF Error in Registrar State

Efficient Ops

Aug 8, 2022 · Operations

Master Essential Linux Ops: xargs, Background Jobs, Process Monitoring & More

This guide walks you through practical Linux operations—from using xargs for efficient file handling and running commands in the background, to monitoring high‑memory and high‑CPU processes, viewing multiple logs with multitail, continuous ping logging, checking TCP states, identifying top IPs on port 80, and leveraging SSH for port forwarding.

MonitoringOpsmultitail

0 likes · 10 min read

Master Essential Linux Ops: xargs, Background Jobs, Process Monitoring & More

Efficient Ops

Jul 12, 2022 · Operations

Master Linux Performance Troubleshooting in the First 60 Seconds

This guide walks you through the ten essential Linux command‑line tools that Netflix’s performance team uses to quickly assess system health, focusing on error and saturation metrics before utilization, so you can pinpoint and resolve server issues within the critical first minute.

Opscommand-lineperformance monitoring

0 likes · 18 min read

Master Linux Performance Troubleshooting in the First 60 Seconds

Efficient Ops

May 10, 2022 · Operations

How to Containerize Ansible for Automated MySQL Backups

This article demonstrates how to package Ansible in a Docker container, use the mysql_db module to create MySQL backups, and run a simple playbook, highlighting the benefits of containerized deployment for clean, flexible operations automation.

AnsibleAutomationOps

0 likes · 10 min read

How to Containerize Ansible for Automated MySQL Backups

Architect's Tech Stack

May 4, 2022 · Operations

Lessons Learned from Improving Web Application Performance: A Case Study

This article shares a real‑world case study of a company operating fifteen web applications, describing how a hidden DB‑connection leak in a pod liveness probe caused severe latency, how the issue was diagnosed and fixed, and the four key take‑aways for reliable performance engineering.

Opsbackendload testing

0 likes · 8 min read

Lessons Learned from Improving Web Application Performance: A Case Study

Open Source Linux

Jan 5, 2022 · Operations

Designing Scalable High‑Availability Prometheus Architectures

This article explains how to build both small‑scale and large‑scale high‑availability Prometheus setups using local and remote storage, federation, keepalived, and PostgreSQL + TimescaleDB adapters to ensure reliable monitoring and alerting across growing infrastructures.

FederationOpsRemote Storage

0 likes · 6 min read

Designing Scalable High‑Availability Prometheus Architectures

Architecture Digest

Dec 30, 2021 · Operations

Step‑by‑Step Deployment of JumpServer with MariaDB, Redis, and Docker

This tutorial walks through installing MariaDB and Redis on a backend node, configuring Docker on a separate host, pulling and running the JumpServer container, and then setting up users, assets, and permissions so that operations teams can securely manage internal servers via a bastion host.

BastionHostDockerJumpServer

0 likes · 15 min read

Step‑by‑Step Deployment of JumpServer with MariaDB, Redis, and Docker

Efficient Ops

Nov 22, 2021 · Operations

Essential Linux Shell Commands for System Monitoring & Troubleshooting

This guide compiles a comprehensive set of Linux shell commands and common regular expressions for checking processes, CPU, memory, disk usage, network activity, logs, and other system metrics, helping administrators quickly diagnose and resolve performance issues.

LinuxOpscommand-line

0 likes · 14 min read

Essential Linux Shell Commands for System Monitoring & Troubleshooting

IT Architects Alliance

Nov 16, 2021 · Cloud Native

Kubernetes and CI/CD Architecture Diagrams Overview

This article presents a collection of visual diagrams illustrating Kubernetes cluster structures, OpenShift/Kubernetes architectures, and several common CI/CD pipeline designs, providing readers with clear reference material for modern cloud‑native operations and deployment workflows.

CI/CDOps

0 likes · 2 min read

Kubernetes and CI/CD Architecture Diagrams Overview

Ops Development Stories

Oct 19, 2021 · Operations

How to Build a Highly Available Alertmanager Cluster with Gossip

Learn to set up a highly available Alertmanager cluster using the Gossip protocol, covering deduplication, routing, HA architecture, required cluster parameters, systemd service files, and Prometheus integration, with step‑by‑step commands and configuration examples.

AlertmanagerGossipHA

0 likes · 8 min read

How to Build a Highly Available Alertmanager Cluster with Gossip

Java Architect Essentials

Aug 30, 2021 · Databases

How to Monitor and Optimize Redis Performance

This article explains how to use Redis INFO commands to track memory usage, command processing, latency, key eviction and fragmentation, and provides practical tips such as adjusting maxmemory, using hash structures, pipelines, and slowlog to diagnose and improve Redis performance.

LatencyMonitoringOps

0 likes · 23 min read

How to Monitor and Optimize Redis Performance

MaGe Linux Operations

Aug 17, 2021 · Operations

Essential Linux Ops Interview Questions & Answers for High‑Paying Jobs

A comprehensive collection of Linux operations interview questions covering fundamentals, server management, RAID, load balancing, MySQL, networking, security, scripting, and best‑practice solutions to help candidates ace high‑salary positions.

OpsServerinterview

0 likes · 44 min read

Essential Linux Ops Interview Questions & Answers for High‑Paying Jobs

Code Ape Tech Column

Jul 8, 2021 · Operations

Essential Nginx Configuration Cheat Sheet: Quick Snippets for Ports, Logs, SSL, and More

This article compiles a concise Nginx cheat sheet covering common configuration blocks such as port listening, access logging, server name handling, static file serving, redirects, reverse proxy, load balancing, and SSL settings, plus a brief note on a visual configuration tool.

ConfigurationOpsReverse Proxy

0 likes · 6 min read

Essential Nginx Configuration Cheat Sheet: Quick Snippets for Ports, Logs, SSL, and More

dbaplus Community

Jun 28, 2021 · Cloud Native

From chroot to Kubernetes: Choosing the Right Redis Container Strategy

This talk walks through the evolution of containerization—from early chroot and jails to modern Kubernetes—explains Redis’s core features, compares various container solutions for Redis deployment, and offers practical guidance on installation, scaling, monitoring, and fault recovery in both single‑instance and clustered environments.

DockerNamespaceOps

0 likes · 30 min read

From chroot to Kubernetes: Choosing the Right Redis Container Strategy

Efficient Ops

May 12, 2021 · Operations

7 Ready‑to‑Use Python & Shell Scripts to Supercharge Your Ops

This article shares a curated collection of ready‑to‑run Python and Shell scripts—including Enterprise WeChat alerts, FTP and SSH clients, SaltStack and vCenter utilities, SSL certificate checks, weather notifications, SVN backups, Zabbix password monitoring, local YUM mirroring, and high‑load detection—complete with full source code and usage notes to help engineers automate routine tasks and boost operational efficiency.

OpsPythonscripts

0 likes · 30 min read

7 Ready‑to‑Use Python & Shell Scripts to Supercharge Your Ops

Big Data Technology & Architecture

Apr 26, 2021 · Operations

Comprehensive Guide to Prometheus: Installation, Configuration, PromQL, Exporters, Grafana, and Alerting

This article provides a complete tutorial on Prometheus, covering its origins, core features, installation methods (binary and Docker), configuration file structure, PromQL basics, HTTP API usage, Grafana integration, various exporters for metrics collection, and alerting with Alertmanager, all within a cloud‑native monitoring context.

AlertingExportersMonitoring

0 likes · 32 min read

Comprehensive Guide to Prometheus: Installation, Configuration, PromQL, Exporters, Grafana, and Alerting

Top Architect

Mar 25, 2021 · Operations

Improving REPL Container Shutdown Performance at Replit

Replit engineers analyzed why container shutdown on preemptible VMs caused REPL sessions to stall for up to a minute, identified Docker's network‑release bottleneck, and implemented a direct SIGKILL workaround that reduced error rates and startup latency dramatically.

Container ManagementDockerOps

0 likes · 12 min read

Improving REPL Container Shutdown Performance at Replit

dbaplus Community

Mar 24, 2021 · Cloud Native

Three Years of Production Kubernetes: Key Lessons and Practical Tips

Over three years of running Kubernetes in production across on‑premise RHEL VMs and AWS EC2, we learned hard‑won lessons about Java container compatibility, upgrade strategies, build and deployment pipelines, probe tuning, external IP scaling, and when Kubernetes truly adds value.

Cloud NativeJavaOps

0 likes · 11 min read

Three Years of Production Kubernetes: Key Lessons and Practical Tips

Programmer DD

Jan 15, 2021 · Operations

Why Does Prometheus Sometimes Fail to Trigger Alerts?

This article explains why Prometheus alerts may not fire or may fire unexpectedly, covering the role of the for parameter, sampling intervals, Grafana range queries, and practical steps to diagnose and fix alerting issues.

AlertingMonitoringObservability

0 likes · 7 min read

Why Does Prometheus Sometimes Fail to Trigger Alerts?

Ops Development Stories

Jan 15, 2021 · Operations

How to Deploy a Multi‑Node Ceph Cluster on CentOS 7 – Step‑by‑Step Guide

This article provides a comprehensive, step‑by‑step tutorial for setting up a three‑node Ceph storage cluster on CentOS 7.9, covering host configuration, firewall and SELinux settings, package installation, monitor, manager, OSD, MDS, and RGW deployment, along with required keyrings, configuration files, and troubleshooting tips.

CentOSCephCluster Deployment

0 likes · 20 min read

How to Deploy a Multi‑Node Ceph Cluster on CentOS 7 – Step‑by‑Step Guide

MaGe Linux Operations

Jan 9, 2021 · Operations

How to Monitor Kubernetes API with Python and Zabbix Sender – Step‑by‑Step Guide

This tutorial walks you through using Python's requests library and Zabbix Sender to retrieve Kubernetes API metrics, covering API endpoint discovery, token generation, script deployment, host configuration, and manual trigger of checks to visualize the data.

APIAutomationMonitoring

0 likes · 3 min read

How to Monitor Kubernetes API with Python and Zabbix Sender – Step‑by‑Step Guide

MaGe Linux Operations

Aug 7, 2020 · Operations

How to Diagnose Linux Server Issues in the First 60 Seconds with 10 Essential Commands

This article explains how Netflix's performance team uses ten standard Linux command‑line tools to quickly assess system health within the first minute, focusing on error detection, resource saturation, and utilization across CPU, memory, disk, and network to pinpoint performance problems.

MonitoringOpscommand-line

0 likes · 18 min read

How to Diagnose Linux Server Issues in the First 60 Seconds with 10 Essential Commands

Big Data Technology & Architecture

Jun 24, 2020 · Operations

Design and Implementation of a General Business Monitoring and Alert Engine Using Prometheus and ClickHouse

This article describes how a company replaced its Zabbix‑based monitoring with a scalable, Prometheus‑driven alert engine that leverages ClickHouse for storage, remote‑storage integration via Prom2Click, and materialized views to provide flexible, SQL‑based business metric alerts.

AlertingClickHouseOps

0 likes · 11 min read

Design and Implementation of a General Business Monitoring and Alert Engine Using Prometheus and ClickHouse

MaGe Linux Operations

May 26, 2020 · Operations

How to Optimize Ceph Cluster Hardware: CPU, RAM, Storage & Network Guidelines

This guide explains how to plan Ceph hardware by balancing CPU, memory, storage, network bandwidth, and failure domains, offering practical recommendations for daemons, OSDs, monitors, managers, and SSD vs HDD choices to achieve cost‑effective, high‑performance large‑scale clusters.

CephCluster PerformanceHardware Planning

0 likes · 15 min read

How to Optimize Ceph Cluster Hardware: CPU, RAM, Storage & Network Guidelines

Efficient Ops

Apr 1, 2020 · Operations

How to Use Nagios for Business-Level Service Monitoring: A Step-by-Step Guide

This article explains why traditional server and service monitoring (e.g., Zabbix) may miss business outages, then walks through setting up Nagios on Debian to monitor web page URLs, API health checks, and related services, including configuration files, plugins, and a desktop alert tool, Nagstamon.

LinuxMonitoringOps

0 likes · 18 min read

How to Use Nagios for Business-Level Service Monitoring: A Step-by-Step Guide