Tagged articles
3281 articles
Page 2 of 33
Raymond Ops
Raymond Ops
Dec 24, 2025 · Operations

How to Combine Terraform and Ansible for Seamless Multi‑Cloud Orchestration

This guide explains why single‑tool approaches fall short in modern IaC, compares Terraform’s state management and multi‑cloud support with Ansible’s configuration capabilities, and provides a step‑by‑step architecture, code samples, CI/CD integration, monitoring, cost‑saving, and security practices for enterprise‑grade deployments.

AnsibleInfrastructure AutomationOperations
0 likes · 17 min read
How to Combine Terraform and Ansible for Seamless Multi‑Cloud Orchestration
Java Architect Handbook
Java Architect Handbook
Dec 24, 2025 · Operations

Why Tencent’s SOPS Is the Go‑To Open‑Source Workflow Engine for Modern Ops

The article introduces a Java learning community offering multiple hands‑on projects and then provides a detailed overview of Tencent BlueKing's open‑source Standard Operations (SOPS) workflow engine, highlighting its BPMN‑2.0 modeling, one‑click automation, integration capabilities, and self‑service benefits for IT teams.

AutomationOperationsSOPS
0 likes · 5 min read
Why Tencent’s SOPS Is the Go‑To Open‑Source Workflow Engine for Modern Ops
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Dec 24, 2025 · Operations

How to Deploy a Two‑Location Three‑Center Disaster‑Recovery Architecture for High Availability

This guide explains the two‑location three‑center disaster‑recovery pattern, describing its purpose, typical deployment across two cities and three data centers, and step‑by‑step recommendations for same‑city dual‑active or primary‑backup setups, remote backup strategies, traffic routing, and essential monitoring.

GSLBOperationsSLB
0 likes · 5 min read
How to Deploy a Two‑Location Three‑Center Disaster‑Recovery Architecture for High Availability
Xiao Liu Lab
Xiao Liu Lab
Dec 23, 2025 · Operations

Master Incident Response: Diagnose and Recover Service Outages in 15 Minutes

When a service crashes and users flood you with complaints, following a structured 15‑minute workflow—first narrowing the impact, then probing six layers (network, system, application, data, external services, security), and finally documenting the incident—lets you pinpoint and fix most outages quickly and reliably.

Operationsnetwork debuggingservice recovery
0 likes · 10 min read
Master Incident Response: Diagnose and Recover Service Outages in 15 Minutes
Raymond Ops
Raymond Ops
Dec 22, 2025 · Operations

Mastering Production Site Backup: A Multi‑Layer Disaster Recovery Blueprint

After a midnight disk failure that threatened 300,000 users, this article presents a production‑grade, multi‑layer backup architecture with 3‑2‑1 redundancy, RTO ≤30 min and RPO ≤5 min, covering application code, configuration, database (physical and logical), file storage, automated scheduling, monitoring, performance tuning, a real‑world recovery case, and future AI‑driven enhancements.

AutomationBackupOperations
0 likes · 15 min read
Mastering Production Site Backup: A Multi‑Layer Disaster Recovery Blueprint
Alibaba Cloud Observability
Alibaba Cloud Observability
Dec 22, 2025 · Operations

How to Pinpoint Packet Loss in Cloud‑Native Deployments with SysOM

This article walks through two real‑world cases of network packet loss in Alibaba Cloud ACK clusters, showing how SysOM’s intelligent diagnostics and systematic checks—covering iptables, kernel drops, hooks, and nftables rules—can quickly locate the root cause and restore service continuity.

Alibaba CloudOperationsPacket Loss
0 likes · 10 min read
How to Pinpoint Packet Loss in Cloud‑Native Deployments with SysOM
Alibaba Cloud Native
Alibaba Cloud Native
Dec 21, 2025 · Operations

How to Pinpoint and Resolve Packet Loss in Cloud‑Native Deployments with SysOM

This article walks through real‑world cases of network packet loss in Alibaba Cloud Kubernetes clusters, showing how SysOM’s diagnostics quickly locate root causes—ranging from kernel‑level drops to hidden netfilter hooks and nftables rules—and provides a step‑by‑step troubleshooting guide for cloud‑native operations teams.

Alibaba CloudOperationsPacket Loss
0 likes · 10 min read
How to Pinpoint and Resolve Packet Loss in Cloud‑Native Deployments with SysOM
Ray's Galactic Tech
Ray's Galactic Tech
Dec 20, 2025 · Operations

How RocketMQ Achieves High‑Availability Storage and Fast Fault Recovery

RocketMQ ensures durable, consistent, and highly available message storage through fixed‑length append‑only files, efficient index rebuilding, checkpoint tracking, and configurable master‑slave replication, offering both synchronous and asynchronous HA modes, detailed recovery steps, performance trade‑offs, and practical operational guidelines for robust fault tolerance.

OperationsRocketMQfault-recovery
0 likes · 10 min read
How RocketMQ Achieves High‑Availability Storage and Fast Fault Recovery
Eric Tech Circle
Eric Tech Circle
Dec 19, 2025 · Operations

Step‑by‑Step Guide to Adding Google AdSense to a Halo‑Based Blog

This tutorial walks through registering a Google AdSense account, passing site approval, and three practical integration methods—including inserting the AdSense script, using a meta tag, and configuring an ads.txt file with Nginx—followed by tips for ad placement on a personal blog.

Blog MonetizationFrontend IntegrationGoogle AdSense
0 likes · 6 min read
Step‑by‑Step Guide to Adding Google AdSense to a Halo‑Based Blog
Xiao Liu Lab
Xiao Liu Lab
Dec 18, 2025 · Operations

scp vs rsync: Choose the Right Tool for Fast, Efficient File Transfers

This guide explains the fundamental differences between scp and rsync, outlines their mechanisms, advantages, and drawbacks, provides practical command examples for various scenarios, highlights common pitfalls, and offers a concise comparison table to help operations engineers select the appropriate tool for secure and efficient file transfers.

LinuxOperationsSysadmin
0 likes · 10 min read
scp vs rsync: Choose the Right Tool for Fast, Efficient File Transfers
Code Wrench
Code Wrench
Dec 16, 2025 · Operations

Rescuing a Failing System: 3‑Step Triage Playbook Every Go Engineer Needs

This article explains how to demonstrate real‑world system‑engineering expertise in Go interviews by mastering incident triage, diagnosing CPU, memory, GC, and goroutine problems, and applying a three‑step "stop‑bleed, diagnose, cure" strategy to keep services alive.

GoOperationsincident management
0 likes · 11 min read
Rescuing a Failing System: 3‑Step Triage Playbook Every Go Engineer Needs
IT Architects Alliance
IT Architects Alliance
Dec 14, 2025 · Operations

How to Build a Scientific KPI System for Enterprise Architecture Efficiency

This article explains why many enterprises lack quantitative architecture efficiency metrics, outlines the multidimensional challenges of assessing technical, business, cost, and organizational performance, and provides a detailed, step‑by‑step KPI framework—including technical, business, cost, and organizational indicators, data collection automation, monitoring dashboards, and continuous improvement practices—to enable data‑driven architecture optimization.

EnterpriseKPIOperations
0 likes · 9 min read
How to Build a Scientific KPI System for Enterprise Architecture Efficiency
Architect Chen
Architect Chen
Dec 11, 2025 · Operations

How to Boost Nginx Concurrency from 5K to 50K: Key Config Tweaks

This guide explains how to dramatically increase Nginx's concurrent handling capacity by tuning worker processes, connections, keep‑alive settings, and high‑performance I/O options, providing concrete configuration examples and practical advice for high‑traffic deployments.

ConfigurationNginxOperations
0 likes · 4 min read
How to Boost Nginx Concurrency from 5K to 50K: Key Config Tweaks
Efficient Ops
Efficient Ops
Dec 10, 2025 · Operations

5 Essential Skills Ops Engineers Need to Stay Valuable in the K8s & AI Era

In the fast‑changing world of Kubernetes and AI, operations professionals must cultivate five compound abilities—communication, problem‑solving, ownership, stress handling, and continuous improvement—to transform technical expertise into lasting career growth and higher compensation.

Continuous ImprovementOperationsOwnership
0 likes · 11 min read
5 Essential Skills Ops Engineers Need to Stay Valuable in the K8s & AI Era
MaGe Linux Operations
MaGe Linux Operations
Dec 10, 2025 · Operations

Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates

This handbook presents a complete, two‑year‑tested SRE on‑call process that defines alert severity tiers, response requirements, escalation paths, War‑Room roles, handoff schedules, post‑mortem procedures, and provides ready‑to‑use configuration snippets, checklists and templates to reduce MTTR and repeat incidents.

Alert ManagementOn-CallOperations
0 likes · 26 min read
Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates
Raymond Ops
Raymond Ops
Dec 9, 2025 · Databases

Deep Dive into MySQL Architecture, SQL Syntax, and Performance Tuning

This comprehensive guide explores MySQL’s layered architecture, core components, storage engines, and detailed SQL language structures, while providing practical commands, optimization techniques, security best practices, and operational procedures for administrators to efficiently manage, tune, and secure MySQL databases.

OperationsSQLdatabase
0 likes · 31 min read
Deep Dive into MySQL Architecture, SQL Syntax, and Performance Tuning
Raymond Ops
Raymond Ops
Dec 9, 2025 · Operations

Master the Must‑Know Linux Commands Every Ops Engineer Needs

This comprehensive guide lists essential Linux commands for file handling, system monitoring, text processing, process control, network troubleshooting, compression, backup, security, and scripting, providing practical examples and interview tips to boost an operations engineer's efficiency and expertise.

LinuxOperationsShell scripting
0 likes · 18 min read
Master the Must‑Know Linux Commands Every Ops Engineer Needs
Continuous Delivery 2.0
Continuous Delivery 2.0
Dec 9, 2025 · Operations

How Tencent Interactive Entertainment Scaled SRE: From Traditional Ops to Modern Reliability Engineering

This article examines Tencent Interactive Entertainment's eight‑year journey from a traditional operations team to a 400‑person SRE organization, detailing timeline milestones, the shift in mindset and practices, management challenges, and the broader industry trends driving reliability engineering adoption.

OperationsSRETencent
0 likes · 13 min read
How Tencent Interactive Entertainment Scaled SRE: From Traditional Ops to Modern Reliability Engineering
DevOps Coach
DevOps Coach
Dec 8, 2025 · Operations

How to Quantify SRE ROI: Turning Reliability Metrics into Business Value

This article explains how SRE leaders can bridge the gap between technical reliability metrics and business outcomes by defining core SRE concepts, applying a step‑by‑step ROI formula, illustrating code‑level impact, avoiding common pitfalls, and looking ahead to AI‑driven reliability forecasting.

BusinessValueMetricsOperations
0 likes · 10 min read
How to Quantify SRE ROI: Turning Reliability Metrics into Business Value
Raymond Ops
Raymond Ops
Dec 8, 2025 · Operations

Mastering the Linux Filesystem Hierarchy: A Complete Guide for Sysadmins

This comprehensive guide explains the Linux Filesystem Hierarchy Standard (FHS), details the purpose and typical contents of each top‑level directory such as /, /bin, /sbin, /usr, /var, /etc, /home, /root, /tmp, /dev, /proc, /sys, /boot and /run, and provides practical sysadmin commands and best‑practice recommendations for managing permissions, mounting strategies, performance tuning and troubleshooting.

Directory HierarchyFHSFilesystem
0 likes · 27 min read
Mastering the Linux Filesystem Hierarchy: A Complete Guide for Sysadmins
Xiao Liu Lab
Xiao Liu Lab
Dec 7, 2025 · Operations

How to Diagnose and Prevent 502 Bad Gateway Errors in an Nginx‑PHP‑MySQL Stack

This article walks through a real‑world 502 outage, explains why the error is rarely a simple gateway failure, shows how to use enhanced Nginx upstream logs and automated scripts to pinpoint timeouts, misconfigurations, and database bottlenecks, and provides concrete tuning, monitoring, and self‑healing measures to stop the problem from recurring.

502NginxOperations
0 likes · 11 min read
How to Diagnose and Prevent 502 Bad Gateway Errors in an Nginx‑PHP‑MySQL Stack
Raymond Ops
Raymond Ops
Dec 7, 2025 · Operations

Ceph Uncovered: Architecture, Deployment, and Ops Best Practices

Ceph is an open‑source distributed storage platform offering object, block, and file services with high availability, scalability, and self‑management; the guide explains its core components, CRUSH algorithm, storage interfaces, deployment steps using ceph‑deploy, operational monitoring, performance tuning, and common use cases in cloud and big‑data environments.

Big DataCephDeployment
0 likes · 11 min read
Ceph Uncovered: Architecture, Deployment, and Ops Best Practices
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Dec 5, 2025 · Operations

Essential Ceph Command Cheat Sheet for Cluster Management

This guide provides a concise collection of essential Ceph commands for starting services, checking health and status, managing monitors, metadata servers, and OSDs, as well as creating admin users, purging nodes, and handling crush maps, enabling administrators to efficiently operate and troubleshoot a Ceph storage cluster.

CephCluster ManagementLinux
0 likes · 6 min read
Essential Ceph Command Cheat Sheet for Cluster Management
Efficient Ops
Efficient Ops
Dec 3, 2025 · Artificial Intelligence

Unlocking AI Agent Paradigms: 6 Patterns to Supercharge Operations

This article introduces six core AI agent paradigms—Prompt Chain, Routing & Handoff, Parallelization, Tool Use, ReAct, and Multi‑Agent—explaining their concepts, real‑world analogies, and practical examples for enhancing efficiency and intelligence in operational workflows.

AI AgentAutomationOperations
0 likes · 6 min read
Unlocking AI Agent Paradigms: 6 Patterns to Supercharge Operations
Cloud Native Technology Community
Cloud Native Technology Community
Dec 3, 2025 · Operations

5 Hard‑Won Lessons for Managing Kubernetes at Scale

Drawing from years of real‑world Kubernetes deployments, this article outlines five practical lessons—covering operational overload, hidden security risks, scaling costs, talent shortages, and accelerating technical debt—plus extra guidance on workload suitability, policy enforcement, and building a reliable, cost‑effective cluster environment.

Cloud NativeCost ManagementKubernetes
0 likes · 10 min read
5 Hard‑Won Lessons for Managing Kubernetes at Scale
Liangxu Linux
Liangxu Linux
Nov 29, 2025 · Operations

20 Essential Linux Command Combos Every Sysadmin Must Master

This article presents 20 powerful Linux command combinations, grouped by file management, process monitoring, network diagnostics, log analysis, and system maintenance, each with clear examples, real‑world scenarios, common pitfalls, and practical tips to help administrators troubleshoot and automate daily operations efficiently.

AutomationLinuxOperations
0 likes · 13 min read
20 Essential Linux Command Combos Every Sysadmin Must Master
MaGe Linux Operations
MaGe Linux Operations
Nov 28, 2025 · Operations

10 Essential Linux Ops Tools Every Engineer Should Master

This article presents a curated list of ten widely used Linux operations tools, detailing each tool's core functions, typical use cases, key advantages, and real‑world examples, while also providing practical shell and Ansible code snippets to help engineers apply them immediately.

AnsibleDockerGrafana
0 likes · 9 min read
10 Essential Linux Ops Tools Every Engineer Should Master
dbaplus Community
dbaplus Community
Nov 24, 2025 · Operations

How We Rescued a Critical etcd Outage in 4 Hours: Step‑by‑Step Recovery Guide

A midnight Kubernetes disaster caused API server timeouts, etcd health failures, and a full service outage, prompting a detailed investigation, root‑cause analysis of massive database fragmentation, and a four‑stage emergency recovery that restored the cluster within 4 hours while outlining preventive measures.

KubernetesOperationsdatabase fragmentation
0 likes · 10 min read
How We Rescued a Critical etcd Outage in 4 Hours: Step‑by‑Step Recovery Guide
Liangxu Linux
Liangxu Linux
Nov 23, 2025 · Operations

20 Essential Linux Commands Every Ops Engineer Must Master

This article presents twenty indispensable Linux command‑line tools—covering system monitoring, performance analysis, process management, network diagnostics, disk handling, and kernel tuning—explaining their syntax, practical tips, common pitfalls, and how they integrate with modern cloud‑native environments.

LinuxNetwork DiagnosticsOperations
0 likes · 12 min read
20 Essential Linux Commands Every Ops Engineer Must Master
Raymond Ops
Raymond Ops
Nov 22, 2025 · Operations

Master Rsync Backup: From Basics to Real-World Deployment

This guide walks through the fundamentals of data backup, explains why backups are essential, and provides a comprehensive tutorial on using Rsync—including its concepts, sync modes, configuration, common options, service deployment, and real‑world scenarios such as push/pull transfers, bidirectional sync, and bandwidth‑limited backups.

BackupLinuxOperations
0 likes · 16 min read
Master Rsync Backup: From Basics to Real-World Deployment
Xiao Liu Lab
Xiao Liu Lab
Nov 21, 2025 · Operations

How to Stop Docker from Eating Your Disk Space: Proven Cleanup Strategies

This guide explains why Docker can rapidly fill storage, shows how to pinpoint the biggest space consumers, and provides tiered, production‑ready cleanup commands, automation scripts, and monitoring setups to keep container environments healthy and efficient.

Container ManagementDisk CleanupDocker
0 likes · 10 min read
How to Stop Docker from Eating Your Disk Space: Proven Cleanup Strategies
ITPUB
ITPUB
Nov 20, 2025 · Operations

What Triggered Cloudflare’s Massive November 2023 Outage? Inside the Bot Management Failure

On November 18, 2023 Cloudflare suffered a multi‑hour network outage that crippled major services worldwide, caused by a ClickHouse permission change that generated oversized bot‑management feature files, leading to 5xx errors across CDN, security, and authentication layers, and prompting a complex, step‑by‑step remediation effort.

Bot ManagementClickHouseCloudflare
0 likes · 19 min read
What Triggered Cloudflare’s Massive November 2023 Outage? Inside the Bot Management Failure
DevOps Coach
DevOps Coach
Nov 18, 2025 · Operations

Why Platform Engineering Is the Next Evolution of DevOps for 2025

Platform engineering emerges as the new DevOps, offering internal developer platforms that streamline complex microservice ecosystems, reduce tool sprawl, enforce golden paths, and empower developers while relieving ops teams, with practical steps, real‑world case studies, and a roadmap for organizations of any size to boost productivity and reliability.

DevOpsGolden PathInternal Developer Platform
0 likes · 9 min read
Why Platform Engineering Is the Next Evolution of DevOps for 2025
Xiao Liu Lab
Xiao Liu Lab
Nov 18, 2025 · Operations

Mastering Ops: Security, High Availability, and Fault Diagnosis for Interviews

This article compiles concise, high‑scoring answers to essential operations interview questions, covering security hardening, intrusion response, high‑availability architecture, disaster‑recovery design, Redis replication and clustering, Docker fundamentals and networking, Kubernetes components, monitoring, CI/CD pipelines, and the evolving role of DevOps.

DockerKubernetesOperations
0 likes · 14 min read
Mastering Ops: Security, High Availability, and Fault Diagnosis for Interviews
21CTO
21CTO
Nov 18, 2025 · Operations

What Cloudflare’s Latest Outage Reveals About Cloud Dependency Risks

A massive Cloudflare outage on November 18, 2025 crippled DNS and CDN services, causing widespread failures for platforms like ChatGPT and Discord, and the article analyzes the incident, past failures, and offers four practical resilience strategies to mitigate over‑reliance on single cloud providers.

CDNCloudflareDNS
0 likes · 7 min read
What Cloudflare’s Latest Outage Reveals About Cloud Dependency Risks
Efficient Ops
Efficient Ops
Nov 17, 2025 · Operations

Mastering pwru: A Step‑by‑Step Guide to eBPF Packet Tracing with Cilium

This article introduces pwru, Cilium's eBPF‑based packet‑tracing tool, explains kernel requirements, shows how to install the pre‑built binary, details command‑line options, and provides practical examples for filtering, output customization, and debugging dropped packets in Linux networking.

CiliumLinux networkingOperations
0 likes · 6 min read
Mastering pwru: A Step‑by‑Step Guide to eBPF Packet Tracing with Cilium
dbaplus Community
dbaplus Community
Nov 15, 2025 · Operations

What 20 Years of Ops Mishaps Reveal About Infrastructure Resilience

A chronicle of real‑world operations incidents from 2003 to 2024 shows how simple mistakes—mis‑configured passwords, unplugged cables, hardware mix‑ups, and rushed cloud migrations—can cascade into massive outages, offering hard‑earned lessons for anyone managing production systems.

Case StudyIncidentInfrastructure
0 likes · 11 min read
What 20 Years of Ops Mishaps Reveal About Infrastructure Resilience
Ray's Galactic Tech
Ray's Galactic Tech
Nov 14, 2025 · Operations

Mastering Nginx Under High Load: Practical Tuning & Troubleshooting Guide

Learn how to identify and resolve common high‑concurrency bottlenecks for Nginx by optimizing OS limits, network stack, Nginx configuration, logging, reverse‑proxy settings, backend services, and hardware resources, with concrete commands, parameter values, and step‑by‑step troubleshooting procedures.

BackendOperationsSystem optimization
0 likes · 6 min read
Mastering Nginx Under High Load: Practical Tuning & Troubleshooting Guide
Raymond Ops
Raymond Ops
Nov 14, 2025 · Operations

Step‑by‑Step Guide to Install an ElasticSearch 7.17.x Cluster on Ubuntu

This tutorial walks through installing Java, configuring hostnames and hosts files, synchronizing time, tuning system parameters, creating Elasticsearch directories and users, downloading and extracting ElasticSearch 7.17.x, setting up its configuration and systemd service, starting the three‑node cluster, and verifying its health on Ubuntu 22.04.

ClusterDevOpsInstallation
0 likes · 12 min read
Step‑by‑Step Guide to Install an ElasticSearch 7.17.x Cluster on Ubuntu
Xiao Liu Lab
Xiao Liu Lab
Nov 10, 2025 · Operations

Why a Healthy Frontend Still Returns 504 Errors: An MTU Mismatch Case Study

A production incident showed that despite flawless frontend health metrics and no logged errors, a subset of users experienced 504 Gateway Timeout errors caused by an MTU mismatch in the network path, highlighting the need for end‑to‑end connectivity checks beyond application monitoring.

504 timeoutMSS clampingMTU
0 likes · 9 min read
Why a Healthy Frontend Still Returns 504 Errors: An MTU Mismatch Case Study
Ops Community
Ops Community
Nov 10, 2025 · Operations

8 Essential Skills Every Senior Ops Engineer Must Master

This article outlines the eight critical competencies—from Linux and scripting to cloud, container orchestration, and automation—that define the career progression of an operations professional and explains why mastering them is key to advancing to senior roles.

LinuxOperationsSystem Administration
0 likes · 3 min read
8 Essential Skills Every Senior Ops Engineer Must Master
MaGe Linux Operations
MaGe Linux Operations
Nov 10, 2025 · Operations

8 Essential Skills Every Senior Ops Engineer Must Master

This article outlines the eight critical competencies—ranging from Linux and scripting to cloud, container orchestration, and automation—that distinguish senior operations engineers and are essential for career advancement and personal growth.

AutomationContainerLinux
0 likes · 3 min read
8 Essential Skills Every Senior Ops Engineer Must Master
MaGe Linux Operations
MaGe Linux Operations
Nov 10, 2025 · Operations

100 Essential Operations Interview Questions to Boost Your Career

This article compiles 100 common operations interview questions from major tech companies, covering DevOps principles, CI/CD, infrastructure as code, monitoring, automation, Linux system administration, security, and scripting, providing a comprehensive study guide for aspiring sysadmins and site reliability engineers.

AutomationDevOpsOperations
0 likes · 4 min read
100 Essential Operations Interview Questions to Boost Your Career
Alibaba Cloud Observability
Alibaba Cloud Observability
Nov 10, 2025 · Cloud Native

How a Next‑Gen Cloud‑Native Observability Platform Boosted Ticketing Stability by 80%

A leading digital‑entertainment group tackled severe stability and monitoring challenges in its high‑traffic ticketing system by building a cloud‑native, full‑link observability platform on Alibaba Cloud, achieving an 80% improvement in fault detection speed, a 40% reduction in operational costs, and establishing data‑driven operations as the digital foundation for product growth.

ObservabilityOperationsaiops
0 likes · 15 min read
How a Next‑Gen Cloud‑Native Observability Platform Boosted Ticketing Stability by 80%
Xiao Liu Lab
Xiao Liu Lab
Nov 9, 2025 · Operations

50 Essential Docker Maintenance Commands for Daily Ops and Security

This guide compiles 50 practical Docker commands covering daily status checks, weekly resource cleanup, monthly security hardening, logging and monitoring, image management, high‑availability, and disaster‑recovery, helping operators maintain healthy containers across Rocky, CentOS, and Kylin environments.

ContainerDockerOperations
0 likes · 10 min read
50 Essential Docker Maintenance Commands for Daily Ops and Security
Liangxu Linux
Liangxu Linux
Nov 8, 2025 · Operations

Boost Your Ops Efficiency: 30 Essential Vim Shortcuts Every Engineer Should Master

This comprehensive guide explains why Vim is a must‑have tool for modern operations engineers, introduces its three core modes, details 30 high‑impact shortcuts with real‑world examples such as Nginx configuration tuning, log file analysis, and bulk parameter updates, and provides advanced techniques, performance tweaks, plugin recommendations, and a skill‑development roadmap to dramatically accelerate daily text‑editing tasks.

Configuration ManagementLinuxOperations
0 likes · 21 min read
Boost Your Ops Efficiency: 30 Essential Vim Shortcuts Every Engineer Should Master
DataFunTalk
DataFunTalk
Nov 6, 2025 · Cloud Native

How Tencent Music Cut Kafka Costs by 50% with Cloud‑Native AutoMQ

Tencent Music migrated its massive Kafka streaming infrastructure to the cloud‑native AutoMQ platform, slashing operational costs by over half, achieving second‑level partition migration, and dramatically improving scaling efficiency while maintaining high‑throughput, low‑latency data processing for its music services.

AutoMQCost OptimizationData Streaming
0 likes · 16 min read
How Tencent Music Cut Kafka Costs by 50% with Cloud‑Native AutoMQ
Open Source Linux
Open Source Linux
Nov 6, 2025 · Operations

How to Break the 20K Salary Ceiling in Operations: 4 Power Moves

This article reveals why many ops engineers are stuck below 20K, outlines four high‑impact practices—including coding automation, mastering cloud‑native, aligning with business performance, and shifting from firefighting to prevention—and presents concrete career paths and daily actions to boost expertise and salary.

Operationscareercloud-native
0 likes · 7 min read
How to Break the 20K Salary Ceiling in Operations: 4 Power Moves
Continuous Delivery 2.0
Continuous Delivery 2.0
Nov 6, 2025 · Operations

How Spotify Manages Weekly Mobile App Releases: Balancing Speed and Quality

Spotify’s weekly iOS and Android mobile app releases reach over 675 million users, and the release team balances rapid delivery with rigorous quality checks through coordinated tooling, bug prioritization, and a detailed release‑cycle process that includes dashboards, alpha/beta testing, and staged rollouts.

Continuous DeliveryOperationsSpotify
0 likes · 13 min read
How Spotify Manages Weekly Mobile App Releases: Balancing Speed and Quality
Radish, Keep Going!
Radish, Keep Going!
Nov 4, 2025 · Artificial Intelligence

What You Need to Know: Backpropagation, FreeBSD, AI MoE, and More Tech Insights

This roundup covers essential insights on backpropagation fundamentals, FreeBSD self‑hosting benefits, an open‑source 30B MoE AI model, misuse of cybercrime laws, historic moving sidewalks, party‑planning hacks, deceptive signal‑strength tricks, a 1000‑hp micro motor, Nextcloud performance fixes, and Google Cloud account suspensions, offering a blend of technical depth and practical advice.

AIBackpropagationDeep Learning
0 likes · 11 min read
What You Need to Know: Backpropagation, FreeBSD, AI MoE, and More Tech Insights
Efficient Ops
Efficient Ops
Nov 3, 2025 · Operations

Why Uptime Kuma Is the Lightweight Self‑Hosted Monitoring Solution You Need

Uptime Kuma is a lightweight, self‑hosted monitoring tool with a web UI that tracks service uptime across multiple protocols, offers rich notification integrations, 20‑second intervals, and easy Docker or manual installation, making it a practical alternative to heavyweight solutions for ops teams.

DockerOperationsUptime Kuma
0 likes · 4 min read
Why Uptime Kuma Is the Lightweight Self‑Hosted Monitoring Solution You Need
Open Source Linux
Open Source Linux
Nov 3, 2025 · Operations

Master Linux ‘top’: Decode System Metrics and Boost Performance

This guide walks you through every line of the Linux top command output, explaining system summaries, CPU and memory metrics, process details, and advanced shortcuts, so you can quickly diagnose performance bottlenecks and become proficient at real‑time system troubleshooting.

CPU usageLinuxOperations
0 likes · 7 min read
Master Linux ‘top’: Decode System Metrics and Boost Performance
Liangxu Linux
Liangxu Linux
Nov 1, 2025 · Operations

Master Essential Linux Command-Line Tricks for Faster Sysadmin Work

This guide presents ten practical Linux command-line shortcuts and techniques—ranging from cursor navigation and Vim editing to quick directory switching, file transfer, process inspection, and output logging—designed to boost productivity for system administrators and developers working in terminal environments.

BashLinuxOperations
0 likes · 10 min read
Master Essential Linux Command-Line Tricks for Faster Sysadmin Work
转转QA
转转QA
Oct 31, 2025 · Operations

Boosting Service Quality with Intelligent Inspection, Notification, and Automation Engines

This article outlines the design and value of an automated service quality monitoring platform, detailing its core benefits—intelligent detection, automated execution, data‑driven decisions, and precise notifications—along with functional architecture, key modules, code examples, technical requirements, and practical recommendations.

AI MonitoringAutomationBackend
0 likes · 10 min read
Boosting Service Quality with Intelligent Inspection, Notification, and Automation Engines
Liangxu Linux
Liangxu Linux
Oct 30, 2025 · Operations

Boost Your Linux Ops: Advanced CLI Tricks and Fast Error Fixes

This guide equips Linux operations engineers with powerful command‑line shortcuts, automation tips, and step‑by‑step troubleshooting procedures for common errors such as permission issues, disk‑space exhaustion, missing commands, high load, and port conflicts, dramatically improving incident response speed and system reliability.

CLILinuxOperations
0 likes · 13 min read
Boost Your Linux Ops: Advanced CLI Tricks and Fast Error Fixes
Xiao Liu Lab
Xiao Liu Lab
Oct 30, 2025 · Operations

Why systemd Timers Outperform crontab and How to Migrate Your Jobs

This article explains why the built‑in systemd timer engine is a more reliable, observable, and feature‑rich replacement for traditional crontab, and provides a step‑by‑step guide to rewrite, configure, and manage your scheduled tasks on Linux.

AutomationOperationscrontab
0 likes · 9 min read
Why systemd Timers Outperform crontab and How to Migrate Your Jobs
High Availability Architecture
High Availability Architecture
Oct 30, 2025 · Operations

How Tencent Music Cut Kafka Costs by 50% with Cloud‑Native AutoMQ

Tencent Music replaced its traditional Kafka clusters with the cloud‑native AutoMQ platform, slashing infrastructure costs by over half, achieving second‑level partition migration, and dramatically simplifying operations while maintaining high‑throughput, low‑latency data streams for its massive music services.

AutoMQCloud NativeData Streaming
0 likes · 17 min read
How Tencent Music Cut Kafka Costs by 50% with Cloud‑Native AutoMQ
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Oct 28, 2025 · Big Data

How Huolala Scaled Elasticsearch to 40B Records with Serverless Cloud Architecture

Huolala, a leading smart logistics platform serving over 14 markets and millions of users, detailed its massive Elasticsearch deployment—over 1.5 万 CPU cores, 40 billion records, 4 PB data—highlighting multi‑AZ design, serverless migration, and a comprehensive management platform that boosted performance, reduced costs, and enabled AI‑driven services.

AI searchBig DataElasticsearch
0 likes · 10 min read
How Huolala Scaled Elasticsearch to 40B Records with Serverless Cloud Architecture
MaGe Linux Operations
MaGe Linux Operations
Oct 27, 2025 · Operations

Essential Ops Playbook: Real‑World Linux Tuning & Incident Diagnosis

This article walks ops engineers through a real production incident, explains why deep Linux kernel knowledge is crucial, presents typical high‑traffic, log‑burst, and DB‑slow‑query scenarios, and shares a three‑step practical tuning methodology with code snippets, monitoring scripts, and future‑proof tips such as eBPF and AIOps.

LinuxOperationsSystem Tuning
0 likes · 14 min read
Essential Ops Playbook: Real‑World Linux Tuning & Incident Diagnosis
Ray's Galactic Tech
Ray's Galactic Tech
Oct 26, 2025 · Operations

How to Diagnose and Fix the 9 Most Common Nginx Errors

This guide systematically outlines the typical Nginx error codes, missing client IP, WebSocket proxy failures, load‑balancing issues, static file problems, large upload limits, SSL/TLS errors, cache misses, and rate‑limiting, providing root‑cause analysis, step‑by‑step checks, configuration fixes and useful command‑line tools.

502504Nginx
0 likes · 7 min read
How to Diagnose and Fix the 9 Most Common Nginx Errors
MaGe Linux Operations
MaGe Linux Operations
Oct 24, 2025 · Operations

Master Linux Ops: 50 Essential Commands for Fast Troubleshooting

A comprehensive cheat sheet of 50 essential Linux commands covering system information, file management, process control, network diagnostics, disk utilities, log inspection, performance monitoring, security, and SSH, complete with usage examples, practical scenarios, and best‑practice recommendations for daily operations and incident response.

Operationscommand-linesystem-administration
0 likes · 46 min read
Master Linux Ops: 50 Essential Commands for Fast Troubleshooting
Xiao Liu Lab
Xiao Liu Lab
Oct 24, 2025 · Operations

Why Nginx Caches DNS for Weeks and How to Fix It

In production, Nginx cached DNS lookups for up to a month, causing requests to stale IPs after CDN changes; this article explains the root cause, demonstrates how to configure upstream health checks and the built‑in resolver to ensure timely DNS updates and avoid prolonged outages.

DNSOperationsload balancing
0 likes · 6 min read
Why Nginx Caches DNS for Weeks and How to Fix It
DevOps Coach
DevOps Coach
Oct 22, 2025 · Cloud Native

Simplify Scalable Kubernetes Pod Logging with Grafana podLogs

This guide explains how Grafana's podLogs feature, powered by Vector.dev, transforms raw Kubernetes pod logs into enriched, searchable, cluster‑wide observability data, covering why pod‑level logs matter, configuration steps, advanced custom log paths, and practical examples.

Cloud NativeGrafanaKubernetes
0 likes · 14 min read
Simplify Scalable Kubernetes Pod Logging with Grafana podLogs
Ray's Galactic Tech
Ray's Galactic Tech
Oct 22, 2025 · Operations

Master Docker Management with a Powerful Bash Automation Script

This article provides a comprehensive, enhanced Docker automation Bash script—docker‑manager.sh—covering container lifecycle commands, image cleanup, backup, network inspection, log viewing, and configuration export, along with step‑by‑step usage instructions and additional handy Docker commands for efficient container management.

AutomationBashContainer Management
0 likes · 11 min read
Master Docker Management with a Powerful Bash Automation Script
MaGe Linux Operations
MaGe Linux Operations
Oct 21, 2025 · Operations

Mastering Prometheus: Proven Strategies to Optimize Monitoring Performance

This article shares real‑world experiences and step‑by‑step techniques—including metric pruning, sampling interval tuning, TSDB configuration, query rewriting, and federation—to dramatically improve Prometheus memory usage, query latency, and overall scalability for large‑scale cloud‑native environments.

OperationsPrometheuscloud-native
0 likes · 11 min read
Mastering Prometheus: Proven Strategies to Optimize Monitoring Performance
Liangxu Linux
Liangxu Linux
Oct 19, 2025 · Operations

100 Essential Windows Command-Line Tricks Every Sysadmin Should Know

This comprehensive guide lists 100 practical Windows command‑line utilities covering system management, network diagnostics, file and disk operations, process and user handling, as well as advanced operational commands, complete with high‑risk warnings and best‑practice tips for safe administration.

OperationsPowerShellWindows
0 likes · 14 min read
100 Essential Windows Command-Line Tricks Every Sysadmin Should Know
Efficient Ops
Efficient Ops
Oct 19, 2025 · Operations

How China’s New DevOps Standards Are Shaping Global IT Operations

The article outlines China’s 2024‑2027 information standard action plan, CAICT’s dual ITU‑DevOps and domestic assessments, key results from the 27th GOPS conference, industry participation statistics, and the emerging BizDevOps framework that together drive international standardization and digital transformation in operations.

BizDevOpsInternational StandardsOperations
0 likes · 11 min read
How China’s New DevOps Standards Are Shaping Global IT Operations
dbaplus Community
dbaplus Community
Oct 14, 2025 · Operations

Can Intelligent DNS Safely Power Dual‑Active Database Access? Pros, Cons, and Best Practices

This article explains the fundamentals of intelligent DNS, its resolution process and strategy algorithms, evaluates its advantages and drawbacks, and critically examines whether using intelligent DNS for database access in dual‑active architectures is appropriate, highlighting potential pitfalls and recommended practices.

Database AccessDual-Active ArchitectureIntelligent DNS
0 likes · 16 min read
Can Intelligent DNS Safely Power Dual‑Active Database Access? Pros, Cons, and Best Practices
Ops Community
Ops Community
Oct 14, 2025 · Operations

Mastering Ansible: A Complete Guide to Automated Operations Standards

Discover how to replace chaotic shell scripts with a comprehensive, Ansible‑based automation framework that covers tool selection, architecture design, standardized directory structures, inventory management, variable hierarchy, role development, secure vault usage, real‑world multi‑environment deployments, baseline configurations, monitoring, CI/CD integration, and best‑practice guidelines for modern operations teams.

AnsibleInfrastructure as CodeOperations
0 likes · 34 min read
Mastering Ansible: A Complete Guide to Automated Operations Standards
Continuous Delivery 2.0
Continuous Delivery 2.0
Oct 13, 2025 · Operations

How Google’s SRE Evolved Over 20 Years: From Crisis to Industry Standard

This article traces Google Site Reliability Engineering from its 2003 inception addressing scale crises, through organizational growth, core principles, team structures, and recent security integrations, showing how SRE transformed operations into a software‑engineering discipline that drives reliable, scalable digital services.

Error BudgetGoogleOperations
0 likes · 13 min read
How Google’s SRE Evolved Over 20 Years: From Crisis to Industry Standard
Liangxu Linux
Liangxu Linux
Oct 8, 2025 · Operations

Mastering High‑Load Linux Server Performance: Diagnose and Fix Bottlenecks

When a Linux server spikes to 90% CPU, memory pressure grows, and database connections exhaust, this guide walks you through a systematic methodology, essential tools, real‑world case studies, and practical optimizations to quickly locate and resolve performance bottlenecks.

LinuxOperationsServer Monitoring
0 likes · 15 min read
Mastering High‑Load Linux Server Performance: Diagnose and Fix Bottlenecks
MaGe Linux Operations
MaGe Linux Operations
Oct 7, 2025 · Operations

7 Fatal Monitoring Alert Mistakes That Keep You Up at 3 AM—and How to Fix Them

This article examines why ops engineers are repeatedly woken by false alerts, outlines seven common monitoring alert pitfalls—from over‑alerting to static thresholds—and provides practical solutions such as golden‑signal rules, dynamic baselines, alert enrichment, routing, suppression, and continuous quality audits.

AlertingDevOpsObservability
0 likes · 27 min read
7 Fatal Monitoring Alert Mistakes That Keep You Up at 3 AM—and How to Fix Them
dbaplus Community
dbaplus Community
Oct 7, 2025 · Operations

Why Ops Engineers Need a Massive Knowledge Stack—and How to Master It

This comprehensive guide explains why modern operations engineers must cover the full technology stack, outlines common learning pitfalls, presents a three‑layer, nine‑domain knowledge framework, and offers a step‑by‑step, personalized roadmap with practical labs and career‑growth advice.

AutomationCareer DevelopmentDevOps
0 likes · 14 min read
Why Ops Engineers Need a Massive Knowledge Stack—and How to Master It
MaGe Linux Operations
MaGe Linux Operations
Oct 6, 2025 · Operations

Avoid the Fatal Ops Mistakes That Could Ruin Your Career – 10 Critical Pitfalls and How to Prevent Them

Drawing on real-world incidents and Gartner 2023 data, this article reveals ten deadly operational pitfalls—from executing untested commands in production to inadequate backups—and offers concrete technical safeguards, process controls, and cultural practices to help engineers avoid costly errors and protect their careers.

AutomationBackupOperations
0 likes · 27 min read
Avoid the Fatal Ops Mistakes That Could Ruin Your Career – 10 Critical Pitfalls and How to Prevent Them
Architect's Guide
Architect's Guide
Oct 6, 2025 · Operations

Mastering Graceful Shutdown in Kubernetes: Real-World Spring Boot & Nacos Cases

This article explains the concept of graceful shutdown, walks through detailed Kubernetes pod termination steps, presents real-world Spring Boot and Nacos integration cases, analyzes common pitfalls such as premature termination and message loss, and offers practical optimization strategies for handling MQ, scheduled tasks, and traffic control.

Graceful ShutdownKubernetesNacos
0 likes · 12 min read
Mastering Graceful Shutdown in Kubernetes: Real-World Spring Boot & Nacos Cases
MaGe Linux Operations
MaGe Linux Operations
Oct 5, 2025 · Operations

What Skills Do 500k‑Salary Ops Engineers Master? A Complete Roadmap

This comprehensive guide breaks down the eight essential competencies—from deep Linux kernel knowledge and database optimization to cloud‑native orchestration, observability, automation, security, and business‑focused soft skills—that distinguish 500k‑salary operations engineers and provides a practical roadmap for mastering each area.

Career DevelopmentOperationsmonitoring
0 likes · 45 min read
What Skills Do 500k‑Salary Ops Engineers Master? A Complete Roadmap
MaGe Linux Operations
MaGe Linux Operations
Oct 4, 2025 · Operations

How I Doubled My Salary by Switching from Traditional Ops to SRE in 18 Months

Over 18 months, the author details a step‑by‑step transformation from a fire‑fighting traditional operations role to a high‑paying SRE/DevOps career, covering motivations, skill gaps, learning plans, project implementations, interview preparation, and real‑world outcomes, offering a practical roadmap for engineers seeking similar growth.

Cloud NativeOperationsSRE
0 likes · 44 min read
How I Doubled My Salary by Switching from Traditional Ops to SRE in 18 Months
Ops Community
Ops Community
Oct 4, 2025 · Databases

How to Quickly Diagnose and Fix a Frozen MySQL in Production: 5 Proven Steps

Facing a MySQL that suddenly becomes unresponsive in production? This article walks through the exact five‑step investigative process—checking process status, examining connections, locating lock waits, analyzing slow queries and system bottlenecks, and applying emergency recovery—illustrated with real‑world examples and command‑line snippets.

OperationsProduction Incidentdatabase troubleshooting
0 likes · 19 min read
How to Quickly Diagnose and Fix a Frozen MySQL in Production: 5 Proven Steps