Author

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

627

Articles

Likes

3.1k

Views

Comments

Latest from Raymond Ops

100 recent articles max

Raymond Ops

Mar 10, 2026 · Operations

How to Master Service Avalanche Recovery: A Complete SRE Playbook from Alert to Restoration

This guide walks SRE and senior operations engineers through a real-world service‑avalanche incident, detailing alert hierarchy design, fault‑location commands, emergency SOPs, capacity‑baseline building, and post‑mortem best practices to dramatically reduce MTTR in distributed micro‑service environments.

PrometheusSREService Avalanche

0 likes · 19 min read

How to Master Service Avalanche Recovery: A Complete SRE Playbook from Alert to Restoration

Raymond Ops

Mar 10, 2026 · Operations

How to Quickly Diagnose and Fix High CPU Usage on Linux: 10 Root Causes & Step‑by‑Step Guide

This guide walks you through detecting, analyzing, and resolving Linux CPU spikes by monitoring overall load, pinpointing the offending process, drilling down with tools like top, ps, strace, perf, and sar, and applying targeted fixes for the ten most common causes.

CPULinuxtroubleshooting

0 likes · 19 min read

How to Quickly Diagnose and Fix High CPU Usage on Linux: 10 Root Causes & Step‑by‑Step Guide

Raymond Ops

Mar 7, 2026 · Cloud Native

Master Kubernetes Troubleshooting: From Pod Crashes to Network Failures

This comprehensive guide walks you through Kubernetes fault‑tolerance by covering core components, classifying six major failure types, presenting a three‑step troubleshooting methodology, and detailing six real‑world case studies with commands, manifests, monitoring setups and preventive best practices.

Networkpodstorage

0 likes · 36 min read

Master Kubernetes Troubleshooting: From Pod Crashes to Network Failures

Raymond Ops

Mar 7, 2026 · Operations

7 Hidden Traps in Nginx+Lua Gray Releases and How to Fix Them

This article reveals seven critical pitfalls that can cripple Nginx+Lua gray‑release deployments—ranging from memory leaks and blocking I/O to uneven traffic hashing, configuration reload races, cross‑datacenter latency, session stickiness issues, and blind‑spot monitoring—while providing concrete Lua scripts, Nginx configurations, monitoring commands, and step‑by‑step remediation strategies.

DevOpsGray ReleaseLua

0 likes · 43 min read

7 Hidden Traps in Nginx+Lua Gray Releases and How to Fix Them

Raymond Ops

Mar 6, 2026 · Cloud Native

Scaling Kubernetes from 1k to 5k Nodes: Complete Performance Tuning Playbook

This article presents a comprehensive, real‑world guide for expanding a Kubernetes cluster from 1,000 to 5,000 nodes, covering control‑plane HA, etcd optimization, network and scheduler tuning, monitoring, and automation, with detailed configurations, code snippets, and a step‑by‑step case study of a large‑scale production environment.

CNIControl PlanePerformance Tuning

0 likes · 22 min read

Scaling Kubernetes from 1k to 5k Nodes: Complete Performance Tuning Playbook

Raymond Ops

Mar 4, 2026 · Operations

Build an Enterprise‑Grade DevOps CI/CD Pipeline in 7 Days with Ready‑to‑Use Scripts

This guide walks you through constructing a full‑stack, enterprise‑level DevOps pipeline—from environment preparation and tool installation to Jenkins pipeline scripting, Kubernetes deployment, monitoring, security hardening, and cost optimization—providing complete scripts and step‑by‑step instructions to achieve automated, reliable releases within a week.

CI/CDDevOpsDocker

0 likes · 27 min read

Build an Enterprise‑Grade DevOps CI/CD Pipeline in 7 Days with Ready‑to‑Use Scripts

Raymond Ops

Mar 3, 2026 · Operations

How I Turned a Firefighter Ops Engineer into a High‑Paid Tech Expert in 3 Years

This article chronicles a three‑year journey from a junior operations engineer blamed for outages to a senior technical specialist, detailing the four pivotal turning points, concrete learning plans, automation projects, cost‑optimization strategies, and actionable advice for anyone seeking to advance in modern operations.

careercloud-nativemonitoring

0 likes · 27 min read

How I Turned a Firefighter Ops Engineer into a High‑Paid Tech Expert in 3 Years

Raymond Ops

Mar 2, 2026 · Operations

Why Most Alerts Fail and How to Build a Night‑Quiet, High‑Signal Monitoring System

This article examines the root causes of alert fatigue—mis‑configured thresholds, noisy alerts, lack of context, and poor routing—then presents a step‑by‑step guide using golden signals, dynamic baselines, enriched alert payloads, severity‑based routing, and suppression techniques to create an effective, low‑noise monitoring system.

AlertmanagerPrometheusSRE

0 likes · 24 min read

Why Most Alerts Fail and How to Build a Night‑Quiet, High‑Signal Monitoring System

Raymond Ops

Mar 2, 2026 · Cloud Native

ELK vs EFK vs Loki: 2025’s Best Log Solution for Cost, Performance & Simplicity

This comprehensive 2025 guide compares ELK, EFK, and Loki across architecture, deployment complexity, storage cost, query performance, feature completeness, high‑availability, and real‑world case studies, helping teams of any size choose the most cost‑effective and operationally suitable log collection stack.

EFKELKLog Aggregation

0 likes · 37 min read

ELK vs EFK vs Loki: 2025’s Best Log Solution for Cost, Performance & Simplicity

Raymond Ops

Mar 1, 2026 · Operations

How I Transitioned from Traditional Ops to SRE/DevOps in 18 Months

This detailed guide shares a step‑by‑step 18‑month roadmap, covering self‑assessment, skill acquisition (Python, Kubernetes, monitoring), project execution, interview preparation, and real‑world outcomes for engineers moving from legacy operations to SRE/DevOps roles.

CI/CDKubernetesPython

0 likes · 35 min read

How I Transitioned from Traditional Ops to SRE/DevOps in 18 Months