Tagged articles

kubernetes

4150 articles · Page 1 of 42

Jul 4, 2026 · Operations

20 Common Ops Rookie Mistakes and How to Avoid Them

This guide lists the twenty most frequent pitfalls that new operations engineers encounter, explains why they happen, and provides step‑by‑step safe practices, code examples, risk classifications and a verification checklist to help prevent costly outages and data loss.

LinuxOperationsdatabase

0 likes · 28 min read

20 Common Ops Rookie Mistakes and How to Avoid Them

Raymond Ops

Jul 1, 2026 · Operations

Memory Leak Postmortem: Combining free, smem, pmap, and perf for Effective Diagnosis

When a thumbnail service experienced sudden latency spikes and OOM kills shortly after a new release, the author walks through a systematic investigation using free, smem, pmap, and perf to distinguish true memory leaks from page‑cache or shared‑page artifacts, pinpoint the native decoder buffer issue, and outline remediation steps.

Linuxkubernetesmemory-leak

0 likes · 29 min read

Memory Leak Postmortem: Combining free, smem, pmap, and perf for Effective Diagnosis

Golang Shines

Jul 1, 2026 · Operations

10 Essential Ops Tools That Can Cut Your Overtime by 80%

This article introduces ten Linux operations tools—Shell scripts, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, typical use cases, advantages, and concrete examples to help engineers streamline daily tasks and dramatically reduce overtime.

AnsibleDockerGit

0 likes · 9 min read

10 Essential Ops Tools That Can Cut Your Overtime by 80%

dbaplus Community

Jun 29, 2026 · Cloud Computing

Why More Companies Are Dropping VMware for Proxmox

Since 2024, a growing number of enterprises—especially small‑to‑medium businesses and some large firms—are re‑evaluating the cost‑driven VMware licensing model and migrating to the open‑source Proxmox VE platform, which bundles KVM, LXC, Ceph, backup and clustering into a free, easy‑to‑manage solution that fits modern AI and Kubernetes workloads.

Cloud NativeProxmoxVMware

0 likes · 6 min read

Why More Companies Are Dropping VMware for Proxmox

Golang Shines

Jun 29, 2026 · Cloud Native

How I Built a Production‑Ready HA Kubernetes Cluster with Private Harbor in Minutes

The author shares a complete 83‑page step‑by‑step guide that enabled a rapid, production‑grade high‑availability Kubernetes cluster integrated with a private Harbor registry, dramatically cutting setup time and improving cloud‑native operational reliability.

Cloud NativeCluster SetupHarbor

0 likes · 2 min read

How I Built a Production‑Ready HA Kubernetes Cluster with Private Harbor in Minutes

Raymond Ops

Jun 28, 2026 · Operations

Why Large‑Model Services Keep Running Out of GPU Memory: An Ops View from KV Cache to Concurrency

The article explains why large‑model inference services frequently hit GPU memory limits, breaks down static vs. dynamic memory consumption, shows how KV‑Cache, request length, and concurrency amplify usage, and provides a step‑by‑step troubleshooting and mitigation workflow for production environments.

GPU memoryInference OptimizationKV cache

0 likes · 26 min read

Why Large‑Model Services Keep Running Out of GPU Memory: An Ops View from KV Cache to Concurrency

Architect's Guide

Jun 28, 2026 · Cloud Native

Kubernetes Networking Explained with 16 Detailed Diagrams

This article provides a comprehensive, diagram‑driven analysis of Kubernetes networking, covering underlay and overlay models, the role of VLAN, OSPF, BGP, and various CNI plugins such as Flannel host‑gw, Calico BGP, IPVLAN/MACVLAN, Multus, and Danm, as well as tunnel technologies like VxLAN and IPIP.

CNICalicoFlannel

0 likes · 13 min read

Kubernetes Networking Explained with 16 Detailed Diagrams

Raymond Ops

Jun 27, 2026 · Operations

Hands‑On DNS Ops: Deploy BIND and CoreDNS with Full Troubleshooting Guide

This comprehensive guide walks you through DNS fundamentals, compares BIND, CoreDNS, PowerDNS and Unbound, provides step‑by‑step deployment scripts for BIND 9.20 and CoreDNS 1.12, explains DNSSEC configuration, caching optimizations, security hardening, high‑availability designs, monitoring, backup and recovery procedures, and advanced troubleshooting techniques.

BINDCoreDNSDNS

0 likes · 43 min read

Hands‑On DNS Ops: Deploy BIND and CoreDNS with Full Troubleshooting Guide

Cloud Native Technology Community

Jun 26, 2026 · Cloud Native

What New Changes in Kubernetes 1.36 Should Platform Teams Watch in the AI Era?

Kubernetes 1.36 marks a shift toward native governance of complex AI resources, workload‑aware scheduling, fine‑grained security, and control‑plane scalability, urging platform teams to rethink resource allocation, isolation, and governance as AI workloads move into production.

AI workloadsCloud NativeDRA

0 likes · 10 min read

What New Changes in Kubernetes 1.36 Should Platform Teams Watch in the AI Era?

Golang Shines

Jun 26, 2026 · Cloud Native

Why Every Ops Role Now Demands Kubernetes Skills (And a 100‑Question K8s Interview Guide)

After being laid off after five years in operations, the author realized that all job listings now require Docker and Kubernetes expertise, so they compiled a comprehensive "100 K8s Interview Questions" guide covering core concepts, architecture, resource management, networking, storage, security, troubleshooting, and ecosystem tools.

Cloud NativeContainer OrchestrationDocker

0 likes · 7 min read

Why Every Ops Role Now Demands Kubernetes Skills (And a 100‑Question K8s Interview Guide)

Alibaba Cloud Infrastructure

Jun 26, 2026 · Cloud Computing

How Kimi’s AI Agent Scales on Alibaba Cloud – Architecture, Elastic Sandbox, and Cost Optimisation

The article analyses how Kimi’s AI Agent workloads are deployed on Alibaba Cloud using ACK and the ACS Agent Sandbox, detailing the challenges of massive concurrency, rapid sandbox start‑up, state continuity, cost‑effective scaling, and the security and scheduling mechanisms that enable production‑grade performance.

AI AgentAlibaba CloudCost Optimisation

0 likes · 19 min read

How Kimi’s AI Agent Scales on Alibaba Cloud – Architecture, Elastic Sandbox, and Cost Optimisation

Architect Chen

Jun 25, 2026 · Cloud Native

Four Key Ways to Deploy Microservices: From Bare Metal to Kubernetes

The article compares four microservice deployment approaches—physical servers, virtual machines, containerization with Docker, and Kubernetes clusters—detailing their implementation, advantages, drawbacks, and ideal scenarios, helping teams choose the most suitable strategy based on resource isolation, scalability, operational complexity, and team expertise.

Cloud NativeMicroservicescontainer

0 likes · 6 min read

Four Key Ways to Deploy Microservices: From Bare Metal to Kubernetes

Ops Development & AI Practice

Jun 24, 2026 · Information Security

Ending Hard‑Coded Rules: OPA Policy‑as‑Code for Unified SecOps Guardrails

The article explains how enterprises can replace fragmented, hard‑coded security checks in Terraform, CI/CD pipelines, Kubernetes admission webhooks, and API gateways with a unified, declarative policy engine—Open Policy Agent—using Rego to decouple decision and enforcement, enabling fast, auditable SecOps guardrails across the entire software lifecycle.

CI/CDOPAPolicy-as-Code

0 likes · 12 min read

Ending Hard‑Coded Rules: OPA Policy‑as‑Code for Unified SecOps Guardrails

Raymond Ops

Jun 22, 2026 · Artificial Intelligence

Elastic Deployment and GPU Scheduling for Large‑Model Inference with vLLM on Kubernetes

This article presents a detailed, step‑by‑step analysis of deploying the high‑performance vLLM inference engine on Kubernetes, covering GPU memory management, tensor parallelism, quantization choices, continuous batching, and automated scaling with HPA/KEDA to achieve low latency and high throughput for large language models.

DockerGPU schedulingLLM Inference

0 likes · 49 min read

Elastic Deployment and GPU Scheduling for Large‑Model Inference with vLLM on Kubernetes

TonyBai

Jun 22, 2026 · Cloud Native

Why Go Dominates CNCF: How It Outpaces Java, C++ and Rust in the Cloud‑Native Era

An in‑depth analysis explains how Go’s historical ties to Google, lightweight binaries, memory safety, cross‑compilation ease, and balanced performance‑vs‑devex make it the default language for CNCF projects, sidelining Java, C++, and Rust despite their technical merits.

CNCFCloud NativeDeveloper Experience

0 likes · 11 min read

Why Go Dominates CNCF: How It Outpaces Java, C++ and Rust in the Cloud‑Native Era

Raymond Ops

Jun 21, 2026 · Cloud Native

Stop Pods From “Running Wild”: A Practical Guide to Kubernetes Scheduling Strategies

This guide explains why default Kubernetes scheduling often falls short in production, introduces nodeSelector, nodeAffinity, podAffinity/anti‑affinity, taints/tolerations, topologySpreadConstraints and PriorityClass, and provides step‑by‑step configuration examples, real‑world use cases, best‑practice recommendations, troubleshooting tips, and monitoring alerts to ensure reliable pod placement.

Pod SchedulingPriorityClassTaints and Tolerations

0 likes · 36 min read

Stop Pods From “Running Wild”: A Practical Guide to Kubernetes Scheduling Strategies

Raymond Ops

Jun 20, 2026 · Operations

Eliminate Monitoring Blind Spots: Hands‑On Enterprise‑Grade Prometheus + Grafana Deployment

This comprehensive guide walks you through the end‑to‑end setup of a production‑grade Prometheus and Grafana monitoring stack, covering architecture choices, installation steps, configuration details, high‑availability designs, performance tuning, security hardening, troubleshooting, backup strategies, and best‑practice recommendations.

AlertingHigh AvailabilityMonitoring

0 likes · 49 min read

Eliminate Monitoring Blind Spots: Hands‑On Enterprise‑Grade Prometheus + Grafana Deployment

DataFunTalk

Jun 19, 2026 · Artificial Intelligence

How NVIDIA Dynamo Boosts Multi‑Node Distributed Inference MFU for Agentic AI

The article explains how NVIDIA Dynamo tackles the production bottlenecks of Agentic AI by using KV‑Cache‑aware routing, a three‑stage multimodal inference architecture, and intelligent cache scheduling on Kubernetes to improve multi‑node throughput (MFU) while maintaining latency SLAs.

Agentic AIDistributed InferenceKV cache

0 likes · 3 min read

How NVIDIA Dynamo Boosts Multi‑Node Distributed Inference MFU for Agentic AI

AI Agent Super App

Jun 18, 2026 · Operations

Free 50GB+ Operations Learning Pack: Linux, Data Center, Kubernetes, Engineer Roadmap, Security

The author shares a curated collection of over 50 GB of free operations learning materials—including Linux system administration, data‑center fundamentals, Kubernetes clusters, a complete engineer learning path, and information‑security compliance—each with Baidu Cloud download links for beginners.

Data CenterLinuxOperations

0 likes · 6 min read

Free 50GB+ Operations Learning Pack: Linux, Data Center, Kubernetes, Engineer Roadmap, Security

Programmer XiaoFu

Jun 18, 2026 · Cloud Native

Why Use Service Registration When Nginx Already Handles Load Balancing?

The article explains that Nginx’s static upstream configuration and passive health checks cannot keep up with dynamic microservice environments, while a service registry provides real‑time instance awareness, automatic failure detection, and metadata‑driven routing, making both tools complementary rather than interchangeable.

NGINXeurekakubernetes

0 likes · 9 min read

Why Use Service Registration When Nginx Already Handles Load Balancing?

Architecture & Thinking

Jun 18, 2026 · Backend Development

How to Scale a Flash‑Sale System from Zero to 1 Million QPS: A Step‑by‑Step Architecture Guide

This article dissects the evolution of a flash‑sale system from a simple monolithic controller to a cloud‑native, micro‑service architecture that can handle over one million requests per second, detailing traffic‑shaping, multi‑level caching, async processing, and inventory‑consistency techniques.

CachingFlash SaleHigh concurrency

0 likes · 18 min read

How to Scale a Flash‑Sale System from Zero to 1 Million QPS: A Step‑by‑Step Architecture Guide

Deepin Linux

Jun 18, 2026 · Cloud Native

Linux Kernel Networking: How veth, Bridges, and Overlay Forwarding Power Container Communication

This article explains the Linux kernel networking components—veth pairs, Linux bridges, and overlay networks—detailing their creation, configuration, packet flow, and troubleshooting with concrete Docker and Kubernetes examples, commands, and packet‑capture analysis.

DockerLinuxOverlay

0 likes · 30 min read

Linux Kernel Networking: How veth, Bridges, and Overlay Forwarding Power Container Communication

Sohu Tech Products

Jun 17, 2026 · Cloud Native

Breaking Cloud‑Native Gateway Limits: Routing & Session Persistence for AI Sandboxes

The article details a cloud‑native gateway design that solves the zero‑loss routing and session‑persistence challenges of massive AI sandbox Web VNC streams by dissecting protocol stages, exposing classic gateway pitfalls, and presenting a two‑phase URL‑plus‑cookie routing architecture built on OpenResty, Lua, and Redis.

API GatewayDynamic RoutingOpenResty

0 likes · 26 min read

Breaking Cloud‑Native Gateway Limits: Routing & Session Persistence for AI Sandboxes

Raymond Ops

Jun 17, 2026 · Operations

Enterprise Monitoring with Prometheus: Rule Hierarchy and Alertmanager Notification Orchestration

This guide explains how to turn a fully built Prometheus monitoring system into a closed‑loop alerting solution by designing layered PromQL rules, configuring Alertmanager routing, grouping, inhibition and silencing, integrating DingTalk and WeChat webhooks, and applying best‑practice performance, security, high‑availability, and troubleshooting techniques.

AlertingAlertmanagerHigh Availability

0 likes · 34 min read

Enterprise Monitoring with Prometheus: Rule Hierarchy and Alertmanager Notification Orchestration

Alibaba Cloud Native

Jun 17, 2026 · Cloud Native

From Half-Day to 6 Minutes: Embedding AI Agents into Organizational Structure to Accelerate Ticket Resolution

A 3 am alert that once required hours of manual triage is now closed in six minutes thanks to AgentTeams, a cloud‑native platform that treats AI agents as first‑class citizens, defines declarative organization structures, and orchestrates multi‑agent collaboration across development, operations, and open‑source workflows.

AI AgentsAutomationCloud Native

0 likes · 21 min read

From Half-Day to 6 Minutes: Embedding AI Agents into Organizational Structure to Accelerate Ticket Resolution

DataFunSummit

Jun 17, 2026 · Artificial Intelligence

Why Agentic AI Inference Is Slow and How NVIDIA Dynamo 1.1 Solves It

Developers deploying Agentic AI face multi‑turn latency caused by repeated token recomputation, KV‑cache eviction, and cold‑starts, and NVIDIA Dynamo 1.1 addresses these issues with KV‑cache‑aware routing, multi‑level cache offload, priority scheduling, and Prefill/Decode separation, as demonstrated in an upcoming Kubernetes‑based live session.

AI inferenceAgentic AIDistributed Inference

0 likes · 3 min read

Why Agentic AI Inference Is Slow and How NVIDIA Dynamo 1.1 Solves It

Raymond Ops

Jun 16, 2026 · Cloud Native

Eliminate Permission Chaos: Kubernetes RBAC Design Standards and Implementation Guide

This guide explains how to design and implement a secure, least‑privilege RBAC model for multi‑team Kubernetes clusters, covering authentication methods, role and binding definitions, concrete YAML examples, CI/CD integration, audit scripts, performance tips, backup and recovery procedures, and common troubleshooting steps.

Access ControlRBACdevops

0 likes · 35 min read

Eliminate Permission Chaos: Kubernetes RBAC Design Standards and Implementation Guide

Architect's Tech Stack

Jun 14, 2026 · Backend Development

Why Quarkus Can Outrun Spring Boot: Launching Apps in Under 0.002 Seconds

The article compares Spring Boot and Quarkus, explaining how Quarkus’s build‑time optimizations, native image support, and container‑first design dramatically reduce startup time and memory usage, while also discussing development experience, extension mechanisms, and the trade‑offs involved in adopting the framework.

JavaMicroProfileNative Image

0 likes · 14 min read

Why Quarkus Can Outrun Spring Boot: Launching Apps in Under 0.002 Seconds

Raymond Ops

Jun 14, 2026 · Cloud Native

How to Handle Traffic Spikes and Optimize Resources with Kubernetes HPA + VPA

This guide walks through the problem of fluctuating traffic in Kubernetes, explains the differences between Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA), and provides step‑by‑step commands, YAML examples, best‑practice recommendations, troubleshooting tips, and monitoring alerts for deploying a production‑grade HPA + VPA solution.

Cloud NativeHPAMetrics Server

0 likes · 41 min read

How to Handle Traffic Spikes and Optimize Resources with Kubernetes HPA + VPA

Architect Chen

Jun 14, 2026 · Cloud Native

All Essential Kubernetes Commands – 2026 Updated Guide

This article provides a concise, step‑by‑step reference of the most frequently used kubectl commands for Kubernetes, explaining each command's purpose, typical scenarios, useful options, and the information it reveals to help operators troubleshoot clusters, nodes, pods, deployments, logs, and resources.

Cloud Nativecommand-linekubectl

0 likes · 4 min read

All Essential Kubernetes Commands – 2026 Updated Guide

Raymond Ops

Jun 13, 2026 · Operations

What Is Load Average? Uncovering the Truth Behind System Load Metrics

Load Average measures the average number of runnable and uninterruptible processes over 1, 5, and 15‑minute windows, differs from CPU usage, and can be misinterpreted—this article explains its kernel calculation, how to assess overload, troubleshoot CPU, I/O, or process‑count issues, and handle container‑specific distortions with cgroup v2 and LXCFS.

LinuxMonitoringcgroup

0 likes · 38 min read

What Is Load Average? Uncovering the Truth Behind System Load Metrics

Golang Shines

Jun 13, 2026 · Cloud Native

Kubernetes (K8s) from Beginner to Hands‑On: Complete 2026 Guide

This step‑by‑step tutorial walks you through preparing the environment, installing container runtimes, setting up a single‑master multi‑worker K8s cluster, deploying applications, managing configurations, enabling persistent storage, configuring health probes, applying namespaces and quotas, troubleshooting common pitfalls, and adding Prometheus‑Grafana monitoring, all with concrete commands and examples.

Container OrchestrationMonitoringdeployment

0 likes · 14 min read

Kubernetes (K8s) from Beginner to Hands‑On: Complete 2026 Guide

Raymond Ops

Jun 12, 2026 · Cloud Native

Choosing Between containerd and CRI‑O for Production Kubernetes: A Detailed Comparison

This article provides a comprehensive analysis of containerd and CRI‑O as Kubernetes container runtimes, covering their architectures, feature sets, installation procedures, migration strategies, performance benchmarks, best‑practice configurations, troubleshooting tips, and monitoring approaches to help operators decide which runtime best fits a production environment.

CRI-OMonitoringProduction

0 likes · 47 min read

Choosing Between containerd and CRI‑O for Production Kubernetes: A Detailed Comparison

Huawei Cloud Developer Alliance

Jun 12, 2026 · Cloud Native

Unlock AgentCube on Huawei Cloud CCE to Build High‑Performance AI Agents

This guide explains how AgentCube, a Volcano sub‑project, enables rapid startup, high‑throughput scheduling, native session management, and strong isolation for AI Agent workloads on Huawei Cloud CCE, with step‑by‑step installation, configuration, and code examples demonstrating both CodeInterpreter and AgentRuntime.

AI AgentAgentCubeAgentRuntime

0 likes · 15 min read

Unlock AgentCube on Huawei Cloud CCE to Build High‑Performance AI Agents

AI Agent Super App

Jun 12, 2026 · Operations

End‑to‑End Prometheus Monitoring: Deployment, Tuning, HA & Troubleshooting

This guide walks through the complete Prometheus monitoring lifecycle—from binary, Docker, and Kubernetes deployments to Ansible‑driven node_exporter rollout, SNMP switch and router monitoring, alert routing via WeChat, SMS and email, production‑grade tuning, high‑availability designs, and systematic troubleshooting.

AlertmanagerAnsibleMonitoring

0 likes · 25 min read

End‑to‑End Prometheus Monitoring: Deployment, Tuning, HA & Troubleshooting

Raymond Ops

Jun 11, 2026 · Cloud Native

Master Istio: Core Service Mesh Concepts and Hands‑On Deployment Guide

This comprehensive guide explains Istio’s sidecar architecture, traffic management, mutual TLS security, and observability features, then walks through prerequisite checks, installation with istioctl and Helm, sample Bookinfo deployment, advanced configuration, troubleshooting, monitoring, and backup strategies for production‑grade service meshes.

IstioObservabilityService Mesh

0 likes · 29 min read

Master Istio: Core Service Mesh Concepts and Hands‑On Deployment Guide

Xiao Liu Lab

Jun 11, 2026 · Operations

Ops Engineer Core Skills: From Basic Commands to High‑Availability Architecture

This article provides a comprehensive roadmap for operations engineers, covering essential Linux commands, core system concepts, service principles, fault‑diagnosis methods, high‑availability architecture designs, data security, backup strategies, performance tuning, and automation scripts to handle both single‑machine and large‑scale cluster environments.

AutomationDockerHigh Availability

0 likes · 13 min read

Ops Engineer Core Skills: From Basic Commands to High‑Availability Architecture

Ops Community

Jun 11, 2026 · Cloud Native

etcd Operations Handbook: Backup, Restore, Scaling, and Performance Tuning for Kubernetes

This guide explains why mastering etcd is essential for Kubernetes stability and walks through its core concepts, Raft consensus, MVCC storage, deployment, backup and restore procedures, scaling from three to five nodes, performance optimization, monitoring, alerting, troubleshooting, upgrade strategies, security hardening, and real‑world best‑practice recommendations.

EtcdMonitoringbackup

0 likes · 49 min read

etcd Operations Handbook: Backup, Restore, Scaling, and Performance Tuning for Kubernetes

Alibaba Cloud Developer

Jun 11, 2026 · Artificial Intelligence

Building an AI‑Native Multi‑Agent Digital Human Architecture on Cloud Native

The article details how a cloud‑native platform called AgentTeams enables AI‑Native multi‑agent digital‑human teams to replace manual incident response, automate end‑to‑end development workflows, and securely integrate LLMs and internal services through declarative orchestration and fine‑grained permission models.

AI-nativeAgentTeamsAutomation

0 likes · 24 min read

Building an AI‑Native Multi‑Agent Digital Human Architecture on Cloud Native

dbaplus Community

Jun 10, 2026 · Operations

Why Deploying Kubernetes on Just Three Servers Is Overkill

The article argues that for startups with only a handful of servers, using systemd and simple scripts is far more practical and cost‑effective than adopting heavyweight Kubernetes orchestration, which adds unnecessary complexity and hidden expenses.

Operationscost analysiskubernetes

0 likes · 8 min read

Why Deploying Kubernetes on Just Three Servers Is Overkill

Java Architect Essentials

Jun 9, 2026 · Cloud Native

Boost Spring Boot Service Availability to 99.9% with Smart K8s Probe Configurations

The article walks through common Kubernetes health‑probe pitfalls for Spring Boot services and presents a concrete set of liveness, readiness, graceful‑shutdown, autoscaling, and configuration‑separation techniques that together raise production availability to 99.9%, backed by real‑world incidents and code snippets.

Config ManagementGraceful ShutdownHealth Probes

0 likes · 8 min read

Boost Spring Boot Service Availability to 99.9% with Smart K8s Probe Configurations

Raymond Ops

Jun 9, 2026 · Cloud Native

Kubernetes Outage? Essential Troubleshooting Guide for Production Clusters

A comprehensive, step‑by‑step guide that explains the most common Kubernetes failure scenarios—from pod crashes and image pull errors to node NotReady and API server timeouts—provides concrete kubectl commands, diagnostic scripts, real‑world case studies, best‑practice recommendations, monitoring metrics, and backup‑restore procedures to keep production clusters healthy.

Cluster OperationsEtcdMonitoring

0 likes · 37 min read

Kubernetes Outage? Essential Troubleshooting Guide for Production Clusters

Ops Community

Jun 7, 2026 · Information Security

Practical Container Escape Detection and Defense Strategies

This article outlines a comprehensive, step‑by‑step approach to detecting and preventing container escape attacks, covering threat modeling, vulnerability classification, hardening layers, key open‑source tools, CI/CD integration, incident response, compliance checks, and ATT&CK matrix mapping for robust Kubernetes security.

attack detectioncis benchmarkcontainer security

0 likes · 43 min read

Practical Container Escape Detection and Defense Strategies

Alibaba Cloud Native

Jun 7, 2026 · Cloud Native

Eliminate Complex Integration: AI Agent Skill Powers Cloud Monitoring

The article shows how Alibaba Cloud's CMS CLI and the AI‑driven alibabacloud‑cms‑manage Skill turn a multi‑step observability setup into a single natural‑language command, detailing the six‑step CLI workflow, the two‑stage confirmation safety, and a full K8s LangChain auto‑integration demo.

AI AgentAutomationCLI

0 likes · 10 min read

Eliminate Complex Integration: AI Agent Skill Powers Cloud Monitoring

MaGe Linux Operations

Jun 6, 2026 · Operations

Kubernetes etcd Operations Guide: From Backup & Restore to Cluster Performance Tuning

This comprehensive guide walks Kubernetes operators through the role of etcd, version compatibility, manual and automated backup strategies, disaster‑recovery procedures, performance tuning parameters, monitoring with Prometheus and Grafana, common failure troubleshooting, upgrade paths, and data‑at‑rest encryption, providing concrete commands and best‑practice recommendations for production clusters.

EncryptionEtcdMonitoring

0 likes · 47 min read

Kubernetes etcd Operations Guide: From Backup & Restore to Cluster Performance Tuning

Ops Community

Jun 5, 2026 · Cloud Native

Practical Cloud‑Native Log Aggregation with Loki, Promtail & Grafana

This guide walks SREs and DevOps engineers through the challenges of log aggregation in containerized Kubernetes environments and shows how Loki, Promtail, and Grafana together provide a low‑cost, label‑based alternative to the ELK stack, covering architecture, deployment, query language, multi‑tenant security, performance tuning, alerting, and disaster recovery.

Cloud NativeLogQLObservability

0 likes · 36 min read

Practical Cloud‑Native Log Aggregation with Loki, Promtail & Grafana

Black & White Path

Jun 5, 2026 · Information Security

Hackers Strike Tianya Within 12 Hours of Its Revival: A Data Crisis Amid Nostalgia

When the iconic Tianya community relaunched on June 1, 2026, hackers exploited its modern stack within twelve hours, dumping over 127 million user records and exposing how nostalgic platforms can suffer severe security flaws under sudden traffic spikes.

DDoSSQL InjectionTiDB

0 likes · 6 min read

Hackers Strike Tianya Within 12 Hours of Its Revival: A Data Crisis Amid Nostalgia

Raymond Ops

Jun 3, 2026 · Operations

10 Critical Kubernetes Production Failures I Caused and How to Recover

The article walks through ten real‑world Kubernetes production incidents—from an etcd disk‑full disaster to image‑pull failures—detailing symptoms, root‑cause analysis, step‑by‑step remediation commands, and preventive measures such as monitoring, quota alerts, and configuration best practices.

API ServerAlertingCertificate

0 likes · 25 min read

10 Critical Kubernetes Production Failures I Caused and How to Recover

Raymond Ops

Jun 2, 2026 · Cloud Native

200+ Essential kubectl Commands for Managing and Troubleshooting Kubernetes Clusters

This guide compiles over 200 practical kubectl commands, covering cluster setup, context switching, resource inspection, workload management, networking, storage, security hardening, high‑availability patterns, troubleshooting techniques, and performance monitoring to help operators efficiently administer Kubernetes environments.

Cloud Nativecluster managementdevops

0 likes · 39 min read

200+ Essential kubectl Commands for Managing and Troubleshooting Kubernetes Clusters

dbaplus Community

Jun 1, 2026 · Operations

One Nginx Config Change Triggered a P0 Outage on Promotion Day – 5 Hard‑Earned Lessons

A single missing keepalive setting in Nginx caused a massive P0 outage during a sales promotion, and the article walks through five real incidents—covering logging, WebSocket timeouts, Docker worker counts, reload pitfalls, and SSL expiry—offering concrete configuration fixes and preventive best practices.

DockerNGINXWebSocket

0 likes · 12 min read

One Nginx Config Change Triggered a P0 Outage on Promotion Day – 5 Hard‑Earned Lessons

Woodpecker Software Testing

Jun 1, 2026 · Artificial Intelligence

Adversarial Testing Performance Optimization: Practical Strategies for Test Engineers

The article analyzes why adversarial testing is slow—highlighting redundant PGD steps, full model re‑execution, and serial verification—and presents a four‑stage optimization framework (intelligent termination, hierarchical reuse, parallel orchestration, feedback‑driven iteration) that dramatically speeds testing and enables CI/CD integration.

AI robustnessCI/CDPGD

0 likes · 8 min read

Adversarial Testing Performance Optimization: Practical Strategies for Test Engineers

Ops Community

Jun 1, 2026 · Cloud Native

Prevent a Single Pod from Crashing Your Kubernetes Cluster with Resource Quota

This article explains why missing ResourceQuota and LimitRange cause cluster-wide failures, walks through core concepts, provides step‑by‑step commands for quota inspection, creation, and validation, shares a real‑world outage case study, and offers best‑practice recommendations, advanced configurations, monitoring, and rollback procedures for Kubernetes resource management.

ClusterOperationsLimitRangeMonitoring

0 likes · 40 min read

Prevent a Single Pod from Crashing Your Kubernetes Cluster with Resource Quota

Linux Cloud-Native Ops Stack

Jun 1, 2026 · Cloud Native

Interview Question: How Does Kubernetes Schedule Pods to Nodes? (Full Answer)

The article explains Kubernetes pod scheduling in detail, covering how the kube‑scheduler filters and scores nodes, binds the pod, and how kubelet launches containers, plus common reasons for pods staying Pending and useful troubleshooting commands.

NodeSelectorPod SchedulingTaint and Toleration

0 likes · 8 min read

Interview Question: How Does Kubernetes Schedule Pods to Nodes? (Full Answer)

Full-Stack DevOps & Kubernetes

Jun 1, 2026 · Cloud Native

Beyond Traditional HPA: AI‑Agent‑Driven Intelligent Autoscaling for Kubernetes Pods

The article analyzes the shortcomings of Kubernetes' native HPA and presents a comprehensive AI‑Agent architecture that predicts load, makes autonomous scaling decisions, and integrates with the K8s API to achieve proactive, adaptive, and globally coordinated pod autoscaling.

AI AgentCloud NativeHPA

0 likes · 16 min read

Beyond Traditional HPA: AI‑Agent‑Driven Intelligent Autoscaling for Kubernetes Pods

MaGe Linux Operations

May 31, 2026 · Fundamentals

Essential Network Basics for Ops: IP Addresses, Subnet Masks, and Gateways Explained

This guide walks operations engineers through core networking concepts—including IP address structure, binary‑decimal conversion, private address ranges, subnet masks, CIDR notation, gateway functions, VLAN isolation, routing tables, DNS resolution, Docker/Kubernetes networking, and firewall configuration—while providing concrete command‑line examples and step‑by‑step troubleshooting workflows.

DockerIP addressingLinux

0 likes · 35 min read

Essential Network Basics for Ops: IP Addresses, Subnet Masks, and Gateways Explained

Ops Community

May 29, 2026 · Cloud Native

10 Common Pitfalls When Migrating Docker‑Compose to Kubernetes

This guide details the ten most frequent issues encountered when converting Docker‑Compose configurations to Kubernetes, explains why direct mappings often fail, and provides concrete examples, correct configurations, validation steps, and best‑practice recommendations to help teams avoid weeks of troubleshooting.

ContainersDocker Composebest practices

0 likes · 47 min read

10 Common Pitfalls When Migrating Docker‑Compose to Kubernetes

Alibaba Cloud Infrastructure

May 29, 2026 · Cloud Native

Alibaba Cloud Knative Gets a Major Upgrade to Fully Support AI Agents

Alibaba Cloud's Knative now integrates a dedicated Agent Sandbox workload type, enabling stateful AI agents to run in a serverless Kubernetes environment with per‑user isolation, automatic scaling, instant pause/resume, and warm‑pool pre‑warming for zero‑cost idle periods.

AI AgentAgent SandboxCloud Native

0 likes · 13 min read

Alibaba Cloud Knative Gets a Major Upgrade to Fully Support AI Agents

MaGe Linux Operations

May 28, 2026 · Cloud Native

7 Quick Ways to Diagnose a Kubernetes Pod Stuck in Pending

When a Kubernetes Pod remains in the Pending state, this guide walks through seven systematic troubleshooting directions—covering node resource shortages, taints and tolerations, node selectors and affinity, PVC binding issues, image pull problems, quota limits, and priority or topology constraints—providing concrete commands, examples, and remediation steps to get the pod running.

AffinityPVCPending

0 likes · 47 min read

7 Quick Ways to Diagnose a Kubernetes Pod Stuck in Pending

Full-Stack DevOps & Kubernetes

May 28, 2026 · Cloud Native

How to Diagnose CrashLoopBackOff in Kubernetes: A Practical Guide

This article explains that CrashLoopBackOff is a symptom, not the root cause, and walks through a production‑grade troubleshooting workflow—including checking pod status, describing events, examining logs (current and previous), and exec‑ing into containers—while covering common failures such as OOMKilled, liveness‑probe misconfiguration, bad config files, database connection issues, image command errors, and disk‑pressure problems, and warns against premature pod deletion.

Cloud NativeCrashLoopBackOffOOMKilled

0 likes · 10 min read

How to Diagnose CrashLoopBackOff in Kubernetes: A Practical Guide

Xiaohongshu Tech REDtech

May 27, 2026 · Cloud Native

How RedProcess Evolved into DES: Optimizing Xiaohongshu’s Multimedia Task Scheduler

The article details the evolution from the first‑generation RedProcess scheduler to the Distributed Execution Scheduler (DES), explaining how architectural redesigns in storage layering, push‑based dispatch, and systematic disaster‑recovery transformed Xiaohongshu’s video‑cloud task scheduling from merely usable to highly efficient and resilient.

DESRedisTask scheduling

0 likes · 15 min read

How RedProcess Evolved into DES: Optimizing Xiaohongshu’s Multimedia Task Scheduler

Alibaba Cloud Infrastructure

May 26, 2026 · Cloud Native

How BYD and Alibaba Cloud Use Argo Workflows to Efficiently Schedule Millions of Autonomous Driving Tasks

Facing over 1 PB of daily sensor data, BYD replaced Airflow with a multi‑cluster Argo Workflows and Argo CD architecture, integrated Ray for GPU workloads, and achieved 20‑40 k concurrent workflows, an 11‑fold efficiency boost, 30% cost reduction, and near‑99% success rates.

Argo WorkflowsCloud NativeRay

0 likes · 11 min read

How BYD and Alibaba Cloud Use Argo Workflows to Efficiently Schedule Millions of Autonomous Driving Tasks

TonyBai

May 26, 2026 · Artificial Intelligence

Why NVIDIA Chose Go for Its GPU Cloud Platform: Inside the AI Infrastructure Rewrite

NVIDIA quietly rewrote its AI cloud platform using Go, open‑sourcing NVCF, AICR, and AIStore, where Go accounts for over 80% of the code, enabling a three‑plane architecture, scale‑to‑zero via NATS JetStream, and a cloud‑native stack that balances performance, maintainability, and rapid iteration.

AI InfrastructureCloud NativeGPU

0 likes · 15 min read

Why NVIDIA Chose Go for Its GPU Cloud Platform: Inside the AI Infrastructure Rewrite

ITPUB

May 25, 2026 · Operations

Why Manually Pulling Server Logs Is Inefficient: Comparing ELK, EFK, and PLG Stacks

The article compares popular log‑collection stacks—ELK/Elastic Stack, EFK with Fluent Bit, and the PLG solution (Promtail + Loki + Grafana)—detailing their components, deployment scenarios, and trade‑offs such as indexing strategy, storage options, and integration with Kubernetes for observability.

EFKELKPLG

0 likes · 5 min read

Why Manually Pulling Server Logs Is Inefficient: Comparing ELK, EFK, and PLG Stacks

Coder Trainee

May 24, 2026 · Backend Development

Load Testing and Tuning Insights for a Spring Cloud Microservice System

This article walks through the complete load‑testing and performance‑tuning workflow for a Spring Cloud microservice application, covering environment preparation, JMeter script creation, benchmark execution, bottleneck analysis, JVM, database pool, and Sentinel optimizations, and presents before‑and‑after results with a detailed checklist.

DockerJMeterMicroservices

0 likes · 11 min read

Load Testing and Tuning Insights for a Spring Cloud Microservice System

Coder Trainee

May 23, 2026 · Cloud Native

Deploy Spring Cloud Microservices to Production on Kubernetes – Revised Edition

This article walks through migrating a Spring Cloud microservice suite from local Docker Compose to a production‑grade Kubernetes deployment, covering namespace setup, ConfigMaps, Secrets, service deployments, auto‑scaling, rolling updates, self‑healing, load balancing, Docker image builds, deployment scripts, common operational commands, and validation steps.

DockerHPAIngress

0 likes · 16 min read

Deploy Spring Cloud Microservices to Production on Kubernetes – Revised Edition

Ops Community

May 21, 2026 · Information Security

How to Harden Docker in Production: From Image Scanning to Runtime Protection

This guide walks DevOps engineers through a complete Docker hardening workflow—explaining the security model, recommending safe base images, removing secrets, applying multi‑stage builds, enforcing image signing, configuring runtime privileges, resource limits, network isolation, logging, and continuous audit with tools like Trivy, Cosign, Falco and CIS benchmarks.

Dockercis benchmarkhardening

0 likes · 29 min read

How to Harden Docker in Production: From Image Scanning to Runtime Protection

Go Development Architecture Practice

May 20, 2026 · Operations

10 Essential Linux Ops Tools to Cut 80% of Overtime

This article introduces ten widely used Linux operations tools—Shell, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, typical scenarios, advantages, and concrete usage examples to help engineers streamline daily tasks.

AnsibleDockerELK

0 likes · 9 min read

10 Essential Linux Ops Tools to Cut 80% of Overtime

MaGe Linux Operations

May 18, 2026 · Cloud Native

Does Your Application Really Need Kubernetes? Consider These 3 Critical Questions

This article guides ops engineers and development leads through three essential questions—architecture suitability, team capability, and cost‑benefit analysis—to determine whether migrating to Kubernetes adds real value or just extra complexity.

Cluster OperationsCost-Benefit AnalysisK8s migration

0 likes · 43 min read

Does Your Application Really Need Kubernetes? Consider These 3 Critical Questions

Cloud Native Technology Community

May 18, 2026 · Operations

How to Cut Engineering Time on Kubernetes Upgrades

Kubernetes upgrades can consume 4‑6 weeks of engineering effort per minor release, delaying product roadmaps and inflating cloud costs, while reports show teams lose dozens of workdays to incidents and over‑provisioned resources, highlighting the need for dedicated SRE ownership to reclaim time for business‑impacting work.

Operational CostPlatform EngineeringSRE

0 likes · 8 min read

How to Cut Engineering Time on Kubernetes Upgrades

Architecture & Thinking

May 18, 2026 · Backend Development

Practical Traffic Governance: Canary Release, Circuit Breaking, and Auto Fault Recovery

This article explains how canary releases, circuit‑breaker degradation, and automatic fault‑recovery mechanisms work together to ensure high availability and stability in distributed microservice systems, providing detailed principles, configuration steps, code samples, and real‑world case studies.

Auto Fault RecoveryCanary ReleaseMicroservices

0 likes · 18 min read

Practical Traffic Governance: Canary Release, Circuit Breaking, and Auto Fault Recovery

Ops Community

May 17, 2026 · Cloud Native

Istio Service Mesh Basics: What Is the Sidecar Pattern and Why Microservices Need It?

The article explains how traditional microservice architectures embed network concerns such as time‑outs, retries, circuit breaking, traffic monitoring and mTLS in application code, why this leads to code coupling, upgrade difficulty and duplicated effort, and how Istio’s sidecar‑based service mesh cleanly separates those concerns while providing traffic management, observability and security features.

EnvoyIstioObservability

0 likes · 30 min read

Istio Service Mesh Basics: What Is the Sidecar Pattern and Why Microservices Need It?

AI Engineering

May 17, 2026 · Information Security

LiteLLM Agent Platform: K8s Sandbox Stops Agents Accessing Real API Keys

The open‑source LiteLLM Agent Platform isolates each coding agent in a fresh Kubernetes pod and swaps stub tokens for real credentials only on outbound TLS requests, preventing any agent from ever seeing or leaking actual API keys.

API SecurityLLM AgentsLiteLLM

0 likes · 4 min read

LiteLLM Agent Platform: K8s Sandbox Stops Agents Accessing Real API Keys

MaGe Linux Operations

May 16, 2026 · Cloud Native

Why Pods Are the Most Powerful Unit in Kubernetes – A Deep Dive

This article provides a comprehensive, step‑by‑step analysis of Kubernetes Pods, covering their design as a shared‑namespace container group, the role of the pause (infra) container, creation flow, lifecycle phases, resource requests and limits, QoS classes, scheduling mechanics, volume types, and detailed troubleshooting techniques with concrete command‑line examples.

NamespaceResource ManagementScheduling

0 likes · 30 min read

Why Pods Are the Most Powerful Unit in Kubernetes – A Deep Dive

AI Agent Super App

May 16, 2026 · Operations

14 Open‑Source Monitoring Tools Compared – Stop Guessing the Right One

This article systematically reviews 14 open‑source server‑monitoring solutions, explains the three monitoring layers, dives deep into Prometheus + Alertmanager and Zabbix, compares architectures, performance, and costs, and provides a practical decision‑making guide with real‑world scenarios and pitfalls.

AlertingMonitoringZabbix

0 likes · 31 min read

14 Open‑Source Monitoring Tools Compared – Stop Guessing the Right One

MaGe Linux Operations

May 14, 2026 · Operations

Ops Veteran's Secret: Master These 10 Tools to Cut Overtime by 80%

The article lists ten essential Linux operations tools—Shell scripting, Git, Ansible, Prometheus, Grafana, Docker, Kubernetes, Nginx, ELK Stack, and Zabbix—detailing their functions, typical scenarios, advantages, and concrete usage examples, helping engineers streamline daily tasks and reduce overtime.

AnsibleDockerELK Stack

0 likes · 9 min read

Ops Veteran's Secret: Master These 10 Tools to Cut Overtime by 80%

Ops Community

May 13, 2026 · Operations

Kubernetes Node Failures: One‑Stop Guide to Diagnose and Fix Common Issues

This comprehensive guide walks Kubernetes operators through a step‑by‑step process for diagnosing node health problems—such as NotReady, MemoryPressure, DiskPressure, PIDPressure, and NetworkUnavailable—by examining node conditions, reviewing events, checking system resources, inspecting component logs, applying targeted fixes, and verifying recovery, all illustrated with real‑world commands and examples.

CNIDiskPressureMemoryPressure

0 likes · 44 min read

Kubernetes Node Failures: One‑Stop Guide to Diagnose and Fix Common Issues

Huawei Cloud Developer Alliance

May 13, 2026 · Cloud Native

Why HPA Falls Short for LLMs and How Kthena Autoscaler Redefines Elastic Scaling

The article explains why traditional Kubernetes HPA cannot meet the unique demands of large‑language‑model inference, introduces Kthena Autoscaler’s model‑aware architecture, its dual stable/panic scaling modes, cost‑aware algorithms, flexible policy bindings, and provides practical configuration and observability guidance.

Kthena AutoscalerLLM Inferenceautoscaling

0 likes · 10 min read

Why HPA Falls Short for LLMs and How Kthena Autoscaler Redefines Elastic Scaling

Coder Trainee

May 13, 2026 · Cloud Native

Spring Cloud Microservices Revised Edition – Intro and New Tech Stack

After finishing the Spring Boot source‑code series, the author launches a refreshed Spring Cloud microservices tutorial built on Spring Boot 3.x, Jakarta EE, GraalVM native images, full production‑grade demos, Kubernetes deployment, observability and performance testing, outlining a 12‑episode roadmap.

GraalVMMicroservicesObservability

0 likes · 7 min read

Spring Cloud Microservices Revised Edition – Intro and New Tech Stack

Architect's Guide

May 10, 2026 · Backend Development

Why We Dropped Nacos for Apollo: A Hands‑On Guide to Apollo Configuration Center

This article walks through the reasons for abandoning Nacos in favor of Apollo and provides a step‑by‑step tutorial that covers Apollo’s core concepts, architecture, client integration with Spring Boot, dynamic updates, environment/cluster/namespace handling, and deployment on Kubernetes.

ApolloDockerSpring Boot

0 likes · 26 min read

Why We Dropped Nacos for Apollo: A Hands‑On Guide to Apollo Configuration Center

Weekly Large Model Application

May 6, 2026 · Cloud Native

How OpenAI Scales Low-Latency Voice AI with WebRTC: Architecture Deep Dive

The article dissects OpenAI's engineering approach to delivering low‑latency voice AI at scale, explaining why WebRTC was chosen, how a Relay + Transceiver split solves Kubernetes integration challenges, the use of ICE ufrag for deterministic routing, and how global relay and implementation choices reduce perceived latency.

OpenAIRelayTransceiver

0 likes · 9 min read

How OpenAI Scales Low-Latency Voice AI with WebRTC: Architecture Deep Dive

DevOps Operations Practice

May 3, 2026 · Cloud Native

Kubernetes Dashboard Is Deprecated—Officially Recommended Replacement Headlamp

The article explains why the Kubernetes Dashboard has been deprecated due to security and multi‑cluster limitations, and introduces Headlamp as the officially endorsed, lightweight web UI that offers multi‑cluster management, strict RBAC enforcement, and extensible plugins, with simple installation steps.

HeadlampRBACWeb UI

0 likes · 3 min read

Kubernetes Dashboard Is Deprecated—Officially Recommended Replacement Headlamp

MaGe Linux Operations

May 3, 2026 · Cloud Native

How to Troubleshoot Kubernetes NotReady Nodes: A Complete Step‑by‑Step Guide

This article walks Kubernetes operators through a systematic investigation of NotReady node symptoms, explaining the kubelet status mechanism, detailing each diagnostic step—from verifying node conditions with kubectl to checking kubelet, container runtime, resources, network, and certificates—and providing concrete remediation and preventive measures.

EtcdMonitoringNotReady

0 likes · 35 min read

How to Troubleshoot Kubernetes NotReady Nodes: A Complete Step‑by‑Step Guide

Coder Trainee

May 2, 2026 · Cloud Native

Spring Cloud Microservices Series #10: Key Takeaways and Best Practices

This article reviews the entire Spring Cloud microservices series, presents a full technology stack diagram, outlines production‑grade best practices for service decomposition, configuration, remote calls, rate limiting, databases, logging and monitoring, lists common pitfalls, offers performance‑tuning tips, discusses the pros and cons of microservices, and points to future directions such as service mesh, serverless and cloud‑native adoption.

MicroservicesMonitoringService Mesh

0 likes · 14 min read

Spring Cloud Microservices Series #10: Key Takeaways and Best Practices

Coder Trainee

May 1, 2026 · Cloud Native

Containerizing Spring Cloud Microservices with Docker and Kubernetes (Part 9)

This article explains why traditional deployment is problematic, then walks through building Docker images, composing services with Docker‑Compose, deploying to a Kubernetes cluster, setting up CI/CD pipelines, and addressing common pitfalls such as slow starts and service discovery failures.

CI/CDDockerDocker Compose

0 likes · 12 min read

Containerizing Spring Cloud Microservices with Docker and Kubernetes (Part 9)

MaGe Linux Operations

Apr 30, 2026 · Cloud Native

Kubernetes Service Connectivity Issues? A Step‑by‑Step Guide from Pods to Services to Ingress

This article provides a systematic, layer‑by‑layer troubleshooting guide for Kubernetes service connectivity problems, covering pod health, service and endpoint configuration, kube‑proxy rules, CNI plugins, Ingress controllers, DNS resolution, and NetworkPolicy, with concrete commands, examples, and preventive scripts.

IngressNetworkService

0 likes · 39 min read

Kubernetes Service Connectivity Issues? A Step‑by‑Step Guide from Pods to Services to Ingress

Data STUDIO

Apr 28, 2026 · Backend Development

FastAPI in Production: Auth, Rate Limiting, and Zero‑Downtime with One Codebase

This article walks through a complete production‑ready FastAPI setup, covering secure OIDC/JWKS authentication, Redis‑backed token‑bucket rate limiting, zero‑downtime rolling deployments on Docker/Kubernetes, and observability best practices such as request‑ID middleware and structured JSON logging.

DockerFastAPIObservability

0 likes · 20 min read

FastAPI in Production: Auth, Rate Limiting, and Zero‑Downtime with One Codebase

dbaplus Community

Apr 27, 2026 · Cloud Native

When MTU Misconfiguration Turns Into a Two‑Day Network Mystery

A two‑day investigation of intermittent packet loss in a hybrid‑cloud Kubernetes environment revealed that an oversized VXLAN MTU caused fragmentation, prompting a step‑by‑step analysis of MTU fundamentals, diagnostic commands, Cilium configuration changes, and best‑practice recommendations for cloud‑native networks.

CiliumMTUOverlay Networks

0 likes · 30 min read

When MTU Misconfiguration Turns Into a Two‑Day Network Mystery

DevOps Coach

Apr 27, 2026 · Operations

How a 2 AM Kubernetes Change Cost $47,000: My Nightmare Incident and 7 Lessons

A mis‑timed production resource change triggered a cascading Kubernetes failure that cost $47,000, and the author details the incident timeline, mistakes made, and seven concrete operational safeguards introduced to prevent similar outages.

circuit breakingincident responsekubernetes

0 likes · 12 min read

How a 2 AM Kubernetes Change Cost $47,000: My Nightmare Incident and 7 Lessons

ITPUB

Apr 27, 2026 · Cloud Native

Why Skipping Backups Makes Kubernetes Operations Impossible

The article explains that running production Kubernetes clusters without regular backup and recovery plans exposes businesses to severe risks such as cluster failures, data loss, and prolonged downtime, and it details practical etcd physical and Velero logical backup strategies to mitigate these threats.

Cloud NativeEtcdVelero

0 likes · 9 min read

Why Skipping Backups Makes Kubernetes Operations Impossible

DevOps Coach

Apr 26, 2026 · Cloud Native

Accelerating Kubernetes Automation: Mastering GitOps Best Practices

This guide explains GitOps fundamentals—declarative, versioned, automated deployments—and shows how tools like Argo CD, Flux, Helm, Kustomize, Tekton, and Sealed Secrets can speed up Kubernetes delivery, improve reliability, enhance security, and foster better collaboration across DevOps teams.

Argo CDCI/CDCloud Native

0 likes · 16 min read

Accelerating Kubernetes Automation: Mastering GitOps Best Practices

Ray's Galactic Tech

Apr 26, 2026 · Cloud Native

Kubernetes Networking Unpacked: How a Service Timeout Reveals iptables‑CNI Collaboration

A real‑world Service timeout in a high‑traffic e‑commerce cluster exposed a saturated conntrack table, prompting a step‑by‑step dissection of Pods, Services, iptables, conntrack, CNI plugins, DNS and NetworkPolicy, and culminating in concrete production‑grade remediation tactics.

CNIServiceconntrack

0 likes · 28 min read

Kubernetes Networking Unpacked: How a Service Timeout Reveals iptables‑CNI Collaboration

AI Explorer

Apr 26, 2026 · Artificial Intelligence

Take Control of AI: Choose Any Model and Keep Your Data Private

Thunderbolt, an open‑source AI client from Mozilla’s Thunderbird team, lets developers pick any OpenAI‑compatible model, run it on‑premises via Docker or Kubernetes, and keep all conversation data on their own servers, eliminating vendor lock‑in and enhancing privacy.

AI clientDockerdata privacy

0 likes · 6 min read

Take Control of AI: Choose Any Model and Keep Your Data Private

DevOps Coach

Apr 24, 2026 · Cloud Native

After Years Using Kubernetes, I Finally Grasped CRDs – Build One from Scratch

The article reveals why most Kubernetes engineers use Custom Resource Definitions without truly understanding them, explains how CRDs act as the language that extends the Kubernetes API, and provides a step‑by‑step walkthrough to create a production‑ready DatabaseCluster CRD, interact with it via kubectl and the Python client, and avoid common pitfalls.

API extensionCRDCustomResourceDefinition

0 likes · 17 min read

After Years Using Kubernetes, I Finally Grasped CRDs – Build One from Scratch

Cloud Native Technology Community

Apr 24, 2026 · Cloud Native

Kubernetes v1.36 “Haru”: Why Some Changes Aren’t Worth the Wait

Kubernetes v1.36 focuses on clearing technical debt rather than adding flashy features, retiring ingress‑nginx, tightening kubelet API auth, optimizing SELinux mounts, externalizing ServiceAccount token signing, expanding DRA for GPU scheduling, graduating MutatingAdmissionPolicy, and removing long‑standing legacy components, all accompanied by a concrete upgrade checklist.

DRAMutatingAdmissionPolicygitRepo

0 likes · 15 min read

Kubernetes v1.36 “Haru”: Why Some Changes Aren’t Worth the Wait

Ray's Galactic Tech

Apr 23, 2026 · Backend Development

Stop Treating LLMs as 'All‑Purpose Tools': Practical Spring AI Multi‑Agent Architecture for Production

This article analyses why a single‑agent LLM approach quickly hits scalability, context, and governance limits, and presents a production‑ready Spring AI Multi‑Agent design—including layered architecture, agent metadata, skill engineering, routing strategies, orchestration, resilience, A2A service discovery, Kubernetes deployment, observability, security, and cost‑control—backed by concrete Java code examples.

A2AJavaResilience4j

0 likes · 38 min read

Stop Treating LLMs as 'All‑Purpose Tools': Practical Spring AI Multi‑Agent Architecture for Production

Linux Cloud-Native Ops Stack

Apr 23, 2026 · Cloud Native

Kubernetes Interview: What Exactly Happens When You Delete a Pod?

When a pod is deleted in Kubernetes, the API server timestamps the request, services stop routing traffic, kubelet runs any preStop hook, sends SIGTERM followed by a configurable grace period, then SIGKILL if needed, cleans up resources, and finally removes the pod metadata, with controllers optionally recreating a replacement pod.

Graceful TerminationPod Deletiondeployment

0 likes · 7 min read

Kubernetes Interview: What Exactly Happens When You Delete a Pod?

DevOps Coach

Apr 22, 2026 · Operations

2026 AI DevOps Outlook: 10 Must‑Watch MCP Servers Transforming SRE

The article surveys the rapidly growing Model Context Protocol (MCP) ecosystem in 2026, detailing ten AI‑enabled DevOps servers, their core capabilities, real‑world impact on SRE workflows, and a practical framework for selecting the most valuable servers for a given team.

AI DevOpsMCPObservability

0 likes · 16 min read

2026 AI DevOps Outlook: 10 Must‑Watch MCP Servers Transforming SRE

Ray's Galactic Tech

Apr 22, 2026 · Cloud Native

Solving K8s Stateful App Storage Pain: Production-Ready Longhorn + MySQL StatefulSet

This article dissects the challenges of running MySQL as a stateful workload on Kubernetes, explains why storage, consistency, and fail‑over are the real pain points, and provides a production‑grade solution that combines Longhorn distributed block storage with a carefully engineered MySQL 8.0 StatefulSet, complete with YAML manifests, performance tuning, backup strategies, and disaster‑recovery playbooks.

LonghornProductionkubernetes

0 likes · 50 min read

Solving K8s Stateful App Storage Pain: Production-Ready Longhorn + MySQL StatefulSet

Raymond Ops

Apr 22, 2026 · Operations

How Prometheus Recording Rules Can Reduce Alert Noise by 70%

This guide explains how to use Prometheus Recording Rules to pre‑compute, aggregate, and smooth metrics in large‑scale microservice environments, cutting daily alert noise by up to 70% through hierarchical alert design, practical examples, and best‑practice recommendations.

Alert Noise ReductionMonitoringObservability

0 likes · 22 min read

How Prometheus Recording Rules Can Reduce Alert Noise by 70%

Full-Stack DevOps & Kubernetes

Apr 22, 2026 · Operations

Avoid 90% of Kubernetes Ops Pitfalls: A Definitive Guide

This guide outlines the five most common Kubernetes operational pitfalls, offers step‑by‑step remediation practices, introduces three emerging trends such as AI‑assisted troubleshooting, serverless clusters, and Tekton CI/CD, and provides three ready‑to‑copy kubectl commands to streamline daily management.

AIOpsOperationsServerless

0 likes · 9 min read

Avoid 90% of Kubernetes Ops Pitfalls: A Definitive Guide