Tagged articles

cluster reliability

2 articles · Page 1 of 1
AI Agent Super App
AI Agent Super App
May 7, 2026 · Operations

Linux Time Drift Can Crash Clusters – A Rescue Guide to Save Your Ops

A 47‑second clock skew once broke MySQL replication, Redis clustering, and Kubernetes scheduling, prompting a three‑year deep‑dive into Linux time services, from hardware clocks to chrony configuration, with practical commands, pitfalls, monitoring, and a checklist to keep production systems in sync.

LinuxNTPOperations
0 likes · 12 min read
Linux Time Drift Can Crash Clusters – A Rescue Guide to Save Your Ops
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 26, 2024 · Cloud Native

How a New Telemetry Service Overwhelmed OpenAI’s Kubernetes API Server

An in‑depth post‑mortem reveals how OpenAI’s newly deployed telemetry service generated massive Kubernetes API requests, overloading the API server, breaking DNS resolution, and slowing recovery, while contrasting OpenAI’s approach with LoongCollector/iLogtail’s design to minimize API load and improve cluster stability.

API ServerCloud NativeObservability
0 likes · 15 min read
How a New Telemetry Service Overwhelmed OpenAI’s Kubernetes API Server