Tagged articles

Cluster Operations

21 articles · Page 1 of 1
Raymond Ops
Raymond Ops
Jun 9, 2026 · Cloud Native

Kubernetes Outage? Essential Troubleshooting Guide for Production Clusters

A comprehensive, step‑by‑step guide that explains the most common Kubernetes failure scenarios—from pod crashes and image pull errors to node NotReady and API server timeouts—provides concrete kubectl commands, diagnostic scripts, real‑world case studies, best‑practice recommendations, monitoring metrics, and backup‑restore procedures to keep production clusters healthy.

Cluster OperationsEtcdMonitoring
0 likes · 37 min read
Kubernetes Outage? Essential Troubleshooting Guide for Production Clusters
Qunar Tech Salon
Qunar Tech Salon
Jun 9, 2026 · Operations

Mastering Elasticsearch Shard Management: From Fundamentals to 100k‑Shard Scale

This article explains Elasticsearch shard fundamentals, primary and replica roles, allocation rules, recovery and rebalance mechanisms, tuning parameters, best‑practice sizing, and presents real‑world production cases—including a 100,000‑shard cluster—along with concrete API commands for effective shard operations.

Cluster OperationsElasticsearchLarge Scale
0 likes · 28 min read
Mastering Elasticsearch Shard Management: From Fundamentals to 100k‑Shard Scale
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 7, 2025 · Operations

How Alibaba Scales Flink to Millions of Cores: Real‑Time Ops Secrets

This article details Alibaba's decade‑long evolution of its real‑time computing platform, the massive operational challenges of managing Flink clusters at million‑core scale, and the comprehensive strategies—including SLA metrics, self‑healing services, cloud‑native redesign, and job‑level advisory tools—used to ensure stability, cost efficiency, and performance during peak events like Double‑11.

Apache FlinkCloud NativeCluster Operations
0 likes · 19 min read
How Alibaba Scales Flink to Millions of Cores: Real‑Time Ops Secrets
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
May 9, 2025 · Big Data

Mastering Multi‑AZ Replication in HDFS with AZ Mover

This article introduces AZ Mover, a lightweight HDFS client‑side tool that intelligently scans, schedules, and migrates block replicas across multiple availability zones, detailing its design goals, core workflow, command‑line options, concurrency controls, and future enhancements for robust big‑data disaster recovery.

AZ MoverCluster OperationsData Governance
0 likes · 9 min read
Mastering Multi‑AZ Replication in HDFS with AZ Mover
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Dec 6, 2024 · Operations

Boost Kubernetes API Server Performance: Tuning max-mutating-requests-inflight & watch-cache-size

This guide explains how to optimize Kubernetes API Server performance by configuring the max-mutating-requests-inflight limit and watch-cache-size, offering recommended values for different cluster sizes, monitoring metrics, and step‑by‑step adjustment strategies for stable, high‑throughput clusters.

API ServerCluster OperationsPerformance Tuning
0 likes · 7 min read
Boost Kubernetes API Server Performance: Tuning max-mutating-requests-inflight & watch-cache-size
dbaplus Community
dbaplus Community
Mar 11, 2024 · Cloud Native

Why Kubernetes’ One‑Year Loopback Certificate Breaks and How the Community Is Tackling It

The article explains that the kube‑apiserver’s built‑in LoopbackClient certificate expires after one year, causing API server failures in long‑running clusters, and examines the community’s support policies, recent Alibaba Cloud changes, and ongoing discussions about introducing a true Kubernetes LTS.

CertificateCluster OperationsLTS
0 likes · 13 min read
Why Kubernetes’ One‑Year Loopback Certificate Breaks and How the Community Is Tackling It
政采云技术
政采云技术
Jan 10, 2024 · Operations

Understanding and Improving Elasticsearch Shard Balancing Strategies

This article analyzes Elasticsearch shard imbalance incidents, explains the built‑in shard balancing algorithm and its configuration parameters, demonstrates weight calculations with source code, and proposes practical improvements—including shard count adjustments and a custom load‑aware balancing tool—to achieve more effective cluster load distribution.

Cluster OperationsElasticsearchPerformance Optimization
0 likes · 17 min read
Understanding and Improving Elasticsearch Shard Balancing Strategies
dbaplus Community
dbaplus Community
Sep 14, 2023 · Cloud Native

Mastering Kubernetes: 30+ Essential Pod, Node, and Cluster Troubleshooting Techniques

This guide compiles over thirty practical Kubernetes troubleshooting steps, covering pod startup failures, networking issues, resource bottlenecks, node abnormalities, cluster‑wide service problems, and detailed explanations of common container exit codes to help operators quickly diagnose and resolve issues.

Cluster OperationsContainer exit codesNode diagnostics
0 likes · 22 min read
Mastering Kubernetes: 30+ Essential Pod, Node, and Cluster Troubleshooting Techniques
ITPUB
ITPUB
Aug 9, 2023 · Operations

Why Is My Elasticsearch Cluster Using 15 GB Heap? A Deep Dive into Memory Bottlenecks

The article examines a 7‑node Elasticsearch cluster with 500 million documents, uncovering excessive heap usage, high OS memory pressure, numerous deleted documents, large translog, low query‑cache hit rate, and an over‑sharded design, then offers concrete tuning and redesign recommendations to restore performance.

Cluster OperationsElasticsearchMemory optimization
0 likes · 16 min read
Why Is My Elasticsearch Cluster Using 15 GB Heap? A Deep Dive into Memory Bottlenecks
Efficient Ops
Efficient Ops
Nov 23, 2022 · Operations

How to Diagnose and Fix Node2 Ceph‑Related cgroup Leaks in a Kubernetes Cluster

This article walks through a real‑world Kubernetes incident where a node ran out of space due to Ceph storage inconsistencies and cgroup leaks, detailing step‑by‑step diagnostics, Ceph repair commands, pod eviction, node reboot, and post‑mortem recommendations for cluster operations.

CephCluster OperationsNode troubleshooting
0 likes · 6 min read
How to Diagnose and Fix Node2 Ceph‑Related cgroup Leaks in a Kubernetes Cluster
政采云技术
政采云技术
Jul 14, 2022 · Operations

Diagnosing and Optimizing Elasticsearch IO Bottlenecks for Billion-Scale Product Catalogs

Facing severe IO-wait and read bottlenecks as product data grew from tens of millions to billions, this article analyzes root causes in Elasticsearch clusters and presents a comprehensive solution involving index parameter tuning, merge settings, translog async writes, query optimizations, and hardware upgrades to restore performance and stability.

Cluster OperationsElasticsearchIO optimization
0 likes · 14 min read
Diagnosing and Optimizing Elasticsearch IO Bottlenecks for Billion-Scale Product Catalogs
Ops Development Stories
Ops Development Stories
Oct 9, 2021 · Cloud Native

Why Do Some Kubernetes Pods Stay Stuck in Terminating? Causes and Fixes

This article explains the Kubernetes pod lifecycle, the meaning of the Terminating state, detailed pod creation and deletion processes, and the eviction mechanisms of both kube‑controller‑manager and kubelet, offering troubleshooting steps and best practices to resolve pods that remain stuck in Terminating.

Cloud NativeCluster OperationsPod Lifecycle
0 likes · 13 min read
Why Do Some Kubernetes Pods Stay Stuck in Terminating? Causes and Fixes
Code Ape Tech Column
Code Ape Tech Column
Jan 19, 2021 · Operations

Scaling Kafka Clusters to Support Millions of Partitions: Challenges and Solutions

This article examines the technical challenges of scaling Kafka clusters to handle millions of partitions—including Zookeeper node explosion, replication overhead, controller recovery latency, and broker restart delays—and proposes solutions such as parallel ZK fetching, metadata synchronization via internal topics, logical cluster composition, and physical cluster splitting.

Cluster Operationsdistributed systemskafka
0 likes · 13 min read
Scaling Kafka Clusters to Support Millions of Partitions: Challenges and Solutions
Tencent Cloud Middleware
Tencent Cloud Middleware
Apr 9, 2020 · Operations

Scaling Kafka to Support Millions of Partitions Without Downtime

This article explains the metadata, controller, and Zookeeper challenges of supporting a million‑plus Kafka partitions and presents practical solutions such as parallel ZK fetching, metadata‑via‑topic redesign, logical cluster assembly, and physical cluster splitting to achieve large‑scale, stable Kafka deployments.

Cluster OperationsZookeepercontroller optimization
0 likes · 15 min read
Scaling Kafka to Support Millions of Partitions Without Downtime
Big Data Technology Architecture
Big Data Technology Architecture
Jul 9, 2019 · Operations

Elasticsearch Node Shutdown Process and Risks During Rolling Upgrade

During a rolling upgrade of an Elasticsearch cluster, stopping nodes—especially the master—can block write requests, cause client connection failures, trigger master re‑election, and lead to temporary data duplication, making it essential to understand the shutdown sequence and its impact on read/write operations.

Cluster OperationsElasticsearchNode Shutdown
0 likes · 5 min read
Elasticsearch Node Shutdown Process and Risks During Rolling Upgrade
JD Tech Talk
JD Tech Talk
Aug 9, 2018 · Operations

Ensuring Stability and Scalability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices

The article explains why operating massive Kubernetes clusters is as challenging as building large systems, outlines three critical stability questions, shares real‑world data collection, visualization, and tooling practices, and provides concrete recommendations for high‑availability, monitoring, and performance optimization.

AutomationCluster OperationsObservability
0 likes · 12 min read
Ensuring Stability and Scalability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices
JD Retail Technology
JD Retail Technology
Jul 24, 2018 · Operations

Stability and Operational Practices for Large‑Scale Kubernetes Clusters

This article shares practical experience and best‑practice guidelines for operating large‑scale Kubernetes clusters, covering stability checks, component failure impact, recovery strategies, alerting mechanisms, data collection, visualization, and the suite of operational tools that help ensure reliable, high‑performance cloud‑native infrastructure.

Cluster OperationsHigh AvailabilityObservability
0 likes · 10 min read
Stability and Operational Practices for Large‑Scale Kubernetes Clusters