Tag

Cluster Operations

0 views collected around this technical thread.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
May 9, 2025 · Big Data

Mastering Multi‑AZ Replication in HDFS with AZ Mover

This article introduces AZ Mover, a lightweight HDFS client‑side tool that intelligently scans, schedules, and migrates block replicas across multiple availability zones, detailing its design goals, core workflow, command‑line options, concurrency controls, and future enhancements for robust big‑data disaster recovery.

AZ MoverBig DataCluster Operations
0 likes · 9 min read
Mastering Multi‑AZ Replication in HDFS with AZ Mover
政采云技术
政采云技术
Jan 10, 2024 · Operations

Understanding and Improving Elasticsearch Shard Balancing Strategies

This article analyzes Elasticsearch shard imbalance incidents, explains the built‑in shard balancing algorithm and its configuration parameters, demonstrates weight calculations with source code, and proposes practical improvements—including shard count adjustments and a custom load‑aware balancing tool—to achieve more effective cluster load distribution.

Cluster OperationsElasticsearchLoad Balancing
0 likes · 17 min read
Understanding and Improving Elasticsearch Shard Balancing Strategies
Efficient Ops
Efficient Ops
Nov 23, 2022 · Operations

How to Diagnose and Fix Node2 Ceph‑Related cgroup Leaks in a Kubernetes Cluster

This article walks through a real‑world Kubernetes incident where a node ran out of space due to Ceph storage inconsistencies and cgroup leaks, detailing step‑by‑step diagnostics, Ceph repair commands, pod eviction, node reboot, and post‑mortem recommendations for cluster operations.

CephCluster OperationsKubernetes
0 likes · 6 min read
How to Diagnose and Fix Node2 Ceph‑Related cgroup Leaks in a Kubernetes Cluster
政采云技术
政采云技术
Jul 14, 2022 · Operations

Diagnosing and Optimizing Elasticsearch IO Bottlenecks for Billion-Scale Product Catalogs

Facing severe IO-wait and read bottlenecks as product data grew from tens of millions to billions, this article analyzes root causes in Elasticsearch clusters and presents a comprehensive solution involving index parameter tuning, merge settings, translog async writes, query optimizations, and hardware upgrades to restore performance and stability.

Cluster OperationsElasticsearchIndex Tuning
0 likes · 14 min read
Diagnosing and Optimizing Elasticsearch IO Bottlenecks for Billion-Scale Product Catalogs
Efficient Ops
Efficient Ops
Mar 23, 2022 · Operations

Master Elasticsearch Node Commands: Inspect, Monitor, and Troubleshoot Your Cluster

This article walks through essential Elasticsearch node‑level APIs—covering how to retrieve basic node info, detailed statistics, thread‑pool usage, and hot‑thread diagnostics—complete with request examples, response samples, and practical tips for diagnosing common cluster issues.

APICluster OperationsElasticsearch
0 likes · 12 min read
Master Elasticsearch Node Commands: Inspect, Monitor, and Troubleshoot Your Cluster
Code Ape Tech Column
Code Ape Tech Column
Jan 19, 2021 · Operations

Scaling Kafka Clusters to Support Millions of Partitions: Challenges and Solutions

This article examines the technical challenges of scaling Kafka clusters to handle millions of partitions—including Zookeeper node explosion, replication overhead, controller recovery latency, and broker restart delays—and proposes solutions such as parallel ZK fetching, metadata synchronization via internal topics, logical cluster composition, and physical cluster splitting.

Cluster OperationsKafkaPartition Scaling
0 likes · 13 min read
Scaling Kafka Clusters to Support Millions of Partitions: Challenges and Solutions
Big Data Technology Architecture
Big Data Technology Architecture
Jul 9, 2019 · Operations

Elasticsearch Node Shutdown Process and Risks During Rolling Upgrade

During a rolling upgrade of an Elasticsearch cluster, stopping nodes—especially the master—can block write requests, cause client connection failures, trigger master re‑election, and lead to temporary data duplication, making it essential to understand the shutdown sequence and its impact on read/write operations.

Cluster OperationsElasticsearchNode Shutdown
0 likes · 5 min read
Elasticsearch Node Shutdown Process and Risks During Rolling Upgrade
JD Tech Talk
JD Tech Talk
Aug 9, 2018 · Operations

Ensuring Stability and Scalability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices

The article explains why operating massive Kubernetes clusters is as challenging as building large systems, outlines three critical stability questions, shares real‑world data collection, visualization, and tooling practices, and provides concrete recommendations for high‑availability, monitoring, and performance optimization.

Cluster OperationsKubernetesObservability
0 likes · 12 min read
Ensuring Stability and Scalability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices
JD Retail Technology
JD Retail Technology
Jul 24, 2018 · Operations

Stability and Operational Practices for Large‑Scale Kubernetes Clusters

This article shares practical experience and best‑practice guidelines for operating large‑scale Kubernetes clusters, covering stability checks, component failure impact, recovery strategies, alerting mechanisms, data collection, visualization, and the suite of operational tools that help ensure reliable, high‑performance cloud‑native infrastructure.

Cluster OperationsHigh AvailabilityKubernetes
0 likes · 10 min read
Stability and Operational Practices for Large‑Scale Kubernetes Clusters