Tagged articles
19 articles
Page 1 of 1
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 7, 2025 · Operations

How Alibaba Scales Flink to Millions of Cores: Real‑Time Ops Secrets

This article details Alibaba's decade‑long evolution of its real‑time computing platform, the massive operational challenges of managing Flink clusters at million‑core scale, and the comprehensive strategies—including SLA metrics, self‑healing services, cloud‑native redesign, and job‑level advisory tools—used to ensure stability, cost efficiency, and performance during peak events like Double‑11.

Apache FlinkCloud NativeJob Advisory
0 likes · 19 min read
How Alibaba Scales Flink to Millions of Cores: Real‑Time Ops Secrets
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
May 9, 2025 · Big Data

Mastering Multi‑AZ Replication in HDFS with AZ Mover

This article introduces AZ Mover, a lightweight HDFS client‑side tool that intelligently scans, schedules, and migrates block replicas across multiple availability zones, detailing its design goals, core workflow, command‑line options, concurrency controls, and future enhancements for robust big‑data disaster recovery.

AZ MoverData GovernanceHDFS
0 likes · 9 min read
Mastering Multi‑AZ Replication in HDFS with AZ Mover
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Dec 6, 2024 · Operations

Boost Kubernetes API Server Performance: Tuning max-mutating-requests-inflight & watch-cache-size

This guide explains how to optimize Kubernetes API Server performance by configuring the max-mutating-requests-inflight limit and watch-cache-size, offering recommended values for different cluster sizes, monitoring metrics, and step‑by‑step adjustment strategies for stable, high‑throughput clusters.

API ServerKubernetescluster operations
0 likes · 7 min read
Boost Kubernetes API Server Performance: Tuning max-mutating-requests-inflight & watch-cache-size
政采云技术
政采云技术
Jan 10, 2024 · Operations

Understanding and Improving Elasticsearch Shard Balancing Strategies

This article analyzes Elasticsearch shard imbalance incidents, explains the built‑in shard balancing algorithm and its configuration parameters, demonstrates weight calculations with source code, and proposes practical improvements—including shard count adjustments and a custom load‑aware balancing tool—to achieve more effective cluster load distribution.

ElasticsearchPerformance Optimizationcluster operations
0 likes · 17 min read
Understanding and Improving Elasticsearch Shard Balancing Strategies
dbaplus Community
dbaplus Community
Sep 14, 2023 · Cloud Native

Mastering Kubernetes: 30+ Essential Pod, Node, and Cluster Troubleshooting Techniques

This guide compiles over thirty practical Kubernetes troubleshooting steps, covering pod startup failures, networking issues, resource bottlenecks, node abnormalities, cluster‑wide service problems, and detailed explanations of common container exit codes to help operators quickly diagnose and resolve issues.

Container exit codesKubernetesNode diagnostics
0 likes · 22 min read
Mastering Kubernetes: 30+ Essential Pod, Node, and Cluster Troubleshooting Techniques
ITPUB
ITPUB
Aug 9, 2023 · Operations

Why Is My Elasticsearch Cluster Using 15 GB Heap? A Deep Dive into Memory Bottlenecks

The article examines a 7‑node Elasticsearch cluster with 500 million documents, uncovering excessive heap usage, high OS memory pressure, numerous deleted documents, large translog, low query‑cache hit rate, and an over‑sharded design, then offers concrete tuning and redesign recommendations to restore performance.

ElasticsearchMemory Optimizationcluster operations
0 likes · 16 min read
Why Is My Elasticsearch Cluster Using 15 GB Heap? A Deep Dive into Memory Bottlenecks
政采云技术
政采云技术
Jul 14, 2022 · Operations

Diagnosing and Optimizing Elasticsearch IO Bottlenecks for Billion-Scale Product Catalogs

Facing severe IO-wait and read bottlenecks as product data grew from tens of millions to billions, this article analyzes root causes in Elasticsearch clusters and presents a comprehensive solution involving index parameter tuning, merge settings, translog async writes, query optimizations, and hardware upgrades to restore performance and stability.

ElasticsearchIO optimizationIndex Tuning
0 likes · 14 min read
Diagnosing and Optimizing Elasticsearch IO Bottlenecks for Billion-Scale Product Catalogs
Ops Development Stories
Ops Development Stories
Oct 9, 2021 · Cloud Native

Why Do Some Kubernetes Pods Stay Stuck in Terminating? Causes and Fixes

This article explains the Kubernetes pod lifecycle, the meaning of the Terminating state, detailed pod creation and deletion processes, and the eviction mechanisms of both kube‑controller‑manager and kubelet, offering troubleshooting steps and best practices to resolve pods that remain stuck in Terminating.

Cloud NativeKubernetesPod Lifecycle
0 likes · 13 min read
Why Do Some Kubernetes Pods Stay Stuck in Terminating? Causes and Fixes
Code Ape Tech Column
Code Ape Tech Column
Jan 19, 2021 · Operations

Scaling Kafka Clusters to Support Millions of Partitions: Challenges and Solutions

This article examines the technical challenges of scaling Kafka clusters to handle millions of partitions—including Zookeeper node explosion, replication overhead, controller recovery latency, and broker restart delays—and proposes solutions such as parallel ZK fetching, metadata synchronization via internal topics, logical cluster composition, and physical cluster splitting.

Distributed SystemsKafkacluster operations
0 likes · 13 min read
Scaling Kafka Clusters to Support Millions of Partitions: Challenges and Solutions
Tencent Cloud Middleware
Tencent Cloud Middleware
Apr 9, 2020 · Operations

Scaling Kafka to Support Millions of Partitions Without Downtime

This article explains the metadata, controller, and Zookeeper challenges of supporting a million‑plus Kafka partitions and presents practical solutions such as parallel ZK fetching, metadata‑via‑topic redesign, logical cluster assembly, and physical cluster splitting to achieve large‑scale, stable Kafka deployments.

KafkaZooKeepercluster operations
0 likes · 15 min read
Scaling Kafka to Support Millions of Partitions Without Downtime
Big Data Technology Architecture
Big Data Technology Architecture
Jul 9, 2019 · Operations

Elasticsearch Node Shutdown Process and Risks During Rolling Upgrade

During a rolling upgrade of an Elasticsearch cluster, stopping nodes—especially the master—can block write requests, cause client connection failures, trigger master re‑election, and lead to temporary data duplication, making it essential to understand the shutdown sequence and its impact on read/write operations.

ElasticsearchNode ShutdownRolling Upgrade
0 likes · 5 min read
Elasticsearch Node Shutdown Process and Risks During Rolling Upgrade
JD Tech Talk
JD Tech Talk
Aug 9, 2018 · Operations

Ensuring Stability and Scalability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices

The article explains why operating massive Kubernetes clusters is as challenging as building large systems, outlines three critical stability questions, shares real‑world data collection, visualization, and tooling practices, and provides concrete recommendations for high‑availability, monitoring, and performance optimization.

AutomationKubernetesObservability
0 likes · 12 min read
Ensuring Stability and Scalability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices
JD Retail Technology
JD Retail Technology
Jul 24, 2018 · Operations

Stability and Operational Practices for Large‑Scale Kubernetes Clusters

This article shares practical experience and best‑practice guidelines for operating large‑scale Kubernetes clusters, covering stability checks, component failure impact, recovery strategies, alerting mechanisms, data collection, visualization, and the suite of operational tools that help ensure reliable, high‑performance cloud‑native infrastructure.

KubernetesObservabilitycluster operations
0 likes · 10 min read
Stability and Operational Practices for Large‑Scale Kubernetes Clusters