Tagged articles

19 articles

Page 1 of 1

May 18, 2026 · Cloud Native

Does Your Application Really Need Kubernetes? Consider These 3 Critical Questions

This article guides ops engineers and development leads through three essential questions—architecture suitability, team capability, and cost‑benefit analysis—to determine whether migrating to Kubernetes adds real value or just extra complexity.

K8s migrationKubernetesMicroservices

0 likes · 43 min read

Does Your Application Really Need Kubernetes? Consider These 3 Critical Questions

Alibaba Cloud Big Data AI Platform

Aug 7, 2025 · Operations

How Alibaba Scales Flink to Millions of Cores: Real‑Time Ops Secrets

This article details Alibaba's decade‑long evolution of its real‑time computing platform, the massive operational challenges of managing Flink clusters at million‑core scale, and the comprehensive strategies—including SLA metrics, self‑healing services, cloud‑native redesign, and job‑level advisory tools—used to ensure stability, cost efficiency, and performance during peak events like Double‑11.

Apache FlinkCloud NativeJob Advisory

0 likes · 19 min read

How Alibaba Scales Flink to Millions of Cores: Real‑Time Ops Secrets

360 Zhihui Cloud Developer

May 9, 2025 · Big Data

Mastering Multi‑AZ Replication in HDFS with AZ Mover

This article introduces AZ Mover, a lightweight HDFS client‑side tool that intelligently scans, schedules, and migrates block replicas across multiple availability zones, detailing its design goals, core workflow, command‑line options, concurrency controls, and future enhancements for robust big‑data disaster recovery.

AZ MoverData GovernanceHDFS

0 likes · 9 min read

Mastering Multi‑AZ Replication in HDFS with AZ Mover

Full-Stack DevOps & Kubernetes

Dec 6, 2024 · Operations

Boost Kubernetes API Server Performance: Tuning max-mutating-requests-inflight & watch-cache-size

This guide explains how to optimize Kubernetes API Server performance by configuring the max-mutating-requests-inflight limit and watch-cache-size, offering recommended values for different cluster sizes, monitoring metrics, and step‑by‑step adjustment strategies for stable, high‑throughput clusters.

API ServerKubernetescluster operations

0 likes · 7 min read

Boost Kubernetes API Server Performance: Tuning max-mutating-requests-inflight & watch-cache-size

Full-Stack DevOps & Kubernetes

Nov 11, 2024 · Cloud Native

How K3s Embedded Registry Enables Offline Image Sharing in Kubernetes Clusters

This article explains how K3s's new embedded container image feature lets Kubernetes nodes share OCI images peer‑to‑peer, eliminating external registry dependence and speeding up distribution in offline, low‑bandwidth, or geographically dispersed environments.

Embedded RegistryK3sKubernetes

0 likes · 8 min read

How K3s Embedded Registry Enables Offline Image Sharing in Kubernetes Clusters

dbaplus Community

Mar 11, 2024 · Cloud Native

Why Kubernetes’ One‑Year Loopback Certificate Breaks and How the Community Is Tackling It

The article explains that the kube‑apiserver’s built‑in LoopbackClient certificate expires after one year, causing API server failures in long‑running clusters, and examines the community’s support policies, recent Alibaba Cloud changes, and ongoing discussions about introducing a true Kubernetes LTS.

CertificateKubernetesLTS

0 likes · 13 min read

Why Kubernetes’ One‑Year Loopback Certificate Breaks and How the Community Is Tackling It

政采云技术

Jan 10, 2024 · Operations

Understanding and Improving Elasticsearch Shard Balancing Strategies

This article analyzes Elasticsearch shard imbalance incidents, explains the built‑in shard balancing algorithm and its configuration parameters, demonstrates weight calculations with source code, and proposes practical improvements—including shard count adjustments and a custom load‑aware balancing tool—to achieve more effective cluster load distribution.

ElasticsearchPerformance Optimizationcluster operations

0 likes · 17 min read

Understanding and Improving Elasticsearch Shard Balancing Strategies

dbaplus Community

Sep 14, 2023 · Cloud Native

Mastering Kubernetes: 30+ Essential Pod, Node, and Cluster Troubleshooting Techniques

This guide compiles over thirty practical Kubernetes troubleshooting steps, covering pod startup failures, networking issues, resource bottlenecks, node abnormalities, cluster‑wide service problems, and detailed explanations of common container exit codes to help operators quickly diagnose and resolve issues.

Container exit codesKubernetesNode diagnostics

0 likes · 22 min read

Mastering Kubernetes: 30+ Essential Pod, Node, and Cluster Troubleshooting Techniques

ITPUB

Aug 9, 2023 · Operations

Why Is My Elasticsearch Cluster Using 15 GB Heap? A Deep Dive into Memory Bottlenecks

The article examines a 7‑node Elasticsearch cluster with 500 million documents, uncovering excessive heap usage, high OS memory pressure, numerous deleted documents, large translog, low query‑cache hit rate, and an over‑sharded design, then offers concrete tuning and redesign recommendations to restore performance.

ElasticsearchMemory Optimizationcluster operations

0 likes · 16 min read

Why Is My Elasticsearch Cluster Using 15 GB Heap? A Deep Dive into Memory Bottlenecks

Efficient Ops

Nov 23, 2022 · Operations

How to Diagnose and Fix Node2 Ceph‑Related cgroup Leaks in a Kubernetes Cluster

This article walks through a real‑world Kubernetes incident where a node ran out of space due to Ceph storage inconsistencies and cgroup leaks, detailing step‑by‑step diagnostics, Ceph repair commands, pod eviction, node reboot, and post‑mortem recommendations for cluster operations.

CephKubernetesNode troubleshooting

0 likes · 6 min read

How to Diagnose and Fix Node2 Ceph‑Related cgroup Leaks in a Kubernetes Cluster

政采云技术

Jul 14, 2022 · Operations

Diagnosing and Optimizing Elasticsearch IO Bottlenecks for Billion-Scale Product Catalogs

Facing severe IO-wait and read bottlenecks as product data grew from tens of millions to billions, this article analyzes root causes in Elasticsearch clusters and presents a comprehensive solution involving index parameter tuning, merge settings, translog async writes, query optimizations, and hardware upgrades to restore performance and stability.

ElasticsearchIO optimizationIndex Tuning

0 likes · 14 min read

Diagnosing and Optimizing Elasticsearch IO Bottlenecks for Billion-Scale Product Catalogs

Efficient Ops

Mar 23, 2022 · Operations

Master Elasticsearch Node Commands: Inspect, Monitor, and Troubleshoot Your Cluster

This article walks through essential Elasticsearch node‑level APIs—covering how to retrieve basic node info, detailed statistics, thread‑pool usage, and hot‑thread diagnostics—complete with request examples, response samples, and practical tips for diagnosing common cluster issues.

APIElasticsearchcluster operations

0 likes · 12 min read

Master Elasticsearch Node Commands: Inspect, Monitor, and Troubleshoot Your Cluster

Ops Development Stories

Oct 9, 2021 · Cloud Native

Why Do Some Kubernetes Pods Stay Stuck in Terminating? Causes and Fixes

This article explains the Kubernetes pod lifecycle, the meaning of the Terminating state, detailed pod creation and deletion processes, and the eviction mechanisms of both kube‑controller‑manager and kubelet, offering troubleshooting steps and best practices to resolve pods that remain stuck in Terminating.

Cloud NativeKubernetesPod Lifecycle

0 likes · 13 min read

Why Do Some Kubernetes Pods Stay Stuck in Terminating? Causes and Fixes

Code Ape Tech Column

Jan 19, 2021 · Operations

Scaling Kafka Clusters to Support Millions of Partitions: Challenges and Solutions

This article examines the technical challenges of scaling Kafka clusters to handle millions of partitions—including Zookeeper node explosion, replication overhead, controller recovery latency, and broker restart delays—and proposes solutions such as parallel ZK fetching, metadata synchronization via internal topics, logical cluster composition, and physical cluster splitting.

Distributed SystemsKafkacluster operations

0 likes · 13 min read

Scaling Kafka Clusters to Support Millions of Partitions: Challenges and Solutions

Tencent Cloud Middleware

Apr 9, 2020 · Operations

Scaling Kafka to Support Millions of Partitions Without Downtime

This article explains the metadata, controller, and Zookeeper challenges of supporting a million‑plus Kafka partitions and presents practical solutions such as parallel ZK fetching, metadata‑via‑topic redesign, logical cluster assembly, and physical cluster splitting to achieve large‑scale, stable Kafka deployments.

KafkaZooKeepercluster operations

0 likes · 15 min read

Scaling Kafka to Support Millions of Partitions Without Downtime

Big Data Technology Architecture

Jul 9, 2019 · Operations

Elasticsearch Node Shutdown Process and Risks During Rolling Upgrade

During a rolling upgrade of an Elasticsearch cluster, stopping nodes—especially the master—can block write requests, cause client connection failures, trigger master re‑election, and lead to temporary data duplication, making it essential to understand the shutdown sequence and its impact on read/write operations.

ElasticsearchNode ShutdownRolling Upgrade

0 likes · 5 min read

Elasticsearch Node Shutdown Process and Risks During Rolling Upgrade

JD Tech Talk

Aug 9, 2018 · Operations

Ensuring Stability and Scalability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices

The article explains why operating massive Kubernetes clusters is as challenging as building large systems, outlines three critical stability questions, shares real‑world data collection, visualization, and tooling practices, and provides concrete recommendations for high‑availability, monitoring, and performance optimization.

AutomationKubernetesObservability

0 likes · 12 min read

Ensuring Stability and Scalability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices

JD Tech

Aug 6, 2018 · Operations

Ensuring Stability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices

The article shares practical experience on operating massive Kubernetes clusters, focusing on three stability questions, data collection and visualization, and a suite of operational tools to achieve reliable, high‑availability services in production environments.

Kubernetescluster operationslarge scale

0 likes · 12 min read

Ensuring Stability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices

JD Retail Technology

Jul 24, 2018 · Operations

Stability and Operational Practices for Large‑Scale Kubernetes Clusters

This article shares practical experience and best‑practice guidelines for operating large‑scale Kubernetes clusters, covering stability checks, component failure impact, recovery strategies, alerting mechanisms, data collection, visualization, and the suite of operational tools that help ensure reliable, high‑performance cloud‑native infrastructure.

KubernetesObservabilitycluster operations

0 likes · 10 min read

Stability and Operational Practices for Large‑Scale Kubernetes Clusters