Tagged articles

cluster management

188 articles · Page 1 of 2

Jun 16, 2026 · Operations

Open‑Source OCManager: A Smart Manager that Handles 7 Million Daily Alerts

OCManager, an open‑source integrated platform from OpenCloudOS, unifies cluster management, whole‑machine monitoring, and AI‑driven operations in a single web console, supporting millions of daily alerts, thousands of incidents, and multi‑OS environments with a four‑layer architecture and Docker‑based deployment.

AI OpsDockerMonitoring

0 likes · 15 min read

Open‑Source OCManager: A Smart Manager that Handles 7 Million Daily Alerts

Raymond Ops

Jun 2, 2026 · Cloud Native

200+ Essential kubectl Commands for Managing and Troubleshooting Kubernetes Clusters

This guide compiles over 200 practical kubectl commands, covering cluster setup, context switching, resource inspection, workload management, networking, storage, security hardening, high‑availability patterns, troubleshooting techniques, and performance monitoring to help operators efficiently administer Kubernetes environments.

Cloud Nativecluster managementdevops

0 likes · 39 min read

200+ Essential kubectl Commands for Managing and Troubleshooting Kubernetes Clusters

DevOps Operations Practice

May 3, 2026 · Cloud Native

Kubernetes Dashboard Is Deprecated—Officially Recommended Replacement Headlamp

The article explains why the Kubernetes Dashboard has been deprecated due to security and multi‑cluster limitations, and introduces Headlamp as the officially endorsed, lightweight web UI that offers multi‑cluster management, strict RBAC enforcement, and extensible plugins, with simple installation steps.

HeadlampRBACWeb UI

0 likes · 3 min read

Kubernetes Dashboard Is Deprecated—Officially Recommended Replacement Headlamp

Mingyi World Elasticsearch

Mar 30, 2026 · Operations

Cerebro + Easysearch: A Practical Guide to Avoid Common Pitfalls

This guide explains how to integrate the lightweight Cerebro monitoring tool with Easysearch, covering core features, configuration steps, and detailed solutions for frequent issues such as Java version conflicts, SSL certificate errors, and authentication mismatches.

CerebroConfigurationEasysearch

0 likes · 8 min read

Cerebro + Easysearch: A Practical Guide to Avoid Common Pitfalls

Raymond Ops

Dec 27, 2025 · Cloud Native

15 Powerful kubectl Tricks to Master Kubernetes Management

Learn 15 practical kubectl techniques—from resource shortcuts and context switching to advanced JSONPath queries, custom output formats, and efficient alias configurations—that enable Kubernetes administrators to streamline cluster management, improve debugging, and boost operational productivity.

CLIcluster managementdevops

0 likes · 12 min read

15 Powerful kubectl Tricks to Master Kubernetes Management

Linux Cloud Computing Practice

Dec 5, 2025 · Operations

Essential Ceph Command Cheat Sheet for Cluster Management

This guide provides a concise collection of essential Ceph commands for starting services, checking health and status, managing monitors, metadata servers, and OSDs, as well as creating admin users, purging nodes, and handling crush maps, enabling administrators to efficiently operate and troubleshoot a Ceph storage cluster.

CephLinuxOperations

0 likes · 6 min read

Essential Ceph Command Cheat Sheet for Cluster Management

Mingyi World Elasticsearch

Nov 24, 2025 · Operations

Master Easysearch Write Throttling: Node, Shard, and Index Level Controls to Tame Bulk Write Spikes

This article walks through configuring Easysearch write throttling at the node, shard, and index levels, showing dynamic cluster settings, key parameters, DSL examples, and when to use retry versus drop actions to protect cluster stability during massive bulk indexing operations.

EasysearchElasticsearchIndex Throttling

0 likes · 8 min read

Master Easysearch Write Throttling: Node, Shard, and Index Level Controls to Tame Bulk Write Spikes

Rare Earth Juejin Tech Community

Nov 19, 2025 · Backend Development

Master Elasticsearch: Index Design, Field Types, and Cluster Management Tips

An experienced engineer shares practical Elasticsearch insights covering index design with aliases and routing, field type choices, query optimization techniques, pagination strategies, real‑time refresh settings, memory limits, and cluster management, offering concrete examples and actionable recommendations for robust search implementations.

ElasticsearchQuery Optimizationcluster management

0 likes · 12 min read

Master Elasticsearch: Index Design, Field Types, and Cluster Management Tips

DevOps Coach

Oct 28, 2025 · Cloud Native

20 Essential Kubernetes Tips to Boost Security, Reliability, and Manageability

This guide presents twenty practical Kubernetes best‑practice tips covering productivity shortcuts, resource limits, health probes, node draining, PodDisruptionBudgets, RBAC hardening, read‑only ConfigMaps/Secrets, non‑root containers, network policies, image version pinning, secret rotation, centralized logging, etcd backups, resource cleanup, and secure access methods.

Reliabilitybest practicescluster management

0 likes · 8 min read

20 Essential Kubernetes Tips to Boost Security, Reliability, and Manageability

Liangxu Linux

Oct 1, 2025 · Cloud Native

Master Kubernetes with Essential kubectl Commands: From Cluster Overview to Advanced Ops

This comprehensive guide walks you through the most useful kubectl commands for Kubernetes, covering cluster inspection, pod lifecycle, services, networking, storage, troubleshooting, performance tuning, security, and automation, empowering ops engineers to manage containerized clusters efficiently.

Cloud Nativecluster managementkubectl

0 likes · 15 min read

Master Kubernetes with Essential kubectl Commands: From Cluster Overview to Advanced Ops

Ray's Galactic Tech

Sep 20, 2025 · Operations

How to Safely Upgrade a ZooKeeper Node’s IP Without Disrupting the Cluster

This guide explains why changing a ZooKeeper node’s IP requires updating the configuration on all members, then walks through a step‑by‑step procedure—including stopping the target node, editing zoo.cfg on every server, restarting the remaining nodes, and verifying the quorum—plus best‑practice tips for Kubernetes deployments.

IP upgradecluster managementkubernetes

0 likes · 7 min read

How to Safely Upgrade a ZooKeeper Node’s IP Without Disrupting the Cluster

Mingyi World Elasticsearch

Aug 30, 2025 · Operations

INFINI Console FAQ: Enterprise‑Grade Unified Elasticsearch Management

The article introduces INFINI Console, an open‑source, lightweight platform for unified, multi‑cluster and cross‑version Elasticsearch governance, compares it with Kibana, details deployment options, enterprise‑level features such as monitoring, alerting and security, and analyzes cost advantages and practical migration scenarios.

ElasticsearchINFINI ConsoleMonitoring

0 likes · 13 min read

INFINI Console FAQ: Enterprise‑Grade Unified Elasticsearch Management

DataFunSummit

Aug 28, 2025 · Artificial Intelligence

How We Scaled AI Compute to Millions of Nodes with Ray on WeChat

This article explains how Tencent's WeChat team built the Astra platform on Ray to manage millions of AI compute nodes, addressing challenges of massive scale, heterogeneous GPU resources, low‑priority node instability, deployment complexity, and cost, while detailing architecture, scheduling strategies, and practical usage examples.

AI scalingDistributed ComputingRay

0 likes · 21 min read

How We Scaled AI Compute to Millions of Nodes with Ray on WeChat

360 Zhihui Cloud Developer

Aug 6, 2025 · Cloud Native

Step‑by‑Step Rancher Deployment for Multi‑Cluster Kubernetes Management

This guide explains the background of multi‑IDC Kubernetes clusters, why a unified platform like Rancher is needed, and provides detailed step‑by‑step instructions for single‑node, high‑availability RKE, lightweight K3s deployments, Helm installation, cert‑manager setup, ingress configuration, and best‑practice recommendations.

HA deploymentRKEcluster management

0 likes · 12 min read

Step‑by‑Step Rancher Deployment for Multi‑Cluster Kubernetes Management

MaGe Linux Operations

Jul 21, 2025 · Cloud Native

Master Kubernetes with Essential Commands: Efficient Container Cluster Management

This comprehensive guide walks operations engineers through essential Kubernetes commands, covering cluster inspection, pod lifecycle, service and network handling, storage configuration, troubleshooting, performance monitoring, scaling, security, and automation, enabling efficient and expert management of containerized clusters.

Operationscluster managementkubectl

0 likes · 17 min read

Master Kubernetes with Essential Commands: Efficient Container Cluster Management

Raymond Ops

Jul 19, 2025 · Cloud Native

Step-by-Step Guide to Upgrading Kubernetes Nodes to v1.15.12

This tutorial walks you through downloading the latest Kubernetes packages, preparing master and node services, adjusting nginx proxy settings, cordoning and draining nodes, replacing binaries and certificates, restarting services, and verifying the upgrade across a two‑node cluster.

NGINXNode Upgradecluster management

0 likes · 13 min read

Step-by-Step Guide to Upgrading Kubernetes Nodes to v1.15.12

Raymond Ops

Jun 19, 2025 · Operations

Master Kubernetes Cluster Management: Essential kubectl Commands Explained

This guide walks you through essential kubectl commands for viewing cluster status, inspecting resources, creating and modifying objects, labeling, annotating, and launching pods, providing practical examples and command syntax to help you manage Kubernetes clusters effectively.

cluster managementdevopskubectl

0 likes · 14 min read

Master Kubernetes Cluster Management: Essential kubectl Commands Explained

Mingyi World Elasticsearch

Jun 18, 2025 · Operations

Comprehensively Manage Elasticsearch 9.X with INFINI Console

The article provides a detailed technical overview of INFINI Console, an open‑source, lightweight governance platform that enables multi‑cluster, cross‑version management, dynamic registration, monitoring, alerting, and developer tools for Elasticsearch 9.X, comparing it with Kibana and highlighting deployment simplicity across various OS and CPU architectures.

Cross-Version SupportElasticsearchINFINI Console

0 likes · 11 min read

Comprehensively Manage Elasticsearch 9.X with INFINI Console

DevOps Operations Practice

Jun 16, 2025 · Cloud Native

Mastering Kubernetes: 6 Essential Tools for Cluster Management

This article introduces six indispensable tools—kubectl, Helm, Prometheus + Grafana, Istio, Velero, and K9s—that simplify Kubernetes cluster management by covering resource handling, monitoring, networking, security, backup, and interactive UI, helping readers efficiently operate production‑grade clusters.

Cloud NativeMonitoringcluster management

0 likes · 7 min read

Mastering Kubernetes: 6 Essential Tools for Cluster Management

Mingyi World Elasticsearch

Jun 4, 2025 · Operations

When Should You Deploy Dedicated Coordinating Nodes in Elasticsearch?

The article explains what Elasticsearch coordinating nodes are, why dedicated coordinating‑only nodes can off‑load HTTP handling from data and master nodes to reduce load, lower latency and simplify client configuration, and outlines the associated hardware and cluster‑state costs, usage scenarios, deployment steps and monitoring tips.

Coordinating NodeElasticsearchNode Roles

0 likes · 12 min read

When Should You Deploy Dedicated Coordinating Nodes in Elasticsearch?

Full-Stack DevOps & Kubernetes

May 14, 2025 · Cloud Native

Step‑by‑Step Binary Upgrade of a Kubernetes v1.30 Cluster to v1.32.3

This guide walks through upgrading a binary‑deployed Kubernetes cluster from v1.30.0 to v1.32.3, covering preparation, master and node upgrade procedures, deprecated API handling, validation, rollback, and best‑practice recommendations for production environments.

Binary UpgradeDeprecated APIRollback

0 likes · 11 min read

Step‑by‑Step Binary Upgrade of a Kubernetes v1.30 Cluster to v1.32.3

Efficient Ops

May 12, 2025 · Cloud Native

Master Kubernetes Management with Kuboard: Visual UI Guide & Installation

Kuboard is a web‑based visual tool for managing Kubernetes clusters, offering multi‑auth, multi‑cluster support, micro‑service layering, and storage integration; the guide explains Docker installation, adding clusters via KubeConfig, workload inspection, and how the UI simplifies complex command‑line operations.

Cloud NativeDockercluster management

0 likes · 5 min read

Master Kubernetes Management with Kuboard: Visual UI Guide & Installation

Linux Cloud Computing Practice

Apr 10, 2025 · Cloud Computing

Unlock Scalable, Reliable Storage: A Complete Guide to Deploying Ceph

This article provides a comprehensive overview of Ceph distributed storage, covering storage fundamentals, Ceph architecture, advantages, version lifecycle, and step‑by‑step deployment using ceph‑deploy, including environment preparation, monitor and OSD setup, manager configuration, and dashboard activation.

CephDistributed storageLinux Deployment

0 likes · 28 min read

Unlock Scalable, Reliable Storage: A Complete Guide to Deploying Ceph

Tencent Cloud Middleware

Apr 9, 2025 · Operations

How TDMQ Pulsar’s Cluster‑Level and Topic‑Partition Throttling Keeps Your Messaging System Stable

This article explains why high‑throughput producers and consumers can saturate CPU, memory, network and disk I/O in TDMQ Pulsar clusters, describes the built‑in cluster‑level distributed and topic‑partition rate‑limiting mechanisms, and provides practical guidance for configuration, monitoring, and troubleshooting.

Message QueueOperationsPulsar

0 likes · 12 min read

How TDMQ Pulsar’s Cluster‑Level and Topic‑Partition Throttling Keeps Your Messaging System Stable

Raymond Ops

Mar 30, 2025 · Operations

Mastering Elasticsearch Data Sync and Cluster Architecture: 3 Strategies Explained

This article explains three Elasticsearch data‑synchronization methods, compares their pros and cons, and then dives into ES cluster structure, node roles, shard allocation, distributed queries, split‑brain handling, and fault‑tolerance mechanisms, providing a comprehensive guide for developers and ops engineers.

Data synchronizationElasticsearchcluster management

0 likes · 9 min read

Mastering Elasticsearch Data Sync and Cluster Architecture: 3 Strategies Explained

Mingyi World Elasticsearch

Mar 29, 2025 · Operations

How to Reset a Forgotten INFINI Console Password

The article explains two ways to recover access to INFINI Console when the password is lost: locating the original console_configuration.json file to retrieve the stored credentials, or using the built‑in Reset Password feature in the user management UI, with step‑by‑step instructions and screenshots.

Configuration FileINFINI Consoleadmin guide

0 likes · 5 min read

How to Reset a Forgotten INFINI Console Password

Cloud Native Technology Community

Mar 18, 2025 · Cloud Native

Best Practices for Managing Core Services in Large‑Scale Kubernetes Deployments

Scaling Kubernetes across dozens or hundreds of clusters requires standardized core services—networking, security, observability, and automation—so organizations should adopt templated configurations, GitOps tools, centralized monitoring, and automated certificate management to reduce complexity, improve security, and lower operational overhead.

AutomationGitOpsObservability

0 likes · 8 min read

Best Practices for Managing Core Services in Large‑Scale Kubernetes Deployments

Linux Cloud Computing Practice

Feb 28, 2025 · Cloud Native

Why Does a Kubernetes Node Stay Ready Only 3 Minutes After Restart?

This article examines a recurring Kubernetes node NotReady issue where nodes become ready for only three minutes after a kubelet restart, detailing the underlying PLEG mechanism, component interactions, and diagnostic steps to resolve the problem.

Cloud NativeNodeReadyPLEG

0 likes · 8 min read

Why Does a Kubernetes Node Stay Ready Only 3 Minutes After Restart?

dbaplus Community

Feb 13, 2025 · Databases

Automating Redis Resource Balancing to Cut DBA Effort

To handle growing memory pressure across thousands of Redis servers, the platform implements an automated, daily resource‑balancing scheduler that selects overloaded hosts, chooses optimal nodes based on instance count, tier, and placement rules, then safely migrates them through a multi‑step process with rigorous validation.

AutomationDatabase operationsRedis

0 likes · 14 min read

Automating Redis Resource Balancing to Cut DBA Effort

Mingyi World Elasticsearch

Feb 11, 2025 · Operations

How to Ace the Elastic Certified Engineer Exam: Full 8.15 Syllabus Breakdown and Fast‑Track Tips

This guide dissects the Elastic Certified Engineer 8.15 exam syllabus, explains each core topic—from searchable snapshots and async search to ILM policies and cross‑cluster replication—while offering a step‑by‑step study roadmap, hands‑on lab ideas, and resource recommendations to help candidates pass efficiently.

8.15Elastic Certified EngineerElasticsearch

0 likes · 6 min read

How to Ace the Elastic Certified Engineer Exam: Full 8.15 Syllabus Breakdown and Fast‑Track Tips

Architect

Dec 27, 2024 · Big Data

Fault Self‑Healing System for Large‑Scale Big Data Clusters

This article describes the design, architecture, and technical implementation of BMR's fault self‑healing platform, which automatically collects data, analyzes failures, defines decision rules, and executes safe recovery workflows to improve reliability and efficiency of massive, heterogeneous big‑data environments.

Big DataMonitoringcluster management

0 likes · 16 min read

Fault Self‑Healing System for Large‑Scale Big Data Clusters

Bilibili Tech

Dec 10, 2024 · Big Data

Fault Self‑Healing System for Bilibili's Large‑Scale Big Data Cluster (BMR)

Bilibili's fault‑self‑healing platform for its massive BMR big‑data cluster—over 10,000 machines and 1 EB storage—adds near‑real‑time fault discovery, intelligent diagnosis, and automated workflow handling, dramatically cutting resolution time, improving stability across services, and scaling to dozens of daily automated repairs.

BMRcluster managementfault self-healing

0 likes · 16 min read

Fault Self‑Healing System for Bilibili's Large‑Scale Big Data Cluster (BMR)

System Architect Go

Nov 6, 2024 · Cloud Native

How Kubernetes Extended Resources Enable Custom Scheduling (and Their Limits)

This article explains how Kubernetes Extended Resources let you define custom resource types, describes the creation, synchronization, and scheduling workflow, highlights the non‑real‑time allocatable status behavior, and discusses practical limitations and the role of Device Plugins and Operators.

Custom SchedulingDevice PluginExtended Resource

0 likes · 6 min read

How Kubernetes Extended Resources Enable Custom Scheduling (and Their Limits)

Bilibili Tech

Oct 29, 2024 · Big Data

Bilibili One‑Stop Big Data Cluster Management Platform (BMR): Architecture, Modules, and Future Outlook

Bilibili's One‑Stop Big Data Cluster Management Platform (BMR) unifies cluster, metadata, intelligent operations, and custom managers to oversee 50+ services, 10,000 machines, exabyte storage, and millions of cores, using cloud‑native containers, fault prediction, and resource‑sharing techniques to boost efficiency, stability, and cost savings.

BMRIntelligent OperationsMetadata Warehouse

0 likes · 17 min read

Bilibili One‑Stop Big Data Cluster Management Platform (BMR): Architecture, Modules, and Future Outlook

Baidu Geek Talk

Oct 9, 2024 · Artificial Intelligence

How Baidu’s Baige 4.0 Architecture Redefines AI Compute Efficiency

This article analyzes Baidu's Baige 4.0 AI infrastructure, detailing its four‑layer architecture, XMAN 5.0 hardware, HPN network, BCCL communication library, and AIAK inference upgrades, and explains how these innovations address large‑model training and inference challenges while boosting performance, utilization, and cost efficiency.

AI InfrastructureGPU AccelerationHigh-performance computing

0 likes · 16 min read

How Baidu’s Baige 4.0 Architecture Redefines AI Compute Efficiency

Architects' Tech Alliance

Sep 12, 2024 · Industry Insights

Managing and Optimizing Large‑Scale AI Compute Clusters: Practical Insights

This article examines the key pain points of massive AI compute clusters—including heterogeneous hardware compatibility, efficient scheduling, training and inference acceleration, and fault‑tolerant operations—while presenting practical management and performance‑tuning strategies, a cloud‑native AI platform implementation, and future directions for the ecosystem.

AI computingOperationsPerformance Tuning

0 likes · 7 min read

Managing and Optimizing Large‑Scale AI Compute Clusters: Practical Insights

Mike Chen's Internet Architecture

Aug 29, 2024 · Cloud Native

Mastering Kubernetes: Core Concepts, Architecture, and Real‑World Use Cases

This article provides a comprehensive overview of Kubernetes (K8S), covering its origins, key problems it solves, master‑node architecture, core components such as kube‑apiserver, scheduler, controllers, node agents, and practical applications like CI/CD integration, multi‑tenant and micro‑service deployments.

CI/CDCloud NativeContainer Orchestration

0 likes · 9 min read

Mastering Kubernetes: Core Concepts, Architecture, and Real‑World Use Cases

Rare Earth Juejin Tech Community

Aug 6, 2024 · Operations

ZooKeeper Core Concepts: Data Model, Node Types, Sessions, Cluster, Election, ZAB, Watch, ACL, and Distributed Lock Patterns

This article explains ZooKeeper's hierarchical data model, node types, session mechanism, cluster roles and election process, ZAB protocol, watch mechanism, ACL permissions, and common distributed lock implementations, providing a comprehensive overview of its core concepts and practical usage.

ACLCoordination ServiceDistributed Lock

0 likes · 17 min read

ZooKeeper Core Concepts: Data Model, Node Types, Sessions, Cluster, Election, ZAB, Watch, ACL, and Distributed Lock Patterns

Bilibili Tech

Jul 19, 2024 · Big Data

Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation

Bilibili’s one‑stop Big Data Cluster Management Platform (BMR) consolidates HDFS, Spark, Flink, ClickHouse, Kafka and other services into a unified system that evolved through four stages—standardization, metadata‑driven construction, containerization, and observability—addressing node consistency, scaling, fault self‑healing, and resource optimization while delivering elastic scaling, automated start/stop, and future cost‑saving and stability enhancements.

Observabilitybig data platformcluster management

0 likes · 12 min read

Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation

DevOps Cloud Academy

Jun 18, 2024 · Operations

Essential kubectl Commands for DevOps Engineers

This guide presents a comprehensive collection of the most important and frequently used kubectl commands, explaining how to retrieve version information, manage clusters, list resources, manipulate contexts, create, update, patch, scale, expose, delete, and debug Kubernetes objects, as well as format output and control verbosity, enabling DevOps engineers to efficiently operate Kubernetes clusters.

CLIcluster managementdevops

0 likes · 14 min read

Essential kubectl Commands for DevOps Engineers

Baidu Geek Talk

Apr 24, 2024 · Industry Insights

How Baidu’s New AI OS “WanYuan” Redefines Intelligent Computing

At the Create 2024 Baidu AI Developer Conference, Baidu unveiled its next‑generation intelligent computing operating system WanYuan, detailing its cluster‑scale management, GPU‑centric performance, integrated large‑model services, and a layered architecture that aims to simplify AI‑native application development and accelerate the AI era.

AIBaiduCloud Computing

0 likes · 12 min read

How Baidu’s New AI OS “WanYuan” Redefines Intelligent Computing

Practical DevOps Architecture

Apr 18, 2024 · Cloud Native

Kubernetes Source Code Deep Dive and Secondary Development Course Outline

This curriculum provides a comprehensive, step‑by‑step exploration of Kubernetes internals—including kubeadm core source, Go module management, cobra libraries, kubeadm init/join processes, client‑go components, code generators, custom resources, operators, and practical deployment automation—aimed at mastering cluster setup, configuration, and advanced development.

GoSource Codeclient-go

0 likes · 10 min read

Kubernetes Source Code Deep Dive and Secondary Development Course Outline

Ops Development Stories

Apr 7, 2024 · Cloud Native

How to Build a Kube‑on‑Kube Controller for Managing Multiple Kubernetes Clusters

This article explains the concept of kube‑on‑kube—creating a Kubernetes meta‑cluster to manage other clusters via declarative APIs—detailing its architecture, controller design, execution flow, and step‑by‑step code walkthrough including CRD definitions, Docker images, and deployment procedures.

AnsibleCRDDocker

0 likes · 15 min read

How to Build a Kube‑on‑Kube Controller for Managing Multiple Kubernetes Clusters

NewBeeNLP

Mar 8, 2024 · Industry Insights

Why Building LLMs Is Like Buying a Hardware Lottery – Lessons from a Startup

The article recounts Yi Tay’s experience founding Reka and building large language models from scratch, highlighting the unpredictable quality of GPU clusters, the challenges of multi‑cluster orchestration, code‑base choices, and how startups must rely on fast, intuition‑driven experimentation to succeed.

GPUHardwareLLM

0 likes · 12 min read

Why Building LLMs Is Like Buying a Hardware Lottery – Lessons from a Startup

dbaplus Community

Feb 26, 2024 · Cloud Native

10 Hard‑Earned Lessons from 3 Years Managing Kubernetes Clusters

After three years of hands‑on Kubernetes administration, the author shares ten practical lessons covering cloud‑hosted clusters, infrastructure‑as‑code, Helm chart usage, service mesh decisions, tool selection, resource limits, stateless design, HPA configuration, and upgrade strategies to help both newcomers and seasoned engineers manage clusters effectively.

Cloud Nativebest practicescluster management

0 likes · 7 min read

10 Hard‑Earned Lessons from 3 Years Managing Kubernetes Clusters

Practical DevOps Architecture

Feb 26, 2024 · Big Data

Advanced ElasticStack Development and Architecture Course (P6)

This course provides comprehensive, hands‑on training on ElasticSearch, Logstash, Kibana, and the ElasticStack ecosystem, covering advanced development, cluster design, performance tuning, security, and real‑world integration techniques for large‑scale data processing.

Big DataElasticStackElasticsearch

0 likes · 6 min read

Advanced ElasticStack Development and Architecture Course (P6)

Full-Stack DevOps & Kubernetes

Feb 22, 2024 · Cloud Native

Getting Started with Rancher: Simplify Kubernetes Cluster Management

This guide introduces Rancher as an open‑source container management platform and outlines step‑by‑step how to install Rancher, create clusters, add nodes, deploy applications, monitor resources, and manage user permissions for Kubernetes environments.

Container Managementcluster managementrancher

0 likes · 5 min read

Getting Started with Rancher: Simplify Kubernetes Cluster Management

Ops Development Stories

Feb 2, 2024 · Cloud Native

Essential kubectl Commands for Efficient Kubernetes Management

This guide compiles a comprehensive set of kubectl and Docker commands for retrieving logs, sorting pods, managing secrets, cleaning resources, debugging, port forwarding, and performing cluster maintenance tasks, helping administrators streamline Kubernetes operations and troubleshoot issues effectively.

CLICloud Nativecluster management

0 likes · 15 min read

Essential kubectl Commands for Efficient Kubernetes Management

Didi Tech

Jan 9, 2024 · Big Data

Introducing Apache Pulsar: Technical Benefits and Solutions for Didi Big Data Messaging System

Apache Pulsar, a cloud‑native distributed messaging platform, solves Didi Big Data’s DKafka bottlenecks by separating compute and storage, using sequential log writes, heterogeneous disks, multi‑level caching, bundle‑based load balancing and automatic scaling, dramatically improving stability while introducing richer monitoring complexity.

Apache PulsarDKafkaMessaging System

0 likes · 17 min read

Introducing Apache Pulsar: Technical Benefits and Solutions for Didi Big Data Messaging System

dbaplus Community

Dec 20, 2023 · Operations

Scaling Kafka to 1000+ Nodes: Governance, Auto‑Balancing & Tiered Storage

This article outlines how a large‑scale Kafka deployment of over a thousand machines across dozens of clusters was engineered for stability and efficiency through a custom Guardian controller that adds partition‑level throttling, automatic balancing, multi‑tenant isolation, cross‑IDC management, tiered storage, audit capabilities, and fully automated operational workflows.

MonitoringMulti‑tenantOperations

0 likes · 21 min read

Scaling Kafka to 1000+ Nodes: Governance, Auto‑Balancing & Tiered Storage

WeiLi Technology Team

Nov 1, 2023 · Big Data

How to Diagnose and Resolve HDFS Safe Mode Issues

This guide explains why HDFS enters safe mode after a DataNode failure, describes the safe‑mode state and its exit conditions, and provides step‑by‑step commands and troubleshooting procedures to analyze, fix, and recover from safe‑mode incidents in Hadoop clusters.

Big DataHDFSHadoop

0 likes · 10 min read

How to Diagnose and Resolve HDFS Safe Mode Issues

Efficient Ops

Sep 17, 2023 · Cloud Native

Top 9 Essential Kubernetes Tools to Streamline Your Cloud‑Native Workflows

Explore nine indispensable Kubernetes tools—including Kubie, Kubespray, Helm, Minikube, K3s, Kustomize, KOps, Prometheus, and krew—that simplify cluster management, accelerate deployments, and enhance efficiency, helping you choose the right solution for smoother, more productive cloud‑native operations.

cloud-nativecluster managementdevops tools

0 likes · 6 min read

Top 9 Essential Kubernetes Tools to Streamline Your Cloud‑Native Workflows

Aikesheng Open Source Community

Jul 3, 2023 · Databases

Replacing OCP Nodes Using the ANTMAN Tool in OceanBase Cloud Platform

This article provides a step‑by‑step guide on how to replace OceanBase Cloud Platform (OCP) nodes using the ANTMAN tool, covering environment preparation, configuration adjustments, execution of management scripts, tenant migration, cleanup of old services, and troubleshooting tips for a seamless database cluster upgrade.

ANTMANDockerOCP

0 likes · 25 min read

Replacing OCP Nodes Using the ANTMAN Tool in OceanBase Cloud Platform

Liangxu Linux

Jul 2, 2023 · Cloud Native

Mastering kubectl: Essential Commands for Kubernetes Management

This guide explains what kubectl is, how it interacts with the Kubernetes API server, and provides a categorized list of essential commands for retrieving information, debugging, state management, scaling, deployment, and security, helping users efficiently operate and automate K8s clusters.

Cloud Nativecluster managementdevops

0 likes · 5 min read

Mastering kubectl: Essential Commands for Kubernetes Management

Test Development Learning Exchange

Jun 29, 2023 · Cloud Native

Essential Kubernetes Commands for Testers: 50 Commands with Practical Examples

This article presents a comprehensive collection of 50 essential kubectl commands covering cluster, namespace, pod, deployment, service, ConfigMap, secret, volume, logging, debugging, scaling, configuration, and cleanup operations, providing testers with practical examples to efficiently manage and troubleshoot Kubernetes environments.

cluster managementkubectltesting

0 likes · 9 min read

Essential Kubernetes Commands for Testers: 50 Commands with Practical Examples

High Availability Architecture

May 26, 2023 · Big Data

Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster Resource Scheduling

This article introduces Amiya, a self‑developed overcommit component that dynamically increases Yarn memory and vCore capacity on Bilibili's offline big‑data clusters, details its architecture, key implementation of overcommit, eviction and mixed‑deployment strategies, and evaluates its resource‑utilization impact.

OvercommitResource SchedulingYARN

0 likes · 22 min read

Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster Resource Scheduling

Bilibili Tech

May 23, 2023 · Big Data

Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster

Amiya, a self‑developed dynamic over‑commit component for Bilibili’s offline big‑data cluster, inflates reported resources on under‑utilized nodes and adjusts them when load rises, adding roughly 683 TB of memory and 137 k vCores, boosting per‑node memory by 15 % and CPU usage by over 20 % while keeping eviction rates below 3 %.

AmiyaBilibiliResource Overcommit

0 likes · 22 min read

Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster

Cloud Native Technology Community

May 17, 2023 · Cloud Native

Why Do You Need Kubernetes Multi‑Cluster? Core Challenges and Design Principles

This article explains the motivations behind Kubernetes multi‑cluster deployments, outlines common use cases such as isolation and high‑availability, and analyzes core management elements including deployment models, control‑plane architectures, network connectivity, service discovery, cross‑cluster scheduling, application model extensions, and treating clusters as resources.

Cloud NativeMulti-ClusterNetwork Model

0 likes · 23 min read

Why Do You Need Kubernetes Multi‑Cluster? Core Challenges and Design Principles

ITPUB

Apr 5, 2023 · Operations

Automating TiDB Operations: From Manual Pain Points to a Scalable Platform

This article details how Zhaozhuan's DBA team transformed TiDB cluster management by addressing metadata, resource allocation, upgrade, and alert challenges through a comprehensive automation platform that streamlines work orders, node operations, scaling, monitoring, and alert handling, ultimately reducing manual effort and improving reliability.

AlertingDatabase AutomationTiDB

0 likes · 22 min read

Automating TiDB Operations: From Manual Pain Points to a Scalable Platform

MaGe Linux Operations

Mar 30, 2023 · Cloud Native

Why Is Kubernetes So Hard to Master? A Beginner’s Q&A Guide

This article explains the core concepts of Kubernetes—including its architecture, node communication, pod scheduling, data storage, service exposure, scaling, and controller coordination—through a series of clear questions and answers, helping beginners grasp why the platform feels complex.

Cloud NativeContainer OrchestrationPod Scheduling

0 likes · 9 min read

Why Is Kubernetes So Hard to Master? A Beginner’s Q&A Guide

Architecture Digest

Mar 20, 2023 · Cloud Native

Kubernetes: What It Is and Why It’s Hard to Get Started

This article provides a concise, question‑and‑answer overview of Kubernetes, explaining its role as a distributed container‑orchestration system, the architecture of master and worker nodes, core components such as etcd, kube‑apiserver, scheduler, controllers, and how services, pods, labels, and scaling operate within a cluster.

Cloud NativeContainer OrchestrationControllers

0 likes · 8 min read

Kubernetes: What It Is and Why It’s Hard to Get Started

21CTO

Feb 10, 2023 · Cloud Native

Why Kubernetes Is So Hard to Master: A Beginner’s Q&A Walkthrough

This article introduces Kubernetes fundamentals through a series of questions and answers, covering its architecture, node communication, pod scheduling, data storage, external access, scaling mechanisms, and component coordination, all illustrated with clear diagrams.

ContainersPod Schedulingcluster management

0 likes · 9 min read

Why Kubernetes Is So Hard to Master: A Beginner’s Q&A Walkthrough

Top Architect

Feb 7, 2023 · Cloud Native

Understanding Kubernetes: Core Concepts and Architecture

This article provides a concise, question‑driven overview of Kubernetes, covering its architecture, node and master communication, pod fundamentals, scheduling, storage via etcd, service exposure, scaling mechanisms, and the roles of core components such as kube‑apiserver, kubelet, kube‑proxy and controllers.

Cloud NativeContainersMicroservices

0 likes · 9 min read

Understanding Kubernetes: Core Concepts and Architecture

Ziru Technology

Jan 12, 2023 · Operations

Why Alertmanager Config Keeps Getting Overwritten in TiDB Clusters and How to Fix It

This guide explains why the Alertmanager configuration file in a TiDB cluster is repeatedly overwritten during reloads, analyzes error logs and TiUP documentation, and provides step‑by‑step instructions to edit the topology, set a custom config file, reload the service, and verify the fix.

AlertmanagerConfigurationMonitoring

0 likes · 8 min read

Why Alertmanager Config Keeps Getting Overwritten in TiDB Clusters and How to Fix It

Open Source Linux

Dec 30, 2022 · Operations

Top 7 Kubernetes Management Tools to Simplify Cluster Operations

This article introduces seven popular Kubernetes management solutions—including K9s, Rancher, the native Dashboard with Kubectl and Kubeadm, Helm, KubeSpray, Kontena Lens, and WKSctl—detailing their key features, usage scenarios, and how they help streamline cluster monitoring, deployment, scaling, and security across cloud‑native environments.

Operationscluster managementdevops

0 likes · 9 min read

Top 7 Kubernetes Management Tools to Simplify Cluster Operations

Thoughts on Knowledge and Action

Dec 7, 2022 · Big Data

Mastering Elasticsearch: Core Concepts, Cluster Architecture, and Indexing Mechanics

This article explains Elasticsearch’s fundamental building blocks, cluster roles, shard and replica strategies, master election, split‑brain prevention, inverted index structure, and the complete search and indexing lifecycle for handling large‑scale data efficiently.

Big DataElasticsearchIndexing

0 likes · 10 min read

Mastering Elasticsearch: Core Concepts, Cluster Architecture, and Indexing Mechanics

Architecture Digest

Nov 30, 2022 · Backend Development

Meituan Kafka at Scale: Challenges and Optimizations for Latency, Cluster Management, and Reliability

This article details Meituan's large‑scale Kafka deployment—over 15,000 machines and petabyte‑level daily traffic—its operational challenges such as slow nodes, load imbalance, and resource contention, and the comprehensive read/write latency, system‑level, and cluster‑management optimizations implemented to improve performance and reliability.

Performance Optimizationcluster managementdistributed systems

0 likes · 22 min read

Meituan Kafka at Scale: Challenges and Optimizations for Latency, Cluster Management, and Reliability

ITPUB

Nov 23, 2022 · Backend Development

How Zookeeper Elects Its Leader: A Human Election Analogy Explained

This article explains Zookeeper's leader election mechanism by comparing it to human voting, detailing the four core concepts, the role of zxid, the step‑by‑step process during startup and runtime failures, and the key terms every interviewee should know.

Backend DevelopmentLeader Electioncluster management

0 likes · 11 min read

How Zookeeper Elects Its Leader: A Human Election Analogy Explained

Java Architect Essentials

Nov 11, 2022 · Big Data

Meituan Kafka at Scale: Challenges and Optimizations for Latency, Cluster Management, and Reliability

This article details Meituan's large‑scale Kafka deployment, describing the current state, performance challenges such as slow nodes and disk imbalance, and the comprehensive optimizations applied—including read/write latency reductions, migration pipelines, fetcher isolation, SSD caching, RAID acceleration, cgroup isolation, full‑link monitoring, service lifecycle management, and TOR disaster recovery—to improve reliability and prepare for future growth.

Latency ReductionMeituancluster management

0 likes · 21 min read

Java High-Performance Architecture

Oct 11, 2022 · Operations

How Meituan Optimized Kafka for Massive Scale: Reducing Latency and Managing Clusters

This article details Meituan's real‑world challenges with a 15,000‑node Kafka deployment and explains the application‑layer and system‑layer optimizations—such as disk balancing, migration pipeline acceleration, fetcher isolation, RAID acceleration, cgroup isolation, and an SSD‑based cache—that together dramatically cut read/write latency and simplify large‑scale cluster management.

Large ScaleMeituanOptimization

0 likes · 23 min read

How Meituan Optimized Kafka for Massive Scale: Reducing Latency and Managing Clusters

Code Ape Tech Column

Sep 24, 2022 · Operations

Overview of Redis Monitoring, Data Migration, and Cluster Management Tools

This article introduces essential Redis operational tools, covering real‑time monitoring with the INFO command and Prometheus‑exporter, data migration using Redis‑shake, consistency checking via Redis‑full‑check, and cluster management through CacheCloud, providing practical guidance for administrators.

Data MigrationOperationscluster management

0 likes · 10 min read

Overview of Redis Monitoring, Data Migration, and Cluster Management Tools

Aikesheng Open Source Community

Sep 22, 2022 · Databases

Using Orchestrator for Automatic MySQL Cluster Failover: Configuration and Test Cases

This article demonstrates how to configure the open-source Orchestrator tool for automatic MySQL cluster failover, explains key parameters, and presents three test cases covering normal failover, lag‑induced prevention, and the effect of disabling global recoveries.

Database operationsHigh AvailabilityOrchestrator

0 likes · 6 min read

Using Orchestrator for Automatic MySQL Cluster Failover: Configuration and Test Cases

vivo Internet Technology

Sep 14, 2022 · Big Data

Exploring and Practicing Apache Pulsar at vivo: Cluster Management, Monitoring, and Optimization

The vivo big‑data team details how they migrated massive real‑time workloads from Kafka to Apache Pulsar, describing cluster‑level bundle and ledger management, retention policies, a Prometheus‑Kafka‑Druid monitoring pipeline, load‑balancing tweaks, client tuning, rapid broker‑failure recovery, and future cloud‑native tracing and migration plans.

Apache PulsarBig Datacluster management

0 likes · 19 min read

Exploring and Practicing Apache Pulsar at vivo: Cluster Management, Monitoring, and Optimization

Python Crawling & Data Mining

Aug 12, 2022 · Big Data

Master the Big Data Ecosystem: 9 Core Technology Frameworks Explained

This article provides a comprehensive overview of the big data ecosystem, detailing nine essential technology categories—including data collection, storage, computation, analysis, resource management, retrieval, underlying infrastructure, and cluster installation—while comparing popular tools and illustrating their typical use‑cases with diagrams.

cluster managementdata collectiondata storage

0 likes · 11 min read

Master the Big Data Ecosystem: 9 Core Technology Frameworks Explained

Meituan Technology Team

Aug 11, 2022 · Cloud Native

LAR: Load Auto-Regulator System for Resource Utilization and Service Quality

The article analyzes Meituan’s self‑designed Load Auto‑Regulator (LAR), detailing its tiered resource‑pool architecture, dynamic load‑to‑static‑resource mapping, and QoS mechanisms that together raise data‑center CPU utilization by 5‑10% while keeping online service quality stable, and discusses its deployment in online and mixed‑workload scenarios.

Cloud NativeQoSResource Scheduling

0 likes · 28 min read

LAR: Load Auto-Regulator System for Resource Utilization and Service Quality

Meituan Technology Team

Aug 4, 2022 · Big Data

Optimizing Kafka for Large-Scale Data Platforms at Meituan

The article details Meituan's massive Kafka deployment—over 15,000 machines handling more than 30 PB of daily data—its performance and management challenges, and the comprehensive application‑layer, system‑layer, and hybrid‑layer optimizations Meituan implemented to reduce read/write latency and improve large‑scale cluster reliability.

Data PlatformFull‑Link MonitoringMeituan

0 likes · 25 min read

Optimizing Kafka for Large-Scale Data Platforms at Meituan

NetEase Game Operations Platform

Jun 10, 2022 · Databases

Apache Doris Deployment and Optimization at NetEase Interactive Entertainment

This article details NetEase Interactive Entertainment's adoption of Apache Doris for large‑scale game data analytics, covering background, Doris architecture, cluster governance, tablet and compaction tuning, scaling strategies, monitoring, alerting, and fault‑handling practices to improve performance and stability.

Apache DorisBig DataCompaction Tuning

0 likes · 22 min read

Apache Doris Deployment and Optimization at NetEase Interactive Entertainment

vivo Internet Technology

Jun 8, 2022 · Cloud Native

Vivo’s Large‑Scale Kubernetes Operator Practice for Multi‑Data‑Center Cluster Management

Vivo replaced error‑prone manual Ansible playbooks with a custom Kubernetes Operator that uses declarative CRDs and modular Ansible scripts to automate the full lifecycle—deployment, scaling, upgrades, and recovery—of thousands of nodes across multiple data‑centers, supported by extensive CI testing and future kubeadm integration.

AnsibleCI/CDCloud Native

0 likes · 14 min read

Vivo’s Large‑Scale Kubernetes Operator Practice for Multi‑Data‑Center Cluster Management

ITPUB

Jun 4, 2022 · Operations

Mastering Kafka Load Balancing with Cruise Control: From Manual Migration to Automated Optimization

This article explains why Kafka suffers from broker‑side load imbalance, walks through manual replica migration examples, and then details how Cruise Control automates load balancing, supports resource‑group targeting, leader‑replica dispersion, and provides step‑by‑step deployment instructions.

Cruise ControlOperationsReplica Migration

0 likes · 21 min read

Mastering Kafka Load Balancing with Cruise Control: From Manual Migration to Automated Optimization

vivo Internet Technology

May 31, 2022 · Big Data

Kafka Load Balancing and Cruise Control: Concepts, Manual Migration, and Deployment

Kafka’s server‑side load imbalance, caused by static replica placement on broker disks, makes manual replica migration infeasible at scale, but Cruise Control automates metric collection, analysis, and execution of fine‑grained rebalance plans—including broker de‑commissioning and leader dispersion—allowing large clusters to expand and operate efficiently.

Big DataCruise ControlReplica Migration

0 likes · 21 min read

Kafka Load Balancing and Cruise Control: Concepts, Manual Migration, and Deployment

dbaplus Community

May 12, 2022 · Big Data

How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains

This article details Bilibili's end‑to‑end Presto on Hadoop architecture, covering the multi‑engine SQL stack, dispatcher routing, cluster scale, stability enhancements like coordinator HA and real‑time punish, query limits, Hive UDF compatibility, insert‑overwrite support, Alluxio caching, multi‑datacenter routing, query result caching, Raptorx local cache, JDK upgrades, dynamic filtering, and future roadmap, illustrating how these innovations boosted query throughput and reduced latency.

Big DataHadoopPerformance Optimization

0 likes · 32 min read

How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains

Senior Brother's Insights

May 10, 2022 · Backend Development

Mastering Elasticsearch: Core Concepts, Architecture, and Performance Tuning

This comprehensive guide explains what Elasticsearch does, its underlying Lucene engine, core concepts like clusters, shards, replicas, mappings, and provides practical steps for installation, configuration, indexing, storage mechanics, and performance optimization.

LuceneShardingcluster management

0 likes · 36 min read

Mastering Elasticsearch: Core Concepts, Architecture, and Performance Tuning

Aikesheng Open Source Community

Apr 7, 2022 · Databases

TiDB 2.1.x to 4.0.13 Upgrade and Data Migration Guide

This article provides a comprehensive step‑by‑step guide for senior DBAs to upgrade an online TiDB 2.1.x cluster to version 4.0.13 via data migration, detailing environment assessment, configuration changes, component deployment, full and incremental data transfer, consistency verification, permission synchronization, and traffic switchover.

AnsibleDatabase UpgradeTiDB

0 likes · 26 min read

TiDB 2.1.x to 4.0.13 Upgrade and Data Migration Guide

IT Services Circle

Apr 3, 2022 · Cloud Native

Understanding Kubernetes Federation: kubefed and Karmada Multi‑Cluster Management

This article explains why Kubernetes single‑cluster scalability is limited to about 5,000 nodes, introduces the concept of multi‑cluster federation, compares the legacy kubefed project with the actively maintained Karmada solution, and shows how policies and replica‑scheduling enable flexible cross‑AZ deployments and failover.

Cloud NativeFederationKarmada

0 likes · 13 min read

Understanding Kubernetes Federation: kubefed and Karmada Multi‑Cluster Management

Open Source Linux

Mar 14, 2022 · Cloud Native

Mastering Kubernetes Hard‑Way Upgrade: From v1.22 to v1.23 Step‑by‑Step

This guide walks you through the hard‑way method for upgrading a Kubernetes cluster from version 1.22 to 1.23, covering prerequisites, master and worker node procedures, package handling, and verification steps to ensure a successful, fully controlled upgrade.

Hard WayLinuxUpgrade

0 likes · 7 min read

Mastering Kubernetes Hard‑Way Upgrade: From v1.22 to v1.23 Step‑by‑Step

Architect

Feb 6, 2022 · Big Data

Elasticsearch Overview: Architecture, Core Concepts, and Performance Optimization

This article provides a comprehensive introduction to Elasticsearch, covering data types, Lucene fundamentals, inverted indexes, cluster components, node roles, shard and replica mechanisms, mapping, installation, health monitoring, write path, storage strategies, segment management, refresh and translog processes, as well as practical performance and JVM tuning tips.

Distributed SearchElasticsearchPerformance Optimization

0 likes · 37 min read

Elasticsearch Overview: Architecture, Core Concepts, and Performance Optimization

DataFunTalk

Feb 1, 2022 · Big Data

Kafka at Meituan: Practices, Challenges, and Optimizations for Large‑Scale Data Platforms

This article presents Meituan's large‑scale Kafka deployment, describing the current state and challenges of massive data ingestion, detailing latency‑reduction techniques, cluster‑level optimizations, SSD‑based caching, isolation strategies, full‑link monitoring, lifecycle management, and future directions for high availability.

Large-Scale DataMeituanMonitoring

0 likes · 22 min read

Kafka at Meituan: Practices, Challenges, and Optimizations for Large‑Scale Data Platforms

MaGe Linux Operations

Jan 28, 2022 · Cloud Native

Top 7 Kubernetes Management Tools to Simplify Cluster Operations

Discover the most popular Kubernetes management solutions—including K9s, Rancher, Dashboard, Helm, Kubespray, Lens, and WKSctl—detailing their features, deployment options, and how they streamline cluster monitoring, scaling, and security for cloud-native environments and improve operational efficiency.

Cloud Nativecluster managementdevops

0 likes · 9 min read

Yiche Technology

Jan 11, 2022 · Databases

Elasticsearch Overview, Comparison, Maintenance Challenges, Deployment Strategies, and Automation Management Platform

This document provides a comprehensive technical overview of Elasticsearch, comparing it with Solr and ClickHouse, detailing common operational pain points and configuration solutions, describing containerized and ECK deployments, and outlining a company‑wide automation platform for cluster provisioning, monitoring, index and security management, with future directions for lifecycle and backup strategies.

Automationcluster managementkubernetes

0 likes · 31 min read

Elasticsearch Overview, Comparison, Maintenance Challenges, Deployment Strategies, and Automation Management Platform

21CTO

Jan 4, 2022 · Operations

Deploy Searchable Snapshots in Elasticsearch 7.14: A Complete Step‑by‑Step Guide

This article explains the principles behind Elasticsearch searchable snapshots, details the DataTier model and node role optimizations, and provides a full practical walkthrough—including cluster setup, COS repository creation, ILM policy configuration, index templates, mounting strategies, and performance considerations—using ES 7.14.2.

Data TierElasticsearchILM

0 likes · 15 min read

Deploy Searchable Snapshots in Elasticsearch 7.14: A Complete Step‑by‑Step Guide

Efficient Ops

Jan 3, 2022 · Operations

Master Elasticsearch Cluster: Essential Commands for Health, Tasks, and Settings

This article explains how to manage Tencent Cloud Elasticsearch clusters by using key APIs to check health status, monitor pending tasks, retrieve metadata, view statistics, adjust shard allocation, modify cluster settings, and control tasks, providing practical command examples and detailed explanations for effective operations.

APISettingscluster management

0 likes · 19 min read

Master Elasticsearch Cluster: Essential Commands for Health, Tasks, and Settings

政采云技术

Nov 9, 2021 · Cloud Native

Design and Usage of Clusterfile in Sealer for Cluster Configuration and Plugins

This article explains the design principles of Sealer's Clusterfile, details its configuration parameters, demonstrates how to inject additional settings and environment variables, and describes the supported plugins for customizing Kubernetes clusters, providing practical examples and code snippets.

Cloud NativeClusterfileSealer

0 likes · 10 min read

Design and Usage of Clusterfile in Sealer for Cluster Configuration and Plugins

Alibaba Cloud Native

Oct 29, 2021 · Cloud Native

Unified Management & Secure Governance for Alibaba Cloud ACK and On-Prem Kubernetes

This article explains how cloud‑native technologies enable a unified control plane for Alibaba Cloud ACK clusters and self‑built Kubernetes clusters, detailing the ACK registered‑cluster architecture, one‑way registration, non‑managed security mechanisms, step‑by‑step cluster onboarding, and consistent security governance across environments.

ACKCloud Nativecluster management

0 likes · 11 min read

Unified Management & Secure Governance for Alibaba Cloud ACK and On-Prem Kubernetes

Tencent Cloud Developer

Oct 8, 2021 · Operations

Unveiling Kafka’s Controller: Architecture, Election, and Monitoring Deep Dive

This article provides a comprehensive technical analysis of Kafka’s Controller component, covering its background, core responsibilities, data storage, election process, version‑specific improvements, monitoring techniques, and key source‑code excerpts to help engineers understand and manage Kafka clusters effectively.

MonitoringZookeepercluster management

0 likes · 27 min read

Unveiling Kafka’s Controller: Architecture, Election, and Monitoring Deep Dive

MaGe Linux Operations

Oct 5, 2021 · Cloud Native

Unlock Advanced kubectl Tricks for Faster Kubernetes Management

This article shares a collection of powerful kubectl commands and tips—including API debugging, status‑based pod filtering and deletion, node‑specific pod listing, distribution counting with awk, and proxy usage—to help experienced Kubernetes users work more efficiently and avoid manual API client coding.

CLIcluster managementdevops

0 likes · 7 min read

Unlock Advanced kubectl Tricks for Faster Kubernetes Management

DevOps Cloud Academy

Sep 21, 2021 · Operations

Practical Elasticsearch Operations and Performance Tuning Guide

This article extends previous Elasticsearch cheat sheets with practical commands and step‑by‑step instructions for shard allocation, replica adjustment, cluster settings, slow‑log configuration, mapping routing, force merge, bulk writes, refresh intervals, translog durability, heap sizing, disk‑space monitoring, and troubleshooting strategies.

ElasticsearchOperationsPerformance Tuning

0 likes · 7 min read

Practical Elasticsearch Operations and Performance Tuning Guide

Selected Java Interview Questions

Sep 7, 2021 · Big Data

Elasticsearch Basics: Core Concepts, Indexing, Write and Search Processes, Cluster Management and Performance Tips

This article provides a comprehensive overview of Elasticsearch, covering its fundamental architecture, key concepts such as indices, shards and replicas, the complete write and search workflows, consistency mechanisms, master node election, and practical performance‑tuning recommendations for large‑scale deployments.

Big DataElasticsearchIndexing

0 likes · 15 min read

Elasticsearch Basics: Core Concepts, Indexing, Write and Search Processes, Cluster Management and Performance Tips

Tencent Cloud Developer

Aug 26, 2021 · Big Data

Recap of Shenzhen Elasticsearch Meetup – Community Growth, Compression Optimization, Real‑time Data Fusion, and Cluster Practices

The first Shenzhen Elasticsearch meetup on August 21, 2021, jointly hosted by the ES Chinese community and Tencent Cloud, gathered experts from Tencent, Tapdata, ByteDance and Vivo to showcase rapid community growth, compression‑encoding optimizations, real‑time ES‑MongoDB data fusion, custom kernel extensions, large‑scale cluster practices, and concluded with extensive Q&A and networking.

Big DataElasticsearchReal-time Data Fusion

0 likes · 11 min read

Recap of Shenzhen Elasticsearch Meetup – Community Growth, Compression Optimization, Real‑time Data Fusion, and Cluster Practices

Senior Brother's Insights

Jul 28, 2021 · Operations

How Zookeeper Prevents Split‑Brain: Inside Quorum‑Based Leader Election

This article explains the split‑brain phenomenon in distributed clusters, uses Zookeeper as a case study to illustrate how network partitions can create multiple leaders, and details Zookeeper's majority‑quorum mechanism, node count considerations, and common strategies for avoiding split‑brain scenarios.

Leader ElectionSplit-BrainZookeeper

0 likes · 13 min read

How Zookeeper Prevents Split‑Brain: Inside Quorum‑Based Leader Election

Sohu Tech Products

Jul 14, 2021 · Cloud Native

Limitations and Challenges of Kubernetes in Cluster Management and Application Scenarios

The article examines Kubernetes' widespread adoption, outlines its scalability and multi‑cluster management constraints, discusses practical application scenarios such as deployment models, batch scheduling, and hard multi‑tenancy, and highlights the gaps that still limit its use in large‑scale production environments.

Cloud NativeMulti-Clustercluster management

0 likes · 21 min read

Limitations and Challenges of Kubernetes in Cluster Management and Application Scenarios