Tagged articles

cluster management

188 articles · Page 1 of 2
Tencent Architect
Tencent Architect
Jun 16, 2026 · Operations

Open‑Source OCManager: A Smart Manager that Handles 7 Million Daily Alerts

OCManager, an open‑source integrated platform from OpenCloudOS, unifies cluster management, whole‑machine monitoring, and AI‑driven operations in a single web console, supporting millions of daily alerts, thousands of incidents, and multi‑OS environments with a four‑layer architecture and Docker‑based deployment.

AI OpsDockerMonitoring
0 likes · 15 min read
Open‑Source OCManager: A Smart Manager that Handles 7 Million Daily Alerts
Raymond Ops
Raymond Ops
Jun 2, 2026 · Cloud Native

200+ Essential kubectl Commands for Managing and Troubleshooting Kubernetes Clusters

This guide compiles over 200 practical kubectl commands, covering cluster setup, context switching, resource inspection, workload management, networking, storage, security hardening, high‑availability patterns, troubleshooting techniques, and performance monitoring to help operators efficiently administer Kubernetes environments.

Cloud Nativecluster managementdevops
0 likes · 39 min read
200+ Essential kubectl Commands for Managing and Troubleshooting Kubernetes Clusters
Raymond Ops
Raymond Ops
Dec 27, 2025 · Cloud Native

15 Powerful kubectl Tricks to Master Kubernetes Management

Learn 15 practical kubectl techniques—from resource shortcuts and context switching to advanced JSONPath queries, custom output formats, and efficient alias configurations—that enable Kubernetes administrators to streamline cluster management, improve debugging, and boost operational productivity.

CLIcluster managementdevops
0 likes · 12 min read
15 Powerful kubectl Tricks to Master Kubernetes Management
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Dec 5, 2025 · Operations

Essential Ceph Command Cheat Sheet for Cluster Management

This guide provides a concise collection of essential Ceph commands for starting services, checking health and status, managing monitors, metadata servers, and OSDs, as well as creating admin users, purging nodes, and handling crush maps, enabling administrators to efficiently operate and troubleshoot a Ceph storage cluster.

CephLinuxOperations
0 likes · 6 min read
Essential Ceph Command Cheat Sheet for Cluster Management
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Nov 19, 2025 · Backend Development

Master Elasticsearch: Index Design, Field Types, and Cluster Management Tips

An experienced engineer shares practical Elasticsearch insights covering index design with aliases and routing, field type choices, query optimization techniques, pagination strategies, real‑time refresh settings, memory limits, and cluster management, offering concrete examples and actionable recommendations for robust search implementations.

ElasticsearchQuery Optimizationcluster management
0 likes · 12 min read
Master Elasticsearch: Index Design, Field Types, and Cluster Management Tips
DevOps Coach
DevOps Coach
Oct 28, 2025 · Cloud Native

20 Essential Kubernetes Tips to Boost Security, Reliability, and Manageability

This guide presents twenty practical Kubernetes best‑practice tips covering productivity shortcuts, resource limits, health probes, node draining, PodDisruptionBudgets, RBAC hardening, read‑only ConfigMaps/Secrets, non‑root containers, network policies, image version pinning, secret rotation, centralized logging, etcd backups, resource cleanup, and secure access methods.

Reliabilitybest practicescluster management
0 likes · 8 min read
20 Essential Kubernetes Tips to Boost Security, Reliability, and Manageability
Ray's Galactic Tech
Ray's Galactic Tech
Sep 20, 2025 · Operations

How to Safely Upgrade a ZooKeeper Node’s IP Without Disrupting the Cluster

This guide explains why changing a ZooKeeper node’s IP requires updating the configuration on all members, then walks through a step‑by‑step procedure—including stopping the target node, editing zoo.cfg on every server, restarting the remaining nodes, and verifying the quorum—plus best‑practice tips for Kubernetes deployments.

IP upgradecluster managementkubernetes
0 likes · 7 min read
How to Safely Upgrade a ZooKeeper Node’s IP Without Disrupting the Cluster
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Aug 30, 2025 · Operations

INFINI Console FAQ: Enterprise‑Grade Unified Elasticsearch Management

The article introduces INFINI Console, an open‑source, lightweight platform for unified, multi‑cluster and cross‑version Elasticsearch governance, compares it with Kibana, details deployment options, enterprise‑level features such as monitoring, alerting and security, and analyzes cost advantages and practical migration scenarios.

ElasticsearchINFINI ConsoleMonitoring
0 likes · 13 min read
INFINI Console FAQ: Enterprise‑Grade Unified Elasticsearch Management
DataFunSummit
DataFunSummit
Aug 28, 2025 · Artificial Intelligence

How We Scaled AI Compute to Millions of Nodes with Ray on WeChat

This article explains how Tencent's WeChat team built the Astra platform on Ray to manage millions of AI compute nodes, addressing challenges of massive scale, heterogeneous GPU resources, low‑priority node instability, deployment complexity, and cost, while detailing architecture, scheduling strategies, and practical usage examples.

AI scalingDistributed ComputingRay
0 likes · 21 min read
How We Scaled AI Compute to Millions of Nodes with Ray on WeChat
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Aug 6, 2025 · Cloud Native

Step‑by‑Step Rancher Deployment for Multi‑Cluster Kubernetes Management

This guide explains the background of multi‑IDC Kubernetes clusters, why a unified platform like Rancher is needed, and provides detailed step‑by‑step instructions for single‑node, high‑availability RKE, lightweight K3s deployments, Helm installation, cert‑manager setup, ingress configuration, and best‑practice recommendations.

HA deploymentRKEcluster management
0 likes · 12 min read
Step‑by‑Step Rancher Deployment for Multi‑Cluster Kubernetes Management
MaGe Linux Operations
MaGe Linux Operations
Jul 21, 2025 · Cloud Native

Master Kubernetes with Essential Commands: Efficient Container Cluster Management

This comprehensive guide walks operations engineers through essential Kubernetes commands, covering cluster inspection, pod lifecycle, service and network handling, storage configuration, troubleshooting, performance monitoring, scaling, security, and automation, enabling efficient and expert management of containerized clusters.

Operationscluster managementkubectl
0 likes · 17 min read
Master Kubernetes with Essential Commands: Efficient Container Cluster Management
Raymond Ops
Raymond Ops
Jul 19, 2025 · Cloud Native

Step-by-Step Guide to Upgrading Kubernetes Nodes to v1.15.12

This tutorial walks you through downloading the latest Kubernetes packages, preparing master and node services, adjusting nginx proxy settings, cordoning and draining nodes, replacing binaries and certificates, restarting services, and verifying the upgrade across a two‑node cluster.

NGINXNode Upgradecluster management
0 likes · 13 min read
Step-by-Step Guide to Upgrading Kubernetes Nodes to v1.15.12
Raymond Ops
Raymond Ops
Jun 19, 2025 · Operations

Master Kubernetes Cluster Management: Essential kubectl Commands Explained

This guide walks you through essential kubectl commands for viewing cluster status, inspecting resources, creating and modifying objects, labeling, annotating, and launching pods, providing practical examples and command syntax to help you manage Kubernetes clusters effectively.

cluster managementdevopskubectl
0 likes · 14 min read
Master Kubernetes Cluster Management: Essential kubectl Commands Explained
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Jun 18, 2025 · Operations

Comprehensively Manage Elasticsearch 9.X with INFINI Console

The article provides a detailed technical overview of INFINI Console, an open‑source, lightweight governance platform that enables multi‑cluster, cross‑version management, dynamic registration, monitoring, alerting, and developer tools for Elasticsearch 9.X, comparing it with Kibana and highlighting deployment simplicity across various OS and CPU architectures.

Cross-Version SupportElasticsearchINFINI Console
0 likes · 11 min read
Comprehensively Manage Elasticsearch 9.X with INFINI Console
DevOps Operations Practice
DevOps Operations Practice
Jun 16, 2025 · Cloud Native

Mastering Kubernetes: 6 Essential Tools for Cluster Management

This article introduces six indispensable tools—kubectl, Helm, Prometheus + Grafana, Istio, Velero, and K9s—that simplify Kubernetes cluster management by covering resource handling, monitoring, networking, security, backup, and interactive UI, helping readers efficiently operate production‑grade clusters.

Cloud NativeMonitoringcluster management
0 likes · 7 min read
Mastering Kubernetes: 6 Essential Tools for Cluster Management
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Jun 4, 2025 · Operations

When Should You Deploy Dedicated Coordinating Nodes in Elasticsearch?

The article explains what Elasticsearch coordinating nodes are, why dedicated coordinating‑only nodes can off‑load HTTP handling from data and master nodes to reduce load, lower latency and simplify client configuration, and outlines the associated hardware and cluster‑state costs, usage scenarios, deployment steps and monitoring tips.

Coordinating NodeElasticsearchNode Roles
0 likes · 12 min read
When Should You Deploy Dedicated Coordinating Nodes in Elasticsearch?
Efficient Ops
Efficient Ops
May 12, 2025 · Cloud Native

Master Kubernetes Management with Kuboard: Visual UI Guide & Installation

Kuboard is a web‑based visual tool for managing Kubernetes clusters, offering multi‑auth, multi‑cluster support, micro‑service layering, and storage integration; the guide explains Docker installation, adding clusters via KubeConfig, workload inspection, and how the UI simplifies complex command‑line operations.

Cloud NativeDockercluster management
0 likes · 5 min read
Master Kubernetes Management with Kuboard: Visual UI Guide & Installation
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Apr 10, 2025 · Cloud Computing

Unlock Scalable, Reliable Storage: A Complete Guide to Deploying Ceph

This article provides a comprehensive overview of Ceph distributed storage, covering storage fundamentals, Ceph architecture, advantages, version lifecycle, and step‑by‑step deployment using ceph‑deploy, including environment preparation, monitor and OSD setup, manager configuration, and dashboard activation.

CephDistributed storageLinux Deployment
0 likes · 28 min read
Unlock Scalable, Reliable Storage: A Complete Guide to Deploying Ceph
Tencent Cloud Middleware
Tencent Cloud Middleware
Apr 9, 2025 · Operations

How TDMQ Pulsar’s Cluster‑Level and Topic‑Partition Throttling Keeps Your Messaging System Stable

This article explains why high‑throughput producers and consumers can saturate CPU, memory, network and disk I/O in TDMQ Pulsar clusters, describes the built‑in cluster‑level distributed and topic‑partition rate‑limiting mechanisms, and provides practical guidance for configuration, monitoring, and troubleshooting.

Message QueueOperationsPulsar
0 likes · 12 min read
How TDMQ Pulsar’s Cluster‑Level and Topic‑Partition Throttling Keeps Your Messaging System Stable
Raymond Ops
Raymond Ops
Mar 30, 2025 · Operations

Mastering Elasticsearch Data Sync and Cluster Architecture: 3 Strategies Explained

This article explains three Elasticsearch data‑synchronization methods, compares their pros and cons, and then dives into ES cluster structure, node roles, shard allocation, distributed queries, split‑brain handling, and fault‑tolerance mechanisms, providing a comprehensive guide for developers and ops engineers.

Data synchronizationElasticsearchcluster management
0 likes · 9 min read
Mastering Elasticsearch Data Sync and Cluster Architecture: 3 Strategies Explained
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Mar 29, 2025 · Operations

How to Reset a Forgotten INFINI Console Password

The article explains two ways to recover access to INFINI Console when the password is lost: locating the original console_configuration.json file to retrieve the stored credentials, or using the built‑in Reset Password feature in the user management UI, with step‑by‑step instructions and screenshots.

Configuration FileINFINI Consoleadmin guide
0 likes · 5 min read
How to Reset a Forgotten INFINI Console Password
Cloud Native Technology Community
Cloud Native Technology Community
Mar 18, 2025 · Cloud Native

Best Practices for Managing Core Services in Large‑Scale Kubernetes Deployments

Scaling Kubernetes across dozens or hundreds of clusters requires standardized core services—networking, security, observability, and automation—so organizations should adopt templated configurations, GitOps tools, centralized monitoring, and automated certificate management to reduce complexity, improve security, and lower operational overhead.

AutomationGitOpsObservability
0 likes · 8 min read
Best Practices for Managing Core Services in Large‑Scale Kubernetes Deployments
dbaplus Community
dbaplus Community
Feb 13, 2025 · Databases

Automating Redis Resource Balancing to Cut DBA Effort

To handle growing memory pressure across thousands of Redis servers, the platform implements an automated, daily resource‑balancing scheduler that selects overloaded hosts, chooses optimal nodes based on instance count, tier, and placement rules, then safely migrates them through a multi‑step process with rigorous validation.

AutomationDatabase operationsRedis
0 likes · 14 min read
Automating Redis Resource Balancing to Cut DBA Effort
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Feb 11, 2025 · Operations

How to Ace the Elastic Certified Engineer Exam: Full 8.15 Syllabus Breakdown and Fast‑Track Tips

This guide dissects the Elastic Certified Engineer 8.15 exam syllabus, explains each core topic—from searchable snapshots and async search to ILM policies and cross‑cluster replication—while offering a step‑by‑step study roadmap, hands‑on lab ideas, and resource recommendations to help candidates pass efficiently.

8.15Elastic Certified EngineerElasticsearch
0 likes · 6 min read
How to Ace the Elastic Certified Engineer Exam: Full 8.15 Syllabus Breakdown and Fast‑Track Tips
Architect
Architect
Dec 27, 2024 · Big Data

Fault Self‑Healing System for Large‑Scale Big Data Clusters

This article describes the design, architecture, and technical implementation of BMR's fault self‑healing platform, which automatically collects data, analyzes failures, defines decision rules, and executes safe recovery workflows to improve reliability and efficiency of massive, heterogeneous big‑data environments.

Big DataMonitoringcluster management
0 likes · 16 min read
Fault Self‑Healing System for Large‑Scale Big Data Clusters
Bilibili Tech
Bilibili Tech
Dec 10, 2024 · Big Data

Fault Self‑Healing System for Bilibili's Large‑Scale Big Data Cluster (BMR)

Bilibili's fault‑self‑healing platform for its massive BMR big‑data cluster—over 10,000 machines and 1 EB storage—adds near‑real‑time fault discovery, intelligent diagnosis, and automated workflow handling, dramatically cutting resolution time, improving stability across services, and scaling to dozens of daily automated repairs.

BMRcluster managementfault self-healing
0 likes · 16 min read
Fault Self‑Healing System for Bilibili's Large‑Scale Big Data Cluster (BMR)
System Architect Go
System Architect Go
Nov 6, 2024 · Cloud Native

How Kubernetes Extended Resources Enable Custom Scheduling (and Their Limits)

This article explains how Kubernetes Extended Resources let you define custom resource types, describes the creation, synchronization, and scheduling workflow, highlights the non‑real‑time allocatable status behavior, and discusses practical limitations and the role of Device Plugins and Operators.

Custom SchedulingDevice PluginExtended Resource
0 likes · 6 min read
How Kubernetes Extended Resources Enable Custom Scheduling (and Their Limits)
Bilibili Tech
Bilibili Tech
Oct 29, 2024 · Big Data

Bilibili One‑Stop Big Data Cluster Management Platform (BMR): Architecture, Modules, and Future Outlook

Bilibili's One‑Stop Big Data Cluster Management Platform (BMR) unifies cluster, metadata, intelligent operations, and custom managers to oversee 50+ services, 10,000 machines, exabyte storage, and millions of cores, using cloud‑native containers, fault prediction, and resource‑sharing techniques to boost efficiency, stability, and cost savings.

BMRIntelligent OperationsMetadata Warehouse
0 likes · 17 min read
Bilibili One‑Stop Big Data Cluster Management Platform (BMR): Architecture, Modules, and Future Outlook
Baidu Geek Talk
Baidu Geek Talk
Oct 9, 2024 · Artificial Intelligence

How Baidu’s Baige 4.0 Architecture Redefines AI Compute Efficiency

This article analyzes Baidu's Baige 4.0 AI infrastructure, detailing its four‑layer architecture, XMAN 5.0 hardware, HPN network, BCCL communication library, and AIAK inference upgrades, and explains how these innovations address large‑model training and inference challenges while boosting performance, utilization, and cost efficiency.

AI InfrastructureGPU AccelerationHigh-performance computing
0 likes · 16 min read
How Baidu’s Baige 4.0 Architecture Redefines AI Compute Efficiency
Architects' Tech Alliance
Architects' Tech Alliance
Sep 12, 2024 · Industry Insights

Managing and Optimizing Large‑Scale AI Compute Clusters: Practical Insights

This article examines the key pain points of massive AI compute clusters—including heterogeneous hardware compatibility, efficient scheduling, training and inference acceleration, and fault‑tolerant operations—while presenting practical management and performance‑tuning strategies, a cloud‑native AI platform implementation, and future directions for the ecosystem.

AI computingOperationsPerformance Tuning
0 likes · 7 min read
Managing and Optimizing Large‑Scale AI Compute Clusters: Practical Insights
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Aug 29, 2024 · Cloud Native

Mastering Kubernetes: Core Concepts, Architecture, and Real‑World Use Cases

This article provides a comprehensive overview of Kubernetes (K8S), covering its origins, key problems it solves, master‑node architecture, core components such as kube‑apiserver, scheduler, controllers, node agents, and practical applications like CI/CD integration, multi‑tenant and micro‑service deployments.

CI/CDCloud NativeContainer Orchestration
0 likes · 9 min read
Mastering Kubernetes: Core Concepts, Architecture, and Real‑World Use Cases
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Aug 6, 2024 · Operations

ZooKeeper Core Concepts: Data Model, Node Types, Sessions, Cluster, Election, ZAB, Watch, ACL, and Distributed Lock Patterns

This article explains ZooKeeper's hierarchical data model, node types, session mechanism, cluster roles and election process, ZAB protocol, watch mechanism, ACL permissions, and common distributed lock implementations, providing a comprehensive overview of its core concepts and practical usage.

ACLCoordination ServiceDistributed Lock
0 likes · 17 min read
ZooKeeper Core Concepts: Data Model, Node Types, Sessions, Cluster, Election, ZAB, Watch, ACL, and Distributed Lock Patterns
Bilibili Tech
Bilibili Tech
Jul 19, 2024 · Big Data

Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation

Bilibili’s one‑stop Big Data Cluster Management Platform (BMR) consolidates HDFS, Spark, Flink, ClickHouse, Kafka and other services into a unified system that evolved through four stages—standardization, metadata‑driven construction, containerization, and observability—addressing node consistency, scaling, fault self‑healing, and resource optimization while delivering elastic scaling, automated start/stop, and future cost‑saving and stability enhancements.

Observabilitybig data platformcluster management
0 likes · 12 min read
Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation
DevOps Cloud Academy
DevOps Cloud Academy
Jun 18, 2024 · Operations

Essential kubectl Commands for DevOps Engineers

This guide presents a comprehensive collection of the most important and frequently used kubectl commands, explaining how to retrieve version information, manage clusters, list resources, manipulate contexts, create, update, patch, scale, expose, delete, and debug Kubernetes objects, as well as format output and control verbosity, enabling DevOps engineers to efficiently operate Kubernetes clusters.

CLIcluster managementdevops
0 likes · 14 min read
Essential kubectl Commands for DevOps Engineers
Baidu Geek Talk
Baidu Geek Talk
Apr 24, 2024 · Industry Insights

How Baidu’s New AI OS “WanYuan” Redefines Intelligent Computing

At the Create 2024 Baidu AI Developer Conference, Baidu unveiled its next‑generation intelligent computing operating system WanYuan, detailing its cluster‑scale management, GPU‑centric performance, integrated large‑model services, and a layered architecture that aims to simplify AI‑native application development and accelerate the AI era.

AIBaiduCloud Computing
0 likes · 12 min read
How Baidu’s New AI OS “WanYuan” Redefines Intelligent Computing
Practical DevOps Architecture
Practical DevOps Architecture
Apr 18, 2024 · Cloud Native

Kubernetes Source Code Deep Dive and Secondary Development Course Outline

This curriculum provides a comprehensive, step‑by‑step exploration of Kubernetes internals—including kubeadm core source, Go module management, cobra libraries, kubeadm init/join processes, client‑go components, code generators, custom resources, operators, and practical deployment automation—aimed at mastering cluster setup, configuration, and advanced development.

GoSource Codeclient-go
0 likes · 10 min read
Kubernetes Source Code Deep Dive and Secondary Development Course Outline
NewBeeNLP
NewBeeNLP
Mar 8, 2024 · Industry Insights

Why Building LLMs Is Like Buying a Hardware Lottery – Lessons from a Startup

The article recounts Yi Tay’s experience founding Reka and building large language models from scratch, highlighting the unpredictable quality of GPU clusters, the challenges of multi‑cluster orchestration, code‑base choices, and how startups must rely on fast, intuition‑driven experimentation to succeed.

GPUHardwareLLM
0 likes · 12 min read
Why Building LLMs Is Like Buying a Hardware Lottery – Lessons from a Startup
dbaplus Community
dbaplus Community
Feb 26, 2024 · Cloud Native

10 Hard‑Earned Lessons from 3 Years Managing Kubernetes Clusters

After three years of hands‑on Kubernetes administration, the author shares ten practical lessons covering cloud‑hosted clusters, infrastructure‑as‑code, Helm chart usage, service mesh decisions, tool selection, resource limits, stateless design, HPA configuration, and upgrade strategies to help both newcomers and seasoned engineers manage clusters effectively.

Cloud Nativebest practicescluster management
0 likes · 7 min read
10 Hard‑Earned Lessons from 3 Years Managing Kubernetes Clusters
Ops Development Stories
Ops Development Stories
Feb 2, 2024 · Cloud Native

Essential kubectl Commands for Efficient Kubernetes Management

This guide compiles a comprehensive set of kubectl and Docker commands for retrieving logs, sorting pods, managing secrets, cleaning resources, debugging, port forwarding, and performing cluster maintenance tasks, helping administrators streamline Kubernetes operations and troubleshoot issues effectively.

CLICloud Nativecluster management
0 likes · 15 min read
Essential kubectl Commands for Efficient Kubernetes Management
Didi Tech
Didi Tech
Jan 9, 2024 · Big Data

Introducing Apache Pulsar: Technical Benefits and Solutions for Didi Big Data Messaging System

Apache Pulsar, a cloud‑native distributed messaging platform, solves Didi Big Data’s DKafka bottlenecks by separating compute and storage, using sequential log writes, heterogeneous disks, multi‑level caching, bundle‑based load balancing and automatic scaling, dramatically improving stability while introducing richer monitoring complexity.

Apache PulsarDKafkaMessaging System
0 likes · 17 min read
Introducing Apache Pulsar: Technical Benefits and Solutions for Didi Big Data Messaging System
dbaplus Community
dbaplus Community
Dec 20, 2023 · Operations

Scaling Kafka to 1000+ Nodes: Governance, Auto‑Balancing & Tiered Storage

This article outlines how a large‑scale Kafka deployment of over a thousand machines across dozens of clusters was engineered for stability and efficiency through a custom Guardian controller that adds partition‑level throttling, automatic balancing, multi‑tenant isolation, cross‑IDC management, tiered storage, audit capabilities, and fully automated operational workflows.

MonitoringMulti‑tenantOperations
0 likes · 21 min read
Scaling Kafka to 1000+ Nodes: Governance, Auto‑Balancing & Tiered Storage
WeiLi Technology Team
WeiLi Technology Team
Nov 1, 2023 · Big Data

How to Diagnose and Resolve HDFS Safe Mode Issues

This guide explains why HDFS enters safe mode after a DataNode failure, describes the safe‑mode state and its exit conditions, and provides step‑by‑step commands and troubleshooting procedures to analyze, fix, and recover from safe‑mode incidents in Hadoop clusters.

Big DataHDFSHadoop
0 likes · 10 min read
How to Diagnose and Resolve HDFS Safe Mode Issues
Efficient Ops
Efficient Ops
Sep 17, 2023 · Cloud Native

Top 9 Essential Kubernetes Tools to Streamline Your Cloud‑Native Workflows

Explore nine indispensable Kubernetes tools—including Kubie, Kubespray, Helm, Minikube, K3s, Kustomize, KOps, Prometheus, and krew—that simplify cluster management, accelerate deployments, and enhance efficiency, helping you choose the right solution for smoother, more productive cloud‑native operations.

cloud-nativecluster managementdevops tools
0 likes · 6 min read
Top 9 Essential Kubernetes Tools to Streamline Your Cloud‑Native Workflows
Liangxu Linux
Liangxu Linux
Jul 2, 2023 · Cloud Native

Mastering kubectl: Essential Commands for Kubernetes Management

This guide explains what kubectl is, how it interacts with the Kubernetes API server, and provides a categorized list of essential commands for retrieving information, debugging, state management, scaling, deployment, and security, helping users efficiently operate and automate K8s clusters.

Cloud Nativecluster managementdevops
0 likes · 5 min read
Mastering kubectl: Essential Commands for Kubernetes Management
Test Development Learning Exchange
Test Development Learning Exchange
Jun 29, 2023 · Cloud Native

Essential Kubernetes Commands for Testers: 50 Commands with Practical Examples

This article presents a comprehensive collection of 50 essential kubectl commands covering cluster, namespace, pod, deployment, service, ConfigMap, secret, volume, logging, debugging, scaling, configuration, and cleanup operations, providing testers with practical examples to efficiently manage and troubleshoot Kubernetes environments.

cluster managementkubectltesting
0 likes · 9 min read
Essential Kubernetes Commands for Testers: 50 Commands with Practical Examples
High Availability Architecture
High Availability Architecture
May 26, 2023 · Big Data

Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster Resource Scheduling

This article introduces Amiya, a self‑developed overcommit component that dynamically increases Yarn memory and vCore capacity on Bilibili's offline big‑data clusters, details its architecture, key implementation of overcommit, eviction and mixed‑deployment strategies, and evaluates its resource‑utilization impact.

OvercommitResource SchedulingYARN
0 likes · 22 min read
Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster Resource Scheduling
Bilibili Tech
Bilibili Tech
May 23, 2023 · Big Data

Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster

Amiya, a self‑developed dynamic over‑commit component for Bilibili’s offline big‑data cluster, inflates reported resources on under‑utilized nodes and adjusts them when load rises, adding roughly 683 TB of memory and 137 k vCores, boosting per‑node memory by 15 % and CPU usage by over 20 % while keeping eviction rates below 3 %.

AmiyaBilibiliResource Overcommit
0 likes · 22 min read
Amiya: Dynamic Overcommit Component for Bilibili Offline Big Data Cluster
Cloud Native Technology Community
Cloud Native Technology Community
May 17, 2023 · Cloud Native

Why Do You Need Kubernetes Multi‑Cluster? Core Challenges and Design Principles

This article explains the motivations behind Kubernetes multi‑cluster deployments, outlines common use cases such as isolation and high‑availability, and analyzes core management elements including deployment models, control‑plane architectures, network connectivity, service discovery, cross‑cluster scheduling, application model extensions, and treating clusters as resources.

Cloud NativeMulti-ClusterNetwork Model
0 likes · 23 min read
Why Do You Need Kubernetes Multi‑Cluster? Core Challenges and Design Principles
ITPUB
ITPUB
Apr 5, 2023 · Operations

Automating TiDB Operations: From Manual Pain Points to a Scalable Platform

This article details how Zhaozhuan's DBA team transformed TiDB cluster management by addressing metadata, resource allocation, upgrade, and alert challenges through a comprehensive automation platform that streamlines work orders, node operations, scaling, monitoring, and alert handling, ultimately reducing manual effort and improving reliability.

AlertingDatabase AutomationTiDB
0 likes · 22 min read
Automating TiDB Operations: From Manual Pain Points to a Scalable Platform
MaGe Linux Operations
MaGe Linux Operations
Mar 30, 2023 · Cloud Native

Why Is Kubernetes So Hard to Master? A Beginner’s Q&A Guide

This article explains the core concepts of Kubernetes—including its architecture, node communication, pod scheduling, data storage, service exposure, scaling, and controller coordination—through a series of clear questions and answers, helping beginners grasp why the platform feels complex.

Cloud NativeContainer OrchestrationPod Scheduling
0 likes · 9 min read
Why Is Kubernetes So Hard to Master? A Beginner’s Q&A Guide
Architecture Digest
Architecture Digest
Mar 20, 2023 · Cloud Native

Kubernetes: What It Is and Why It’s Hard to Get Started

This article provides a concise, question‑and‑answer overview of Kubernetes, explaining its role as a distributed container‑orchestration system, the architecture of master and worker nodes, core components such as etcd, kube‑apiserver, scheduler, controllers, and how services, pods, labels, and scaling operate within a cluster.

Cloud NativeContainer OrchestrationControllers
0 likes · 8 min read
Kubernetes: What It Is and Why It’s Hard to Get Started
21CTO
21CTO
Feb 10, 2023 · Cloud Native

Why Kubernetes Is So Hard to Master: A Beginner’s Q&A Walkthrough

This article introduces Kubernetes fundamentals through a series of questions and answers, covering its architecture, node communication, pod scheduling, data storage, external access, scaling mechanisms, and component coordination, all illustrated with clear diagrams.

ContainersPod Schedulingcluster management
0 likes · 9 min read
Why Kubernetes Is So Hard to Master: A Beginner’s Q&A Walkthrough
Top Architect
Top Architect
Feb 7, 2023 · Cloud Native

Understanding Kubernetes: Core Concepts and Architecture

This article provides a concise, question‑driven overview of Kubernetes, covering its architecture, node and master communication, pod fundamentals, scheduling, storage via etcd, service exposure, scaling mechanisms, and the roles of core components such as kube‑apiserver, kubelet, kube‑proxy and controllers.

Cloud NativeContainersMicroservices
0 likes · 9 min read
Understanding Kubernetes: Core Concepts and Architecture
Open Source Linux
Open Source Linux
Dec 30, 2022 · Operations

Top 7 Kubernetes Management Tools to Simplify Cluster Operations

This article introduces seven popular Kubernetes management solutions—including K9s, Rancher, the native Dashboard with Kubectl and Kubeadm, Helm, KubeSpray, Kontena Lens, and WKSctl—detailing their key features, usage scenarios, and how they help streamline cluster monitoring, deployment, scaling, and security across cloud‑native environments.

Operationscluster managementdevops
0 likes · 9 min read
Top 7 Kubernetes Management Tools to Simplify Cluster Operations
Architecture Digest
Architecture Digest
Nov 30, 2022 · Backend Development

Meituan Kafka at Scale: Challenges and Optimizations for Latency, Cluster Management, and Reliability

This article details Meituan's large‑scale Kafka deployment—over 15,000 machines and petabyte‑level daily traffic—its operational challenges such as slow nodes, load imbalance, and resource contention, and the comprehensive read/write latency, system‑level, and cluster‑management optimizations implemented to improve performance and reliability.

Performance Optimizationcluster managementdistributed systems
0 likes · 22 min read
Meituan Kafka at Scale: Challenges and Optimizations for Latency, Cluster Management, and Reliability
ITPUB
ITPUB
Nov 23, 2022 · Backend Development

How Zookeeper Elects Its Leader: A Human Election Analogy Explained

This article explains Zookeeper's leader election mechanism by comparing it to human voting, detailing the four core concepts, the role of zxid, the step‑by‑step process during startup and runtime failures, and the key terms every interviewee should know.

Backend DevelopmentLeader Electioncluster management
0 likes · 11 min read
How Zookeeper Elects Its Leader: A Human Election Analogy Explained
Java Architect Essentials
Java Architect Essentials
Nov 11, 2022 · Big Data

Meituan Kafka at Scale: Challenges and Optimizations for Latency, Cluster Management, and Reliability

This article details Meituan's large‑scale Kafka deployment, describing the current state, performance challenges such as slow nodes and disk imbalance, and the comprehensive optimizations applied—including read/write latency reductions, migration pipelines, fetcher isolation, SSD caching, RAID acceleration, cgroup isolation, full‑link monitoring, service lifecycle management, and TOR disaster recovery—to improve reliability and prepare for future growth.

Latency ReductionMeituancluster management
0 likes · 21 min read
Meituan Kafka at Scale: Challenges and Optimizations for Latency, Cluster Management, and Reliability
Java High-Performance Architecture
Java High-Performance Architecture
Oct 11, 2022 · Operations

How Meituan Optimized Kafka for Massive Scale: Reducing Latency and Managing Clusters

This article details Meituan's real‑world challenges with a 15,000‑node Kafka deployment and explains the application‑layer and system‑layer optimizations—such as disk balancing, migration pipeline acceleration, fetcher isolation, RAID acceleration, cgroup isolation, and an SSD‑based cache—that together dramatically cut read/write latency and simplify large‑scale cluster management.

Large ScaleMeituanOptimization
0 likes · 23 min read
How Meituan Optimized Kafka for Massive Scale: Reducing Latency and Managing Clusters
Code Ape Tech Column
Code Ape Tech Column
Sep 24, 2022 · Operations

Overview of Redis Monitoring, Data Migration, and Cluster Management Tools

This article introduces essential Redis operational tools, covering real‑time monitoring with the INFO command and Prometheus‑exporter, data migration using Redis‑shake, consistency checking via Redis‑full‑check, and cluster management through CacheCloud, providing practical guidance for administrators.

Data MigrationOperationscluster management
0 likes · 10 min read
Overview of Redis Monitoring, Data Migration, and Cluster Management Tools
vivo Internet Technology
vivo Internet Technology
Sep 14, 2022 · Big Data

Exploring and Practicing Apache Pulsar at vivo: Cluster Management, Monitoring, and Optimization

The vivo big‑data team details how they migrated massive real‑time workloads from Kafka to Apache Pulsar, describing cluster‑level bundle and ledger management, retention policies, a Prometheus‑Kafka‑Druid monitoring pipeline, load‑balancing tweaks, client tuning, rapid broker‑failure recovery, and future cloud‑native tracing and migration plans.

Apache PulsarBig Datacluster management
0 likes · 19 min read
Exploring and Practicing Apache Pulsar at vivo: Cluster Management, Monitoring, and Optimization
Python Crawling & Data Mining
Python Crawling & Data Mining
Aug 12, 2022 · Big Data

Master the Big Data Ecosystem: 9 Core Technology Frameworks Explained

This article provides a comprehensive overview of the big data ecosystem, detailing nine essential technology categories—including data collection, storage, computation, analysis, resource management, retrieval, underlying infrastructure, and cluster installation—while comparing popular tools and illustrating their typical use‑cases with diagrams.

cluster managementdata collectiondata storage
0 likes · 11 min read
Master the Big Data Ecosystem: 9 Core Technology Frameworks Explained
Meituan Technology Team
Meituan Technology Team
Aug 11, 2022 · Cloud Native

LAR: Load Auto-Regulator System for Resource Utilization and Service Quality

The article analyzes Meituan’s self‑designed Load Auto‑Regulator (LAR), detailing its tiered resource‑pool architecture, dynamic load‑to‑static‑resource mapping, and QoS mechanisms that together raise data‑center CPU utilization by 5‑10% while keeping online service quality stable, and discusses its deployment in online and mixed‑workload scenarios.

Cloud NativeQoSResource Scheduling
0 likes · 28 min read
LAR: Load Auto-Regulator System for Resource Utilization and Service Quality
Meituan Technology Team
Meituan Technology Team
Aug 4, 2022 · Big Data

Optimizing Kafka for Large-Scale Data Platforms at Meituan

The article details Meituan's massive Kafka deployment—over 15,000 machines handling more than 30 PB of daily data—its performance and management challenges, and the comprehensive application‑layer, system‑layer, and hybrid‑layer optimizations Meituan implemented to reduce read/write latency and improve large‑scale cluster reliability.

Data PlatformFull‑Link MonitoringMeituan
0 likes · 25 min read
Optimizing Kafka for Large-Scale Data Platforms at Meituan
NetEase Game Operations Platform
NetEase Game Operations Platform
Jun 10, 2022 · Databases

Apache Doris Deployment and Optimization at NetEase Interactive Entertainment

This article details NetEase Interactive Entertainment's adoption of Apache Doris for large‑scale game data analytics, covering background, Doris architecture, cluster governance, tablet and compaction tuning, scaling strategies, monitoring, alerting, and fault‑handling practices to improve performance and stability.

Apache DorisBig DataCompaction Tuning
0 likes · 22 min read
Apache Doris Deployment and Optimization at NetEase Interactive Entertainment
vivo Internet Technology
vivo Internet Technology
Jun 8, 2022 · Cloud Native

Vivo’s Large‑Scale Kubernetes Operator Practice for Multi‑Data‑Center Cluster Management

Vivo replaced error‑prone manual Ansible playbooks with a custom Kubernetes Operator that uses declarative CRDs and modular Ansible scripts to automate the full lifecycle—deployment, scaling, upgrades, and recovery—of thousands of nodes across multiple data‑centers, supported by extensive CI testing and future kubeadm integration.

AnsibleCI/CDCloud Native
0 likes · 14 min read
Vivo’s Large‑Scale Kubernetes Operator Practice for Multi‑Data‑Center Cluster Management
vivo Internet Technology
vivo Internet Technology
May 31, 2022 · Big Data

Kafka Load Balancing and Cruise Control: Concepts, Manual Migration, and Deployment

Kafka’s server‑side load imbalance, caused by static replica placement on broker disks, makes manual replica migration infeasible at scale, but Cruise Control automates metric collection, analysis, and execution of fine‑grained rebalance plans—including broker de‑commissioning and leader dispersion—allowing large clusters to expand and operate efficiently.

Big DataCruise ControlReplica Migration
0 likes · 21 min read
Kafka Load Balancing and Cruise Control: Concepts, Manual Migration, and Deployment
dbaplus Community
dbaplus Community
May 12, 2022 · Big Data

How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains

This article details Bilibili's end‑to‑end Presto on Hadoop architecture, covering the multi‑engine SQL stack, dispatcher routing, cluster scale, stability enhancements like coordinator HA and real‑time punish, query limits, Hive UDF compatibility, insert‑overwrite support, Alluxio caching, multi‑datacenter routing, query result caching, Raptorx local cache, JDK upgrades, dynamic filtering, and future roadmap, illustrating how these innovations boosted query throughput and reduced latency.

Big DataHadoopPerformance Optimization
0 likes · 32 min read
How Bilibili Scaled Presto on Hadoop: Architecture, Optimizations, and Performance Gains
Aikesheng Open Source Community
Aikesheng Open Source Community
Apr 7, 2022 · Databases

TiDB 2.1.x to 4.0.13 Upgrade and Data Migration Guide

This article provides a comprehensive step‑by‑step guide for senior DBAs to upgrade an online TiDB 2.1.x cluster to version 4.0.13 via data migration, detailing environment assessment, configuration changes, component deployment, full and incremental data transfer, consistency verification, permission synchronization, and traffic switchover.

AnsibleDatabase UpgradeTiDB
0 likes · 26 min read
TiDB 2.1.x to 4.0.13 Upgrade and Data Migration Guide
IT Services Circle
IT Services Circle
Apr 3, 2022 · Cloud Native

Understanding Kubernetes Federation: kubefed and Karmada Multi‑Cluster Management

This article explains why Kubernetes single‑cluster scalability is limited to about 5,000 nodes, introduces the concept of multi‑cluster federation, compares the legacy kubefed project with the actively maintained Karmada solution, and shows how policies and replica‑scheduling enable flexible cross‑AZ deployments and failover.

Cloud NativeFederationKarmada
0 likes · 13 min read
Understanding Kubernetes Federation: kubefed and Karmada Multi‑Cluster Management
Architect
Architect
Feb 6, 2022 · Big Data

Elasticsearch Overview: Architecture, Core Concepts, and Performance Optimization

This article provides a comprehensive introduction to Elasticsearch, covering data types, Lucene fundamentals, inverted indexes, cluster components, node roles, shard and replica mechanisms, mapping, installation, health monitoring, write path, storage strategies, segment management, refresh and translog processes, as well as practical performance and JVM tuning tips.

Distributed SearchElasticsearchPerformance Optimization
0 likes · 37 min read
Elasticsearch Overview: Architecture, Core Concepts, and Performance Optimization
DataFunTalk
DataFunTalk
Feb 1, 2022 · Big Data

Kafka at Meituan: Practices, Challenges, and Optimizations for Large‑Scale Data Platforms

This article presents Meituan's large‑scale Kafka deployment, describing the current state and challenges of massive data ingestion, detailing latency‑reduction techniques, cluster‑level optimizations, SSD‑based caching, isolation strategies, full‑link monitoring, lifecycle management, and future directions for high availability.

Large-Scale DataMeituanMonitoring
0 likes · 22 min read
Kafka at Meituan: Practices, Challenges, and Optimizations for Large‑Scale Data Platforms
MaGe Linux Operations
MaGe Linux Operations
Jan 28, 2022 · Cloud Native

Top 7 Kubernetes Management Tools to Simplify Cluster Operations

Discover the most popular Kubernetes management solutions—including K9s, Rancher, Dashboard, Helm, Kubespray, Lens, and WKSctl—detailing their features, deployment options, and how they streamline cluster monitoring, scaling, and security for cloud-native environments and improve operational efficiency.

Cloud Nativecluster managementdevops
0 likes · 9 min read
Top 7 Kubernetes Management Tools to Simplify Cluster Operations
Yiche Technology
Yiche Technology
Jan 11, 2022 · Databases

Elasticsearch Overview, Comparison, Maintenance Challenges, Deployment Strategies, and Automation Management Platform

This document provides a comprehensive technical overview of Elasticsearch, comparing it with Solr and ClickHouse, detailing common operational pain points and configuration solutions, describing containerized and ECK deployments, and outlining a company‑wide automation platform for cluster provisioning, monitoring, index and security management, with future directions for lifecycle and backup strategies.

Automationcluster managementkubernetes
0 likes · 31 min read
Elasticsearch Overview, Comparison, Maintenance Challenges, Deployment Strategies, and Automation Management Platform
21CTO
21CTO
Jan 4, 2022 · Operations

Deploy Searchable Snapshots in Elasticsearch 7.14: A Complete Step‑by‑Step Guide

This article explains the principles behind Elasticsearch searchable snapshots, details the DataTier model and node role optimizations, and provides a full practical walkthrough—including cluster setup, COS repository creation, ILM policy configuration, index templates, mounting strategies, and performance considerations—using ES 7.14.2.

Data TierElasticsearchILM
0 likes · 15 min read
Deploy Searchable Snapshots in Elasticsearch 7.14: A Complete Step‑by‑Step Guide
Efficient Ops
Efficient Ops
Jan 3, 2022 · Operations

Master Elasticsearch Cluster: Essential Commands for Health, Tasks, and Settings

This article explains how to manage Tencent Cloud Elasticsearch clusters by using key APIs to check health status, monitor pending tasks, retrieve metadata, view statistics, adjust shard allocation, modify cluster settings, and control tasks, providing practical command examples and detailed explanations for effective operations.

APISettingscluster management
0 likes · 19 min read
Master Elasticsearch Cluster: Essential Commands for Health, Tasks, and Settings
政采云技术
政采云技术
Nov 9, 2021 · Cloud Native

Design and Usage of Clusterfile in Sealer for Cluster Configuration and Plugins

This article explains the design principles of Sealer's Clusterfile, details its configuration parameters, demonstrates how to inject additional settings and environment variables, and describes the supported plugins for customizing Kubernetes clusters, providing practical examples and code snippets.

Cloud NativeClusterfileSealer
0 likes · 10 min read
Design and Usage of Clusterfile in Sealer for Cluster Configuration and Plugins
Alibaba Cloud Native
Alibaba Cloud Native
Oct 29, 2021 · Cloud Native

Unified Management & Secure Governance for Alibaba Cloud ACK and On-Prem Kubernetes

This article explains how cloud‑native technologies enable a unified control plane for Alibaba Cloud ACK clusters and self‑built Kubernetes clusters, detailing the ACK registered‑cluster architecture, one‑way registration, non‑managed security mechanisms, step‑by‑step cluster onboarding, and consistent security governance across environments.

ACKCloud Nativecluster management
0 likes · 11 min read
Unified Management & Secure Governance for Alibaba Cloud ACK and On-Prem Kubernetes
Tencent Cloud Developer
Tencent Cloud Developer
Oct 8, 2021 · Operations

Unveiling Kafka’s Controller: Architecture, Election, and Monitoring Deep Dive

This article provides a comprehensive technical analysis of Kafka’s Controller component, covering its background, core responsibilities, data storage, election process, version‑specific improvements, monitoring techniques, and key source‑code excerpts to help engineers understand and manage Kafka clusters effectively.

MonitoringZookeepercluster management
0 likes · 27 min read
Unveiling Kafka’s Controller: Architecture, Election, and Monitoring Deep Dive
MaGe Linux Operations
MaGe Linux Operations
Oct 5, 2021 · Cloud Native

Unlock Advanced kubectl Tricks for Faster Kubernetes Management

This article shares a collection of powerful kubectl commands and tips—including API debugging, status‑based pod filtering and deletion, node‑specific pod listing, distribution counting with awk, and proxy usage—to help experienced Kubernetes users work more efficiently and avoid manual API client coding.

CLIcluster managementdevops
0 likes · 7 min read
Unlock Advanced kubectl Tricks for Faster Kubernetes Management
DevOps Cloud Academy
DevOps Cloud Academy
Sep 21, 2021 · Operations

Practical Elasticsearch Operations and Performance Tuning Guide

This article extends previous Elasticsearch cheat sheets with practical commands and step‑by‑step instructions for shard allocation, replica adjustment, cluster settings, slow‑log configuration, mapping routing, force merge, bulk writes, refresh intervals, translog durability, heap sizing, disk‑space monitoring, and troubleshooting strategies.

ElasticsearchOperationsPerformance Tuning
0 likes · 7 min read
Practical Elasticsearch Operations and Performance Tuning Guide
Selected Java Interview Questions
Selected Java Interview Questions
Sep 7, 2021 · Big Data

Elasticsearch Basics: Core Concepts, Indexing, Write and Search Processes, Cluster Management and Performance Tips

This article provides a comprehensive overview of Elasticsearch, covering its fundamental architecture, key concepts such as indices, shards and replicas, the complete write and search workflows, consistency mechanisms, master node election, and practical performance‑tuning recommendations for large‑scale deployments.

Big DataElasticsearchIndexing
0 likes · 15 min read
Elasticsearch Basics: Core Concepts, Indexing, Write and Search Processes, Cluster Management and Performance Tips
Tencent Cloud Developer
Tencent Cloud Developer
Aug 26, 2021 · Big Data

Recap of Shenzhen Elasticsearch Meetup – Community Growth, Compression Optimization, Real‑time Data Fusion, and Cluster Practices

The first Shenzhen Elasticsearch meetup on August 21, 2021, jointly hosted by the ES Chinese community and Tencent Cloud, gathered experts from Tencent, Tapdata, ByteDance and Vivo to showcase rapid community growth, compression‑encoding optimizations, real‑time ES‑MongoDB data fusion, custom kernel extensions, large‑scale cluster practices, and concluded with extensive Q&A and networking.

Big DataElasticsearchReal-time Data Fusion
0 likes · 11 min read
Recap of Shenzhen Elasticsearch Meetup – Community Growth, Compression Optimization, Real‑time Data Fusion, and Cluster Practices
Sohu Tech Products
Sohu Tech Products
Jul 14, 2021 · Cloud Native

Limitations and Challenges of Kubernetes in Cluster Management and Application Scenarios

The article examines Kubernetes' widespread adoption, outlines its scalability and multi‑cluster management constraints, discusses practical application scenarios such as deployment models, batch scheduling, and hard multi‑tenancy, and highlights the gaps that still limit its use in large‑scale production environments.

Cloud NativeMulti-Clustercluster management
0 likes · 21 min read
Limitations and Challenges of Kubernetes in Cluster Management and Application Scenarios