Tagged articles

186 articles

Page 2 of 2

Apr 2, 2021 · Backend Development

Traffic Governance and Protection: Threshold Configuration

Effective traffic governance and protection rely on properly configuring thresholds—either a fixed global limit that stays constant regardless of node count, or a per‑machine allocation that scales the total capacity as nodes are added or removed—to prevent sudden surges from overwhelming services and ensure high availability.

Backend DevelopmentCluster Managementhigh availability

0 likes · 4 min read

Traffic Governance and Protection: Threshold Configuration

Ops Development Stories

Apr 1, 2021 · Operations

Zookeeper Leader Election Explained: Cluster Architecture & Code Walkthrough

This article provides a comprehensive overview of Zookeeper's cluster deployment, explains the four server states, details the leader election process—including initialization, voting, and decision logic—and presents key source code snippets to help developers understand and implement Zookeeper's high‑availability mechanisms.

Cluster ManagementDistributed SystemsJava

0 likes · 10 min read

Zookeeper Leader Election Explained: Cluster Architecture & Code Walkthrough

JD Tech

Mar 30, 2021 · Artificial Intelligence

JD Retail's Jiushu Business Analytics Platform: AI‑Driven Solutions for Retail

The article introduces JD Retail's Jiushu Business Analytics Platform, detailing how AI, big‑data, and distributed‑training technologies address fragmented retail scenarios, high deployment barriers, large‑scale application difficulties, and cost concerns through specialized frameworks, fault‑tolerant training, and advanced cluster optimization.

AICluster ManagementDistributed Training

0 likes · 12 min read

JD Retail's Jiushu Business Analytics Platform: AI‑Driven Solutions for Retail

Cloud Native Technology Community

Mar 3, 2021 · Operations

How Facebook Scales Millions of Servers with Twine: Inside Its Cluster Management Engine

This article explains how Facebook’s Twine system orchestrates containers across millions of servers, detailing its architecture, support for stateful services, cross‑data‑center control, elastic capacity handling, and the lessons learned from eight years of large‑scale operations.

Cluster ManagementFacebookOperations

0 likes · 15 min read

How Facebook Scales Millions of Servers with Twine: Inside Its Cluster Management Engine

Liangxu Linux

Feb 28, 2021 · Cloud Native

Essential Kubernetes Best Practices for Production‑Ready Clusters

This guide presents a comprehensive checklist of Kubernetes best practices covering container image selection, registry authentication, namespace isolation, labeling, annotations, RBAC, pod security policies, network policies, secrets management, image scanning, CI/CD, canary releases, monitoring, service mesh, and admission controllers to help you build secure, stable, and scalable production clusters.

Cloud NativeCluster ManagementKubernetes

0 likes · 17 min read

Essential Kubernetes Best Practices for Production‑Ready Clusters

Open Source Linux

Feb 20, 2021 · Cloud Native

Fix Inconsistent Kubernetes rc/deployment/service Deletions and Etcd Failures

This guide walks through troubleshooting Kubernetes issues such as partially deleted resources, resetting etcd, apiserver start failures due to missing ServiceAccount certificates, SELinux permission errors, ServiceAccount key generation, etcd startup errors, host trust configuration, and resource limit pitfalls, providing concrete commands and scripts for each problem.

Cluster ManagementKubernetesLinux

0 likes · 17 min read

Fix Inconsistent Kubernetes rc/deployment/service Deletions and Etcd Failures

DataFunTalk

Feb 14, 2021 · Big Data

Impala at NetEase: Architecture, Iceberg Integration, Management System, Optimizations and Future Roadmap

This talk presents NetEase's practical experience with Impala, covering its core architecture, new features in version 3.x, integration with Apache Iceberg, a custom management platform, profiling and statistics enhancements, as well as future plans involving Kubernetes, Alluxio caching and pre‑computation strategies.

Apache IcebergBig DataCluster Management

0 likes · 13 min read

Impala at NetEase: Architecture, Iceberg Integration, Management System, Optimizations and Future Roadmap

Programmer DD

Jan 21, 2021 · Cloud Native

Master Kubernetes with k9s: The Ultimate CLI for Real-Time Cluster Management

k9s is a command‑line interface that streamlines Kubernetes cluster management by wrapping kubectl, offering real‑time resource tracking, metric monitoring, support for standard and custom resources, customizable UI, multi‑resource views, RBAC inspection, and easy navigation of related objects, with an official site for download.

CLICloud NativeCluster Management

0 likes · 2 min read

Master Kubernetes with k9s: The Ultimate CLI for Real-Time Cluster Management

Alibaba Cloud Native

Jan 5, 2021 · Cloud Native

How to Transform a Native Kubernetes Cluster into an OpenYurt Edge‑Native Cluster

This guide walks through setting up a cloud‑hosted Kubernetes control plane with a Raspberry Pi edge node, demonstrates the limitations of a native cluster under edge conditions, converts the cluster to OpenYurt using yurtctl, and evaluates operation commands and network‑disruption scenarios to showcase OpenYurt’s edge‑native capabilities.

Cloud NativeCluster ManagementEdge Computing

0 likes · 21 min read

How to Transform a Native Kubernetes Cluster into an OpenYurt Edge‑Native Cluster

Programmer DD

Dec 28, 2020 · Operations

How to Install and Use Cerebro for Easy Elasticsearch Cluster Management

This guide explains what Cerebro is, how to install it (including binary and Docker options), how to run it on Linux, macOS, and Windows, and how to use its UI to connect to an Elasticsearch node, view cluster overviews, manage shards, and execute DSL queries.

AngularJSCerebroCluster Management

0 likes · 5 min read

How to Install and Use Cerebro for Easy Elasticsearch Cluster Management

vivo Internet Technology

Dec 23, 2020 · Cloud Native

ZooKeeper: Comprehensive Guide to Distributed Coordination Service

ZooKeeper, Apache’s distributed coordination service, offers a highly available in‑memory hierarchical file system with leader‑follower‑observer clustering and the ZAB protocol, guaranteeing sequential consistency, atomicity and a single view while supporting publish/subscribe, configuration management, distributed locks, master election and queueing for robust distributed applications.

ApacheCluster ManagementCoordination Service

0 likes · 20 min read

ZooKeeper: Comprehensive Guide to Distributed Coordination Service

Big Data Technology & Architecture

Dec 16, 2020 · Big Data

Designing a Real‑Time Data Processing Platform with Flink: Architecture, Deployment, and Operations

This article explains how to build a real‑time data processing platform using Flink, covering the Lambda architecture, design approaches, SQL and custom‑Jar task definitions, UI drag‑and‑drop, cluster resource management on Yarn and Kubernetes, submission modes, scheduling, permission and metadata handling, logging, and monitoring with Prometheus and Grafana.

Cluster ManagementFlinkLambda architecture

0 likes · 19 min read

Designing a Real‑Time Data Processing Platform with Flink: Architecture, Deployment, and Operations

MaGe Linux Operations

Dec 3, 2020 · Cloud Native

Essential Kubernetes Tools: Deploy, Monitor, and Develop with Ease

This article introduces a curated list of Kubernetes tools—including cluster deployment solutions, monitoring utilities, CLI helpers, and development aids—explaining how each simplifies container orchestration, enhances DevOps workflows, and empowers engineers to manage, observe, and extend their Kubernetes environments efficiently.

CLI toolsCluster ManagementDevOps

0 likes · 7 min read

Essential Kubernetes Tools: Deploy, Monitor, and Develop with Ease

DataFunTalk

Nov 27, 2020 · Big Data

Evolution of Kafka‑Based Data Pipeline at Chehaoduo Group: Architecture, Scaling, and Best Practices

This article chronicles the four‑year evolution of Chehaoduo Group’s Kafka ecosystem—from its initial role as a simple data‑ingestion layer to becoming the core of the company’s large‑scale data pipeline—detailing cluster management, upgrade strategies, multi‑cluster deployment, AVRO schema handling, SDK development, and operational lessons learned.

AvroCluster ManagementKafka

0 likes · 21 min read

Evolution of Kafka‑Based Data Pipeline at Chehaoduo Group: Architecture, Scaling, and Best Practices

Architecture Digest

Nov 21, 2020 · Operations

Understanding and Handling ZooKeeper Split‑Brain Issues

This article explains the causes of ZooKeeper split‑brain situations, why odd‑numbered node deployments are preferred, how the quorum (majority) rule prevents split‑brain, and outlines practical methods such as quorum configuration, redundant communication, fencing, and pause‑before‑failover to handle and avoid the issue.

Cluster ManagementSplit-Brainhigh availability

0 likes · 13 min read

Understanding and Handling ZooKeeper Split‑Brain Issues

vivo Internet Technology

Oct 14, 2020 · Backend Development

Design and Implementation of a High‑Availability RabbitMQ Middleware Platform at vivo

vivo built a high‑availability RabbitMQ middleware platform that combines an MQ‑Portal for request‑driven provisioning, an SDK that adds application‑level authentication, automatic cluster discovery, rate‑limiting, reset and blockage‑transfer capabilities, and a stateless MQ‑NameServer for name resolution and health‑based failover, enabling ten‑fold traffic growth without incidents.

BackendCluster ManagementMessage Queue

0 likes · 14 min read

Design and Implementation of a High‑Availability RabbitMQ Middleware Platform at vivo

ITPUB

Oct 10, 2020 · Big Data

How Didi Scaled Presto for Petabyte‑Scale Queries: Architecture & Optimizations

Didi’s three‑year journey with Presto transformed it into the company’s primary ad‑hoc and Hive‑SQL acceleration engine, serving over 6 000 users, processing 2‑3 PB of HDFS data daily, and achieving major gains in stability, performance, cost, and usability through extensive architectural tweaks, resource isolation, connector extensions, and monitoring enhancements.

Big DataCluster ManagementDruid Connector

0 likes · 18 min read

How Didi Scaled Presto for Petabyte‑Scale Queries: Architecture & Optimizations

Didi Tech

Oct 9, 2020 · Big Data

Presto at Didi: Architecture, Optimizations, and Operational Experience

At Didi, Presto has been the default ad‑hoc and Hive‑SQL engine for over three years, serving 6,000 users, processing 2‑3 PB daily and 30‑35 trillion rows, with mixed and dedicated clusters, migration to PrestoSQL 340, extensive Hive compatibility, label‑based isolation, a native Druid connector, usability and stability enhancements, and JVM‑level performance optimizations, while planning further resource‑saving upgrades.

Big DataCluster ManagementDistributed SQL

0 likes · 17 min read

Presto at Didi: Architecture, Optimizations, and Operational Experience

Big Data Technology & Architecture

Sep 18, 2020 · Big Data

Understanding the Elasticsearch Master Election Process

This article explains when Elasticsearch triggers a master election, describes each election stage—including active master and candidate selection, Bully algorithm comparison, and master node responsibilities—while providing code excerpts that illustrate the underlying implementation details.

Big DataCluster ManagementDistributed Systems

0 likes · 8 min read

Understanding the Elasticsearch Master Election Process

Java Architect Essentials

Aug 19, 2020 · Cloud Native

Inside Borg: The Predecessor of Kubernetes and Its Architecture Explained

This article provides a comprehensive analysis of Google’s Borg system, covering its design goals, user view, job and task model, resource allocation, scheduling algorithms, fault tolerance, scalability techniques, and operational metrics that shaped modern cloud‑native orchestration platforms.

BorgCloud NativeCluster Management

0 likes · 26 min read

Inside Borg: The Predecessor of Kubernetes and Its Architecture Explained

MaGe Linux Operations

Aug 5, 2020 · Cloud Native

Top Open-Source Tools to Simplify Kubernetes Management Across Any Environment

Discover a curated list of powerful open-source Kubernetes management solutions—including K9s, Rancher, Dashboard, Kubectl, Kubeadm, Helm, KubeSpray, Kontena Lens, and WKSctl—detailing their core features, deployment options, and how they streamline cluster monitoring, configuration, and application lifecycle across cloud-native environments.

Cloud NativeCluster ManagementDevOps

0 likes · 8 min read

Top Open-Source Tools to Simplify Kubernetes Management Across Any Environment

DataFunTalk

Jul 16, 2020 · Big Data

Elasticsearch Practices and Platform Construction at 58.com

This article details 58.com’s extensive use of Elasticsearch for search, analytics, and log processing, covering cluster optimization challenges, typical issues like disk exhaustion and write slowdown, practical solutions, development standards, ELKB architecture, real‑time log and MySQL slow‑log applications, platform‑as‑a‑service construction, and future roadmap plans.

Cluster ManagementElasticsearchLog Analytics

0 likes · 17 min read

Elasticsearch Practices and Platform Construction at 58.com

Open Source Linux

Jul 2, 2020 · Operations

Master Ceph Cluster Management: Fix Nearfull OSD, PG States & Config Commands

This guide explains how to troubleshoot Ceph near‑full OSD warnings, understand PG fault states, manage OSD and monitor statuses, perform cluster configuration changes without restarts, add or remove OSDs and monitors, adjust pool settings, and handle user permissions using detailed command examples.

CephCluster ManagementOSD

0 likes · 15 min read

Master Ceph Cluster Management: Fix Nearfull OSD, PG States & Config Commands

Big Data Technology & Architecture

Jun 22, 2020 · Databases

JDHBase Multi‑Active Architecture and Replication Mechanisms

This article describes JDHBase’s large‑scale KV storage, its HBase‑based replication principle, the multi‑active cluster architecture with Fox Manager, client routing, automatic failover, dynamic replication tuning, serial replication guarantees, and future directions for improving cross‑region disaster recovery.

Cluster ManagementHBaseJDHBase

0 likes · 11 min read

JDHBase Multi‑Active Architecture and Replication Mechanisms

Ops Development Stories

Jun 22, 2020 · Cloud Native

Master Fast Kubernetes Cluster Switching with kubectx and kubens

Learn how to merge multiple kubeconfig files, resolve TLS user conflicts, and use the third‑party tools kubectx and kubens to quickly list, switch, and manage Kubernetes clusters and namespaces with simple command‑line shortcuts.

Cluster ManagementKubernetescontext

0 likes · 10 min read

Master Fast Kubernetes Cluster Switching with kubectx and kubens

HomeTech

Jun 17, 2020 · Big Data

Apache Kylin at AutoHome: Development History, Architecture, Optimizations, and Future Plans

This article details AutoHome's use of Apache Kylin as its core OLAP engine, covering its development timeline, architecture, large‑scale cube deployment, performance optimizations, cluster upgrade experiences, and future directions such as real‑time OLAP and cloud‑native deployment.

Apache KylinCluster ManagementOLAP

0 likes · 23 min read

Apache Kylin at AutoHome: Development History, Architecture, Optimizations, and Future Plans

Big Data Technology Architecture

Jun 11, 2020 · Big Data

Kylin at Autohome: Development History, Deployment Practices, Optimizations, and Future Roadmap

This article details Autohome's use of Apache Kylin as its core OLAP engine, covering its architecture, large‑scale Cube deployment, real‑world business applications, a series of performance and operational optimizations, cluster upgrade experiences, and upcoming plans for real‑time OLAP and cloud‑native evolution.

Cloud NativeCluster ManagementKylin

0 likes · 24 min read

Kylin at Autohome: Development History, Deployment Practices, Optimizations, and Future Roadmap

Programmer DD

Jun 10, 2020 · Operations

Unlock Nacos 1.3.0: Embedded DB, New Raft Protocol, and Cluster Management Tips

Nacos 1.3.0 introduces an embedded relational database, unified cluster management, an upgraded Raft consistency protocol, and security fixes, providing simpler deployment, higher performance, and safer operations for both small‑scale and enterprise environments.

Cluster ManagementEmbedded DatabaseNacos

0 likes · 9 min read

Unlock Nacos 1.3.0: Embedded DB, New Raft Protocol, and Cluster Management Tips

Java Architecture Diary

Jun 6, 2020 · Cloud Native

Explore Nacos 1.3.0: Embedded DB, New Raft Protocol, and High‑Availability

Nacos 1.3.0 introduces an embedded relational database, unified cluster management, an upgraded Raft consistency layer, security patches, Snowflake ID configuration, data migration guidance, new cluster addressing modes, and a set of Open‑API operations for Raft administration, all aimed at simplicity, performance, and high availability.

Cluster ManagementEmbedded DatabaseNacos

0 likes · 10 min read

Explore Nacos 1.3.0: Embedded DB, New Raft Protocol, and High‑Availability

Architect

May 15, 2020 · Databases

Understanding Elasticsearch Architecture: Segments, Translog, Refresh, Shard Allocation and Cluster Operations

This article provides a comprehensive overview of Elasticsearch's internal architecture, explaining how data flows from memory buffers to Lucene segments, the role of refresh and translog for durability, segment merging strategies, shard routing, replica consistency, allocation controls, hot‑cold data separation, and cluster discovery settings.

Cluster ManagementElasticsearchSegments

0 likes · 23 min read

Understanding Elasticsearch Architecture: Segments, Translog, Refresh, Shard Allocation and Cluster Operations

Open Source Linux

May 2, 2020 · Big Data

Mastering Apache Zookeeper: Core Concepts and Real-World Big Data Applications

Apache Zookeeper is an open‑source coordination service that provides reliable distributed synchronization, configuration management, and naming for big‑data components such as Hadoop, HBase, and Kafka, offering features like hierarchical znode structures, watches, master election, and distributed locks to maintain cluster health.

Apache ZookeeperBig DataCluster Management

0 likes · 17 min read

Mastering Apache Zookeeper: Core Concepts and Real-World Big Data Applications

Big Data Technology Architecture

Mar 7, 2020 · Operations

How to Perform a Graceful Shutdown of an Elasticsearch Node

This article outlines a step‑by‑step procedure for safely taking an Elasticsearch node offline—checking master‑eligible settings, adjusting minimum_master_nodes, excluding the node from routing, waiting for shard relocation, stopping the service, and restoring the cluster routing—ensuring no data loss or service interruption.

Cluster ManagementDevOpsElasticsearch

0 likes · 6 min read

How to Perform a Graceful Shutdown of an Elasticsearch Node

Java Backend Technology

Mar 7, 2020 · Backend Development

Mastering ZooKeeper: Core Concepts, Distributed Locks, and Cluster Management

This article explains ZooKeeper's role as a distributed coordination service, covering its architecture, znode types, watch mechanism, configuration management, distributed locking, queue handling, data replication, leader election, and synchronization processes for building reliable backend systems.

Cluster ManagementCoordination ServiceZooKeeper

0 likes · 18 min read

Mastering ZooKeeper: Core Concepts, Distributed Locks, and Cluster Management

dbaplus Community

Feb 22, 2020 · Databases

How to Perform Daily Maintenance on GaussDB T Clusters Without Pitfalls

This guide walks you through the essential daily maintenance tasks for GaussDB T clusters, covering ETCD startup, cluster health checks, host resource monitoring, tablespace usage, abnormal wait events, log inspection, and common error troubleshooting with concrete commands and SQL examples.

Cluster ManagementDatabase MaintenanceError Handling

0 likes · 11 min read

How to Perform Daily Maintenance on GaussDB T Clusters Without Pitfalls

JD Retail Technology

Jan 6, 2020 · Backend Development

JDHBase Multi‑Active Architecture and Replication Practices

This article describes JDHBase’s large‑scale KV storage deployment, its HBase‑based asynchronous replication mechanism, the multi‑active architecture with active‑standby clusters, client interaction via Fox Manager, automatic failover strategies, dynamic replication tuning, and serial replication techniques to ensure data consistency across data centers.

Cluster ManagementHBaseReplication

0 likes · 13 min read

JDHBase Multi‑Active Architecture and Replication Practices

Efficient Ops

Dec 17, 2019 · Operations

How Alibaba Scales Flink: Lessons in Big Data Operations

This article details Alibaba's massive Flink deployment, covering its historical background, the operational challenges of managing tens of thousands of nodes, the design of a comprehensive Flink management platform, and the automated solutions for fault handling, resource allocation, and performance testing in a large‑scale big‑data environment.

AutomationBig Data OperationsCluster Management

0 likes · 20 min read

How Alibaba Scales Flink: Lessons in Big Data Operations

Alibaba Cloud Native

Nov 30, 2019 · Cloud Native

How Alibaba Cloud Manages Over 10,000 Kubernetes Clusters at Double‑11 Scale

This article explains how Alibaba Cloud Container Service (ACK) designs a unit‑based, tiered management system, capacity planning model, global observability architecture, and pluggable components to reliably operate more than ten thousand diverse Kubernetes clusters during the massive Double‑11 shopping event.

ACKAlibaba CloudCluster Management

0 likes · 13 min read

How Alibaba Cloud Manages Over 10,000 Kubernetes Clusters at Double‑11 Scale

dbaplus Community

Nov 13, 2019 · Databases

Scaling TiDB to 200TB at Zhuanzhuan: Key Performance and Management Lessons

Zhuanzhuan’s adoption of TiDB addressed sharding challenges and massive data storage needs, and the team shares six common issues encountered in large‑scale online deployments—including performance diagnosis, cluster management, log inconsistencies, slow‑SQL impact, optimizer limitations, and transaction conflicts—along with their standardized solutions for deployment, monitoring, alerting, and business rollout.

Cluster ManagementSQL OptimizationTiDB

0 likes · 12 min read

Scaling TiDB to 200TB at Zhuanzhuan: Key Performance and Management Lessons

Alibaba Cloud Developer

Nov 13, 2019 · Cloud Native

Scaling Kubernetes at Ant Financial: The Kube‑on‑Kube Architecture Explained

Ant Financial’s article details how its large‑scale Kubernetes management system—built on a meta‑cluster, end‑state operators, and a Kube‑on‑Kube design—ensures reliable creation, upgrade, and self‑healing of thousands of nodes, while providing gray‑scale changes, risk assessment, and fault‑tolerant automation.

Cloud NativeCluster ManagementKubernetes

0 likes · 15 min read

Scaling Kubernetes at Ant Financial: The Kube‑on‑Kube Architecture Explained

dbaplus Community

Nov 4, 2019 · Cloud Native

How Ant Financial Scales Kubernetes: Inside Their Cloud‑Native Cluster Management System

This article explains how Ant Financial designs and operates a large‑scale, highly available Kubernetes management platform—detailing its architecture, core operators, desired‑state controllers, fault‑self‑healing mechanisms, risk mitigation, and practical Q&A for production environments.

AutomationCloud NativeCluster Management

0 likes · 16 min read

How Ant Financial Scales Kubernetes: Inside Their Cloud‑Native Cluster Management System

dbaplus Community

Oct 29, 2019 · Cloud Native

How Meituan‑Dianping Scaled Kubernetes to 100k+ Nodes with HULK2.0

Meituan‑Dianping describes its evolution from a custom Docker‑based scheduler (HULK1.0) to an open‑source Kubernetes‑based platform (HULK2.0), detailing architecture, resource‑management strategies, scheduler optimizations, Kubelet enhancements, and online‑cluster tuning that together enable stable, cost‑effective operation of a 100k+ node fleet.

Cloud NativeCluster ManagementKubernetes

0 likes · 19 min read

How Meituan‑Dianping Scaled Kubernetes to 100k+ Nodes with HULK2.0

Alibaba Cloud Native

Oct 27, 2019 · Cloud Native

How Ant Financial Scales Kubernetes: Inside Their Kube‑on‑Kube Management System

This article explains how Ant Financial designs and operates a large‑scale, highly available Kubernetes management platform—using end‑state driven operators, custom CRDs, self‑healing mechanisms, and risk‑mitigation strategies—to reliably run thousands of nodes and dozens of business clusters in production.

Cluster ManagementKube-on-KubeKubernetes

0 likes · 15 min read

How Ant Financial Scales Kubernetes: Inside Their Kube‑on‑Kube Management System

Big Data Technology & Architecture

Oct 20, 2019 · Cloud Native

Meituan-Dianping Kubernetes Cluster Management and Optimization Practices

This article details Meituan-Dianping's evolution from custom Docker‑based scaling to a Kubernetes‑driven, cloud‑native cluster management platform (HULK), describing its architecture, scheduler enhancements, Kubelet modifications, and resource‑optimization strategies for large‑scale operations.

Cloud NativeCluster ManagementKubernetes

0 likes · 17 min read

Meituan-Dianping Kubernetes Cluster Management and Optimization Practices

dbaplus Community

Oct 8, 2019 · Big Data

How to Master Large-Scale Cluster Management: 10 Real-World Troubleshooting Cases

This article shares a senior data‑platform engineer's hands‑on experience managing dozens of thousand‑node clusters, detailing nine common cluster problems and step‑by‑step solutions—including performance tuning, RPC fixes, HDFS cleanup, Hive metadata repair, Spark shuffle optimization, HBase region recovery, and Kafka bottleneck mitigation.

Big DataCluster ManagementHBase

0 likes · 17 min read

How to Master Large-Scale Cluster Management: 10 Real-World Troubleshooting Cases

Big Data Technology Architecture

Sep 26, 2019 · Databases

Elasticsearch Core Overview and Key Performance Metrics

This article provides a comprehensive guide to Elasticsearch’s architecture, node roles, data organization, and the most important performance metrics—including search, indexing, memory, JVM garbage collection, host‑level system metrics, cluster health, and resource saturation—offering practical advice on monitoring and tuning the cluster for reliability and efficiency.

Cluster ManagementElasticsearchJVM

0 likes · 27 min read

Elasticsearch Core Overview and Key Performance Metrics

21CTO

Aug 23, 2019 · Cloud Native

How Meituan Optimized Kubernetes at Scale: Lessons from HULK2.0

This article details Meituan‑Dianping's evolution from a custom Docker‑based cluster manager to the open‑source Kubernetes‑powered HULK2.0 platform, describing its architecture, operational practices, scheduler and Kubelet optimizations, and resource‑management techniques that enable massive, cost‑effective scaling.

Cluster ManagementMeituanPerformance Optimization

0 likes · 19 min read

How Meituan Optimized Kubernetes at Scale: Lessons from HULK2.0

Meituan Technology Team

Aug 22, 2019 · Cloud Native

Meituan-Dianping Kubernetes Cluster Management and Optimization Practices

Meituan‑Dianping’s evolution from virtualization to the HULK‑2.0 Kubernetes platform enables a 100,000‑instance, multi‑region cluster to achieve high elasticity and availability, using scheduler optimizations, local‑optimal placement, enhanced kubelet features, and fine‑grained resource management to maximize throughput during traffic spikes.

Cluster ManagementKubernetesMeituan

0 likes · 19 min read

AntTech

Aug 15, 2019 · Cloud Native

Design and Implementation of Ant Financial’s Large‑Scale Kubernetes Cluster Management System

This article explains how Ant Financial designs a highly reliable, end‑state‑driven Kubernetes management platform that handles lifecycle operations, node self‑healing, and risk‑controlled changes for clusters with tens of thousands of nodes, using operators, custom resources, and a meta‑cluster architecture.

Cluster ManagementKuberneteslarge scale

0 likes · 9 min read

Design and Implementation of Ant Financial’s Large‑Scale Kubernetes Cluster Management System

MaGe Linux Operations

May 28, 2019 · Operations

What Skills and Knowledge Do You Need to Master Large‑Scale Website Operations?

This article explains what large‑scale website operations entail, outlines the product lifecycle and the crucial role of operations engineers, lists essential technical skills and personal qualities, and discusses current challenges, future prospects, and key technical topics such as cluster management, monitoring, fault handling, and automation.

AutomationCluster ManagementDevOps

0 likes · 18 min read

What Skills and Knowledge Do You Need to Master Large‑Scale Website Operations?

Youzan Coder

Apr 27, 2019 · Big Data

Recap of Elastic Community Technical Salon: Cluster Management, Multi‑Tenant Practices, and Search Platform Engineering

On April 27, Youzan Technology and the Elastic Chinese community hosted a “starry sky” technical salon where experts from Getui, Ant Financial, Haipai Ke and Youzan presented four talks on large‑cluster proxy management, multi‑tenant ES optimization, search‑platform engineering, and the evolution of Youzan’s log platform, followed by lively Q&A and resource sharing.

Cluster ManagementElasticsearchLog Analytics

0 likes · 6 min read

Recap of Elastic Community Technical Salon: Cluster Management, Multi‑Tenant Practices, and Search Platform Engineering

Java Captain

Apr 9, 2019 · Operations

Zookeeper Overview: Functions, Deployment Modes, Synchronization, and Notification Mechanism

This article explains Zookeeper as an open‑source distributed coordination service, detailing its core functions such as cluster management, leader election, distributed locks, and naming service, along with its three deployment modes, state‑synchronization via the ZAB protocol, and its watcher‑based notification mechanism.

Cluster ManagementDistributed CoordinationNaming Service

0 likes · 4 min read

Zookeeper Overview: Functions, Deployment Modes, Synchronization, and Notification Mechanism

MaGe Linux Operations

Jan 24, 2019 · Operations

What It Takes to Master Large‑Scale Website Operations?

This article explores the definition, responsibilities, required skills, career challenges, and key technologies of large‑scale website operations, offering a comprehensive guide for aspiring and current operations engineers to understand and excel in this demanding field.

AutomationCareer DevelopmentCluster Management

0 likes · 20 min read

What It Takes to Master Large‑Scale Website Operations?

Architecture Talk

Jan 8, 2019 · Big Data

Boost Elasticsearch Performance: Bulk API, Gateway & Caching Secrets

This article explains how to dramatically improve Elasticsearch throughput by using the bulk API, tuning bulk request sizes, configuring gateway settings, optimizing cluster state updates, managing caches, leveraging fielddata and doc values, and employing tools like Curator and the Profiler for efficient cluster operations.

Cluster ManagementElasticsearchbulk API

0 likes · 27 min read

Boost Elasticsearch Performance: Bulk API, Gateway & Caching Secrets

vivo Internet Technology

Dec 28, 2018 · Big Data

Running a 400+ Node Elasticsearch Cluster: Architecture, Scaling, and Performance Tuning

Meltwater’s media‑monitoring platform runs a custom Elasticsearch 1.7.6 cluster of over 400 nodes on AWS, handling 200 TB of primary data and 3 million daily documents while serving thousands of complex queries per minute, achieved through careful shard design, master‑node configuration, extensive performance tuning, and automated provisioning.

AWSCluster ManagementElasticsearch

0 likes · 13 min read

Running a 400+ Node Elasticsearch Cluster: Architecture, Scaling, and Performance Tuning

360 Tech Engineering

Sep 29, 2018 · Operations

Design and Implementation of a Multi‑Task Scheduling System Based on Apache Mesos

This article describes how we identified underutilized CPU and memory resources in our company's servers, evaluated Kubernetes versus Apache Mesos, and built a non‑intrusive, Mesos‑based multi‑task scheduling system with dynamic resource reservation, monitoring, task isolation, and cluster‑wide observability, while addressing deployment challenges.

Cluster ManagementDocker alternativeMesos

0 likes · 11 min read

Design and Implementation of a Multi‑Task Scheduling System Based on Apache Mesos

360 Zhihui Cloud Developer

Sep 26, 2018 · Cloud Computing

How We Built a Multi‑Task Scheduler with Mesos on Legacy Servers

This article explains how we leveraged Apache Mesos to create a multi‑task scheduling system that maximizes idle CPU and memory on legacy CentOS machines without kernel upgrades, detailing architecture, deployment, monitoring, resource isolation, and remaining challenges.

Cluster ManagementMesoscontainerization

0 likes · 12 min read

How We Built a Multi‑Task Scheduler with Mesos on Legacy Servers

dbaplus Community

Aug 14, 2018 · Operations

How Ant Financial Automates Tens of Thousands of Kubernetes Nodes with Operators

Ant Financial tackles the challenge of managing dozens of Kubernetes clusters and over a hundred thousand worker nodes by employing a meta‑cluster with Kube‑on‑Kube and Node Operators, enabling automated lifecycle management, scaling, upgrades, and fault recovery for both master components and worker nodes.

AutomationCluster ManagementKubernetes

0 likes · 12 min read

How Ant Financial Automates Tens of Thousands of Kubernetes Nodes with Operators

UCloud Tech

Jul 31, 2018 · Fundamentals

What’s New in Ceph? July 2018 Developer Highlights and Key Feature Updates

The July 2018 Ceph Developer Monthly report from the UMCloud storage team summarizes the latest community contributions, including enhancements to object and block storage, new OPA integration for fine‑grained access control, crash‑dump management, dashboard user UI, and batch operations for ceph‑volume.

CephCluster ManagementDeveloper Updates

0 likes · 6 min read

What’s New in Ceph? July 2018 Developer Highlights and Key Feature Updates

Alibaba Cloud Developer

Jun 26, 2018 · Operations

How Scheduling Algorithms Power Efficient Data Center Resource Management

Scheduling algorithms are a crucial component of cluster resource management systems, determining where containerized tasks run to ensure resource needs, high availability, fault tolerance, and cost efficiency across individual containers, applications, and entire data centers, while also supporting Alibaba’s global scheduling challenge.

Cluster ManagementData centeralgorithm competition

0 likes · 10 min read

How Scheduling Algorithms Power Efficient Data Center Resource Management

ITPUB

May 31, 2018 · Big Data

Mastering Spark on DataMagic: Fast‑Track Your Big Data Skills

This article explains Spark's role in the DataMagic platform, outlines four practical steps to quickly master Spark, details key configuration and parallelism settings, shows how to modify Spark code, and provides operational tips for cluster management and job troubleshooting.

Big DataCluster ManagementConfiguration

0 likes · 10 min read

Mastering Spark on DataMagic: Fast‑Track Your Big Data Skills

UCloud Tech

Apr 28, 2018 · Operations

Ceph April 2018 Update: New Object, Block, and Cluster Features

The April 2018 Ceph monthly report highlights LTTng tracing for RGW, SSL support for the beast frontend, MFA integration, a notrim option for rbd mapping, runtime lz4 and brotli compression, Zabbix PG metrics, asynchronous dashboard tasks, detailed operation tracking, osdmap pruning, and a new iostat manager plugin.

CephCluster ManagementLTTng

0 likes · 9 min read

Ceph April 2018 Update: New Object, Block, and Cluster Features

Alibaba Cloud Developer

Apr 12, 2018 · Backend Development

How Alibaba’s Cainiao Scales a Lightweight Timer Engine for Billions of Packages

Facing the challenge of processing over 100 million daily parcels, Alibaba’s Cainiao designed a lightweight, time‑wheel‑based scheduling engine that decouples task storage from timing, leverages partitioned task chains, master‑driven node IDs, and cluster‑wide soft‑load balancing to achieve scalable, fault‑tolerant timer processing.

Backend EngineeringCluster ManagementTime Wheel

0 likes · 12 min read

How Alibaba’s Cainiao Scales a Lightweight Timer Engine for Billions of Packages

Architects' Tech Alliance

Dec 18, 2017 · Fundamentals

GPFS Technical Practice Sharing and Building‑Block Design Overview

This article provides a comprehensive overview of IBM GPFS, covering its architecture, management components, networking models, cluster and storage design, as well as practical guidance on building‑block configurations for performance and capacity scaling in high‑performance computing environments.

Building BlockCluster ManagementDistributed File System

0 likes · 13 min read

GPFS Technical Practice Sharing and Building‑Block Design Overview

Efficient Ops

Nov 2, 2017 · Operations

Managing Massive Elasticsearch Clusters: Lessons from a 120‑Node Deployment

This article shares practical insights on operating large‑scale Elasticsearch clusters for log analysis, covering use cases, essential tools, hardware choices, node role separation, shard management, hot‑cold data strategies, version upgrades, and key monitoring metrics to ensure stability and performance.

Cluster ManagementElasticsearchHardware Scaling

0 likes · 12 min read

Managing Massive Elasticsearch Clusters: Lessons from a 120‑Node Deployment

37 Interactive Technology Team

Oct 19, 2017 · Big Data

Ambari Technical Practice for Managing Hadoop Big Data Platforms

The team adopted Apache Ambari to streamline deployment, scaling, monitoring, and upgrade of their Hadoop‑centric big‑data platform, overcoming HA cluster takeover and custom Hive 2.1 integration through a three‑phase test, gray‑scale, and production rollout, thereby improving management efficiency and reducing O&M costs.

AmbariCluster ManagementHDP

0 likes · 18 min read

Ambari Technical Practice for Managing Hadoop Big Data Platforms

Alibaba Cloud Developer

Sep 6, 2017 · Operations

How Mixed Offline/Online Scheduling Boosted Alibaba’s Data Center Utilization by 30%

The article explains how rapid internet growth has expanded data centers, why traditional operations fall short, presents a simple utilization formula, shows Alibaba’s mixed offline‑online scheduling experiment that raised server usage from 10% to over 40%, and announces an open dataset for academic research.

AlibabaCluster Managementdata center utilization

0 likes · 7 min read

How Mixed Offline/Online Scheduling Boosted Alibaba’s Data Center Utilization by 30%

High Availability Architecture

Apr 14, 2017 · Databases

Recent Improvements in Elasticsearch 5.x and Outlook for 6.0

This article reviews the latest Elasticsearch 5.x enhancements—including append‑only indexing, range fields, removal of the _all field, unified highlighter, keyword normalizer, multi‑word synonyms, field collapsing, cancellable searches, partitioned term aggregations, cluster allocation explain, Java REST client updates, cross‑cluster search, batched reduce phases—and previews the major features expected in Elasticsearch 6.0 such as sparse doc values, index sorting, sequence numbers, seamless rolling upgrades, type removal, index‑template inheritance, load‑aware shard routing, and X‑Pack extensions like SQL and machine learning.

Cluster ManagementElasticsearchSearch

0 likes · 15 min read

Recent Improvements in Elasticsearch 5.x and Outlook for 6.0

DevOps

Apr 10, 2017 · Cloud Native

Applying Docker and Kubernetes to Build Scalable, Automated Test Environments

The talk outlines how Docker and Kubernetes were adopted to streamline test environment provisioning, address challenges like environment inconsistency and resource scarcity, and enable automated, standardized, and scalable testing infrastructure through containerization, networking, storage, and cluster management techniques.

Cluster ManagementDevOpsDocker

0 likes · 19 min read

Applying Docker and Kubernetes to Build Scalable, Automated Test Environments

Architects' Tech Alliance

Mar 15, 2017 · Cloud Native

Docker Swarm on Apache Mesos: Architecture, Integration Guide, and Practical Considerations

This article explains the architecture of Docker Swarm, the reasons for running it on Apache Mesos, the integration process—including resource offers, task scheduling, and container creation—and discusses current limitations and future improvement directions.

Apache MesosCloud NativeCluster Management

0 likes · 12 min read

Docker Swarm on Apache Mesos: Architecture, Integration Guide, and Practical Considerations

Efficient Ops

Mar 8, 2017 · Big Data

Inside iQIYI’s Massive Hadoop Platform: Architecture, Ops, and the Gear Workflow Engine

iQIYI’s Hadoop platform, built since 2010, now spans over a thousand nodes and 60 PB storage, detailing its architectural evolution, operational management practices, encountered challenges, and the custom Gear workflow system that streamlines job scheduling, dependencies, and alerts for large‑scale data processing.

Cluster ManagementGearHadoop

0 likes · 19 min read

Inside iQIYI’s Massive Hadoop Platform: Architecture, Ops, and the Gear Workflow Engine

Qunar Tech Salon

Dec 30, 2016 · Operations

Mesos Architecture and Its Practical Use at Qunar: Framework Unification and Operational Insights

This article explains the Mesos distributed system kernel, its resource‑allocation workflow, and how Qunar engineers applied and evolved Mesos, Marathon, and custom frameworks to achieve fine‑grained scheduling, high availability, service discovery, and multi‑tenant management in a large‑scale production environment.

Cluster ManagementDistributed SystemsFramework

0 likes · 14 min read

Mesos Architecture and Its Practical Use at Qunar: Framework Unification and Operational Insights

Qunar Tech Salon

Nov 10, 2016 · Operations

Zookeeper Operational Best Practices and Common Pitfalls

This article shares practical experience on operating Zookeeper clusters, covering core concepts, deployment recommendations, configuration tuning, monitoring, migration strategies, and a list of common issues to avoid for reliable distributed coordination.

Cluster ManagementDistributed Coordinationbest practices

0 likes · 11 min read

Zookeeper Operational Best Practices and Common Pitfalls

360 Zhihui Cloud Developer

Oct 27, 2016 · Operations

How Hulk’s Private Cloud Optimizes SaltStack for Scalable Command Execution

This article explains how Hulk’s private‑cloud platform customizes SaltStack for large‑scale command execution, detailing its three‑layer architecture, Redis‑based data flow, and a seven‑step workflow that achieves 99 % success while highlighting current limitations and future improvements.

Cluster ManagementSaltStackZeroMQ

0 likes · 6 min read

How Hulk’s Private Cloud Optimizes SaltStack for Scalable Command Execution

ITPUB

Oct 13, 2016 · Databases

Mastering Oracle RAC: Step‑by‑Step Commands to Start and Stop Clusters

This guide provides a comprehensive, command‑by‑command walkthrough for shutting down and starting Oracle RAC clusters, covering srvctl and crsctl usage, status checks, and essential options to help administrators manage database instances and cluster services reliably.

Cluster ManagementDatabase ShutdownDatabase Startup

0 likes · 12 min read

Mastering Oracle RAC: Step‑by‑Step Commands to Start and Stop Clusters

Alibaba Cloud Developer

Aug 25, 2016 · Operations

Comparing Modern Data‑Center Schedulers: Borg, Mesos, Omega, Kubernetes & Zeus

This article examines resource allocation philosophies—auction, budgeting, and preemption—and compares the architectures, data models, and APIs of major schedulers such as Borg, Omega, Mesos, Kubernetes, and Alibaba’s Zeus, while also exploring sharing strategies, task classifications, utilization metrics, and predictive techniques for efficient resource management.

BorgCluster ManagementKubernetes

0 likes · 34 min read

Comparing Modern Data‑Center Schedulers: Borg, Mesos, Omega, Kubernetes & Zeus

dbaplus Community

Aug 23, 2016 · Operations

How to Build a Scalable Automated Deployment System for Multi‑Node Clusters

This article walks through the shortcomings of manual code releases, designs a multi‑environment automated deployment workflow, details step‑by‑step implementation—including code fetching, configuration handling, logging, parallel execution, and rollback—while sharing practical scripts and common pitfalls for large‑scale clusters.

Cluster ManagementDeployment AutomationDevOps

0 likes · 10 min read

How to Build a Scalable Automated Deployment System for Multi‑Node Clusters

MaGe Linux Operations

Aug 10, 2016 · Cloud Native

Master Docker Swarm: Build, Deploy, and Manage High‑Availability Clusters

This guide explains Docker Swarm’s architecture, installation methods, and step‑by‑step procedures for creating a high‑availability Swarm cluster, including manager and node setup, TLS security, service discovery with Consul, and essential Docker commands for operating and testing the cluster.

Cluster ManagementDevOpsDocker

0 likes · 9 min read

Master Docker Swarm: Build, Deploy, and Manage High‑Availability Clusters

Architecture Digest

Aug 8, 2016 · Databases

Understanding Elasticsearch Architecture: Clusters, Shards, Discovery, and Scaling

This article provides a comprehensive overview of Elasticsearch 2.x, covering its distributed architecture, core concepts such as clusters, nodes, indices, shards and replicas, the ZenDiscovery master‑election process, scaling mechanisms, recovery, query features, and the underlying system components like Guice, Netty, and thread‑pool designs.

Cluster ManagementElasticsearchNoSQL

0 likes · 20 min read

Understanding Elasticsearch Architecture: Clusters, Shards, Discovery, and Scaling

High Availability Architecture

Apr 27, 2016 · Operations

Mesos Architecture and Its Deployment at Qunar: Framework Unification and Operational Strategies

This article explains the Mesos distributed system kernel, its master‑slave architecture, fine‑grained resource scheduling, and how Qunar leverages Mesos and Marathon for log processing, Spark, Alluxio, and multi‑tenant services while addressing framework unification, HA, service discovery, and operational challenges.

Cluster ManagementFrameworkMarathon

0 likes · 14 min read

Mesos Architecture and Its Deployment at Qunar: Framework Unification and Operational Strategies

dbaplus Community

Apr 13, 2016 · Databases

Secure Redis Cluster: Adding Password Authentication and Automated Node Management

This guide explains why the official Redis Cluster tools lack password support, outlines the security risks of an unauthenticated cluster, and introduces a custom management utility that adds password authentication, automates slot migration, and simplifies adding or removing nodes, complete with step‑by‑step testing procedures.

Cluster ManagementData MigrationRedis Cluster

0 likes · 8 min read

Secure Redis Cluster: Adding Password Authentication and Automated Node Management

Architect

Mar 12, 2016 · Backend Development

Design and Evolution of Ctrip's Hermes Message Queue System

This article presents a detailed overview of Ctrip's Hermes message queue system, covering its architectural evolution from a simple Mongo‑based design to a broker‑centric, multi‑storage solution with meta‑server coordination, and discusses practical techniques for building high‑performance, scalable messaging infrastructure.

Cluster ManagementCtripDistributed Systems

0 likes · 21 min read

Design and Evolution of Ctrip's Hermes Message Queue System

ITPUB

Feb 24, 2016 · Big Data

How Pepperdata Optimizes Hadoop Cluster Resources and Improves Performance

The article explains how Hadoop clusters suffer from resource contention among multiple users, why YARN alone often fails to prioritize workloads, and how Pepperdata provides deeper visibility and automatic adjustments that reduce low‑priority usage, cut node count, and lower cloud costs.

Big DataCluster ManagementHadoop

0 likes · 7 min read

How Pepperdata Optimizes Hadoop Cluster Resources and Improves Performance

21CTO

Dec 20, 2015 · Backend Development

How Twitter Scales Redis to 105 TB RAM and 39 M QPS

This article summarizes Yao Yu's "Scaling Redis at Twitter" talk, detailing why Twitter chose Redis, the massive memory and QPS requirements, custom data models, Hybrid List and BTree extensions, cluster management, and operational lessons for building a high‑performance caching service.

Cluster ManagementTwitterbackend infrastructure

0 likes · 21 min read

How Twitter Scales Redis to 105 TB RAM and 39 M QPS

ITPUB

Nov 10, 2015 · Operations

Mastering RHCS: Key Components and Practical Commands for Cluster Management

This guide explains the essential RHCS components—including CMAN, DLM, CCS, and FENCE—details how to start and stop the cluster, manage application services with clusvcadm, monitor cluster status using cman_tool, clustat, and ccs_tool, and maintain GFS2 file systems with dedicated utilities.

Cluster ManagementGFS2Linux

0 likes · 14 min read

Mastering RHCS: Key Components and Practical Commands for Cluster Management

Java High-Performance Architecture

Jun 23, 2015 · Databases

How to Safely Delete Master and Slave Nodes in a Redis Cluster

This guide explains the two scenarios for removing nodes from a Redis cluster—deleting a master node by first migrating its slots and then removing it, and deleting a slave node directly—along with the exact redis-trib.rb commands and verification steps to ensure successful removal.

Cluster ManagementDatabase AdministrationNode Deletion

0 likes · 3 min read

How to Safely Delete Master and Slave Nodes in a Redis Cluster

Art of Distributed System Architecture Design

Jun 4, 2015 · Operations

An Overview of Google’s Borg Cluster Management System

This article provides a comprehensive overview of Google’s Borg system, detailing its purpose, user perspective, workload types, cluster and cell architecture, job and task management, scheduling algorithms, scalability techniques, and availability mechanisms for large‑scale distributed environments.

BorgCluster ManagementOperations

0 likes · 22 min read

An Overview of Google’s Borg Cluster Management System