Tagged articles
186 articles
Page 2 of 2
HelloTech
HelloTech
Apr 2, 2021 · Backend Development

Traffic Governance and Protection: Threshold Configuration

Effective traffic governance and protection rely on properly configuring thresholds—either a fixed global limit that stays constant regardless of node count, or a per‑machine allocation that scales the total capacity as nodes are added or removed—to prevent sudden surges from overwhelming services and ensure high availability.

Backend DevelopmentCluster Managementhigh availability
0 likes · 4 min read
Traffic Governance and Protection: Threshold Configuration
Ops Development Stories
Ops Development Stories
Apr 1, 2021 · Operations

Zookeeper Leader Election Explained: Cluster Architecture & Code Walkthrough

This article provides a comprehensive overview of Zookeeper's cluster deployment, explains the four server states, details the leader election process—including initialization, voting, and decision logic—and presents key source code snippets to help developers understand and implement Zookeeper's high‑availability mechanisms.

Cluster ManagementDistributed SystemsJava
0 likes · 10 min read
Zookeeper Leader Election Explained: Cluster Architecture & Code Walkthrough
JD Tech
JD Tech
Mar 30, 2021 · Artificial Intelligence

JD Retail's Jiushu Business Analytics Platform: AI‑Driven Solutions for Retail

The article introduces JD Retail's Jiushu Business Analytics Platform, detailing how AI, big‑data, and distributed‑training technologies address fragmented retail scenarios, high deployment barriers, large‑scale application difficulties, and cost concerns through specialized frameworks, fault‑tolerant training, and advanced cluster optimization.

AICluster ManagementDistributed Training
0 likes · 12 min read
JD Retail's Jiushu Business Analytics Platform: AI‑Driven Solutions for Retail
Liangxu Linux
Liangxu Linux
Feb 28, 2021 · Cloud Native

Essential Kubernetes Best Practices for Production‑Ready Clusters

This guide presents a comprehensive checklist of Kubernetes best practices covering container image selection, registry authentication, namespace isolation, labeling, annotations, RBAC, pod security policies, network policies, secrets management, image scanning, CI/CD, canary releases, monitoring, service mesh, and admission controllers to help you build secure, stable, and scalable production clusters.

Cloud NativeCluster ManagementKubernetes
0 likes · 17 min read
Essential Kubernetes Best Practices for Production‑Ready Clusters
Open Source Linux
Open Source Linux
Feb 20, 2021 · Cloud Native

Fix Inconsistent Kubernetes rc/deployment/service Deletions and Etcd Failures

This guide walks through troubleshooting Kubernetes issues such as partially deleted resources, resetting etcd, apiserver start failures due to missing ServiceAccount certificates, SELinux permission errors, ServiceAccount key generation, etcd startup errors, host trust configuration, and resource limit pitfalls, providing concrete commands and scripts for each problem.

Cluster ManagementKubernetesLinux
0 likes · 17 min read
Fix Inconsistent Kubernetes rc/deployment/service Deletions and Etcd Failures
DataFunTalk
DataFunTalk
Feb 14, 2021 · Big Data

Impala at NetEase: Architecture, Iceberg Integration, Management System, Optimizations and Future Roadmap

This talk presents NetEase's practical experience with Impala, covering its core architecture, new features in version 3.x, integration with Apache Iceberg, a custom management platform, profiling and statistics enhancements, as well as future plans involving Kubernetes, Alluxio caching and pre‑computation strategies.

Apache IcebergBig DataCluster Management
0 likes · 13 min read
Impala at NetEase: Architecture, Iceberg Integration, Management System, Optimizations and Future Roadmap
Programmer DD
Programmer DD
Jan 21, 2021 · Cloud Native

Master Kubernetes with k9s: The Ultimate CLI for Real-Time Cluster Management

k9s is a command‑line interface that streamlines Kubernetes cluster management by wrapping kubectl, offering real‑time resource tracking, metric monitoring, support for standard and custom resources, customizable UI, multi‑resource views, RBAC inspection, and easy navigation of related objects, with an official site for download.

CLICloud NativeCluster Management
0 likes · 2 min read
Master Kubernetes with k9s: The Ultimate CLI for Real-Time Cluster Management
Alibaba Cloud Native
Alibaba Cloud Native
Jan 5, 2021 · Cloud Native

How to Transform a Native Kubernetes Cluster into an OpenYurt Edge‑Native Cluster

This guide walks through setting up a cloud‑hosted Kubernetes control plane with a Raspberry Pi edge node, demonstrates the limitations of a native cluster under edge conditions, converts the cluster to OpenYurt using yurtctl, and evaluates operation commands and network‑disruption scenarios to showcase OpenYurt’s edge‑native capabilities.

Cloud NativeCluster ManagementEdge Computing
0 likes · 21 min read
How to Transform a Native Kubernetes Cluster into an OpenYurt Edge‑Native Cluster
Programmer DD
Programmer DD
Dec 28, 2020 · Operations

How to Install and Use Cerebro for Easy Elasticsearch Cluster Management

This guide explains what Cerebro is, how to install it (including binary and Docker options), how to run it on Linux, macOS, and Windows, and how to use its UI to connect to an Elasticsearch node, view cluster overviews, manage shards, and execute DSL queries.

AngularJSCerebroCluster Management
0 likes · 5 min read
How to Install and Use Cerebro for Easy Elasticsearch Cluster Management
vivo Internet Technology
vivo Internet Technology
Dec 23, 2020 · Cloud Native

ZooKeeper: Comprehensive Guide to Distributed Coordination Service

ZooKeeper, Apache’s distributed coordination service, offers a highly available in‑memory hierarchical file system with leader‑follower‑observer clustering and the ZAB protocol, guaranteeing sequential consistency, atomicity and a single view while supporting publish/subscribe, configuration management, distributed locks, master election and queueing for robust distributed applications.

ApacheCluster ManagementCoordination Service
0 likes · 20 min read
ZooKeeper: Comprehensive Guide to Distributed Coordination Service
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 16, 2020 · Big Data

Designing a Real‑Time Data Processing Platform with Flink: Architecture, Deployment, and Operations

This article explains how to build a real‑time data processing platform using Flink, covering the Lambda architecture, design approaches, SQL and custom‑Jar task definitions, UI drag‑and‑drop, cluster resource management on Yarn and Kubernetes, submission modes, scheduling, permission and metadata handling, logging, and monitoring with Prometheus and Grafana.

Cluster ManagementFlinkLambda architecture
0 likes · 19 min read
Designing a Real‑Time Data Processing Platform with Flink: Architecture, Deployment, and Operations
MaGe Linux Operations
MaGe Linux Operations
Dec 3, 2020 · Cloud Native

Essential Kubernetes Tools: Deploy, Monitor, and Develop with Ease

This article introduces a curated list of Kubernetes tools—including cluster deployment solutions, monitoring utilities, CLI helpers, and development aids—explaining how each simplifies container orchestration, enhances DevOps workflows, and empowers engineers to manage, observe, and extend their Kubernetes environments efficiently.

CLI toolsCluster ManagementDevOps
0 likes · 7 min read
Essential Kubernetes Tools: Deploy, Monitor, and Develop with Ease
DataFunTalk
DataFunTalk
Nov 27, 2020 · Big Data

Evolution of Kafka‑Based Data Pipeline at Chehaoduo Group: Architecture, Scaling, and Best Practices

This article chronicles the four‑year evolution of Chehaoduo Group’s Kafka ecosystem—from its initial role as a simple data‑ingestion layer to becoming the core of the company’s large‑scale data pipeline—detailing cluster management, upgrade strategies, multi‑cluster deployment, AVRO schema handling, SDK development, and operational lessons learned.

AvroCluster ManagementKafka
0 likes · 21 min read
Evolution of Kafka‑Based Data Pipeline at Chehaoduo Group: Architecture, Scaling, and Best Practices
Architecture Digest
Architecture Digest
Nov 21, 2020 · Operations

Understanding and Handling ZooKeeper Split‑Brain Issues

This article explains the causes of ZooKeeper split‑brain situations, why odd‑numbered node deployments are preferred, how the quorum (majority) rule prevents split‑brain, and outlines practical methods such as quorum configuration, redundant communication, fencing, and pause‑before‑failover to handle and avoid the issue.

Cluster ManagementSplit-Brainhigh availability
0 likes · 13 min read
Understanding and Handling ZooKeeper Split‑Brain Issues
vivo Internet Technology
vivo Internet Technology
Oct 14, 2020 · Backend Development

Design and Implementation of a High‑Availability RabbitMQ Middleware Platform at vivo

vivo built a high‑availability RabbitMQ middleware platform that combines an MQ‑Portal for request‑driven provisioning, an SDK that adds application‑level authentication, automatic cluster discovery, rate‑limiting, reset and blockage‑transfer capabilities, and a stateless MQ‑NameServer for name resolution and health‑based failover, enabling ten‑fold traffic growth without incidents.

BackendCluster ManagementMessage Queue
0 likes · 14 min read
Design and Implementation of a High‑Availability RabbitMQ Middleware Platform at vivo
ITPUB
ITPUB
Oct 10, 2020 · Big Data

How Didi Scaled Presto for Petabyte‑Scale Queries: Architecture & Optimizations

Didi’s three‑year journey with Presto transformed it into the company’s primary ad‑hoc and Hive‑SQL acceleration engine, serving over 6 000 users, processing 2‑3 PB of HDFS data daily, and achieving major gains in stability, performance, cost, and usability through extensive architectural tweaks, resource isolation, connector extensions, and monitoring enhancements.

Big DataCluster ManagementDruid Connector
0 likes · 18 min read
How Didi Scaled Presto for Petabyte‑Scale Queries: Architecture & Optimizations
Didi Tech
Didi Tech
Oct 9, 2020 · Big Data

Presto at Didi: Architecture, Optimizations, and Operational Experience

At Didi, Presto has been the default ad‑hoc and Hive‑SQL engine for over three years, serving 6,000 users, processing 2‑3 PB daily and 30‑35 trillion rows, with mixed and dedicated clusters, migration to PrestoSQL 340, extensive Hive compatibility, label‑based isolation, a native Druid connector, usability and stability enhancements, and JVM‑level performance optimizations, while planning further resource‑saving upgrades.

Big DataCluster ManagementDistributed SQL
0 likes · 17 min read
Presto at Didi: Architecture, Optimizations, and Operational Experience
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 18, 2020 · Big Data

Understanding the Elasticsearch Master Election Process

This article explains when Elasticsearch triggers a master election, describes each election stage—including active master and candidate selection, Bully algorithm comparison, and master node responsibilities—while providing code excerpts that illustrate the underlying implementation details.

Big DataCluster ManagementDistributed Systems
0 likes · 8 min read
Understanding the Elasticsearch Master Election Process
MaGe Linux Operations
MaGe Linux Operations
Aug 5, 2020 · Cloud Native

Top Open-Source Tools to Simplify Kubernetes Management Across Any Environment

Discover a curated list of powerful open-source Kubernetes management solutions—including K9s, Rancher, Dashboard, Kubectl, Kubeadm, Helm, KubeSpray, Kontena Lens, and WKSctl—detailing their core features, deployment options, and how they streamline cluster monitoring, configuration, and application lifecycle across cloud-native environments.

Cloud NativeCluster ManagementDevOps
0 likes · 8 min read
Top Open-Source Tools to Simplify Kubernetes Management Across Any Environment
DataFunTalk
DataFunTalk
Jul 16, 2020 · Big Data

Elasticsearch Practices and Platform Construction at 58.com

This article details 58.com’s extensive use of Elasticsearch for search, analytics, and log processing, covering cluster optimization challenges, typical issues like disk exhaustion and write slowdown, practical solutions, development standards, ELKB architecture, real‑time log and MySQL slow‑log applications, platform‑as‑a‑service construction, and future roadmap plans.

Cluster ManagementElasticsearchLog Analytics
0 likes · 17 min read
Elasticsearch Practices and Platform Construction at 58.com
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 22, 2020 · Databases

JDHBase Multi‑Active Architecture and Replication Mechanisms

This article describes JDHBase’s large‑scale KV storage, its HBase‑based replication principle, the multi‑active cluster architecture with Fox Manager, client routing, automatic failover, dynamic replication tuning, serial replication guarantees, and future directions for improving cross‑region disaster recovery.

Cluster ManagementHBaseJDHBase
0 likes · 11 min read
JDHBase Multi‑Active Architecture and Replication Mechanisms
Big Data Technology Architecture
Big Data Technology Architecture
Jun 11, 2020 · Big Data

Kylin at Autohome: Development History, Deployment Practices, Optimizations, and Future Roadmap

This article details Autohome's use of Apache Kylin as its core OLAP engine, covering its architecture, large‑scale Cube deployment, real‑world business applications, a series of performance and operational optimizations, cluster upgrade experiences, and upcoming plans for real‑time OLAP and cloud‑native evolution.

Cloud NativeCluster ManagementKylin
0 likes · 24 min read
Kylin at Autohome: Development History, Deployment Practices, Optimizations, and Future Roadmap
Java Architecture Diary
Java Architecture Diary
Jun 6, 2020 · Cloud Native

Explore Nacos 1.3.0: Embedded DB, New Raft Protocol, and High‑Availability

Nacos 1.3.0 introduces an embedded relational database, unified cluster management, an upgraded Raft consistency layer, security patches, Snowflake ID configuration, data migration guidance, new cluster addressing modes, and a set of Open‑API operations for Raft administration, all aimed at simplicity, performance, and high availability.

Cluster ManagementEmbedded DatabaseNacos
0 likes · 10 min read
Explore Nacos 1.3.0: Embedded DB, New Raft Protocol, and High‑Availability
Architect
Architect
May 15, 2020 · Databases

Understanding Elasticsearch Architecture: Segments, Translog, Refresh, Shard Allocation and Cluster Operations

This article provides a comprehensive overview of Elasticsearch's internal architecture, explaining how data flows from memory buffers to Lucene segments, the role of refresh and translog for durability, segment merging strategies, shard routing, replica consistency, allocation controls, hot‑cold data separation, and cluster discovery settings.

Cluster ManagementElasticsearchSegments
0 likes · 23 min read
Understanding Elasticsearch Architecture: Segments, Translog, Refresh, Shard Allocation and Cluster Operations
Open Source Linux
Open Source Linux
May 2, 2020 · Big Data

Mastering Apache Zookeeper: Core Concepts and Real-World Big Data Applications

Apache Zookeeper is an open‑source coordination service that provides reliable distributed synchronization, configuration management, and naming for big‑data components such as Hadoop, HBase, and Kafka, offering features like hierarchical znode structures, watches, master election, and distributed locks to maintain cluster health.

Apache ZookeeperBig DataCluster Management
0 likes · 17 min read
Mastering Apache Zookeeper: Core Concepts and Real-World Big Data Applications
Big Data Technology Architecture
Big Data Technology Architecture
Mar 7, 2020 · Operations

How to Perform a Graceful Shutdown of an Elasticsearch Node

This article outlines a step‑by‑step procedure for safely taking an Elasticsearch node offline—checking master‑eligible settings, adjusting minimum_master_nodes, excluding the node from routing, waiting for shard relocation, stopping the service, and restoring the cluster routing—ensuring no data loss or service interruption.

Cluster ManagementDevOpsElasticsearch
0 likes · 6 min read
How to Perform a Graceful Shutdown of an Elasticsearch Node
Java Backend Technology
Java Backend Technology
Mar 7, 2020 · Backend Development

Mastering ZooKeeper: Core Concepts, Distributed Locks, and Cluster Management

This article explains ZooKeeper's role as a distributed coordination service, covering its architecture, znode types, watch mechanism, configuration management, distributed locking, queue handling, data replication, leader election, and synchronization processes for building reliable backend systems.

Cluster ManagementCoordination ServiceZooKeeper
0 likes · 18 min read
Mastering ZooKeeper: Core Concepts, Distributed Locks, and Cluster Management
dbaplus Community
dbaplus Community
Feb 22, 2020 · Databases

How to Perform Daily Maintenance on GaussDB T Clusters Without Pitfalls

This guide walks you through the essential daily maintenance tasks for GaussDB T clusters, covering ETCD startup, cluster health checks, host resource monitoring, tablespace usage, abnormal wait events, log inspection, and common error troubleshooting with concrete commands and SQL examples.

Cluster ManagementDatabase MaintenanceError Handling
0 likes · 11 min read
How to Perform Daily Maintenance on GaussDB T Clusters Without Pitfalls
JD Retail Technology
JD Retail Technology
Jan 6, 2020 · Backend Development

JDHBase Multi‑Active Architecture and Replication Practices

This article describes JDHBase’s large‑scale KV storage deployment, its HBase‑based asynchronous replication mechanism, the multi‑active architecture with active‑standby clusters, client interaction via Fox Manager, automatic failover strategies, dynamic replication tuning, and serial replication techniques to ensure data consistency across data centers.

Cluster ManagementHBaseReplication
0 likes · 13 min read
JDHBase Multi‑Active Architecture and Replication Practices
Efficient Ops
Efficient Ops
Dec 17, 2019 · Operations

How Alibaba Scales Flink: Lessons in Big Data Operations

This article details Alibaba's massive Flink deployment, covering its historical background, the operational challenges of managing tens of thousands of nodes, the design of a comprehensive Flink management platform, and the automated solutions for fault handling, resource allocation, and performance testing in a large‑scale big‑data environment.

AutomationBig Data OperationsCluster Management
0 likes · 20 min read
How Alibaba Scales Flink: Lessons in Big Data Operations
Alibaba Cloud Native
Alibaba Cloud Native
Nov 30, 2019 · Cloud Native

How Alibaba Cloud Manages Over 10,000 Kubernetes Clusters at Double‑11 Scale

This article explains how Alibaba Cloud Container Service (ACK) designs a unit‑based, tiered management system, capacity planning model, global observability architecture, and pluggable components to reliably operate more than ten thousand diverse Kubernetes clusters during the massive Double‑11 shopping event.

ACKAlibaba CloudCluster Management
0 likes · 13 min read
How Alibaba Cloud Manages Over 10,000 Kubernetes Clusters at Double‑11 Scale
dbaplus Community
dbaplus Community
Nov 13, 2019 · Databases

Scaling TiDB to 200TB at Zhuanzhuan: Key Performance and Management Lessons

Zhuanzhuan’s adoption of TiDB addressed sharding challenges and massive data storage needs, and the team shares six common issues encountered in large‑scale online deployments—including performance diagnosis, cluster management, log inconsistencies, slow‑SQL impact, optimizer limitations, and transaction conflicts—along with their standardized solutions for deployment, monitoring, alerting, and business rollout.

Cluster ManagementSQL OptimizationTiDB
0 likes · 12 min read
Scaling TiDB to 200TB at Zhuanzhuan: Key Performance and Management Lessons
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 13, 2019 · Cloud Native

Scaling Kubernetes at Ant Financial: The Kube‑on‑Kube Architecture Explained

Ant Financial’s article details how its large‑scale Kubernetes management system—built on a meta‑cluster, end‑state operators, and a Kube‑on‑Kube design—ensures reliable creation, upgrade, and self‑healing of thousands of nodes, while providing gray‑scale changes, risk assessment, and fault‑tolerant automation.

Cloud NativeCluster ManagementKubernetes
0 likes · 15 min read
Scaling Kubernetes at Ant Financial: The Kube‑on‑Kube Architecture Explained
dbaplus Community
dbaplus Community
Oct 29, 2019 · Cloud Native

How Meituan‑Dianping Scaled Kubernetes to 100k+ Nodes with HULK2.0

Meituan‑Dianping describes its evolution from a custom Docker‑based scheduler (HULK1.0) to an open‑source Kubernetes‑based platform (HULK2.0), detailing architecture, resource‑management strategies, scheduler optimizations, Kubelet enhancements, and online‑cluster tuning that together enable stable, cost‑effective operation of a 100k+ node fleet.

Cloud NativeCluster ManagementKubernetes
0 likes · 19 min read
How Meituan‑Dianping Scaled Kubernetes to 100k+ Nodes with HULK2.0
Alibaba Cloud Native
Alibaba Cloud Native
Oct 27, 2019 · Cloud Native

How Ant Financial Scales Kubernetes: Inside Their Kube‑on‑Kube Management System

This article explains how Ant Financial designs and operates a large‑scale, highly available Kubernetes management platform—using end‑state driven operators, custom CRDs, self‑healing mechanisms, and risk‑mitigation strategies—to reliably run thousands of nodes and dozens of business clusters in production.

Cluster ManagementKube-on-KubeKubernetes
0 likes · 15 min read
How Ant Financial Scales Kubernetes: Inside Their Kube‑on‑Kube Management System
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 20, 2019 · Cloud Native

Meituan-Dianping Kubernetes Cluster Management and Optimization Practices

This article details Meituan-Dianping's evolution from custom Docker‑based scaling to a Kubernetes‑driven, cloud‑native cluster management platform (HULK), describing its architecture, scheduler enhancements, Kubelet modifications, and resource‑optimization strategies for large‑scale operations.

Cloud NativeCluster ManagementKubernetes
0 likes · 17 min read
Meituan-Dianping Kubernetes Cluster Management and Optimization Practices
dbaplus Community
dbaplus Community
Oct 8, 2019 · Big Data

How to Master Large-Scale Cluster Management: 10 Real-World Troubleshooting Cases

This article shares a senior data‑platform engineer's hands‑on experience managing dozens of thousand‑node clusters, detailing nine common cluster problems and step‑by‑step solutions—including performance tuning, RPC fixes, HDFS cleanup, Hive metadata repair, Spark shuffle optimization, HBase region recovery, and Kafka bottleneck mitigation.

Big DataCluster ManagementHBase
0 likes · 17 min read
How to Master Large-Scale Cluster Management: 10 Real-World Troubleshooting Cases
Big Data Technology Architecture
Big Data Technology Architecture
Sep 26, 2019 · Databases

Elasticsearch Core Overview and Key Performance Metrics

This article provides a comprehensive guide to Elasticsearch’s architecture, node roles, data organization, and the most important performance metrics—including search, indexing, memory, JVM garbage collection, host‑level system metrics, cluster health, and resource saturation—offering practical advice on monitoring and tuning the cluster for reliability and efficiency.

Cluster ManagementElasticsearchJVM
0 likes · 27 min read
Elasticsearch Core Overview and Key Performance Metrics
21CTO
21CTO
Aug 23, 2019 · Cloud Native

How Meituan Optimized Kubernetes at Scale: Lessons from HULK2.0

This article details Meituan‑Dianping's evolution from a custom Docker‑based cluster manager to the open‑source Kubernetes‑powered HULK2.0 platform, describing its architecture, operational practices, scheduler and Kubelet optimizations, and resource‑management techniques that enable massive, cost‑effective scaling.

Cluster ManagementMeituanPerformance Optimization
0 likes · 19 min read
How Meituan Optimized Kubernetes at Scale: Lessons from HULK2.0
Meituan Technology Team
Meituan Technology Team
Aug 22, 2019 · Cloud Native

Meituan-Dianping Kubernetes Cluster Management and Optimization Practices

Meituan‑Dianping’s evolution from virtualization to the HULK‑2.0 Kubernetes platform enables a 100,000‑instance, multi‑region cluster to achieve high elasticity and availability, using scheduler optimizations, local‑optimal placement, enhanced kubelet features, and fine‑grained resource management to maximize throughput during traffic spikes.

Cluster ManagementKubernetesMeituan
0 likes · 19 min read
Meituan-Dianping Kubernetes Cluster Management and Optimization Practices
AntTech
AntTech
Aug 15, 2019 · Cloud Native

Design and Implementation of Ant Financial’s Large‑Scale Kubernetes Cluster Management System

This article explains how Ant Financial designs a highly reliable, end‑state‑driven Kubernetes management platform that handles lifecycle operations, node self‑healing, and risk‑controlled changes for clusters with tens of thousands of nodes, using operators, custom resources, and a meta‑cluster architecture.

Cluster ManagementKuberneteslarge scale
0 likes · 9 min read
Design and Implementation of Ant Financial’s Large‑Scale Kubernetes Cluster Management System
MaGe Linux Operations
MaGe Linux Operations
May 28, 2019 · Operations

What Skills and Knowledge Do You Need to Master Large‑Scale Website Operations?

This article explains what large‑scale website operations entail, outlines the product lifecycle and the crucial role of operations engineers, lists essential technical skills and personal qualities, and discusses current challenges, future prospects, and key technical topics such as cluster management, monitoring, fault handling, and automation.

AutomationCluster ManagementDevOps
0 likes · 18 min read
What Skills and Knowledge Do You Need to Master Large‑Scale Website Operations?
Youzan Coder
Youzan Coder
Apr 27, 2019 · Big Data

Recap of Elastic Community Technical Salon: Cluster Management, Multi‑Tenant Practices, and Search Platform Engineering

On April 27, Youzan Technology and the Elastic Chinese community hosted a “starry sky” technical salon where experts from Getui, Ant Financial, Haipai Ke and Youzan presented four talks on large‑cluster proxy management, multi‑tenant ES optimization, search‑platform engineering, and the evolution of Youzan’s log platform, followed by lively Q&A and resource sharing.

Cluster ManagementElasticsearchLog Analytics
0 likes · 6 min read
Recap of Elastic Community Technical Salon: Cluster Management, Multi‑Tenant Practices, and Search Platform Engineering
Java Captain
Java Captain
Apr 9, 2019 · Operations

Zookeeper Overview: Functions, Deployment Modes, Synchronization, and Notification Mechanism

This article explains Zookeeper as an open‑source distributed coordination service, detailing its core functions such as cluster management, leader election, distributed locks, and naming service, along with its three deployment modes, state‑synchronization via the ZAB protocol, and its watcher‑based notification mechanism.

Cluster ManagementDistributed CoordinationNaming Service
0 likes · 4 min read
Zookeeper Overview: Functions, Deployment Modes, Synchronization, and Notification Mechanism
MaGe Linux Operations
MaGe Linux Operations
Jan 24, 2019 · Operations

What It Takes to Master Large‑Scale Website Operations?

This article explores the definition, responsibilities, required skills, career challenges, and key technologies of large‑scale website operations, offering a comprehensive guide for aspiring and current operations engineers to understand and excel in this demanding field.

AutomationCareer DevelopmentCluster Management
0 likes · 20 min read
What It Takes to Master Large‑Scale Website Operations?
Architecture Talk
Architecture Talk
Jan 8, 2019 · Big Data

Boost Elasticsearch Performance: Bulk API, Gateway & Caching Secrets

This article explains how to dramatically improve Elasticsearch throughput by using the bulk API, tuning bulk request sizes, configuring gateway settings, optimizing cluster state updates, managing caches, leveraging fielddata and doc values, and employing tools like Curator and the Profiler for efficient cluster operations.

Cluster ManagementElasticsearchbulk API
0 likes · 27 min read
Boost Elasticsearch Performance: Bulk API, Gateway & Caching Secrets
vivo Internet Technology
vivo Internet Technology
Dec 28, 2018 · Big Data

Running a 400+ Node Elasticsearch Cluster: Architecture, Scaling, and Performance Tuning

Meltwater’s media‑monitoring platform runs a custom Elasticsearch 1.7.6 cluster of over 400 nodes on AWS, handling 200 TB of primary data and 3 million daily documents while serving thousands of complex queries per minute, achieved through careful shard design, master‑node configuration, extensive performance tuning, and automated provisioning.

AWSCluster ManagementElasticsearch
0 likes · 13 min read
Running a 400+ Node Elasticsearch Cluster: Architecture, Scaling, and Performance Tuning
360 Tech Engineering
360 Tech Engineering
Sep 29, 2018 · Operations

Design and Implementation of a Multi‑Task Scheduling System Based on Apache Mesos

This article describes how we identified underutilized CPU and memory resources in our company's servers, evaluated Kubernetes versus Apache Mesos, and built a non‑intrusive, Mesos‑based multi‑task scheduling system with dynamic resource reservation, monitoring, task isolation, and cluster‑wide observability, while addressing deployment challenges.

Cluster ManagementDocker alternativeMesos
0 likes · 11 min read
Design and Implementation of a Multi‑Task Scheduling System Based on Apache Mesos
dbaplus Community
dbaplus Community
Aug 14, 2018 · Operations

How Ant Financial Automates Tens of Thousands of Kubernetes Nodes with Operators

Ant Financial tackles the challenge of managing dozens of Kubernetes clusters and over a hundred thousand worker nodes by employing a meta‑cluster with Kube‑on‑Kube and Node Operators, enabling automated lifecycle management, scaling, upgrades, and fault recovery for both master components and worker nodes.

AutomationCluster ManagementKubernetes
0 likes · 12 min read
How Ant Financial Automates Tens of Thousands of Kubernetes Nodes with Operators
UCloud Tech
UCloud Tech
Jul 31, 2018 · Fundamentals

What’s New in Ceph? July 2018 Developer Highlights and Key Feature Updates

The July 2018 Ceph Developer Monthly report from the UMCloud storage team summarizes the latest community contributions, including enhancements to object and block storage, new OPA integration for fine‑grained access control, crash‑dump management, dashboard user UI, and batch operations for ceph‑volume.

CephCluster ManagementDeveloper Updates
0 likes · 6 min read
What’s New in Ceph? July 2018 Developer Highlights and Key Feature Updates
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 26, 2018 · Operations

How Scheduling Algorithms Power Efficient Data Center Resource Management

Scheduling algorithms are a crucial component of cluster resource management systems, determining where containerized tasks run to ensure resource needs, high availability, fault tolerance, and cost efficiency across individual containers, applications, and entire data centers, while also supporting Alibaba’s global scheduling challenge.

Cluster ManagementData centeralgorithm competition
0 likes · 10 min read
How Scheduling Algorithms Power Efficient Data Center Resource Management
ITPUB
ITPUB
May 31, 2018 · Big Data

Mastering Spark on DataMagic: Fast‑Track Your Big Data Skills

This article explains Spark's role in the DataMagic platform, outlines four practical steps to quickly master Spark, details key configuration and parallelism settings, shows how to modify Spark code, and provides operational tips for cluster management and job troubleshooting.

Big DataCluster ManagementConfiguration
0 likes · 10 min read
Mastering Spark on DataMagic: Fast‑Track Your Big Data Skills
UCloud Tech
UCloud Tech
Apr 28, 2018 · Operations

Ceph April 2018 Update: New Object, Block, and Cluster Features

The April 2018 Ceph monthly report highlights LTTng tracing for RGW, SSL support for the beast frontend, MFA integration, a notrim option for rbd mapping, runtime lz4 and brotli compression, Zabbix PG metrics, asynchronous dashboard tasks, detailed operation tracking, osdmap pruning, and a new iostat manager plugin.

CephCluster ManagementLTTng
0 likes · 9 min read
Ceph April 2018 Update: New Object, Block, and Cluster Features
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 12, 2018 · Backend Development

How Alibaba’s Cainiao Scales a Lightweight Timer Engine for Billions of Packages

Facing the challenge of processing over 100 million daily parcels, Alibaba’s Cainiao designed a lightweight, time‑wheel‑based scheduling engine that decouples task storage from timing, leverages partitioned task chains, master‑driven node IDs, and cluster‑wide soft‑load balancing to achieve scalable, fault‑tolerant timer processing.

Backend EngineeringCluster ManagementTime Wheel
0 likes · 12 min read
How Alibaba’s Cainiao Scales a Lightweight Timer Engine for Billions of Packages
Architects' Tech Alliance
Architects' Tech Alliance
Dec 18, 2017 · Fundamentals

GPFS Technical Practice Sharing and Building‑Block Design Overview

This article provides a comprehensive overview of IBM GPFS, covering its architecture, management components, networking models, cluster and storage design, as well as practical guidance on building‑block configurations for performance and capacity scaling in high‑performance computing environments.

Building BlockCluster ManagementDistributed File System
0 likes · 13 min read
GPFS Technical Practice Sharing and Building‑Block Design Overview
Efficient Ops
Efficient Ops
Nov 2, 2017 · Operations

Managing Massive Elasticsearch Clusters: Lessons from a 120‑Node Deployment

This article shares practical insights on operating large‑scale Elasticsearch clusters for log analysis, covering use cases, essential tools, hardware choices, node role separation, shard management, hot‑cold data strategies, version upgrades, and key monitoring metrics to ensure stability and performance.

Cluster ManagementElasticsearchHardware Scaling
0 likes · 12 min read
Managing Massive Elasticsearch Clusters: Lessons from a 120‑Node Deployment
37 Interactive Technology Team
37 Interactive Technology Team
Oct 19, 2017 · Big Data

Ambari Technical Practice for Managing Hadoop Big Data Platforms

The team adopted Apache Ambari to streamline deployment, scaling, monitoring, and upgrade of their Hadoop‑centric big‑data platform, overcoming HA cluster takeover and custom Hive 2.1 integration through a three‑phase test, gray‑scale, and production rollout, thereby improving management efficiency and reducing O&M costs.

AmbariCluster ManagementHDP
0 likes · 18 min read
Ambari Technical Practice for Managing Hadoop Big Data Platforms
Alibaba Cloud Developer
Alibaba Cloud Developer
Sep 6, 2017 · Operations

How Mixed Offline/Online Scheduling Boosted Alibaba’s Data Center Utilization by 30%

The article explains how rapid internet growth has expanded data centers, why traditional operations fall short, presents a simple utilization formula, shows Alibaba’s mixed offline‑online scheduling experiment that raised server usage from 10% to over 40%, and announces an open dataset for academic research.

AlibabaCluster Managementdata center utilization
0 likes · 7 min read
How Mixed Offline/Online Scheduling Boosted Alibaba’s Data Center Utilization by 30%
High Availability Architecture
High Availability Architecture
Apr 14, 2017 · Databases

Recent Improvements in Elasticsearch 5.x and Outlook for 6.0

This article reviews the latest Elasticsearch 5.x enhancements—including append‑only indexing, range fields, removal of the _all field, unified highlighter, keyword normalizer, multi‑word synonyms, field collapsing, cancellable searches, partitioned term aggregations, cluster allocation explain, Java REST client updates, cross‑cluster search, batched reduce phases—and previews the major features expected in Elasticsearch 6.0 such as sparse doc values, index sorting, sequence numbers, seamless rolling upgrades, type removal, index‑template inheritance, load‑aware shard routing, and X‑Pack extensions like SQL and machine learning.

Cluster ManagementElasticsearchSearch
0 likes · 15 min read
Recent Improvements in Elasticsearch 5.x and Outlook for 6.0
DevOps
DevOps
Apr 10, 2017 · Cloud Native

Applying Docker and Kubernetes to Build Scalable, Automated Test Environments

The talk outlines how Docker and Kubernetes were adopted to streamline test environment provisioning, address challenges like environment inconsistency and resource scarcity, and enable automated, standardized, and scalable testing infrastructure through containerization, networking, storage, and cluster management techniques.

Cluster ManagementDevOpsDocker
0 likes · 19 min read
Applying Docker and Kubernetes to Build Scalable, Automated Test Environments
Qunar Tech Salon
Qunar Tech Salon
Dec 30, 2016 · Operations

Mesos Architecture and Its Practical Use at Qunar: Framework Unification and Operational Insights

This article explains the Mesos distributed system kernel, its resource‑allocation workflow, and how Qunar engineers applied and evolved Mesos, Marathon, and custom frameworks to achieve fine‑grained scheduling, high availability, service discovery, and multi‑tenant management in a large‑scale production environment.

Cluster ManagementDistributed SystemsFramework
0 likes · 14 min read
Mesos Architecture and Its Practical Use at Qunar: Framework Unification and Operational Insights
Qunar Tech Salon
Qunar Tech Salon
Nov 10, 2016 · Operations

Zookeeper Operational Best Practices and Common Pitfalls

This article shares practical experience on operating Zookeeper clusters, covering core concepts, deployment recommendations, configuration tuning, monitoring, migration strategies, and a list of common issues to avoid for reliable distributed coordination.

Cluster ManagementDistributed Coordinationbest practices
0 likes · 11 min read
Zookeeper Operational Best Practices and Common Pitfalls
ITPUB
ITPUB
Oct 13, 2016 · Databases

Mastering Oracle RAC: Step‑by‑Step Commands to Start and Stop Clusters

This guide provides a comprehensive, command‑by‑command walkthrough for shutting down and starting Oracle RAC clusters, covering srvctl and crsctl usage, status checks, and essential options to help administrators manage database instances and cluster services reliably.

Cluster ManagementDatabase ShutdownDatabase Startup
0 likes · 12 min read
Mastering Oracle RAC: Step‑by‑Step Commands to Start and Stop Clusters
Alibaba Cloud Developer
Alibaba Cloud Developer
Aug 25, 2016 · Operations

Comparing Modern Data‑Center Schedulers: Borg, Mesos, Omega, Kubernetes & Zeus

This article examines resource allocation philosophies—auction, budgeting, and preemption—and compares the architectures, data models, and APIs of major schedulers such as Borg, Omega, Mesos, Kubernetes, and Alibaba’s Zeus, while also exploring sharing strategies, task classifications, utilization metrics, and predictive techniques for efficient resource management.

BorgCluster ManagementKubernetes
0 likes · 34 min read
Comparing Modern Data‑Center Schedulers: Borg, Mesos, Omega, Kubernetes & Zeus
dbaplus Community
dbaplus Community
Aug 23, 2016 · Operations

How to Build a Scalable Automated Deployment System for Multi‑Node Clusters

This article walks through the shortcomings of manual code releases, designs a multi‑environment automated deployment workflow, details step‑by‑step implementation—including code fetching, configuration handling, logging, parallel execution, and rollback—while sharing practical scripts and common pitfalls for large‑scale clusters.

Cluster ManagementDeployment AutomationDevOps
0 likes · 10 min read
How to Build a Scalable Automated Deployment System for Multi‑Node Clusters
Architecture Digest
Architecture Digest
Aug 8, 2016 · Databases

Understanding Elasticsearch Architecture: Clusters, Shards, Discovery, and Scaling

This article provides a comprehensive overview of Elasticsearch 2.x, covering its distributed architecture, core concepts such as clusters, nodes, indices, shards and replicas, the ZenDiscovery master‑election process, scaling mechanisms, recovery, query features, and the underlying system components like Guice, Netty, and thread‑pool designs.

Cluster ManagementElasticsearchNoSQL
0 likes · 20 min read
Understanding Elasticsearch Architecture: Clusters, Shards, Discovery, and Scaling
High Availability Architecture
High Availability Architecture
Apr 27, 2016 · Operations

Mesos Architecture and Its Deployment at Qunar: Framework Unification and Operational Strategies

This article explains the Mesos distributed system kernel, its master‑slave architecture, fine‑grained resource scheduling, and how Qunar leverages Mesos and Marathon for log processing, Spark, Alluxio, and multi‑tenant services while addressing framework unification, HA, service discovery, and operational challenges.

Cluster ManagementFrameworkMarathon
0 likes · 14 min read
Mesos Architecture and Its Deployment at Qunar: Framework Unification and Operational Strategies
dbaplus Community
dbaplus Community
Apr 13, 2016 · Databases

Secure Redis Cluster: Adding Password Authentication and Automated Node Management

This guide explains why the official Redis Cluster tools lack password support, outlines the security risks of an unauthenticated cluster, and introduces a custom management utility that adds password authentication, automates slot migration, and simplifies adding or removing nodes, complete with step‑by‑step testing procedures.

Cluster ManagementData MigrationRedis Cluster
0 likes · 8 min read
Secure Redis Cluster: Adding Password Authentication and Automated Node Management
Architect
Architect
Mar 12, 2016 · Backend Development

Design and Evolution of Ctrip's Hermes Message Queue System

This article presents a detailed overview of Ctrip's Hermes message queue system, covering its architectural evolution from a simple Mongo‑based design to a broker‑centric, multi‑storage solution with meta‑server coordination, and discusses practical techniques for building high‑performance, scalable messaging infrastructure.

Cluster ManagementCtripDistributed Systems
0 likes · 21 min read
Design and Evolution of Ctrip's Hermes Message Queue System
ITPUB
ITPUB
Feb 24, 2016 · Big Data

How Pepperdata Optimizes Hadoop Cluster Resources and Improves Performance

The article explains how Hadoop clusters suffer from resource contention among multiple users, why YARN alone often fails to prioritize workloads, and how Pepperdata provides deeper visibility and automatic adjustments that reduce low‑priority usage, cut node count, and lower cloud costs.

Big DataCluster ManagementHadoop
0 likes · 7 min read
How Pepperdata Optimizes Hadoop Cluster Resources and Improves Performance
21CTO
21CTO
Dec 20, 2015 · Backend Development

How Twitter Scales Redis to 105 TB RAM and 39 M QPS

This article summarizes Yao Yu's "Scaling Redis at Twitter" talk, detailing why Twitter chose Redis, the massive memory and QPS requirements, custom data models, Hybrid List and BTree extensions, cluster management, and operational lessons for building a high‑performance caching service.

Cluster ManagementTwitterbackend infrastructure
0 likes · 21 min read
How Twitter Scales Redis to 105 TB RAM and 39 M QPS
ITPUB
ITPUB
Nov 10, 2015 · Operations

Mastering RHCS: Key Components and Practical Commands for Cluster Management

This guide explains the essential RHCS components—including CMAN, DLM, CCS, and FENCE—details how to start and stop the cluster, manage application services with clusvcadm, monitor cluster status using cman_tool, clustat, and ccs_tool, and maintain GFS2 file systems with dedicated utilities.

Cluster ManagementGFS2Linux
0 likes · 14 min read
Mastering RHCS: Key Components and Practical Commands for Cluster Management
Java High-Performance Architecture
Java High-Performance Architecture
Jun 23, 2015 · Databases

How to Safely Delete Master and Slave Nodes in a Redis Cluster

This guide explains the two scenarios for removing nodes from a Redis cluster—deleting a master node by first migrating its slots and then removing it, and deleting a slave node directly—along with the exact redis-trib.rb commands and verification steps to ensure successful removal.

Cluster ManagementDatabase AdministrationNode Deletion
0 likes · 3 min read
How to Safely Delete Master and Slave Nodes in a Redis Cluster