Tagged articles
33 articles
Page 1 of 1
Raymond Ops
Raymond Ops
Dec 17, 2025 · Operations

Build a Production‑Ready Prometheus HA Architecture with Federation and Remote Storage

Learn how to design and implement a robust, production‑grade Prometheus high‑availability solution using a federated global cluster, multiple business‑level instances, remote storage with Thanos or VictoriaMetrics, Docker‑Compose deployment, health‑check scripts, performance metrics, alerting rules, and best‑practice operational guidelines.

Docker ComposeFederationRemote Storage
0 likes · 17 min read
Build a Production‑Ready Prometheus HA Architecture with Federation and Remote Storage
Kuaishou Tech
Kuaishou Tech
Oct 31, 2024 · Cloud Native

Stateful Service Cloud‑Native Practices: Kuaishou’s Redis on Kubernetes

This article examines the challenges and benefits of running stateful services such as Redis on Kubernetes, presents Kuaishou’s practical experience with cloud‑native migration, evaluates risks and performance impacts, and details the custom workloads, operators, federation and KubeBlocks solutions that enable large‑scale, reliable stateful service orchestration.

Cloud NativeFederationKubeBlocks
0 likes · 12 min read
Stateful Service Cloud‑Native Practices: Kuaishou’s Redis on Kubernetes
DataFunSummit
DataFunSummit
Oct 17, 2024 · Big Data

Waggle Dance Based Metadata Solution at Tongcheng Travel: Architecture, Migration Strategies, and Future Outlook

This article presents Tongcheng Travel's metadata solution built on the open‑source Waggle Dance project, detailing the three‑layer architecture, challenges of a monolithic Hive Metastore, evaluated migration plans, federation implementation, migration workflow, and future directions for unified metadata governance.

Data MigrationFederationHive Metastore
0 likes · 11 min read
Waggle Dance Based Metadata Solution at Tongcheng Travel: Architecture, Migration Strategies, and Future Outlook
Ops Development Stories
Ops Development Stories
Jun 28, 2024 · Cloud Native

Multi-Cluster Kubernetes: Benefits, Federation, Karmada, and Practical Tips

This article explains why organizations adopt multi‑cluster Kubernetes for high availability, hybrid‑cloud scaling, and fault isolation, outlines the preparatory steps, compares Federation v1 and v2, introduces Karmada as a CNCF project, and shares practical non‑federated deployment, monitoring, traffic management, and migration techniques with code examples.

Cloud NativeDevOpsFederation
0 likes · 18 min read
Multi-Cluster Kubernetes: Benefits, Federation, Karmada, and Practical Tips
Alibaba Cloud Native
Alibaba Cloud Native
Apr 8, 2024 · Cloud Native

How to Build a Global View for Multiple Prometheus Instances – Community and Alibaba Cloud Solutions

This article explains why a global view is needed when Prometheus metrics are scattered across many instances, compares community approaches such as Federation, Thanos, and Remote Write, and details Alibaba Cloud's Global Aggregation Instance and Remote Write solutions with configuration examples and a real‑world case study.

FederationGlobal ViewPrometheus
0 likes · 25 min read
How to Build a Global View for Multiple Prometheus Instances – Community and Alibaba Cloud Solutions
DevOps Operations Practice
DevOps Operations Practice
Mar 14, 2024 · Operations

Resolving Frequent Crashes of a Single-Node Prometheus Deployment: Analysis and Solutions

This article analyzes why a single Prometheus instance repeatedly runs out of memory and crashes, explains the underlying storage mechanisms, and presents practical solutions such as metric reduction, retention tuning, federation architecture, and remote storage integration to improve stability and scalability.

FederationPrometheusmonitoring
0 likes · 6 min read
Resolving Frequent Crashes of a Single-Node Prometheus Deployment: Analysis and Solutions
Volcano Engine Developer Services
Volcano Engine Developer Services
Jul 7, 2023 · Cloud Native

How KubeAdmiral Redefines Multi-Cluster Kubernetes Federation for Scale and Efficiency

Since Kubernetes became the de‑facto standard, ByteDance faced scaling limits with single‑cluster setups, prompting the adoption of KubeFed V2 and later the development of KubeAdmiral, a next‑generation multi‑cluster federation system that enhances scheduling, resource efficiency, native API support, and dynamic scaling across clouds.

FederationKubeAdmiralKubernetes
0 likes · 15 min read
How KubeAdmiral Redefines Multi-Cluster Kubernetes Federation for Scale and Efficiency
Efficient Ops
Efficient Ops
Aug 28, 2022 · Cloud Native

Mastering Kubernetes Federation: Install, Join Clusters, and Sync Resources

This guide explains the purpose of Kubernetes Federation, its benefits for multi‑cluster management, step‑by‑step installation using Helm and kubefedctl, how to join and unjoin clusters, enable resource federation, and provides a cheat sheet of common commands for reliable cross‑cluster deployments.

FederationKubeFedKubernetes
0 likes · 8 min read
Mastering Kubernetes Federation: Install, Join Clusters, and Sync Resources
Architect's Guide
Architect's Guide
Jun 26, 2022 · Backend Development

Building a Million‑Message‑Per‑Second RabbitMQ Service: Architecture, Scaling, and High Availability

This article explains how to design and operate a RabbitMQ cluster capable of handling millions of messages per second by describing RabbitMQ fundamentals, Google‑scale deployment, sharding and consistent‑hash plugins, high‑availability mirroring, federation, and integration with Spring AMQP, while also covering practical deployment scenarios and performance trade‑offs.

FederationMessage QueueRabbitMQ
0 likes · 23 min read
Building a Million‑Message‑Per‑Second RabbitMQ Service: Architecture, Scaling, and High Availability
ITPUB
ITPUB
May 7, 2022 · Big Data

How eBay Scaled HDFS to 800 PB Using Federation and Router‑Based Architecture

This article details eBay's evolution of its massive HDFS storage—from a single‑cluster design to ViewFS Federation, then to Router‑Based Federation—highlighting the performance bottlenecks, optimization techniques, FastCopy integration, and future plans for further scaling and automation.

FederationHDFSPerformance Optimization
0 likes · 11 min read
How eBay Scaled HDFS to 800 PB Using Federation and Router‑Based Architecture
IT Services Circle
IT Services Circle
Apr 3, 2022 · Cloud Native

Understanding Kubernetes Federation: kubefed and Karmada Multi‑Cluster Management

This article explains why Kubernetes single‑cluster scalability is limited to about 5,000 nodes, introduces the concept of multi‑cluster federation, compares the legacy kubefed project with the actively maintained Karmada solution, and shows how policies and replica‑scheduling enable flexible cross‑AZ deployments and failover.

Cloud NativeCluster ManagementFederation
0 likes · 13 min read
Understanding Kubernetes Federation: kubefed and Karmada Multi‑Cluster Management
IT Architects Alliance
IT Architects Alliance
Jan 14, 2022 · Operations

Scaling RabbitMQ to Million‑Message Throughput: Architecture, Plugins, and High‑Availability Practices

This article explains how to horizontally scale RabbitMQ clusters, use sharding and federation plugins, configure mirror queues and other high‑availability features, and apply practical patterns such as confirms, retries, and delayed delivery to achieve million‑level message throughput in production environments.

FederationMessage QueueRabbitMQ
0 likes · 23 min read
Scaling RabbitMQ to Million‑Message Throughput: Architecture, Plugins, and High‑Availability Practices
Architecture Digest
Architecture Digest
Jan 13, 2022 · Backend Development

Scaling RabbitMQ to Million‑Message Throughput: Architecture, Sharding, Federation, and High‑Availability Practices

This article explains how to horizontally scale RabbitMQ clusters to handle millions of messages per second by leveraging cluster modes, mirror queues, sharding plugins, consistent‑hash exchanges, federation, and high‑availability configurations, while also covering practical scenarios such as retries, delayed tasks, and Spring AMQP integration.

FederationMessage QueueRabbitMQ
0 likes · 22 min read
Scaling RabbitMQ to Million‑Message Throughput: Architecture, Sharding, Federation, and High‑Availability Practices
Open Source Linux
Open Source Linux
Jan 5, 2022 · Operations

Designing Scalable High‑Availability Prometheus Architectures

This article explains how to build both small‑scale and large‑scale high‑availability Prometheus setups using local and remote storage, federation, keepalived, and PostgreSQL + TimescaleDB adapters to ensure reliable monitoring and alerting across growing infrastructures.

FederationOpsPrometheus
0 likes · 6 min read
Designing Scalable High‑Availability Prometheus Architectures
MaGe Linux Operations
MaGe Linux Operations
Dec 1, 2021 · Operations

Scalable High‑Availability Prometheus: Small‑Scale to Massive Deployments

This article explains how Prometheus’s local storage limits scalability and how Remote Storage, federation, and high‑availability setups—using dual instances, keepalived, and adapters with PostgreSQL + TimescaleDB—can overcome data persistence and performance challenges for both small‑scale and large‑scale monitoring environments.

FederationPrometheusRemote Storage
0 likes · 5 min read
Scalable High‑Availability Prometheus: Small‑Scale to Massive Deployments
Qingyun Technology Community
Qingyun Technology Community
Sep 15, 2021 · Cloud Native

Why Enterprises Embrace Hybrid Multi‑Cloud and Kubernetes Multi‑Cluster Strategies

Enterprises adopt hybrid multi‑cloud architectures driven by high‑profile security incidents and regulatory demands, leveraging Kubernetes multi‑cluster capabilities such as disaster recovery, latency reduction, isolation, fault containment, and vendor‑lock‑in avoidance, with solutions like Federation v1/v2 and KubeSphere illustrated through real‑world case studies.

FederationKubeSphereKubernetes
0 likes · 13 min read
Why Enterprises Embrace Hybrid Multi‑Cloud and Kubernetes Multi‑Cluster Strategies
Efficient Ops
Efficient Ops
Jul 25, 2021 · Cloud Native

Why Enterprises Need Multi‑Cluster Kubernetes and How to Implement It

This article explains why modern enterprises adopt multiple Kubernetes clusters, covering single‑cluster capacity limits, hybrid‑cloud requirements, fault‑tolerance concerns, the benefits of multi‑cluster setups, architectural models, and community‑driven implementation patterns.

Cloud NativeFederationMulti-Cluster
0 likes · 9 min read
Why Enterprises Need Multi‑Cluster Kubernetes and How to Implement It
Programmer DD
Programmer DD
Jun 13, 2021 · Operations

How to Build a High‑Availability Prometheus Setup Using Federation and Multi‑Remote‑Read

This article examines common misuse of Prometheus federation, explains its limitations, and presents a pure‑Prometheus solution using multi_remote_read to achieve high‑availability monitoring, including configuration examples, code analysis, and best‑practice recommendations for proper data aggregation and query merging.

FederationPrometheusmulti_remote_read
0 likes · 11 min read
How to Build a High‑Availability Prometheus Setup Using Federation and Multi‑Remote‑Read
Programmer DD
Programmer DD
Apr 13, 2021 · Big Data

What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features

This article provides a comprehensive overview of HDFS—including its design goals, core components, data read/write workflows, high‑availability mechanisms, federation, storage policies, colocation benefits, and practical usage scenarios—explaining why it is the foundational distributed file system for large‑scale data processing.

Big DataFederationHDFS
0 likes · 17 min read
What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features
Big Data Technology Architecture
Big Data Technology Architecture
Mar 11, 2021 · Big Data

Challenges and Optimizations of Hive MetaStore at Kuaishou

This article details how Kuaishou tackled performance, scalability, and stability challenges of Hive MetaStore by introducing a BeaconServer hook architecture, read‑write separation, API refinements, traffic control, and federation designs, resulting in significant query efficiency and service reliability improvements.

FederationHiveRead-Write Separation
0 likes · 14 min read
Challenges and Optimizations of Hive MetaStore at Kuaishou
Cloud Native Technology Community
Cloud Native Technology Community
Mar 30, 2020 · Cloud Native

Building a Cloud‑Native Large‑Scale Distributed Monitoring System with Prometheus

This article explains how to design and implement a cloud‑native, large‑scale distributed monitoring system using Prometheus, covering its limitations, service‑level sharding, centralized storage, federation, and high‑availability strategies to overcome scaling challenges in Kubernetes environments.

Cloud NativeFederationPrometheus
0 likes · 12 min read
Building a Cloud‑Native Large‑Scale Distributed Monitoring System with Prometheus
DataFunTalk
DataFunTalk
Jan 2, 2020 · Big Data

ByteDance’s HDFS Architecture and Evolution: Design, Challenges, and Optimizations

This article presents an in‑depth overview of ByteDance’s large‑scale HDFS deployment, describing its unique access layer, metadata and data layers, the evolution through multiple growth stages, and the key architectural improvements such as NNProxy, DanceNN, lock redesign, startup acceleration, and slow‑node mitigation techniques.

Big DataByteDanceFederation
0 likes · 18 min read
ByteDance’s HDFS Architecture and Evolution: Design, Challenges, and Optimizations
Alibaba Cloud Native
Alibaba Cloud Native
Aug 6, 2019 · Cloud Native

Why Multi-Cluster Architecture Is the Future of Cloud‑Native Applications

This article explains the rise of multi‑cluster designs, outlines three common scenarios—cloud burst, disaster recovery, and active‑active—examines the complexities of application delivery across clusters, and details how Kubernetes and Alibaba Cloud’s ACK implement unified APIs, tunnel mechanisms, and high‑availability to enable true multi‑cloud operations.

ACKCluster TunnelFederation
0 likes · 19 min read
Why Multi-Cluster Architecture Is the Future of Cloud‑Native Applications
Beike Product & Technology
Beike Product & Technology
Jun 28, 2019 · Big Data

Hadoop NameNode Performance Bottlenecks and Solutions: Federation, ViewFS, FastCopy, Balance & Mover

This article analyzes the performance and stability bottlenecks of a Hadoop 2.7.3 NameNode caused by memory limits, RPC QPS, and long restart times, and presents a comprehensive solution stack—including HDFS federation, ViewFS, FastCopy, and tuned Balance/Mover tools—to improve scalability and reduce downtime.

BalanceFastCopyFederation
0 likes · 11 min read
Hadoop NameNode Performance Bottlenecks and Solutions: Federation, ViewFS, FastCopy, Balance & Mover
Qunar Tech Salon
Qunar Tech Salon
May 16, 2019 · Big Data

Optimizing HDFS Federation Data Migration with FastCopy and qFastCopy at Qunar

This article describes the challenges of scaling Qunar's Hadoop NameNode, introduces HDFS Federation and the FastCopy tool, presents performance tests comparing FastCopy with DistCp, and details the development and evaluation of an optimized qFastCopy solution that reduces multi‑petabyte migration time from hours to a few.

Big DataData MigrationFastCopy
0 likes · 8 min read
Optimizing HDFS Federation Data Migration with FastCopy and qFastCopy at Qunar
dbaplus Community
dbaplus Community
May 13, 2019 · Big Data

Tackling HDFS Performance Bottlenecks: Real‑World Optimizations from VIP.com

This article examines the performance challenges encountered after upgrading a large‑scale HDFS cluster at VIP.com, explains the root causes of NameNode RPC latency, and presents concrete solutions—including delayed block reports, configurable block deletion, federation redesign, client monitoring, temp‑directory sharding, and small‑file handling—along with configuration snippets and real‑world results.

Big DataFederationHDFS
0 likes · 13 min read
Tackling HDFS Performance Bottlenecks: Real‑World Optimizations from VIP.com
Meituan Technology Team
Meituan Technology Team
Apr 14, 2017 · Big Data

Practical Experience of HDFS Federation at Meituan: Challenges, Improvements, and Automation

Meituan‑Dianping migrated its 2,000‑node HDFS cluster to Federation by fixing ViewFs compatibility, simplifying mount points, leveraging FastCopy for massive data moves, improving token handling, and automating split‑workflow steps, thereby overcoming single‑NameNode bottlenecks and providing a practical blueprint for large‑scale Hadoop deployments.

Big DataFastCopyFederation
0 likes · 22 min read
Practical Experience of HDFS Federation at Meituan: Challenges, Improvements, and Automation

Design and Implementation of Alibaba Cloud's Cross‑Data‑Center Hadoop Cluster

In 2013 Alibaba Cloud faced full rack capacity in a single IDC, prompting the development of a multi‑NameNode, cross‑data‑center Hadoop solution that overcomes NameNode scalability, inter‑site bandwidth limits, data placement, job scheduling, massive data migration, and user transparency challenges.

Cross‑Data‑CenterFederationHadoop
0 likes · 14 min read
Design and Implementation of Alibaba Cloud's Cross‑Data‑Center Hadoop Cluster