Tagged articles
43 articles
Page 1 of 1
JD Tech
JD Tech
Apr 23, 2026 · Backend Development

How JD Upgraded Its B‑Side Order Storage Architecture to Tackle Elasticsearch High‑Concurrency Pressure

Facing explosive merchant growth and soaring order volumes, JD redesigned its B‑side POP order storage by isolating large tenants, applying double‑hash routing, expanding clusters, buffering updates, and automating data archiving, ultimately delivering a high‑performance, scalable Elasticsearch platform that sustains massive traffic spikes.

Backend ArchitectureData SkewElasticsearch
0 likes · 16 min read
How JD Upgraded Its B‑Side Order Storage Architecture to Tackle Elasticsearch High‑Concurrency Pressure
Raymond Ops
Raymond Ops
Mar 6, 2026 · Cloud Native

Scaling Kubernetes from 1k to 5k Nodes: Complete Performance Tuning Playbook

This article presents a comprehensive, real‑world guide for expanding a Kubernetes cluster from 1,000 to 5,000 nodes, covering control‑plane HA, etcd optimization, network and scheduler tuning, monitoring, and automation, with detailed configurations, code snippets, and a step‑by‑step case study of a large‑scale production environment.

CNIControl Planecluster scaling
0 likes · 22 min read
Scaling Kubernetes from 1k to 5k Nodes: Complete Performance Tuning Playbook
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 28, 2026 · Artificial Intelligence

How DualPath Revives Idle Network Cards to Break Long‑Context I/O Bottlenecks in DeepSeek V4

The article analyzes the KV‑Cache storage I/O bottleneck that limits agentic LLM inference, introduces the DualPath architecture with a storage‑to‑decode data path and RDMA‑based scheduling, and shows up to 1.87× offline and 1.96× online throughput gains on large‑scale GPU clusters.

DeepSeekDualPathKV cache
0 likes · 13 min read
How DualPath Revives Idle Network Cards to Break Long‑Context I/O Bottlenecks in DeepSeek V4
Architect's Guide
Architect's Guide
Jan 22, 2026 · Big Data

Unlock Kafka’s Power: Core Concepts, High‑Performance Architecture & Real‑World Scaling Tips

This comprehensive guide explores Kafka’s core value as a message queue, explains producers, consumers, topics, partitions, and replication, dives into cluster architecture, zero‑copy I/O, resource planning for disks, memory, CPU and network, and provides practical configuration, consumer‑group management, and operational tooling tips for building high‑throughput, highly available Kafka deployments.

Distributed SystemsKafkaMessage Queue
0 likes · 31 min read
Unlock Kafka’s Power: Core Concepts, High‑Performance Architecture & Real‑World Scaling Tips
MaGe Linux Operations
MaGe Linux Operations
Nov 18, 2025 · Big Data

Zero‑Data‑Loss Kafka Cluster Scaling: Complete Step‑by‑Step Guide

This comprehensive guide explains how to safely expand a Kafka cluster without data loss by covering applicable scenarios, pre‑conditions, anti‑pattern warnings, environment matrices, a detailed checklist, step‑by‑step Linux commands for broker preparation, partition‑rebalancing plan generation, throttled execution, real‑time monitoring, verification, rollback procedures, backup strategies, performance testing, common troubleshooting, FAQs and best‑practice scripts, all illustrated with code snippets and practical examples.

KafkaPartition RebalancingShell Scripts
0 likes · 47 min read
Zero‑Data‑Loss Kafka Cluster Scaling: Complete Step‑by‑Step Guide
Ops Community
Ops Community
Nov 6, 2025 · Big Data

Zero Data Loss Kafka Cluster Scaling: From 3 to 10 Nodes – A Complete Guide

This comprehensive guide walks you through expanding or shrinking a production‑grade Kafka cluster—covering prerequisites, anti‑pattern warnings, environment matrices, step‑by‑step expansion and contraction procedures, partition rebalancing principles, monitoring, best practices, and troubleshooting—to ensure zero data loss during scaling.

Big DataKafkaPartition Rebalancing
0 likes · 27 min read
Zero Data Loss Kafka Cluster Scaling: From 3 to 10 Nodes – A Complete Guide
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Jun 30, 2025 · Fundamentals

Can Distributed File Systems Outperform Local NVMe? A Deep Performance Evaluation

This article explains what a Distributed File System (DFS) is, outlines key evaluation criteria such as reliability, availability, performance, scalability, and then compares HDD and SSD performance, investigates whether DFS can surpass local NVMe in large‑IO workloads, and discusses user‑side, cluster‑level, and cache‑level performance assessment methods.

Distributed File SystemNVMePerformance Evaluation
0 likes · 14 min read
Can Distributed File Systems Outperform Local NVMe? A Deep Performance Evaluation
IT Architects Alliance
IT Architects Alliance
Mar 16, 2025 · Cloud Native

Why Does Scaling a Kubernetes Cluster Slow Down? Uncover the Hidden Bottlenecks

When a Kubernetes cluster grows, many teams expect faster performance, yet scaling often becomes slower due to hardware limits, network congestion, data‑sync overhead, load‑balancing misconfigurations, and component bottlenecks, and this article explains each cause and offers concrete optimization strategies.

Cloud NativeKubernetescluster scaling
0 likes · 27 min read
Why Does Scaling a Kubernetes Cluster Slow Down? Uncover the Hidden Bottlenecks
dbaplus Community
dbaplus Community
Mar 4, 2025 · Databases

Why Does Redis Prefer Hash Slots Over Consistent Hashing?

Redis Cluster distributes data using 16,384 hash slots calculated via CRC16, a design that offers flexible slot allocation, simpler data migration, and better performance compared to traditional consistent hashing, and this article explains the slot mechanism, node scaling, client routing, and the reasons behind the 16K slot choice.

CRC16Hash Slotscluster scaling
0 likes · 9 min read
Why Does Redis Prefer Hash Slots Over Consistent Hashing?
dbaplus Community
dbaplus Community
Aug 3, 2023 · Databases

Scaling eBay’s Sherlock.io ClickHouse Platform with Read/Write Separation and Keeper

The article details how eBay’s Sherlock.io event monitoring platform, built on ClickHouse, faced scaling and performance challenges due to ZooKeeper bottlenecks, and explains the design and implementation of read/write separation, shard‑level Keeper coordination, and related operational fixes to improve reliability and latency.

ClickHouseKeeperRead-Write Separation
0 likes · 19 min read
Scaling eBay’s Sherlock.io ClickHouse Platform with Read/Write Separation and Keeper
Top Architect
Top Architect
May 15, 2023 · Backend Development

Comprehensive Guide to Kafka: Architecture, Performance Tuning, and Operational Practices

This article provides an in-depth overview of Kafka, covering its core value as a message queue, fundamental concepts, cluster architecture, producer and consumer configurations, scaling strategies, monitoring tools, and practical operational commands for building and maintaining high‑throughput, highly available streaming systems.

BackendKafkaMessage Queue
0 likes · 31 min read
Comprehensive Guide to Kafka: Architecture, Performance Tuning, and Operational Practices
21CTO
21CTO
Apr 25, 2023 · Databases

How Baidu’s PegaDB Redefines Redis with Low‑Cost, High‑Capacity KV Storage

This article summarizes Liu Donghui’s presentation at DTCC2022, detailing Baidu Intelligent Cloud’s Redis‑compatible, high‑capacity, low‑cost PegaDB, covering its design goals, architecture, KV storage engine choices, cluster scaling, replication enhancements, performance optimizations, multi‑region active‑active support, and future roadmap.

KV storagePegaDBPerformance Optimization
0 likes · 17 min read
How Baidu’s PegaDB Redefines Redis with Low‑Cost, High‑Capacity KV Storage
Aikesheng Open Source Community
Aikesheng Open Source Community
Mar 1, 2023 · Operations

Guide to Expanding an OceanBase Cluster: Adding Zones and Resources

This article provides a step‑by‑step guide for scaling an OceanBase cluster, covering both white‑screen and black‑screen methods to add zones (replicas) and resources (OBServers), including configuration file preparation, deployment commands, zone addition, verification queries, and procedures for both expansion and contraction.

Database operationsObserverOceanBase
0 likes · 12 min read
Guide to Expanding an OceanBase Cluster: Adding Zones and Resources
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Dec 7, 2022 · Cloud Native

How to Scale Kubernetes to 5,000 Nodes: Master, API Server, and Component Tuning

This guide explains how to push a Kubernetes cluster toward its theoretical limit of 5,000 nodes by detailing official limits, master node sizing for GCE and AWS, kube‑apiserver high‑availability and connection‑count tuning, scheduler and controller‑manager leader election settings, kubelet optimizations, and DNS anti‑affinity configuration.

Cloud NativeKubernetesOperations
0 likes · 6 min read
How to Scale Kubernetes to 5,000 Nodes: Master, API Server, and Component Tuning
58 Tech
58 Tech
Nov 17, 2022 · Backend Development

Design and Migration Strategies for the WLock Distributed Lock Service

The article presents the architecture of WLock, a Paxos‑based distributed lock service, analyzes key isolation schemes, evaluates cluster expansion and splitting, and details a multi‑step key migration process—including forward and reverse migration, node scaling, and consistency safeguards—to achieve high‑availability and isolated lock handling in multi‑tenant environments.

ConsistencyKey MigrationPaxos
0 likes · 18 min read
Design and Migration Strategies for the WLock Distributed Lock Service
MaGe Linux Operations
MaGe Linux Operations
Aug 28, 2022 · Cloud Native

Master MinIO: From Client Commands to Scalable Distributed Clusters

This guide walks through MinIO client (mc) usage, bucket management, user and policy administration, and two practical methods for expanding a MinIO distributed cluster—peer‑to‑peer scaling and federation with etcd—providing step‑by‑step commands, scripts, and configuration details for cloud‑native object storage.

Distributed SystemsMiniocluster scaling
0 likes · 31 min read
Master MinIO: From Client Commands to Scalable Distributed Clusters
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 20, 2022 · Databases

Apache Doris Installation, Cluster Deployment, Operations Manual, and Integration with Spark & Flink

This guide provides step‑by‑step instructions for downloading Apache Doris, configuring and deploying FE, BE, and Broker nodes, performing scaling operations, managing users and tables, importing and exporting data, and integrating Doris with Spark and Flink using code examples.

Apache DorisDatabase DeploymentFlink Integration
0 likes · 17 min read
Apache Doris Installation, Cluster Deployment, Operations Manual, and Integration with Spark & Flink
Bilibili Tech
Bilibili Tech
Apr 9, 2022 · Big Data

Bilibili Presto on Hadoop: Architecture, Scaling, and Performance Enhancements

Bilibili’s Presto on Hadoop combines a multi‑engine offline platform with Kubernetes‑managed YARN scheduling, Ranger security, and a custom dispatcher, scaling to over 400 nodes handling 160 k daily queries on 10 PB, while adding coordinator HA, resource‑group punishment, query limits, Alluxio caching, dynamic filtering, and numerous SQL‑level enhancements, with future auto‑scaling and materialized‑view automation.

Big DataHadoopPresto
0 likes · 30 min read
Bilibili Presto on Hadoop: Architecture, Scaling, and Performance Enhancements
vivo Internet Technology
vivo Internet Technology
Feb 9, 2022 · Databases

Redis Optimization for Vivo Push Platform: Architecture, Bottlenecks, and Solutions

To sustain Vivo Push Platform’s massive real‑time traffic, engineers re‑architected two Redis clusters, trimmed capacity by 58 %, split clusters, randomized hotspot‑prone keys, and introduced three‑level caching, cutting peak CPU load by 15 %, halving response time and improving overall Redis efficiency during peak loads.

Hot Key MitigationPerformance Optimizationcluster scaling
0 likes · 15 min read
Redis Optimization for Vivo Push Platform: Architecture, Bottlenecks, and Solutions
Tencent Cloud Middleware
Tencent Cloud Middleware
Dec 16, 2021 · Operations

Inside ZooKeeper: Source Code Walkthrough, Thread Model, and Real‑World Ops Tips

This article provides a comprehensive overview of Apache ZooKeeper, covering its purpose, client‑server thread architecture, key source‑code snippets, watch mechanism, performance characteristics of large‑scale clusters, and practical operational strategies for disaster recovery, observer load, GC pauses, and configuration tuning.

Client-Server ArchitectureDistributed CoordinationZooKeeper
0 likes · 20 min read
Inside ZooKeeper: Source Code Walkthrough, Thread Model, and Real‑World Ops Tips
dbaplus Community
dbaplus Community
Dec 15, 2021 · Big Data

How We Migrated Hundreds of Petabytes of Hadoop Data Without Downtime

This article details the background, challenges, and step‑by‑step solutions for migrating over a hundred petabytes of Hadoop HDFS data across data centers within a month, covering strategy selection, code modifications, balance optimization, and tool enhancements.

Balance OptimizationBig Data OperationsData Migration
0 likes · 14 min read
How We Migrated Hundreds of Petabytes of Hadoop Data Without Downtime
政采云技术
政采云技术
Nov 11, 2021 · Cloud Native

Cluster Scaling, Backup, and Upgrade Using Sealer Clusterfile

This article explains how to scale, back up, and upgrade Kubernetes clusters with Sealer by modifying the Clusterfile, using join/delete commands for both ALI_CLOUD and BAREMETAL providers, and configuring backup plugins and upgrade workflows.

BackupCloud NativeKubernetes
0 likes · 7 min read
Cluster Scaling, Backup, and Upgrade Using Sealer Clusterfile
21CTO
21CTO
Oct 14, 2021 · Big Data

How LinkedIn Scaled Hadoop to 11,000 Nodes and Solved YARN Delays

LinkedIn’s engineers detail how they repeatedly doubled their Hadoop cluster to over 11,000 nodes, tackled YARN scheduling delays caused by workload imbalances, and created the DynoYARN simulation tool to predict performance impacts of massive scaling.

Big DataDynoYARNHadoop
0 likes · 7 min read
How LinkedIn Scaled Hadoop to 11,000 Nodes and Solved YARN Delays
Efficient Ops
Efficient Ops
Aug 11, 2021 · Operations

Scaling Kubernetes Clusters: Node Quotas, Kernel Tweaks & Etcd Tips

This guide outlines how to prepare large‑scale Kubernetes clusters on public clouds by increasing node quotas, adjusting kernel parameters, configuring high‑availability etcd with the etcd‑operator, tuning kube‑apiserver settings, and applying pod‑level best practices for resource limits and affinity.

Kernel TuningOperationscluster scaling
0 likes · 8 min read
Scaling Kubernetes Clusters: Node Quotas, Kernel Tweaks & Etcd Tips
Programmer DD
Programmer DD
Sep 13, 2020 · Backend Development

How JD.com Scaled Its Order System with Elasticsearch: Architecture Evolution

This article details how JD.com's order center migrated from MySQL‑only reads to a high‑throughput Elasticsearch cluster, describing each architectural phase—from the initial bare‑metal setup, through isolation, replica tuning, primary‑secondary adjustments, to the current real‑time dual‑cluster—while sharing synchronization strategies and performance pitfalls.

Elasticsearchcluster scalingdata synchronization
0 likes · 12 min read
How JD.com Scaled Its Order System with Elasticsearch: Architecture Evolution
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Sep 7, 2020 · Cloud Native

How to Scale a Kubernetes Cluster: Node Quotas, Kernel Tweaks, and Component Settings

This guide explains how to prepare a large‑scale Kubernetes cluster by increasing cloud resource quotas, adjusting kernel parameters, configuring master node sizes, optimizing etcd storage, tuning Docker and Kubelet image pull settings, and applying best‑practice pod and scheduler configurations for thousands of nodes.

Image PullKernel ParametersKubernetes
0 likes · 11 min read
How to Scale a Kubernetes Cluster: Node Quotas, Kernel Tweaks, and Component Settings
Efficient Ops
Efficient Ops
Aug 24, 2020 · Operations

How to Scale Elasticsearch for PB‑Level Game Logs: Real‑World Strategies & Lessons

This article walks through a mid‑size gaming company's journey of deploying, tuning, and scaling an Elasticsearch cluster for massive log volumes, covering hot‑cold node architecture, ILM policies, shard management, Logstash‑Kafka optimization, emergency expansions, and the promise of searchable snapshots to achieve petabyte‑scale storage with cost efficiency.

Big DataElasticsearchILM
0 likes · 28 min read
How to Scale Elasticsearch for PB‑Level Game Logs: Real‑World Strategies & Lessons
Tencent Cloud Developer
Tencent Cloud Developer
Jul 29, 2020 · Big Data

Case Study: Optimizing Tencent Cloud Elasticsearch for High‑Volume Game Log Analytics

To handle a gaming company's million‑QPS log stream, the team built a hot‑cold Tencent Cloud Elasticsearch cluster with ILM‑driven tiering, scaled CPU/heap, reduced shard count via shrink and replica tweaks, tuned Logstash‑Kafka pipelines, and employed COS snapshots and searchable snapshots, achieving stable performance and lower cost.

Big DataElasticsearchILM
0 likes · 29 min read
Case Study: Optimizing Tencent Cloud Elasticsearch for High‑Volume Game Log Analytics
Big Data Technology Architecture
Big Data Technology Architecture
Jun 4, 2020 · Big Data

58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration

This article presents a comprehensive case study of 58.com’s massive Hadoop‑based offline computing platform, detailing its architecture, scaling challenges, performance‑tuning measures, YARN and SparkSQL upgrades, and the systematic cross‑data‑center migration of thousands of nodes and petabytes of data.

Big DataData MigrationHadoop
0 likes · 23 min read
58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration