Operations 17 min read

How Alibaba’s Table Store Auto‑Solves Hotspot Issues with Real‑Time Load Balancing

This article explains the architecture and mechanisms of Alibaba Cloud's Table Store load‑balancing system, detailing how it collects metrics, detects user‑access and machine hotspots, and automatically applies actions such as partition moves, splits, merges, and isolation to maintain high availability and performance.

dbaplus Community
dbaplus Community
dbaplus Community
How Alibaba’s Table Store Auto‑Solves Hotspot Issues with Real‑Time Load Balancing

Background

Table Store (formerly OTS) is Alibaba's self‑developed multi‑tenant NoSQL distributed database. The presentation focuses on how its load‑balancing system addresses hotspot problems.

Table Store Architecture

The system is built on the Feitian kernel, providing distributed shared storage, lock services, and communication components. Above the kernel lies the engine composed of a few masters and many workers; masters manage worker states and schedule partitions. The front‑end provides unified HTTP services that forward requests to workers.

Load‑Balancing Context

Data is sharded by a partition key into partitions , the basic scheduling unit. Operations on partitions include move , split , merge , and group , all performed in seconds.

Hotspot Issues

Hotspots in a multi‑tenant NoSQL system fall into two categories:

User‑access hotspots: reasonable burst traffic (e.g., promotions) or unreasonable burst caused by poor table design.

Machine hotspots: sudden spikes in CPU or network usage that make a machine a bottleneck.

These problems are hard to locate and resolve due to incomplete statistics, limited mitigation mechanisms, and slow manual handling.

Load‑Balancing System Overview

The system consists of LBAgent (deployed with each worker) and LBMaster. LBAgent collects all local metrics, keeps recent data in memory, and asynchronously persists it to an external MetricStore. LBMaster aggregates data, provides an HTTP command layer (OpsServer), and hosts two analysis modules:

OnlineAnalyzer : real‑time analysis of in‑memory data to detect and act on hotspots.

OfflineAnalyzer : batch analysis of long‑term data from MetricStore to discover potential issues.

Both modules generate actions that are sent to workers or masters via the Executor. Results and actions are stored in ResultDataStore for further inspection.

Information Collection Module

The module aims for comprehensive metrics while minimally impacting the main request path. Every module records its own statistics, which are streamed through RPC layers and asynchronously pushed to a backend aggregation component, then pulled by LBAgent.

LBAgent Functions

Collect all local metrics.

Pre‑aggregate the data.

Persist metrics asynchronously to MetricStore.

LBMaster Core Functions

Gather cluster‑wide information.

Provide multi‑dimensional top‑N queries (error rate, latency, QPS, etc.).

Analyze data, generate actions, and execute them.

Self‑evaluate action effectiveness.

LBMaster also offers a white‑screen control panel for real‑time queries and manual operation commands, and supports dynamic, hot‑reloadable configuration.

Online Analysis Path

Online analysis processes short‑term data entirely in memory, ensuring sub‑second detection and remediation without relying on external systems. Example actions:

When a worker’s queue is full, the system detects a read‑write bottleneck on a partition and issues a split action, creating two new partitions and redistributing load.

If a partition causes a machine‑resource saturation, the system isolates the partition to a dedicated machine via a group action.

Offline Analysis Path

Offline analysis works on long‑term data stored in MetricStore, using external analytics to identify low‑utilization partitions for auto‑merge or to forecast future traffic peaks for proactive resource adjustment.

Effectiveness

Post‑deployment metrics show significant reductions in read/write error rates and latency, alongside increased throughput, especially for workloads that previously exhibited clear hotspots.

Key Takeaways

Comprehensive per‑module statistics are essential for accurate hotspot detection.

Automating manual mitigation resolves the majority of online issues without complex machine‑learning models.

Rich, configurable strategies allow rapid iteration and tailored responses to diverse workloads.

Q&A

Q1: What statistics are collected, and why worry about hotspots despite having worker queues?

A1: Some queues are shared across all partitions; a hotspot on one partition can monopolize resources, affecting all partitions on that machine.

Q2: Why prefer partition‑key based distribution over hash‑ring sharding?

A2: Hash sharding is hard to adjust dynamically; partition‑key based schemes allow easy operations like split to rebalance.

Q3: Does the solution rely on RabbitMQ, and were alternative approaches considered?

A3: If your system uses a queue service, its own load‑balancing capabilities may limit what can be done; the presented approach helps when the underlying system also suffers from hotspots.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Monitoringload balancingNoSQLAlibaba Cloudhotspot mitigation
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.