How AState Reduces Trillion‑Parameter RL Weight Sync to 6 Seconds

AState is a general‑purpose state data management system for reinforcement‑learning tasks that tackles low IO efficiency, slow weight synchronization, and state‑recovery challenges, achieving sub‑10‑second weight sync for trillion‑parameter models through a three‑layer architecture, zero‑redundancy transfers, and hardware‑aware co‑design, with the code openly available on GitHub.

AntTech
AntTech
AntTech
How AState Reduces Trillion‑Parameter RL Weight Sync to 6 Seconds

Introduction

AState is a universal state data management system for reinforcement‑learning (RL) workloads, addressing low I/O efficiency, slow weight synchronization, and difficulty in state recovery at large scale.

RL State Data Overview

RL workloads involve training states (model weights, optimizer states, activations), inference states (weights, KV cache), RL‑specific states (weight sync data), and agent‑related states (multi‑turn conversation context, external tool call status).

Challenges in Large‑Scale RL

I/O inefficiency due to heavy checkpoint read/write and activation recomputation.

Weight synchronization for trillion‑parameter models can require tens of terabytes per iteration, traditionally taking minutes.

Lack of efficient state caching for multi‑turn dialogs and external tool interactions.

Design Goals of AState

AState provides a unified API for RL tasks, enabling high‑performance weight exchange, reliable state caching, and seamless scaling across deployment and pipeline modes without modifying existing RL frameworks.

System Architecture

The system consists of three layers:

API Layer : Tensor‑native one‑sided read/write semantics for easy integration with training and inference frameworks.

Service Layer : Multiple weight‑sync protocols (pull‑based for co‑located training/inference, fully asynchronous for off‑policy setups), zero‑redundancy tensor sharding, and RL‑topology‑aware execution plans.

Transport Layer : NUMA‑aware topology, support for PCIe, NVLink, RoCE/InfiniBand, and a lightweight RDMA component for high‑throughput data movement.

Weight Synchronization Background

In RL, weight synchronization transfers updated policy parameters from training nodes to rollout (inference) nodes. For trillion‑parameter models, the data volume can reach ~100 TB, making minute‑level sync a severe bottleneck.

Industry Solutions

Early frameworks relied on distributed file systems, resulting in tens of minutes of sync time.

Recent open‑source solutions (e.g., VeRL, Kimi checkpoint engine) use NCCL‑based layer‑by‑layer or bucket‑by‑bucket sync, still limited to minute‑level latency and suffering from data redundancy.

AState Weight‑Sync Solution

AState introduces:

Zero‑redundancy transmission by resharding only the tensor fragments needed on the inference side.

DMA‑based zero‑copy transfers to minimize host‑device copying.

In‑place weight updates on the inference side, avoiding extra memory allocation.

Hardware‑aware scheduling that balances load across NUMA domains and network links.

Performance Optimization Timeline

4 min → 40 s : Optimized inference‑side read ordering with shuffle‑based load balancing, zero‑redundancy row‑parallel transfers, and aggregation of small tensors.

40 s → 10 s : Implemented zero‑redundancy column‑parallel transfers via pre‑aggregation of non‑contiguous shards and overlapped offload‑pre‑read pipelines.

10 s → 6 s : Integrated NUMA‑aware topology awareness and global execution‑plan optimization to eliminate hotspots.

Key Differentiators

No hard dependency on NCCL; works across co‑located and decoupled training/inference deployments.

One‑sided API enables high‑concurrency access without blocking peer processes.

Zero‑redundancy, in‑place updates adapt to row, column, and tensor‑parallel schemes.

Lightweight RDMA stack combined with NVLink/NCCL for optimal bandwidth utilization.

Code Example

class TensorTable:
    """High-level Python API for managing TensorTable instances.

    Attributes:
        name (str): The name of the managed table
        table_type (Union[str, TensorTableType]): The type of table implementation
        parallel_config (ParallelConfig): Parallel configuration for distributed operations

    Example:
        >>> from astra.parallel_config import ParallelConfig
        >>> config = ParallelConfig.create_training_config(world_size=4, global_rank=0, role_rank=0)
        >>> table = TensorTable("my_table", parallel_config=config)
        >>> tensor = torch.randn(3, 4)
        >>> success = table.put(1, "key1", tensor)
        >>> retrieved = table.get(1, "key1")
    """
    def put(self, seq_id: int, key: ShardedKey, tensor: torch.Tensor) -> bool:
        ...
    def get(self, seq_id: int, key: ShardedKey, tensor: torch.Tensor) -> Optional[torch.Tensor]:
        ...
    def multi_put(self, seq_id: int, tensor_pairs: List[Tuple[ShardedKey, torch.Tensor]]) -> bool:
        ...
    def multi_get(self, seq_id: int, tensor_pairs: List[Tuple[ShardedKey, torch.Tensor]]) -> List[Tuple[ShardedKey, torch.Tensor]]:
        ...
    def complete(self, seq_id: int) -> None:
        ...

Benchmarks

Tests on Ling‑flash‑2.0 (100 B) and Ling‑1T (1 T) models show that AState reduces end‑to‑end weight sync from minutes to approximately 6 seconds on a thousand‑GPU cluster.

Open‑Source Repository

The full AState implementation is available at https://github.com/inclusionAI/asystem-astate.

high performance computinglarge modelsreinforcement learningWeight SynchronizationAState
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.