Artificial Intelligence 13 min read

Dual Alignment Theory Redefines Cross-Domain Offline RL Transfer

The paper revisits cross-domain offline reinforcement learning, showing that aligning both dynamics and value of source data is essential for effective policy transfer, and introduces the DVDF framework that jointly filters source samples, achieving consistent performance gains across multiple robotic control benchmarks.

Machine Heart

Apr 2, 2026

Dual Alignment Theory Redefines Cross-Domain Offline RL Transfer

Motivation : Offline reinforcement learning (offline RL) avoids costly and unsafe online interaction, but when the target domain lacks sufficient data, cross‑domain offline RL seeks to leverage abundant source‑domain data. Existing methods focus only on dynamics alignment, assuming that matching transition dynamics is enough for transfer.

Problem Identification : The authors point out two hidden issues: (1) dynamics alignment alone ignores the quality of source data, and (2) the prevailing theoretical frameworks do not match the true learning objective of maximizing target‑domain performance. An illustrative Hopper experiment demonstrates that low‑quality, dynamics‑aligned source samples are filtered out by current methods, leaving only random data that yields sub‑optimal policies.

Theoretical Reconstruction : By deriving a sub‑optimality‑gap bound for the target‑domain policy, the paper proves that both dynamics misalignment and value misalignment contribute to the gap. Consequently, efficient cross‑domain offline RL must achieve dynamics alignment and value alignment —i.e., the source data must be both transition‑compatible and of high quality.

DVDF Framework : Building on this insight, the authors propose the Dynamics‑ and Value‑aligned Data Filtering (DVDF) framework. DVDF defines a scoring function that combines a dynamics‑alignment score (obtained via contrastive learning or optimal transport) with a value‑alignment score (estimated by a pre‑trained advantage function). The value‑alignment score is derived by pre‑training a policy on source data using Sparse Q‑learning (SQL) to obtain a more accurate advantage estimate than IQL, which tends to over‑estimate advantages due to low‑quality actions.

The scoring function includes a tunable hyper‑parameter to balance the two alignment terms and a percentile‑based indicator to select a desired proportion of source samples. The filtered samples are then used to train the final policy with a standard offline RL algorithm (e.g., IQL) on the target domain.

Experiments :

Dynamics‑shift benchmarks : In four robot control tasks (HalfCheetah, Hopper, Walker2d, Ant) with kinematic and morphology shifts, DVDF consistently outperforms baselines (IGDF, OTDF). For example, DVDF‑IGDF improves the normalized score sum from 1001.6 to 1164.7 (↑16.3%) on kinematic shifts.

Ablation study : Replacing SQL pre‑training with IQL leads to higher advantage estimation error and lower final performance, confirming the importance of accurate value alignment.

Parameter sensitivity : The alignment‑balance coefficient and data‑selection percentile are examined. A default setting works well across most datasets, reducing the need for extensive hyper‑parameter tuning.

Conclusion : By jointly aligning dynamics and value, DVDF bridges the gap between source and target domains, delivering superior policy transfer in offline RL settings. Theoretical analysis and extensive empirical results validate the necessity of dual alignment for effective cross‑domain reinforcement learning.