GoLongRL Open‑Source: 23K Samples, 9 Task Types, and the End of the Long‑Context RL Desert
GoLongRL introduces a fully open‑source long‑context reinforcement‑learning pipeline with a 23K‑sample RLVR dataset covering nine capability‑oriented tasks, a TMN‑Reweight optimizer for heterogeneous multitask training, and demonstrates SOTA performance on 4B and 30B models, surpassing leading baselines.
Why existing long‑context RL methods fall short
Current mainstream methods such as LoongRL, LongRLVR, and QwenLong‑L1.5 share two problems: training data focuses on “finding harder answers in longer texts”, resulting in highly homogeneous tasks; reward design is compressed into a single exact‑match or accuracy metric, providing little supervision for ranking, summarization, or exhaustive retrieval.
Data: Capability‑Oriented
Three design principles guide the dataset: capability‑orientation, reward‑task semantic alignment, and real‑document priority.
Capability‑oriented. Following LongBench Pro’s taxonomy, nine core task types are defined, covering key ability dimensions for long‑context understanding. Tasks T1‑T4 form the training backbone (>90% of samples) covering basic abilities; T6‑T9 are rarer (<4%) but retain their natural reward forms to ensure full coverage.
Reward‑Task Semantic Alignment. Different tasks require different evaluation metrics (ROUGE for summarization, NDCG for ranking, F1 for extraction). GoLongRL assigns the most appropriate metric as the reward function for each task, preserving semantic consistency between training feedback and task objectives.
Real‑Document Priority. Synthetic data can leak structural cues that models exploit. Therefore GoLongRL primarily uses real documents—books, academic papers, legal texts, and financial reports—as training sources. For domains with scarce annotations, only question‑answer pairs are synthesized on top of real documents, not the documents themselves.
Dataset Construction
The 22,965 samples come from two complementary pools:
≈14K open‑source samples rewritten from existing long‑context corpora such as CLongEval, LongBench Pro, MultiTableQA, CAIL2018, covering legal cases, financial reports, novels, and multi‑turn dialogues.
≈9K synthetic samples where QA pairs are generated from real source documents (Project Gutenberg books, arXiv CC0, etc.), while the documents themselves remain untouched.
Four‑Stage Construction Pipeline
All data are produced by a unified four‑stage pipeline:
P1 Source Collection : Gather annotated open‑source datasets and unannotated real documents for each of the nine tasks, aiming for diverse domains, structures, and lengths.
P2 Task Filtering & Assignment : Assign a unique task label to each sample based on semantics (e.g., single‑fact CLongEval samples → T1; multi‑law‑clause CAIL2018 samples → T3; dialogue memory sub‑tasks keep only dialogues >50 turns and >30K tokens).
P3 Sample Construction : Apply compatibility filtering and reward‑format standardization to open‑source data (e.g., convert numeric answers to math_verify parsable format). For synthetic data, bucket documents by length; use DeepSeek‑V3.2 to generate QA pairs for ordinary lengths and Gemini‑2.5‑Pro for ultra‑long documents. Perform two‑stage quality filtering: Gemini‑2.5‑Pro checks answer uniqueness and hallucination‑free; Qwen3‑4B and Qwen3‑30B‑A3B test multi‑level pass rates to remove noisy labels.
P4 Iterative Refinement : Apply 13‑gram overlap filtering to prevent data contamination, then train and conduct benchmark diagnostics. If a dimension stalls, investigate reward cheating or answer ambiguity and clean; if signal is insufficient, return to P1‑P3 to add targeted data, looping until performance stabilizes.
TMN‑Reweight: Optimizer for Heterogeneous Multitask Training
The capability‑oriented dataset yields nine distinct reward functions with varying scales and variances. Standard GRPO mixing faces two intertwined issues:
Issue 1: Difficulty‑induced advantage estimation bias. GRPO divides advantage by the within‑group reward standard deviation, inflating advantage for extremely easy or hard prompts and compressing it for medium‑difficulty prompts, which are actually the most valuable for training.
Issue 2: Inconsistent reward scales across tasks. Different evaluation metrics (EM, F1, ROUGE‑L) produce disparate reward distributions. Dr. GRPO’s removal of per‑prompt variance leads high‑variance tasks (e.g., F1 retrieval) to dominate gradients, drowning out low‑variance tasks (e.g., binary accuracy).
Core Idea of TMN‑Reweight. Decouple scale normalization and difficulty correction into two independent steps.
Step 1: Task‑level Mean Normalization (TMN). Instead of per‑prompt standard deviation, compute each prompt’s group‑wise standard deviation, then aggregate RMS within the same task to obtain a shared denominator for all prompts of that task. This preserves task‑level scale alignment while retaining intra‑task difficulty structure. Experiments show TMN reduces the coefficient of variation of cross‑task advantages from 0.54 (Dr. GRPO) and 0.34 (standard GRPO) to 0.18.
Step 2: Difficulty‑Adaptive Re‑weighting. After scale alignment, estimate prompt difficulty using a smoothed pass‑rate (interpolating per‑prompt average reward with the task‑level baseline to avoid high variance under small batches). Compute weights in a four‑quadrant asymmetric fashion: for difficult prompts, amplify positive advantage and shrink negative advantage; for easy prompts, do the opposite. This “four‑quadrant” gradient redistribution strengthens exploration on hard samples while maintaining diversity on easy ones.
Experimental Results
Main Result: 4B model reaches SOTA. With the 4B scale, data and algorithm contributions can be isolated. Vanilla GRPO already outperforms QwenLong‑L1.5 (GRPO) by 6.1 points (62.2 vs 56.1) and exceeds the specialized AEPO version (59.4). Adding TMN‑Reweight further lifts the score to 63.0.
Main Result: 30B model surpasses top flagship models. GoLongRL‑30B‑A3B achieves 69.8, beating DeepSeek‑R1‑0528 (68.67), Qwen3‑235B‑A22B‑Thinking‑2507 (68.45), and Gemini‑2.5‑Flash‑Thinking (68.73), and also outperforms the same‑algorithm baseline QwenLong‑L1.5‑30B (67.2).
General Ability Retention and Transfer. Long‑context RL training does not cause negative transfer. Both 4B and 30B models show modest gains on MMLU‑Pro, AIME24/25, GPQA‑Diamond, with consistent trends across scales.
Transfer effects are notable: tasks unseen during training (Memory‑Vec, Memory‑Rec_Sum) still improve (4B +9.7, 30B +4.5). Dialogue memory (LongMemEval) gains 13.6 points for both scales, with 30B exceeding QwenLong‑L1.5‑30B’s 72.2. This indicates the learned information‑integration ability transfers to novel tasks.
Length Extrapolation. Training context length is 160K tokens, yet capabilities generalize to longer sequences. The 4B model improves by 12.27 points from 128K→512K and 3.50 points from 512K→1M on MRCR; the 30B model shows larger gains (+12.61, +5.45, and +2.74 on CorpusQA 1M).
Summary. Data coverage and reward diversity are the primary bottlenecks for long‑context RL, not the algorithms themselves. Expanding tasks beyond “complex retrieval paths” to a broader ability spectrum and matching each task with a semantically appropriate reward enables even modest models to achieve flagship‑level long‑context performance.
All datasets, models, and training/evaluation code are fully open‑source.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
