How Galaxea’s Self‑Regressive G0.5 Model Sweeps Seven Embodied Benchmarks
Galaxea’s new G0.5 model outperforms the previous π0.5 baseline on seven diverse embodied‑AI benchmarks by leveraging a unified self‑regressive transformer that jointly generates reasoning and action tokens, achieving large gains in zero‑shot transfer, real‑robot fine‑tuning, simulation, and long‑horizon tasks.
Benchmark Results
Zero‑shot transfer (DROID): average success 82.5% on 10 desktop tasks, a 25‑point gain over π0.5‑DROID (57.5%).
Real‑robot fine‑tuning on the self‑developed R1 Lite / R1 Pro platform (6 tasks: folding towels, folding boxes, organising pencil cases, box stacking, etc.): average success 76.7%, 23 points higher than π0.5 (53.0%) and more than double GR00T‑N1.7 (24.4%).
Simulation benchmarks:
LIBERO: 98.9% overall, 98.6% on the long‑program subset.
RoboTwin 2.0: 93.3% average (highest recorded).
SimplerEnv‑Bridge: 87.3% average.
Long‑horizon benchmark BEHAVIOR‑1K (50 tasks, each ~6.6 min): a single checkpoint after 1 epoch yields Task Success Score 0.2904, surpassing π0.5 after 4 epochs (0.2626) and the ensemble champion (0.2605). Extending training to 4 epochs raises the score to 0.3136; G0.5 leads π0.5 on 29 of 50 tasks.
Architectural Redesign
G0.5 replaces the conventional split‑model pipeline (vision‑language encoder + separate action expert) with a single Transformer decoder that autoregressively generates both reasoning and action tokens in one sequence.
Cross‑Embodiment ActionCodec : maps 18 robot embodiments to a 27‑dimensional action space; generates tokens only for moving joints, enabling sparse high‑frequency control.
Native Chain‑of‑Thought (CoT) : inserts four reasoning token types (sub‑task text, object bounding box, 2‑D end‑effector trajectory, action hint) before action tokens. On out‑of‑distribution tasks “bread into air‑fryer” and “bacon frying”, CoT adds 30‑35 percentage points to success.
Vision Memory Module : inserts decomposed spatio‑temporal attention every four Vision Transformer layers with 30 % random historic‑frame dropout, improving robustness on long‑horizon tasks.
Why Autoregressive Decoding Works
Ablation comparing autoregressive (AR) decoding with flow‑matching (FM) while keeping the same pretrained checkpoint and CoT shows AR gains 30‑35 points on complex tasks, whereas FM gains only ≈10 points. Human evaluation of CoT accuracy yields comparable quality (≈90 % PP‑Bench for AR, 85 % for FM), indicating the performance gap originates from the decoding strategy rather than reasoning quality.
Language Prompting as Direct Control
Because reasoning and action share a single sequence, modifying the natural‑language prompt changes the robot’s execution style without additional training. Enriching simple commands with adverbs and spatial modifiers raises success on out‑of‑distribution tasks by 10‑15 points.
Vision Context Improves Language Following
Providing cropped visual patches of target objects alongside textual commands raises language‑following accuracy from 84.4 % to 98.4 % and task success from 75.0 % to 84.4% under a 50‑hour training setting.
Full‑Stack Closed‑Loop Pipeline
Backbone: Qwen‑3.5 2B vision‑language model pretrained on ~1 billion image‑text pairs (including 5 million embodied VQA) and 18 robot embodiments over ~12 k steps. Pipeline: embodiment data → pre‑training → self‑regressive inference → robot action.
Technical report: https://opengalaxea.github.io/G05/Galaxea_G0_5.pdf
Project page: https://opengalaxea.github.io/G05/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
