Artificial Intelligence 9 min read

How G3PT Uses Autoregressive Modeling to Revolutionize 3D Generation

The paper introduces G3PT, a groundbreaking autoregressive 3D generation model that employs a Cross‑Scale Querying Transformer and multi‑scale tokenization to produce high‑quality meshes from a single image, outperforming diffusion‑based methods and revealing a scaling law for 3D generation.

Amap Tech

May 12, 2025

How G3PT Uses Autoregressive Modeling to Revolutionize 3D Generation

Introduction

Recent advances in 3D shape generation have relied on large reconstruction models (LRMs) that convert images to 3D shapes or extend 2D diffusion models to the 3D domain. These approaches suffer from dependence on multi‑view image fidelity, difficulty generating high‑quality meshes, and limited ability to capture complex geometry. Additionally, 3D variational auto‑encoders and diffusion models require long training times and lack scalable strategies.

At the same time, autoregressive (AR) large language models and multimodal AR models have achieved remarkable success in language and image generation by predicting the next token in a discrete token sequence. Extending AR models to 3D generation is challenging because 3D data is inherently unordered, making direct token prediction incompatible.

G3PT is proposed to address this challenge. Recognizing that 3D data possesses natural multi‑resolution relationships, G3PT introduces a multi‑scale tokenizer (CVQ) and a cross‑scale autoregressive modeling (CAR) framework that maps unordered 3D data into discrete tokens at different detail levels, establishing a sequential order suitable for AR modeling.

Approach

The core of G3PT consists of the Cross‑Scale Querying Transformer (CQT) and the Cross‑Scale Autoregressive (CAR) framework.

CVQ stage: A Transformer‑based tokenizer encodes high‑resolution point clouds into latent tokens. The input point cloud is integrated with learnable latent queries via cross‑attention layers. CQT then applies attention with learnable down‑sampling and up‑sampling queries to decompose these tokens into multiple scales, using multi‑level residual quantization to capture geometric details at each resolution. A decoder finally converts the quantized tokens back into a 3D occupancy grid, producing the mesh.

CAR stage: Using the CQT learned in the CVQ stage, G3PT aligns tokens across scales and predicts the next‑scale tokens autoregressively. Starting from the coarsest scale, the model progressively refines the representation until the desired level of detail is reached. Cross‑scale dimensional alignment enables information flow between scales, achieving fine‑grained 3D generation.

G3PT also supports conditional generation. For image conditioning, a pretrained DINO‑v2 model extracts image features, which are fused with 3D tokens via attention to ensure semantic consistency. For text conditioning, a pretrained CLIP model provides text embeddings, and an AdaLN mechanism guides the generation to match the textual description.

Experiments

In the image‑to‑3D task on the Objaverse dataset, G3PT outperforms various LRMs and diffusion models across IoU, Chamfer distance, and F‑score metrics, especially when scaled to 1.5 billion parameters. The experiments also reveal a scaling law for 3D generation: test loss decreases following a power‑law as model size grows.

For text‑to‑3D generation, G3PT produces high‑quality meshes that align well with textual prompts.

Quantitative comparison (see figure below)

Visual comparison of generated meshes

Text‑controlled generation examples

Demonstration of the 3D autoregressive scaling law

Conclusion

G3PT introduces a Cross‑Scale Querying Transformer and Cross‑Scale Autoregressive modeling to provide an innovative AR framework for unordered 3D data, enabling coarse‑to‑fine high‑quality mesh generation and supporting multiple conditioning modalities.

Experimental results show that G3PT surpasses existing 3D generation methods in quality, establishing a new benchmark for 3D content creation.

However, training G3PT demands substantial computational resources and long training times. Future work will explore more efficient training techniques and richer control conditions to further improve performance and applicability.

AI research 3D generation autoregressive modeling cross-scale transformer G3PT

Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.