Fundamentals 12 min read

How QuanTaichi Cuts GPU Memory Needs for High‑Fidelity Physics Simulations

QuanTaichi introduces a new language abstraction and compiler system that quantizes simulation data, dramatically reducing memory and bandwidth usage so that high‑precision physical effects—once requiring multiple GPUs—can now run on a single GPU, even on mobile devices.

Kuaishou Large Model
Kuaishou Large Model
Kuaishou Large Model
How QuanTaichi Cuts GPU Memory Needs for High‑Fidelity Physics Simulations

Advances in computer simulation now allow us to recreate realistic worlds for movies like Frozen , but high‑fidelity physics still demands massive memory and expensive GPU clusters.

Researchers from KuaiShou, MIT, Zhejiang University, and Tsinghua developed a physical‑compiler quantization framework called QuanTaichi . By packing low‑precision numeric types, it cuts memory and bandwidth, enabling full‑precision simulations on a single GPU.

Technical Foundations

QuanTaichi builds on the Taichi language and compiler, offering custom numeric types:

Custom integers of user‑specified bit widths (signed/unsigned).

Custom floats with three implementations: Fixed‑point (integer plus scaling factor). Standard floating‑point (user‑defined mantissa and exponent). Shared‑exponent floats, where many values share a common exponent to exploit value magnitude differences.

It also provides bit adapters to map these types onto hardware‑native widths:

Bit structs combine several custom types into a native 32‑bit word.

Bit arrays store multiple identical custom values within one native word.

Compiler Optimizations

Three key optimizations reduce memory traffic and improve performance:

Bit‑struct fusion storage : batch writes of struct members to minimize atomic operations.

Thread‑safety inference : detect when operations are inherently thread‑safe and avoid costly atomic writes, supporting element‑wise and whole‑struct storage modes.

Bit‑array vectorization : process 32 bits at a time instead of single‑bit loops, eliminating excessive atomicRMW instructions.

Experimental Results

Game of Life : Using QuanTaichi, a binary cell state requires one bit instead of a byte, achieving an 8× storage reduction. On an RTX 3080 Ti, the team simulated over 20 billion cells (2048×2048 OTCA tiles).

Euler fluid simulation : Quantization reduced per‑grid storage from 84 bytes to 44 bytes, enabling >420 million sparse‑grid smoke cells on a Tesla V100 (32 GB).

MLS‑MPM elasticity test : Custom float quantization lowered per‑particle storage from 68 bytes to 40 bytes, allowing >230 million particles on an RTX 3090.

On an iPhone XS, the quantized MLS‑MPM showed significant speed‑ups because the mobile GPU can perform native 32‑bit integer atomic adds, while floating‑point atomics are not hardware‑supported.

Impact

QuanTaichi not only accelerates R&D for games, large‑scale image processing, media codecs, and scientific computing, but also enhances storage efficiency across the Taichi ecosystem, paving the way for broader adoption of quantized physical simulation.

References: Paper PDF Project page GitHub repository

graphicsCompilerquantizationGPU optimizationphysics simulationTaichi
Kuaishou Large Model
Written by

Kuaishou Large Model

Official Kuaishou Account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.