How UniCodebook’s Unified 2D‑3D Discrete Priors Boost Noise‑Robust, Calibration‑Free 3D Human Pose Estimation

UniCodebook introduces a unified 2D‑3D discrete prior that combines continuous and discrete representations, enabling calibration‑free multiview 3D human pose estimation with superior noise robustness and higher accuracy, as demonstrated by state‑of‑the‑art results on Human3.6M and MPI‑INF‑3DHP.

Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
How UniCodebook’s Unified 2D‑3D Discrete Priors Boost Noise‑Robust, Calibration‑Free 3D Human Pose Estimation

Problem and Challenges

3D human pose estimation (3D HPE) is a core technology for AR/VR interaction, action recognition, and robot vision. Existing multiview approaches rely heavily on precise camera calibration, which is costly and inflexible, and calibration‑free methods suffer from noise propagation because they lack geometric constraints.

Motivation

The authors observe that discrete representations naturally resist noise: mapping continuous poses to a finite set of prototypes means small perturbations are snapped to the same token, providing inherent denoising. They also note that human poses obey biomechanical priors, which can be captured in a learned discrete codebook.

Core Idea: Unified 2D‑3D Discrete Priors

UniCodebook builds a unified 2D‑3D discrete prior that serves as an “anchor” for continuous features, guiding them with geometric and anatomical cues. The discrete codebook stores plausible human poses, allowing the model to fall back on stable prototypes when input noise is present.

Model Architecture

The framework consists of two streams:

Backbone network : a continuous Transformer‑based model that provides high‑precision regression.

Side branch : integrates the discrete prior into the continuous regression pipeline, preserving accuracy while enhancing robustness.

Two complementary flows are defined:

Continuous stream : retains the Transformer’s fine‑grained modeling capability for precise 3D regression.

Discrete stream : maps the current pose to a discrete token via the UniCodebook, supplying a structured memory that stabilizes predictions against occlusion and detection errors.

Fusion occurs through a soft injection mechanism: continuous joint features attend to discrete prototypes via a Discrete‑Continuous Spatial Attention (DCSA) module, which combines joint‑to‑joint self‑attention with joint‑to‑token cross‑attention.

Training Stages

Stage I : Train a unified 2D‑3D discrete codebook on the AMASS dataset using four strategies (2D→2D, 2D→3D, 3D→2D, 3D→3D) to learn a shared representation that bridges 2D and 3D poses.

Stage II : Perform multiview 2D→3D lifting, injecting the learned discrete prior as soft guidance into the main lifting network.

Experiments

On Human3.6M, UniCodebook + DCSA achieves SOTA MPJPE of 26.0 mm (CPN 2D input) and 7.74 mm (GT 2D input), demonstrating that the method maintains high‑precision regression while improving cross‑view fusion stability.

On MPI‑INF‑3DHP, without extra fine‑tuning, the model reaches MPJPE 3.37 mm, PCK 99.9 % and AUC 95.94 %, surpassing existing calibration‑free approaches and confirming the benefit of the unified discrete prior for generalization.

Noise‑robustness tests add Gaussian noise of varying intensity to 1–4 random views’ 2D keypoints without retraining. UniCodebook‑enhanced models remain stable across all noise levels, with larger gains at higher noise, indicating that discrete prototypes act as structured anchors that suppress noise propagation.

Conclusion

The study proposes a novel way to improve noise robustness in multiview 3D human pose estimation by unifying discrete and continuous representations. The approach yields higher accuracy on clean data and stronger stability under noisy conditions, suggesting promising applications in VR/AR, intelligent monitoring, and sports analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

transformermultiview3D pose estimationNeurIPS 2025noise robustnessdiscrete priors
Network Intelligence Research Center (NIRC)
Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.