Capture Character Animation from Any Object Using Just a Phone – CHI 2026 Best Paper Nominee

DancingBox demonstrates that a single RGB camera, a flat calibration board, and any handheld object can be used to capture realistic character animation by first estimating coarse 3D bounding‑box motion with visual foundation models and then refining it with a diffusion‑based motion generation model, validated by a user study.

Machine Heart
Machine Heart
Machine Heart
Capture Character Animation from Any Object Using Just a Phone – CHI 2026 Best Paper Nominee

Introduction

Creating character animation is essential for film and game production but traditionally requires expensive motion‑capture rigs or skilled 3D animators. The authors identify the "digital puppetry" problem: enabling intuitive physical interaction (e.g., with a phone or toy) to generate virtual skeletal animation.

System Overview

DancingBox, a CHI 2026 best‑paper nominee, achieves high‑quality animation using only an RGB camera, a ground‑plane calibration board, and any arbitrary object. The system bridges coarse motion capture and fine motion generation via a bounding‑box intermediate representation.

Coarse Motion Capture (MoCap)

The MoCap pipeline combines three visual foundation models: SAM2, CoTracker3, and π3. From a video of the user‑manipulated object, π3 produces per‑frame 3D monocular point clouds. The user interacts with SAM2‑video to segment the object in the first frame. These outputs yield per‑frame 3D point clouds for the object parts.

To obtain a continuous 3D bounding‑box sequence, the method initializes a PCA‑derived box on the first frame, then uses CoTracker3 to extract pixel‑level correspondences, which are mapped to point‑cloud correspondences via π3. An SVD decomposition solves the full box trajectory.

The authors address a potential objection: why not feed the raw point clouds directly to the next stage? They argue that training the fine‑grained motion generator requires paired spatial signals (point cloud or bounding box) and ground‑truth skeletal motion, which is unavailable from point clouds alone.

Bounding‑Box as a Bridge

Bounding‑box sequences solve the data‑pairing problem: point‑cloud tracking provides the spatial signal, while skeletal motion datasets can be converted to corresponding bounding‑box trajectories. By fixing a size range for the boxes, this intermediate representation cleanly connects the two modules.

Fine Motion Generation (MoGen)

MoGen trains a ControlNet that injects bounding‑box control signals into a pre‑trained text‑to‑motion diffusion model (Human‑Motion‑Diffusion‑Model, MDM). Using the HumanML3D dataset, the authors compute bounding‑box trajectories for each skeletal animation via a merging strategy (see Fig. 4). To mimic real‑world estimation errors, they randomly scale, drop, and add noise to the boxes.

Following PointNet principles, the ControlNet aggregates box features with order‑invariant max and mean pooling, ensuring that vertex order or box ordering does not affect the extracted features.

User Study

A broad user survey reported that participants found DancingBox intuitive and easy to use, even as novices. Sample questionnaire results (Fig. 6) highlight two main findings:

Users desire more flexible objects to enable diverse, detailed performances.

Controlling multi‑joint objects with two hands is difficult, and object stability significantly impacts usability.

The team notes a trade‑off between degrees of freedom and interaction simplicity, hoping to inspire further research on interactive devices.

Results and Demonstrations

Video demos (Fig. 7‑9) showcase a variety of animations generated from everyday objects such as a phone. The authors invite researchers and industry practitioners to explore the project page for additional cases.

Conclusion

DancingBox is the first system to produce high‑fidelity character animation from arbitrary objects using only an RGB camera, leveraging visual foundation models for coarse capture and diffusion models for fine synthesis. Ongoing work aims to refine robustness and expand interactive possibilities.

AIdiffusion modelmotion capturehuman-computer interactioncharacter animationDancingBox
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.