MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips

1The Hong Kong University of Science and Technology (Guangzhou),
2ETH Zürich, Switzerland
3The Hong Kong University of Science and Technology 4Max Planck Institute for Intelligent Systems, Tübingen, Germany

🌴ICCV 2025🥥


teaser.

MagicHOI: Given a short-form monocular video sequence capturing hand-object interaction, our method reconstructs high-quality 3D object surfaces and a realistic hand-object spatial relationship, including occluded regions caused by hand interaction and object self-occlusion.
(a) Input images and corresponding reconstructed surfaces of hand and object.
(b, c) Comparison of our method with and without a 3D prior from the object’s front and back views, demonstrating improved reconstruction in occluded areas.

Abstract

Most RGB-based hand-object reconstruction methods depend on object templates, and template-free approaches assume full object visibility. However, this assumption often fails in real-world scenarios, where cameras are fixed and objects are held in a static grip, causing parts of the object to remain unobserved and leading to unrealistic reconstructions.

To address this challenge, we introduce MagicHOI, a method for reconstructing hands and objects from short monocular interaction videos, even with limited view variations. Our key insight is that, although paired 3D hand-object data is extremely scarce, large-scale diffusion models, such as image-to-3D models, provide abundant object supervision. This additional supervision acts as a prior to help regularize unseen object regions during hand interactions.

Leveraging this insight, MagicHOI integrates the image-to-3D diffusion model into the reconstruction framework. We further refine hand poses by incorporating hand-object interaction constraints. Our results demonstrate that MagicHOI significantly outperforms existing state-of-the-art template-free reconstruction methods. We also show that image-to-3D diffusion priors effectively regularize unseen object regions, enhancing 3D hand-object reconstruction. Moreover, the improved object geometries lead to more accurate hand poses.

Video

Comparison to SOTA Methods

We compare our method, which integrates geometry-driven and prior-driven approaches, with the geometry-driven only method HOLD and prior-driven only method EasyHOI.

Scene:

MC1 SM4 ABF12 GPMF12

Method:

Ours HOLD EasyHOI





In-the-wild results of our method

Scene:

Controller Glue Gun Osmo Pocket Toy Plane

BibTeX

@article{xxx,
      author    = {xxx},
      title     = {xxx},
      journal   = {xxx},
      year      = {xxx},
    }