Most RGB-based hand-object reconstruction methods depend on object templates, and template-free approaches assume full object visibility.
However, this assumption often fails in real-world scenarios, where cameras are fixed and objects are held in a static grip, causing parts of the object to remain unobserved and leading to unrealistic reconstructions.
To address this challenge, we introduce MagicHOI, a method for reconstructing hands and objects from short monocular interaction videos, even with limited view variations.
Our key insight is that, although paired 3D hand-object data is extremely scarce, large-scale diffusion models, such as image-to-3D models, provide abundant object supervision.
This additional supervision acts as a prior to help regularize unseen object regions during hand interactions.
Leveraging this insight, MagicHOI integrates the image-to-3D diffusion model into the reconstruction framework.
We further refine hand poses by incorporating hand-object interaction constraints.
Our results demonstrate that MagicHOI significantly outperforms existing state-of-the-art template-free reconstruction methods.
We also show that image-to-3D diffusion priors effectively regularize unseen object regions, enhancing 3D hand-object reconstruction.
Moreover, the improved object geometries lead to more accurate hand poses.