Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks
Abstract: Solving image-to-3D from a single view is an ill-posed problem, and current neural reconstruction methods addressing it through diffusion models still rely on scene-specific optimization, constraining their generalization capability. To overcome the limitations of existing approaches regarding generalization and consistency, we introduce a novel neural rendering technique. Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks. Specifically, our method builds neural encoding volumes from generated multi-view inputs. We adjust the weights of the SDF network conditioned on an input image at test-time to allow model adaptation to novel scenes in a feed-forward manner via HyperNetworks. To mitigate artifacts derived from the synthesized views, we propose the use of a volume transformer module to improve the aggregation of image features instead of processing each viewpoint separately. Through our proposed method, dubbed as Hyper-VolTran, we avoid the bottleneck of scene-specific optimization and maintain consistency across the images generated from multiple viewpoints. Our experiments show the advantages of our proposed approach with consistent results and rapid generation.
- Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields, 2021.
- Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. arXiv preprint arXiv:2103.15595, 2021.
- Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922, 2023.
- Google scanned objects: A high-quality dataset of 3d scanned household items, 2022.
- Hyperdiffusion: Generating implicit neural fields with weight-space diffusion. arXiv preprint arXiv:2303.17015, 2023.
- Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1126–1135, 2017.
- Implicit geometric regularization for learning shapes. In Proceedings of Machine Learning and Systems 2020, pages 3569–3579, 2020.
- Hypernetworks. CoRR, abs/1609.09106, 2016.
- Nerf-rpn: A general framework for object detection in nerfs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Shap-e: Generating conditional 3d implicit functions, 2023.
- Segment anything. arXiv:2304.02643, 2023.
- NARF22: Neural articulated radiance fields for configuration-aware rendering. In International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022.
- Feature pyramid networks for object detection. In CVPR, 2017.
- One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization, 2023a.
- Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9298–9309, 2023b.
- Sparseneus: Fast generalizable neural surface reconstruction from sparse views. ECCV, 2022.
- Realfusion: 360 reconstruction of any object from a single image. In Arxiv, 2023.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, 2020.
- Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
- Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022.
- Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
- Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021.
- Hierarchical text-conditional image generation with clip latents, 2022.
- Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.
- Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, 2022.
- Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937, 2023.
- Se(3)-equivariant relational rearrangement with neural descriptor fields. In Conference on Robot Learning (CoRL). PMLR, 2022.
- On modulating the gradient for meta-learning. In ECCV, 2020a.
- Adaptive subspaces for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020b.
- Meta-learning for multi-label few-shot classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3951–3960, 2022.
- Is attention all that neRF needs? In The Eleventh International Conference on Learning Representations, 2023.
- Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior, 2023.
- Attention is all you need. CoRR, abs/1706.03762, 2017.
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. arXiv preprint arXiv:2212.00774, 2022.
- Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. NeurIPS, 2021.
- Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects. CVPR, 2023.
- Neural fields as learnable kernels for 3d reconstruction. CoRR, abs/2111.13674, 2021.
- S-neRF: Neural radiance fields for street views. In The Eleventh International Conference on Learning Representations, 2023.
- Where is my wallet? modeling object proposal sets for egocentric visual query localization. arXiv preprint arXiv:2211.10528, 2022.
- Mvsnet: Depth inference for unstructured multi-view stereo. European Conference on Computer Vision (ECCV), 2018.
- Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. Computer Vision and Pattern Recognition (CVPR), 2020.
- Volume rendering of neural implicit surfaces. In Advances in Neural Information Processing Systems, pages 4805–4815. Curran Associates, Inc., 2021.
- pixelnerf: Neural radiance fields from one or few images. In CVPR, 2021.
- Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arcXiv:2309.02591, 2023.
- Fast context adaptation via meta-learning. 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.