Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks

Published 24 Dec 2023 in cs.CV | (2312.16218v2)

Abstract: Solving image-to-3D from a single view is an ill-posed problem, and current neural reconstruction methods addressing it through diffusion models still rely on scene-specific optimization, constraining their generalization capability. To overcome the limitations of existing approaches regarding generalization and consistency, we introduce a novel neural rendering technique. Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks. Specifically, our method builds neural encoding volumes from generated multi-view inputs. We adjust the weights of the SDF network conditioned on an input image at test-time to allow model adaptation to novel scenes in a feed-forward manner via HyperNetworks. To mitigate artifacts derived from the synthesized views, we propose the use of a volume transformer module to improve the aggregation of image features instead of processing each viewpoint separately. Through our proposed method, dubbed as Hyper-VolTran, we avoid the bottleneck of scene-specific optimization and maintain consistency across the images generated from multiple viewpoints. Our experiments show the advantages of our proposed approach with consistent results and rapid generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields, 2021.
  2. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. arXiv preprint arXiv:2103.15595, 2021.
  3. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  4. Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922, 2023.
  5. Google scanned objects: A high-quality dataset of 3d scanned household items, 2022.
  6. Hyperdiffusion: Generating implicit neural fields with weight-space diffusion. arXiv preprint arXiv:2303.17015, 2023.
  7. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1126–1135, 2017.
  8. Implicit geometric regularization for learning shapes. In Proceedings of Machine Learning and Systems 2020, pages 3569–3579, 2020.
  9. Hypernetworks. CoRR, abs/1609.09106, 2016.
  10. Nerf-rpn: A general framework for object detection in nerfs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  11. Shap-e: Generating conditional 3d implicit functions, 2023.
  12. Segment anything. arXiv:2304.02643, 2023.
  13. NARF22: Neural articulated radiance fields for configuration-aware rendering. In International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022.
  14. Feature pyramid networks for object detection. In CVPR, 2017.
  15. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization, 2023a.
  16. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9298–9309, 2023b.
  17. Sparseneus: Fast generalizable neural surface reconstruction from sparse views. ECCV, 2022.
  18. Realfusion: 360 reconstruction of any object from a single image. In Arxiv, 2023.
  19. Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, 2020.
  20. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  21. Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022.
  22. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  23. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021.
  24. Hierarchical text-conditional image generation with clip latents, 2022.
  25. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.
  26. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, 2022.
  27. Let 2d diffusion model know 3d-consistency for robust text-to-3d generation. arXiv preprint arXiv:2303.07937, 2023.
  28. Se(3)-equivariant relational rearrangement with neural descriptor fields. In Conference on Robot Learning (CoRL). PMLR, 2022.
  29. On modulating the gradient for meta-learning. In ECCV, 2020a.
  30. Adaptive subspaces for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020b.
  31. Meta-learning for multi-label few-shot classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3951–3960, 2022.
  32. Is attention all that neRF needs? In The Eleventh International Conference on Learning Representations, 2023.
  33. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior, 2023.
  34. Attention is all you need. CoRR, abs/1706.03762, 2017.
  35. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. arXiv preprint arXiv:2212.00774, 2022.
  36. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. NeurIPS, 2021.
  37. Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects. CVPR, 2023.
  38. Neural fields as learnable kernels for 3d reconstruction. CoRR, abs/2111.13674, 2021.
  39. S-neRF: Neural radiance fields for street views. In The Eleventh International Conference on Learning Representations, 2023.
  40. Where is my wallet? modeling object proposal sets for egocentric visual query localization. arXiv preprint arXiv:2211.10528, 2022.
  41. Mvsnet: Depth inference for unstructured multi-view stereo. European Conference on Computer Vision (ECCV), 2018.
  42. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. Computer Vision and Pattern Recognition (CVPR), 2020.
  43. Volume rendering of neural implicit surfaces. In Advances in Neural Information Processing Systems, pages 4805–4815. Curran Associates, Inc., 2021.
  44. pixelnerf: Neural radiance fields from one or few images. In CVPR, 2021.
  45. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arcXiv:2309.02591, 2023.
  46. Fast context adaptation via meta-learning. 2019.
Citations (2)

Summary

  • The paper introduces Hyper-VolTran, a method that uses HyperNetworks to adapt SDF weights on the fly, eliminating the need for scene-specific optimization.
  • It employs a volume transformer module to aggregate image features globally, ensuring consistent 3D geometry reconstruction from a single viewpoint.
  • Experimental results show that Hyper-VolTran rapidly generates high-quality 3D meshes, outperforming conventional iterative methods.

Introduction

Image-to-3D reconstruction from a single viewpoint is a challenging venture in the field of neural rendering. Though substantial progress has been made, the task's complexity is heightened by the requirement for scene-specific optimization, which impedes the application of these methods to novel scenes swiftly and consistently. To counter these constraints, a novel method has been designed to promote generality and efficiency.

Neural Rendering Technique

The newly introduced technique leverages Signed Distance Functions (SDFs), which provide surface representations for neural rendering. What sets this method apart is the conjunction of geometry-encoding volumes with HyperNetworks. The HyperNetworks enable on-the-fly adaptation of the SDF network’s weights based on the input image during testing. This dynamic adjustment abolishes the need for time-intensive scene-specific optimization and ensures a smoother assimilation of features from different viewpoints.

To prevent inconsistencies and artifacts that often arise from synthesized views, a volume transformer module, named VolTran, is implemented. VolTran enhances the process of image feature aggregation, placing a premium on global context and attention, which is crucial given the method’s reliance on synthesized views for multi-perspective consistency.

Training and Results

The method begins with creating neural encoding volumes from generated multi-view inputs and uses synthesized images to guide the production of 3D geometry representations. A trained HyperNetwork computes SDF weights considering the object features in the initial image, streamlining the application to unknown scenes. Experiments have revealed that the introduced method, Hyper-VolTran, performs with rapidity and consistency, outstripping baseline models in the generation process both qualitatively and quantitatively.

Implications and Future Work

The proposed model underscores a significant step forward in the efficient construction of 3D meshes from single images. The system is particularly responsive, requiring only a singular forward pass at test-time, compared to the extensive optimizations existing models demand. This model underscores both the potential for generalization across a range of objects and the promise of broad application within the sphere of computer vision and related neural rendering tasks. Moving forward, this approach opens new avenues for advancements in 3D reconstructions and their application in real-world scenarios.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 107 likes about this paper.