Papers
Topics
Authors
Recent
Search
2000 character limit reached

MagicDrive: Street View Generation with Diverse 3D Geometry Control

Published 4 Oct 2023 in cs.CV and cs.AI | (2310.02601v7)

Abstract: Recent advancements in diffusion models have significantly enhanced the data synthesis with 2D control. Yet, precise 3D control in street view generation, crucial for 3D perception tasks, remains elusive. Specifically, utilizing Bird's-Eye View (BEV) as the primary condition often leads to challenges in geometry control (e.g., height), affecting the representation of object shapes, occlusion patterns, and road surface elevations, all of which are essential to perception data synthesis, especially for 3D object detection tasks. In this paper, we introduce MagicDrive, a novel street view generation framework, offering diverse 3D geometry controls including camera poses, road maps, and 3D bounding boxes, together with textual descriptions, achieved through tailored encoding strategies. Besides, our design incorporates a cross-view attention module, ensuring consistency across multiple camera views. With MagicDrive, we achieve high-fidelity street-view image & video synthesis that captures nuanced 3D geometry and various scene descriptions, enhancing tasks like BEV segmentation and 3D object detection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
  2. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  3. Multisiam: Self-supervised multi-instance siamese representation learning for autonomous driving. In ICCV, 2021.
  4. Mixed autoencoder for self-supervised visual representation learning. In CVPR, 2023a.
  5. Integrating geometric control into text-to-image diffusion models for high-quality detection data generation via text prompt. arXiv preprint arXiv:2306.04607, 2023b.
  6. Diffusiondet: Diffusion model for object detection. arXiv preprint arXiv:2211.09788, 2022.
  7. Boost 3-d object detection via point clouds segmentation and fused 3-d giou-l1 loss. IEEE TNNLS, 2020.
  8. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  9. DiffGuard: Semantic mismatch-guided out-of-distribution detection using pre-trained diffusion models. In ICCV, 2023.
  10. MetaBEV: Solving sensor failures for bev detection and map segmentation. In ICCV, 2023.
  11. Nicholas Guttenberg. Diffusion with offset noise. https://www.crosslabs.org/blog/diffusion-with-offset-noise, 2023.
  12. Soda10m: Towards large-scale object detection benchmark for autonomous driving. arXiv preprint arXiv:2106.11118, 2021.
  13. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  14. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  15. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  16. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
  17. DDP: Diffusion model for dense visual prediction. In ICCV, 2023.
  18. Coda: A real-world road corner case dataset for object detection in autonomous driving. arXiv preprint arXiv:2203.07724, 2022.
  19. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023.
  20. Microsoft coco: Common objects in context. In ECCV, 2014.
  21. Compositional visual generation with composable diffusion models. In ECCV, 2022a.
  22. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  23. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In ICRA, 2023a.
  24. Task-customized self-supervised pre-training with scalable dynamic routing. In AAAI, 2022b.
  25. Geom-erasing: Geometry-driven removal of implicit concept in diffusion models. arXiv preprint arXiv:2310.05873, 2023b.
  26. Decoupled weight decay regularization. In ICLR, 2019.
  27. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  28. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
  29. Learning transferable visual models from natural language supervision. In ICML, 2021.
  30. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  31. Score-based generative modeling through stochastic differential equations. In ICLR, 2020.
  32. Street-view image generation from a bird’s-eye view layout. arXiv preprint arXiv:2301.04634, 2023.
  33. Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprint arXiv:2307.01097, 2023.
  34. Consistent view synthesis with pose-guided diffusion models. In CVPR, 2023.
  35. Attention is all you need. In NeurIPS, 2017.
  36. Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In CVPR, 2023a.
  37. Semantic image synthesis via diffusion models. arXiv preprint arXiv:2207.00050, 2022.
  38. Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777, 2023b.
  39. Are we ready for vision-centric driving streaming perception? the asap benchmark. In CVPR, 2023c.
  40. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In CVPR, 2023a.
  41. Datasetdm: Synthesizing data with perception annotations using diffusion models. arXiv preprint arXiv:2308.06160, 2023b.
  42. Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout. arXiv preprint arXiv:2308.01661, 2023.
  43. Adding conditional control to text-to-image diffusion models. In ICCV, 2023a.
  44. Hive: Harnessing human feedback for instructional visual editing. arXiv preprint arXiv:2303.09618, 2023b.
  45. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. arXiv preprint arXiv:2302.04867, 2023.
  46. Task-customized masked autoencoder via mixture of cluster-conditional experts. In ICLR, 2023.
  47. Semantic understanding of scenes through the ade20k dataset. In IJCV, 2019.
  48. Cross-view transformers for real-time map-view semantic segmentation. In CVPR, 2022.
Citations (66)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 3 likes about this paper.