MagicDrive: Street View Generation with Diverse 3D Geometry Control
Abstract: Recent advancements in diffusion models have significantly enhanced the data synthesis with 2D control. Yet, precise 3D control in street view generation, crucial for 3D perception tasks, remains elusive. Specifically, utilizing Bird's-Eye View (BEV) as the primary condition often leads to challenges in geometry control (e.g., height), affecting the representation of object shapes, occlusion patterns, and road surface elevations, all of which are essential to perception data synthesis, especially for 3D object detection tasks. In this paper, we introduce MagicDrive, a novel street view generation framework, offering diverse 3D geometry controls including camera poses, road maps, and 3D bounding boxes, together with textual descriptions, achieved through tailored encoding strategies. Besides, our design incorporates a cross-view attention module, ensuring consistency across multiple camera views. With MagicDrive, we achieve high-fidelity street-view image & video synthesis that captures nuanced 3D geometry and various scene descriptions, enhancing tasks like BEV segmentation and 3D object detection.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Multisiam: Self-supervised multi-instance siamese representation learning for autonomous driving. In ICCV, 2021.
- Mixed autoencoder for self-supervised visual representation learning. In CVPR, 2023a.
- Integrating geometric control into text-to-image diffusion models for high-quality detection data generation via text prompt. arXiv preprint arXiv:2306.04607, 2023b.
- Diffusiondet: Diffusion model for object detection. arXiv preprint arXiv:2211.09788, 2022.
- Boost 3-d object detection via point clouds segmentation and fused 3-d giou-l1 loss. IEEE TNNLS, 2020.
- Taming transformers for high-resolution image synthesis. In CVPR, 2021.
- DiffGuard: Semantic mismatch-guided out-of-distribution detection using pre-trained diffusion models. In ICCV, 2023.
- MetaBEV: Solving sensor failures for bev detection and map segmentation. In ICCV, 2023.
- Nicholas Guttenberg. Diffusion with offset noise. https://www.crosslabs.org/blog/diffusion-with-offset-noise, 2023.
- Soda10m: Towards large-scale object detection benchmark for autonomous driving. arXiv preprint arXiv:2106.11118, 2021.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
- Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
- DDP: Diffusion model for dense visual prediction. In ICCV, 2023.
- Coda: A real-world road corner case dataset for object detection in autonomous driving. arXiv preprint arXiv:2203.07724, 2022.
- Gligen: Open-set grounded text-to-image generation. In CVPR, 2023.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Compositional visual generation with composable diffusion models. In ECCV, 2022a.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In ICRA, 2023a.
- Task-customized self-supervised pre-training with scalable dynamic routing. In AAAI, 2022b.
- Geom-erasing: Geometry-driven removal of implicit concept in diffusion models. arXiv preprint arXiv:2310.05873, 2023b.
- Decoupled weight decay regularization. In ICLR, 2019.
- Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Score-based generative modeling through stochastic differential equations. In ICLR, 2020.
- Street-view image generation from a bird’s-eye view layout. arXiv preprint arXiv:2301.04634, 2023.
- Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprint arXiv:2307.01097, 2023.
- Consistent view synthesis with pose-guided diffusion models. In CVPR, 2023.
- Attention is all you need. In NeurIPS, 2017.
- Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In CVPR, 2023a.
- Semantic image synthesis via diffusion models. arXiv preprint arXiv:2207.00050, 2022.
- Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777, 2023b.
- Are we ready for vision-centric driving streaming perception? the asap benchmark. In CVPR, 2023c.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In CVPR, 2023a.
- Datasetdm: Synthesizing data with perception annotations using diffusion models. arXiv preprint arXiv:2308.06160, 2023b.
- Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout. arXiv preprint arXiv:2308.01661, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023a.
- Hive: Harnessing human feedback for instructional visual editing. arXiv preprint arXiv:2303.09618, 2023b.
- Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. arXiv preprint arXiv:2302.04867, 2023.
- Task-customized masked autoencoder via mixture of cluster-conditional experts. In ICLR, 2023.
- Semantic understanding of scenes through the ade20k dataset. In IJCV, 2019.
- Cross-view transformers for real-time map-view semantic segmentation. In CVPR, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.