Generative AI for Urban Design: A Stepwise Approach Integrating Human Expertise with Multimodal Diffusion Models

Published 30 May 2025 in cs.AI | (2505.24260v1)

Abstract: Urban design is a multifaceted process that demands careful consideration of site-specific constraints and collaboration among diverse professionals and stakeholders. The advent of generative artificial intelligence (GenAI) offers transformative potential by improving the efficiency of design generation and facilitating the communication of design ideas. However, most existing approaches are not well integrated with human design workflows. They often follow end-to-end pipelines with limited control, overlooking the iterative nature of real-world design. This study proposes a stepwise generative urban design framework that integrates multimodal diffusion models with human expertise to enable more adaptive and controllable design processes. Instead of generating design outcomes in a single end-to-end process, the framework divides the process into three key stages aligned with established urban design workflows: (1) road network and land use planning, (2) building layout planning, and (3) detailed planning and rendering. At each stage, multimodal diffusion models generate preliminary designs based on textual prompts and image-based constraints, which can then be reviewed and refined by human designers. We design an evaluation framework to assess the fidelity, compliance, and diversity of the generated designs. Experiments using data from Chicago and New York City demonstrate that our framework outperforms baseline models and end-to-end approaches across all three dimensions. This study underscores the benefits of multimodal diffusion models and stepwise generation in preserving human control and facilitating iterative refinements, laying the groundwork for human-AI interaction in urban design solutions.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a stepwise framework that integrates human expertise with multimodal diffusion models to produce realistic and compliant urban design layouts.
It leverages ControlNet to translate textual prompts and spatial constraints into structured road networks, building footprints, and detailed renderings.
Experiments in NYC and Chicago demonstrate improved visual fidelity, instruction compliance, and design diversity compared to baseline models.

Generative AI for Urban Design: A Stepwise Approach Integrating Human Expertise with Multimodal Diffusion Models

This essay provides an expert summary of the research paper titled "Generative AI for Urban Design: A Stepwise Approach Integrating Human Expertise with Multimodal Diffusion Models" which presents a novel framework for integrating human expertise in urban design processes using multimodal diffusion models.

Introduction

Urban design plays a critical role in structuring the spatial organization and physical form of urban areas. Given the intricacy of urban planning—characterized by a variety of stakeholders and site-specific constraints—generative artificial intelligence (GenAI) offers promising potential to enhance design efficiency and collaboration. Existing methodologies often follow end-to-end pipelines, limiting the iterative control necessary in practical urban design workflows. In response, this paper introduces a stepwise framework that thoughtfully integrates multimodal diffusion models with human expertise. The proposed approach breaks down the design process into three sequential stages: road network and land use planning, building layout planning, and detailed planning and rendering. Each stage utilizes multimodal diffusion models guided by human textual prompts and image-based constraints, allowing for iterative design refinements.

Proposed Framework

The study introduces a human-in-the-loop stepwise generative framework. This methodological departure from one-pass design generation allows human intervention at crucial stages, reflecting real-world urban planning's non-linear and iterative nature. Human expertise is integrated through textual prompts in a structured manner that guides the diffusion model in generating design diagrams. Review and refinement by human designers at each stage is integral to the process.

Figure 1: Framework Overview

Using ControlNet, an extension of Stable Diffusion designed for guided image synthesis, text outlines planning guidance, while images represent spatial constraints. The process of urban design is divided into three stages:

Road Network and Land Use Planning: Combining site constraints (e.g., water bodies, railways) and human prompts to produce detailed network and land use layouts.
Building Layout Planning: Generating building footprints and height distributions based on previous stage outputs, adapting methods to accommodate city-specific regulations or aesthetic preferences for design realism.
Detailed Planning and Rendering: Producing a comprehensive satellite-style diagram showcasing an intuitive view of the complete urban schema.

The training involves multi-modal inputs leveraging advanced models like the ControlNet version of the diffusion model, providing highly structured designs that integrate both road networks and specific site constraints.

Figure 2: Image Construction and Metric Computation

Results and Quantitative Evaluation

In a series of experiments conducted using data from NYC and Chicago, the stepwise framework outperformed baseline models (such as Pix2Pix and its enhanced version) concerning visual fidelity, accuracy in instruction compliance, and diversity in design.

Visual Fidelity

Assessment utilizing Fréchet Inception Distance (FID) showed that ControlNet surpassed the GAN-based baselines in producing high-fidelity visual images with a significant margin, exhibiting sharp edges and well-structured layouts.

Figure 3: Fidelity comparison of generated images

Instruction Compliance

ControlNet demonstrated superior ability to adhere to human-specified urban planning directives, achieving high $R^2$ scores for key metrics including road density, land use composition, and building height distribution.

Figure 4: Results of Road Network and Land Use Planning Stage

Figure 5: Results of Building Layout Planning Stage

Stepwise vs. End-to-End Approach

Comparisons with an end-to-end approach where urban design is generated from site constraints in a single step further validated our framework's effectiveness. The stepwise approach showed significantly better performance in image fidelity and compliance metrics due to intermediate evaluation points guiding the design process.

Figure 6: Comparison of generated images using stepwise and end-to-end frameworks

Design Diversity

ControlNet reliably produces a diverse array of realistic urban designs that adhere to specified constraints, thereby aiding urban planners in exploring multiple potential layouts. Its flexibility in generating alternative configurations is instrumental in real-world planning, which often demands adaptability to explore a wide range of scenarios.

Figure 7: Diversity of Generated Images

Urban Transferability

Experiments transferring design styles across urban contexts, such as using a model trained on Chicago to generate plans for New York City, highlighted ControlNet’s potential to extend design applications across different cities while aligning with local urban design principles.

Figure 8: Urban Transferability Results

Conclusion

The study establishes the efficacy of a human-in-the-loop, stepwise generative framework for urban design. Integrating the capabilities of multimodal diffusion models and human expertise effectively addresses the shortcomings of existing GenAI frameworks by improving both the realism and compliance of generated urban designs, and by aligning more closely with urban planning processes. The framework's strong results across multiple metrics suggest its utility for real-world urban design applications, allowing for a more flexible, interactive, and context-sensitive approach while also bolstering public engagement through realistic and diverse visual outputs. Future work should aim to further explore cross-city applications and the incorporation of broader socio-economic and cultural design elements into the model. This advancement may serve as a foundation for leveraging GenAI technology in collaborative urban design tasks that demand both computational efficiency and human creativity. The potential inclusion of participatory stakeholder models and qualitative urban planning principles could greatly enhance the proposed framework's applicability to urban renewal contexts.

Markdown Report Issue