Papers
Topics
Authors
Recent
Search
2000 character limit reached

Text-to-3D Scene Generation

Updated 10 February 2026
  • Text-to-3D scene generation is a process that transforms textual descriptions into spatially structured 3D scenes using advanced NLP, computer vision, and generative modeling techniques.
  • It employs diverse representations—scene graphs, implicit fields, and explicit primitives—to ensure semantic fidelity, geometric consistency, and physical plausibility.
  • Diffusion models and physics-based optimization techniques are key to achieving photorealistic detail, stable object placement, and interactive scene editing.

Text-to-3D scene generation refers to the automatic synthesis of structured, spatially coherent three-dimensional scenes directly from textual prompts. This research field integrates advances in natural language processing, computer vision, and 3D generative modeling to enable applications such as virtual world creation, embodied agent simulation, film production, design, and gaming. Recent advances leverage vision-LLMs (VLMs), LLMs, diffusion-based generative models, and scene representations such as scene graphs, 3D Gaussian splatting, radiance fields, and explicit meshes (Ruiz et al., 18 Nov 2025, Zhang et al., 2023, Zhang et al., 2023, Li et al., 18 Jul 2025, Xiong et al., 7 Apr 2025, Zhang et al., 2 Apr 2025, Chu et al., 27 Jan 2026, Yang et al., 2024, Zhou et al., 4 Feb 2025, Chen et al., 2023, Li et al., 2024, Li et al., 2024, Zhou et al., 2024, Lin et al., 26 Nov 2025, Ling et al., 5 May 2025, Chang et al., 2015, Chang et al., 2017, Zhou et al., 2024). The domain has converged around several architectural paradigms and evaluation protocols for semantic fidelity, geometric consistency, and physical plausibility.

1. Representations for 3D Scene Synthesis

Text-to-3D scene generation relies on diverse scene representations. Early systems constructed 3D scenes by selecting assets from model databases and arranging them according to probabilistic spatial priors or rule-based templates (Chang et al., 2017, Chang et al., 2015). Modern generative systems adopt parameterized 3D representations:

2. Text and Semantic Conditioning Schemes

A core challenge is grounding textual semantics into spatially distributed 3D entities and relations:

3. Scene Layout, Geometric Reasoning, and Physical Plausibility

Determining plausible and controllable 3D object placement is central:

  • Combinatorial Layout: Particle Swarm Optimization (PSO) is used for sampling object poses maximizing CLIP-text alignment (SceneWiz3D) (Zhang et al., 2023). Graph-based placement algorithms traverse scene graphs, enforcing region anchors and spatial relations (DreamScene) (Li et al., 18 Jul 2025).
  • Physics-Guided and Physically-Aware Placement: Multiple systems inject energy-based or physically plausible objectives—
    • PAT3D uses a differentiable rigid-body simulator enforcing static equilibrium and intersection-free placement, with a semantics loss enforcing alignment to the scene tree (Lin et al., 26 Nov 2025).
    • LayoutDreamer and PhiP-G introduce explicit physical energies (gravity, penetration, contact, anchor) and iterative adjustments for collision avoidance, stability, and alignment, guided by scene graphs or visual agents (Zhou et al., 4 Feb 2025, Lin et al., 26 Nov 2025).
    • Scenethesis applies signed distance field–based losses for collision elimination, object stability, and coherence (Ling et al., 5 May 2025).
  • Iterative Layout Correction: Visual and language agents analyze current scenes and recommend layout corrections for gaps, overlaps, and “floating” objects (PhiP-G, Scenethesis) (Zhou et al., 4 Feb 2025, Ling et al., 5 May 2025).
  • Adaptive Path Planning: RoamScene3D uses scene-graph reasoning to plan camera trajectories that explore salient objects, ensuring visibility and refined inpainting (Chu et al., 27 Jan 2026).

4. Diffusion-Based 3D Generation and Optimization

Diffusion models and their variants underpin modern text-to-3D pipelines:

5. Novel Scene Editing, Control, and Interactivity

Recent pipelines support scene editing, interactive adjustment, and fine-grained control:

6. Evaluation Protocols and Comparative Metrics

Evaluation is multifaceted, using both automated quantitative indices and human studies:

7. Limitations and Future Directions

Current limitations include:

Future research directions include integration of stronger LLMs for richer and more fine-grained scene graph extraction (Chu et al., 27 Jan 2026, Li et al., 18 Jul 2025), end-to-end scene planning to generation networks, extension to dynamic and outdoor environments, richer material/lighting modeling, and the fusion with embodied agent and robotics pipelines (Lin et al., 26 Nov 2025, Ling et al., 5 May 2025, Ruiz et al., 18 Nov 2025, Zhou et al., 2024).


References:

(Ruiz et al., 18 Nov 2025, Zhang et al., 2023, Zhang et al., 2023, Li et al., 18 Jul 2025, Xiong et al., 7 Apr 2025, Zhang et al., 2 Apr 2025, Chu et al., 27 Jan 2026, Yang et al., 2024, Zhou et al., 4 Feb 2025, Chen et al., 2023, Li et al., 2024, Li et al., 2024, Zhou et al., 2024, Lin et al., 26 Nov 2025, Ling et al., 5 May 2025, Chang et al., 2015, Chang et al., 2017, Zhou et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Text-to-3D Scene Generation.