Papers
Topics
Authors
Recent
Search
2000 character limit reached

ZONE: Zero-Shot Instruction-Guided Local Editing

Published 28 Dec 2023 in cs.CV | (2312.16794v2)

Abstract: Recent advances in vision-LLMs like Stable Diffusion have shown remarkable power in creative image synthesis and editing.However, most existing text-to-image editing methods encounter two obstacles: First, the text prompt needs to be carefully crafted to achieve good results, which is not intuitive or user-friendly. Second, they are insensitive to local edits and can irreversibly affect non-edited regions, leaving obvious editing traces. To tackle these problems, we propose a Zero-shot instructiON-guided local image Editing approach, termed ZONE. We first convert the editing intent from the user-provided instruction (e.g., "make his tie blue") into specific image editing regions through InstructPix2Pix. We then propose a Region-IoU scheme for precise image layer extraction from an off-the-shelf segment model. We further develop an edge smoother based on FFT for seamless blending between the layer and the image.Our method allows for arbitrary manipulation of a specific region with a single instruction while preserving the rest. Extensive experiments demonstrate that our ZONE achieves remarkable local editing results and user-friendliness, outperforming state-of-the-art methods. Code is available at https://github.com/lsl001006/ZONE.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Blended diffusion for text-driven editing of natural images. In CVPR, 2022.
  2. Blended latent diffusion. TOG, 2023.
  3. Text2live: Text-driven layered image and video editing. In ECCV, 2022.
  4. End-to-end conditional gan-based architectures for image colourisation. In MMSPW, 2019.
  5. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
  6. Language models are few-shot learners. In NeurIPS, 2020.
  7. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
  8. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
  9. Vqgan-clip: Open domain image generation and editing with natural language guidance. In ECCV, 2022.
  10. Tuning-free inversion-enhanced control for consistent image editing. arXiv preprint arXiv:2312.14611, 2023.
  11. Tell, draw, and repeat: Generating and modifying images based on continual linguistic instruction. In CVPR, 2019.
  12. Stylegan-nada: Clip-guided domain adaptation of image generators. TOG, 2022.
  13. Implicit diffusion models for continuous super-resolution. In CVPR, 2023.
  14. Pair-diffusion: Object-level image editing with structure-and-appearance paired diffusion models. arXiv preprint arXiv:2303.17546, 2023.
  15. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  16. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  17. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  18. Cascaded diffusion models for high fidelity image generation. JMLR, 2022.
  19. Globally and locally consistent image completion. TOG, 2017.
  20. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
  21. Scaling up gans for text-to-image synthesis. In CVPR, 2023.
  22. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
  23. Analyzing and improving the image quality of stylegan. In CVPR, 2020.
  24. Imagic: Text-based real image editing with diffusion models. In CVPR, 2023.
  25. Diffusionclip: Text-guided diffusion models for robust image manipulation. In CVPR, 2022.
  26. Learning to discover cross-domain relations with generative adversarial networks. In ICML, 2017.
  27. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  28. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  29. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  30. Clipstyler: Image style transfer with a single text condition. In CVPR, 2022.
  31. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
  32. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  33. Diffcolor: Toward high fidelity text-guided image colorization with diffusion models. arXiv preprint arXiv:2308.01655, 2023.
  34. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
  35. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  36. Self-distilled stylegan: Towards generation from internet photos. In SIGGRAPH, 2022.
  37. Image colorization using generative adversarial networks. In AMDO, 2018.
  38. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
  39. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
  40. Zero-shot image-to-image translation. In SIGGRAPH, 2023.
  41. Context encoders: Feature learning by inpainting. In CVPR, 2016.
  42. Learning transferable visual models from natural language supervision. In ICML, 2021.
  43. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  44. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  45. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  46. Palette: Image-to-image diffusion models. In SIGGRAPH, 2022a.
  47. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022b.
  48. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  49. Styledrop: Text-to-image generation in any style. arXiv preprint arXiv:2306.00983, 2023.
  50. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  51. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019.
  52. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
  53. Unsupervised deep exemplar colorization via pyramid dual non-local attention. TIP, 2023.
  54. Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022.
  55. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  56. Resshift: Efficient diffusion model for image super-resolution by residual shifting. arXiv preprint arXiv:2307.12348, 2023.
  57. Ipdreamer: Appearance-controllable 3d object generation with image prompts. arXiv preprint arXiv:2310.05375, 2023a.
  58. Controllable mind visual diffusion model. arXiv preprint arXiv:2305.10135, 2023b.
  59. Magicbrush: A manually annotated dataset for instruction-guided image editing. arXiv preprint arXiv:2306.10012, 2023a.
  60. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  61. Hive: Harnessing human feedback for instructional visual editing. arXiv preprint arXiv:2303.09618, 2023b.
  62. Text as neural operator: Image manipulation by text instruction. In ACMMM, 2021.
  63. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
Citations (18)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.