Images that Sound: Composing Images and Sounds on a Single Canvas
Abstract: Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these visual spectrograms images that sound. Our approach is simple and zero-shot, and it leverages pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. During the reverse process, we denoise noisy latents with both the audio and image diffusion models in parallel, resulting in a sample that is likely under both models. Through quantitative evaluations and perceptual studies, we find that our method successfully generates spectrograms that align with a desired audio prompt while also taking the visual appearance of a desired image prompt. Please see our project page for video results: https://ificl.github.io/images-that-sound/
- Self-supervised learning of audio-visual objects from video. European Conference on Computer Vision (ECCV), 2020.
- Aphex Twin. Formula, 1994. audio track.
- R. Arandjelovic and A. Zisserman. Look, listen and learn. In Proceedings of the IEEE international conference on computer vision, pages 609–617, 2017.
- R. Arandjelovic and A. Zisserman. Objects that sound. In Proceedings of the European conference on computer vision (ECCV), pages 435–451, 2018.
- Labelling unlabelled videos from scratch with multi-modal self-supervision. Advances in Neural Information Processing Systems, 33:4660–4671, 2020.
- Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.
- Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023.
- Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
- B. Buckle. Spectrogram art: A short history of musicians hiding visuals inside their tracks. Available from: https://mixmag.net/feature/spectrogram-art-music-aphex-twin, 2022. Mixmag article.
- Diffusion illusions: Hiding images in plain sight. arXiv preprint arXiv:2312.03817, 2023.
- Visual acoustic matching. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Novel-view acoustic synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6409–6419, 2023.
- Wavmark: Watermarking for audio generation. arXiv preprint arXiv:2308.12770, 2023.
- Audio-visual synchronisation in the wild. arXiv preprint arXiv:2112.04432, 2021.
- Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020.
- Be everywhere-hear everything (bee): Audio scene reconstruction by sparse audio-visual samples. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7853–7862, 2023.
- Real acoustic fields: An audio-visual room acoustics dataset and benchmark. In The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024.
- Structure from silence: Learning scene structure from ambient sound. In 5th Annual Conference on Robot Learning, 2021.
- Sound localization from motion: Jointly learning sound direction and camera rotation. In International Conference on Computer Vision (ICCV), 2023.
- Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938, 2021.
- Adverb: Visually guided audio dereverberation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7884–7896, 2023.
- Classical Music Reimagined. Fun with spectrograms! how to make an image using sound and music. Available from: https://www.youtube.com/watch?v=N2DQFfID6eY, 2017. Youtube video.
- DeepFloyd Lab at StabilityAI. DeepFloyd IF: a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding. https://www.deepfloyd.ai/deepfloyd-if, 2023.
- P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
- Conditional generation of audio from video via foley analogies. Computer Vision and Pattern Recognition (CVPR), 2023.
- Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In International Conference on Machine Learning, pages 8489–8510. PMLR, 2023.
- Compositional visual generation with energy based models. Advances in Neural Information Processing Systems, 33:6637–6647, 2020.
- Diffusion self-guidance for controllable image generation. Advances in Neural Information Processing Systems, 36:16222–16239, 2023.
- Fast timing-conditioned latent audio diffusion. arXiv preprint arXiv:2402.04825, 2024.
- Self-supervised video forensics by audio-visual anomaly detection. Computer Vision and Pattern Recognition (CVPR), 2023.
- S. Forsgren and H. Martiros. Riffusion - Stable diffusion for real-time music generation, 2022.
- Visualechoes: Spatial visual representation learning through echolocation. In European Conference on Computer Vision (ECCV), 2020.
- R. Gao and K. Grauman. 2.5d visual sound. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- R. Gao and K. Grauman. Visualvoice: Audio-visual speech separation with cross-modal consistency. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Listen to look: Action recognition by previewing audio. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10457–10467, 2020.
- D. Geng and A. Owens. Motion guidance: Diffusion-based image editing with differentiable motion estimators. arXiv preprint arXiv:2401.18085, 2024.
- Factorized diffusion: Perceptual illusions by noise decomposition. arXiv:2404.11615, April 2024.
- Visual anagrams: Generating multi-view optical illusions with diffusion models. In CVPR, 2024.
- Text-to-audio generation using instruction tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
- Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
- Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
- Contrastive audio-visual masked autoencoder. arXiv preprint arXiv:2210.07839, 2022.
- D. Griffin and J. Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2):236–243, 1984.
- G. Gwardys and D. Grzywczak. Deep image features in music information retrieval. International Journal of Electronics and Telecommunications, 60:321–326, 2014.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
- G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 2002.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022.
- Mix and localize: Localizing sound sources in mixtures. Computer Vision and Pattern Recognition (CVPR), 2022.
- Egocentric audio-visual object localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22910–22921, 2023.
- Epic-sounds: A large-scale dataset of actions that sound. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- V. Iashin and E. Rahtu. Taming visually guided sound generation. arXiv preprint arXiv:2110.08791, 2021.
- Synchformer: Efficient synchronization from sparse cues. arXiv preprint arXiv:2401.16423, 2024.
- Denoising diffusion restoration models. Advances in Neural Information Processing Systems, 35:23593–23606, 2022.
- Pixels that sound. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 88–95. IEEE, 2005.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033, 2020.
- Av-nerf: Learning neural fields for real-world audio-visual scene synthesis. Advances in Neural Information Processing Systems, 36, 2024.
- Y.-B. Lin and G. Bertasius. Siamese vision transformers are scalable audio-visual learners. arXiv preprint arXiv:2403.19638, 2024.
- Vision transformers are parameter-efficient audio-visual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2299–2309, 2023.
- Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023.
- Audioldm 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023.
- Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pages 423–439. Springer, 2022.
- S. Luo and W. Hu. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2837–2845, 2021.
- T-vsl: Text-guided visual sound source localization in mixtures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Learning spatial features from audio-visual correspondence in egocentric videos. arXiv preprint arXiv:2307.04760, 2023.
- Foleygen: Visually-guided audio generation. arXiv preprint arXiv:2309.10537, 2023.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- Learning state-aware visual representations from audible interactions. Advances in Neural Information Processing Systems, 35:23765–23779, 2022.
- S. Mo and P. Morgado. Localizing visual sounds the easy way. In European Conference on Computer Vision, pages 218–234. Springer, 2022.
- Self-supervised generation of spatial audio for 360 video. Advances in neural information processing systems, 31, 2018.
- Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12475–12486, 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2021.
- Nine Inch Nails. Year zero, 2007. Music Album.
- Audio-visual glance network for efficient video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10150–10159, 2023.
- A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European conference on computer vision (ECCV), pages 631–648, 2018.
- N. Oxman. Sympawnies: animal portraits made of musical notations. Available from: https://www.youtube.com/@Sympawnies, 2023. Youtube Channel.
- Rethinking cnn models for audio classification. arXiv preprint arXiv:2007.11154, 2020.
- Can clip help sound source localization? In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5711–5720, 2024.
- A fast griffin-lim algorithm. In 2013 IEEE workshop on applications of signal processing to audio and acoustics, pages 1–4. IEEE, 2013.
- Dreamfusion: Text-to-3d using 2d diffusion. ICLR, 2023.
- Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding, 2022.
- Sound source localization is all about cross-modal alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7777–7787, 2023.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2256–2265, Lille, France, 07–09 Jul 2015. PMLR.
- Self-supervised visual acoustic matching. Advances in Neural Information Processing Systems, 36, 2024.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Eventfulness for interactive video alignment. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
- Sound to visual scene generation by audio-to-visual latent alignment. Computer Vision and Pattern Recognition (CVPR), 2023.
- Fourier features let networks learn high frequency functions in low dimensional domains. Advances in neural information processing systems, 33:7537–7547, 2020.
- The Beatles. Lucy in the sky with diamonds, 1967.
- Tool. 10,000 days, 2006. Volcano Entertainment.
- D. Ulyanov. Audio texture synthesis and style transfer. https://dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer, 2016.
- Zero-shot image restoration using denoising diffusion null-space model. The Eleventh International Conference on Learning Representations, 2023.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Sonicvisionlm: Playing sound with vision language models. arXiv preprint arXiv:2401.04394, 2024.
- Auffusion: Leveraging the power of diffusion and large language models for text-to-audio generation. arXiv preprint arXiv:2401.01044, 2024.
- Hiding video in audio via reversible generative models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1100–1109, 2019.
- Telling left from right: Learning spatial correspondence of sight and sound. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9932–9941, 2020.
- Scalable diffusion for materials generation. arXiv preprint arXiv:2311.09235, 2023.
- Cameras as rays: Pose estimation via ray diffusion. In International Conference on Learning Representations (ICLR), 2024.
- The sound of motions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1735–1744, 2019.
- Thinimg: Cross-modal steganography for presenting talking heads in images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5553–5562, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.