Sound Infusions: Audio & Cross-Modal Integration

Updated 4 February 2026

Sound Infusions are techniques that blend audio signals with other sensory modalities to create immersive, multi-sensory experiences using algorithmic and physical processes.
Modern systems employ diffusion models, latent conditioning, and cross-modal mapping to generate seamless audio morphs and spatial soundscapes.
Applications span VR/AR, immersive media, and culinary experiences, validated through objective metrics like FAD and subjective listener tests.

Sound Infusions are a diverse set of audio and multisensory techniques in which sound is algorithmically blended, injected, or mapped into another information channel, material, or context to create perceptually salient, often transformative, effects. The scope of “sound infusions” encompasses fusion of audio signals into each other, spatial integration of sound into visual scenes or environments, cross-modal mappings to non-auditory sensory modalities, and the injection of acoustic phenomena into physical substrates, images, video, or interactive media. Key contemporary research threads span generative asymmetric audio morphing, diffusion-based abstract sound synthesis, spatial and cross-modal soundscape embedding, realtime control of multisensory experience, and the physical infusion of sound into tangible or visual substrates for purposes ranging from sound design to immersive media and human-computer interaction.

1. Generative Morphing and Fusion of Sound Signals

Modern sound infusions include sophisticated algorithms for morphing two or more audio sources into novel signals whose perceptual character transcends that of a simple sum. In the text-to-audio diffusion architecture Mix2Morph, a “sound infusion” is characterized as an asymmetric static morph: one sound (primary) provides the overall temporal structure (envelope, rhythm, event onsets), while a secondary sound is woven throughout to enrich timbral and textural qualities. The process is strictly non-convex and is learned by fine-tuning a latent diffusion transformer on noisy surrogate mixes, with augmentations aligning dynamics (RMS) and spectrum, and explicit prompt-conditioning to encode the directionality of infusion. Objective metrics such as Latent Compressibility Score, semantic Correspondence, Intermediateness, and Directionality are combined with listening tests to validate the distinctiveness and perceptual coherence of the resulting morphs. Mix2Morph consistently outperforms prior morphing baselines and naive mixing, yielding infusions perceived as single emergent events rather than overlays (Chu et al., 28 Jan 2026).

“Sound fusion” of abstract audio signals—those with no assignable real-world semantic referents—has been solved using an unconditioned inversion model. Here, DPMSolver++-based SDE/ODE inversion formulas enable the injection of clean latent encodings of an original abstract clip into the intermediate stages of a diffusion process fine-tuned on a reference sound. By swapping in the inverted latent at a chosen timestep, the pipeline produces artifacts with genuinely novel timbral textures, confirmed via spectrograms and listener categorization as distinct “third category” events, invalidating explanations based on simple superposition (Liu et al., 13 Jun 2025).

Sound infusions generalize beyond pure audio morphing to include the synthesis of soundscapes whose elements are mapped and arranged in accordance with spatial or cross-modal (e.g., visual, tactile) cues. ImmerseDiffusion provides a paradigm for infusing sound objects into three-dimensional spatial audio scenes. Using a VAE-based codec for first-order ambisonics (FOA), the system supports conditioning on text, spatial, temporal, and acoustic environment parameters, allowing end-to-end sampling of FOA waveforms in accordance with user-specified or learned spatial arrangements and environmental characteristics. Quantitative metrics include direct spatial error (L1 or great-circle distance) between intended and synthesized azimuth, elevation, and distance, as well as audio–text alignment metrics like FAD and CLAP score (Heydari et al., 2024).

Zero-shot spatial infusions are realized in the SEE-2-SOUND pipeline, which factorizes the task into discrete stages: segmentation of visual regions-of-interest (with SAM), depth-based 3D localization (with Depth Anything), conditional audio generation (with CoDi diffusion), and physical room simulation (via multi-channel convolution). This pipeline enables the direct mapping of objects detected in images or videos to spatialized sound sources in surround output, validated by scene-grounded metrics (AViTAR, MFCC-DTW, ZCR, Chroma, Spectral) and human ratings for realism and directional accuracy (Dagli et al., 2024).

3. Infusions Across Sensory Modalities and Physical Substrates

The concept of sound infusion extends into physicalizable, cross-modal contexts. The Cymatics Cup system routes specific acoustic waveforms through a liquid medium, inducing standing-wave morphologies (Faraday and capillary waves) whose visual patterns encode taste qualities. Carefully chosen frequency and amplitude patterns create liquid surface deformations mapped to taste profiles (“sweet,” “sour,” etc.) as demonstrated by significant effects in controlled user studies on taste perception. The visual dominance of these deformations in altering taste experience, the minimal impact of tactile vibration, and the trade-offs in vessel design, rhythm, and material selection are systematically explored, highlighting the translation of algorithmic sound properties into multi-sensory augmentation of dining experiences (Chen et al., 2024).

4. Algorithmic and Systemic Mechanisms

Sound infusion architectures leverage a broad array of algorithmic primitives, including:

Diffusion-based generative modeling: All major advanced sound infusions utilize diffusion models, either for direct waveform synthesis (“text-to-audio”), spatial encoding (ambisonic latents), or guided deterministic inversion.
Latent conditioning and cross-attention: Fine-grained alignment and manipulation are achieved via cross-attention layers linking audio and non-audio modality representations (text tokens, embeddings of spatial parameters, or images), as in Auffusion and Mix2Morph (Xue et al., 2024, Chu et al., 28 Jan 2026).
Conditional separation and semantic embedding: In universal source separation, semantic embeddings from large-scale classifiers are injected to guide and improve separation quality, yielding measurable SNR gain (Tzinis et al., 2019).
Physical signal processing: Systems like Cymatics Cup and UltrasonicSpheres implement AM/FDM carrier-based signal chains, controlling propagation and demodulation of multiple sound streams in either physical liquids or spatial sound zones, with well-characterized beam patterns, cross-talk suppression, and perceptual SNR (Küttner et al., 3 Jun 2025).

5. Applications, Evaluation Protocols, and Experimental Validation

Sound infusions are validated across objective and subjective protocols—both in pure acoustic domains and in multi- or cross-modal deployments:

Objective metrics: FAD, Mel/STFT distances, latent compressibility, clustering in semantic embedding space, spatial accuracy (in FOA), and SNR for separation/morphing.
Subjective evaluation: MOS (Mean Opinion Score), forced-choice categorization, cross-modal perception (taste/vision), physical experiment (e.g., pointing error in spatial zones).
Representative use cases: Asymmetric sound design (e.g., infusing timbre or environment into a primary source), adaptive and individualized spatial soundscapes in VR/AR or public installations, multisensory augmentation in gastronomy, and localized, unobtrusive sound delivery without physical barriers.

6. Limitations and Future Perspectives

Limitations are context-dependent. In generative fusion, maintaining a balance between structural preservation and genuine novelty is non-trivial and subject to the granularity/location of latent injection and the noise schedule. Cross-modal infusions may suffer from mapping ambiguities or interaction effects (e.g., vibration dampening taste). Spatial audio synthesis is limited by the quality of object detection, geometric localization, and real-time computational constraints. Room acoustic infusion methods are presently limited to the addition of reverberation and do not handle dereverberation or time-varying acoustics (Verma et al., 2022). Physical sound infusions must carefully trade amplitude and pattern stability against practical factors such as spillage, environmental interference, and user comfort (Chen et al., 2024, Küttner et al., 3 Jun 2025).

Across all domains, the field continues to push toward enhanced controllability, finer granularity in cross-modal alignment, real-time operation, and integration with both learned and physically-grounded models of perception and materiality.