Papers
Topics
Authors
Recent
Search
2000 character limit reached

Diffusion-Based Gaze Synthesis

Updated 30 January 2026
  • Diffusion-based gaze synthesis is a generative approach that uses denoising diffusion models to create temporally coherent and perceptually realistic gaze trajectories and images.
  • It leverages iterative stochastic denoising with specialized conditioning (e.g., image features, user embeddings) for subject-specific synthesis across diverse applications.
  • Empirical evaluations demonstrate robust performance through metrics such as DTW, MAE, and FID while ensuring privacy-preserving, controllable synthesis in AR/VR and medical imaging.

Diffusion-based gaze synthesis refers to a family of generative frameworks that employ denoising diffusion probabilistic models (DDPMs) to synthesize, impute, manipulate, or redirectionally control human gaze time-series or gaze-conditioned imagery. These models leverage the iterative stochastic inversion of a forward noising process to generate complex, temporally coherent and perceptually realistic gaze trajectories or images. Diffusion approaches have demonstrated state-of-the-art fidelity and diversity in continuous gaze sequence generation, subject-aware bio-signal synthesis, scanpath simulation, image-conditioned gaze vector synthesis, 3D gaze redirection, imputation under missing data, and the incorporation of gaze as a semantic control for downstream visual synthesis tasks.

1. Mathematical Foundations of Diffusion-Based Gaze Synthesis

Diffusion-based gaze synthesis methods operate by defining a forward noising process q(xtxt1)=N(1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal N(\sqrt{1-\beta_t} x_{t-1}, \beta_t I) for t=1Tt=1\dots T on the clean gaze trajectory x0x_0 (which may be a sequence of gaze vectors, velocities, or higher-order temporally organized gaze features) (Jiao et al., 2024, Jiao et al., 2024, Hasan et al., 13 Nov 2025, Cartella et al., 30 Jul 2025, Jiao et al., 4 Nov 2025). The process transforms x0x_0 into an isotropic Gaussian latent xTx_T by repeated application of stochastic Gaussian noise. The reverse denoising process, parameterized by a neural model ϵθ\epsilon_\theta, iteratively reconstructs a synthetic gaze sequence by sampling from a series of conditional Gaussians of the form

pθ(xt1xt,c)=N(μθ(xt,t,c),σt2I)p_\theta(x_{t-1} \mid x_t, c) = \mathcal N\left(\mu_\theta(x_t, t, c), \sigma_t^2 I \right)

with explicit dependence on the conditioning variable cc (such as user embeddings, image features, or auxiliary sensor data). The objective minimized during training is an MSE (simplified score matching) loss between the true and predicted noise:

L=Ex0,t,ϵϵϵθ(xt,t,c)22\mathcal L = \mathbb E_{x_0,\, t,\, \epsilon}\left\|\epsilon - \epsilon_\theta(x_t, t, c)\right\|_2^2

where xtx_t is constructed as xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon.

Many diffusion-based gaze synthesis models augment this with specialized conditioning mechanisms—for example, user identity embeddings (for personalized sequence generation (Jiao et al., 2024, Hasan et al., 13 Nov 2025)), image/scene embeddings (image- or scanpath-conditioned generation (Jiao et al., 2024, Cartella et al., 30 Jul 2025)), auxiliary sensor tokens (for motion-based imputation (Jiao et al., 4 Nov 2025)), or explicit orthogonality constraints (for disentangling gaze from other facial or behavioral factors (Panchalingam et al., 14 Nov 2025)). Some models introduce additional losses, such as cosine similarity in a user-authenticator embedding space, to enforce subject specificity (Jiao et al., 2024, Hasan et al., 13 Nov 2025).

2. Key Model Architectures and Conditioning Strategies

Diffusion-based gaze synthesis encompasses a variety of architectural paradigms, unified by the core denoising framework but differentiated in their encoder, conditioner, and denoiser network design:

  • DiffGaze employs a 1D residual backbone with dual transformers (temporal and spatial) to model continuous gaze sequences on 360° images, conditioned on Spherical-CNN extracts from equirectangular environments (Jiao et al., 2024).
  • DiffEyeSyn and its successors utilize a DiffWave-style or U-Net-based denoising network, with conditioning on both identity-stripped gaze dynamics and subject-identity vectors from EKYT authenticator models (Jiao et al., 2024, Hasan et al., 13 Nov 2025, Hasan et al., 28 Jan 2026). Feature-wise linear modulation (FiLM) and per-block user embedding injection are commonly employed.
  • ScanDiff fuses Vision Transformer (ViT) feature maps from the stimulus image and language-derived task encoding at every transformer layer to stochastically generate scanpaths as sequences of fixation-state tokens (Cartella et al., 30 Jul 2025).
  • HAGI++ demonstrates a multi-modal transformer denoiser, using cross-modal self-attention and FiLM layers to fuse time-aligned head orientation and wrist motion signals with gaze data for robust imputation and generation under arbitrary missingness (Jiao et al., 4 Nov 2025).
  • TextGaze cascades a sketch diffusion module driven by CLIP-based attention embeddings with a sketch-conditioned face image diffusion module, enabling generation of facial images with specified gaze from free-form textual input (Wang et al., 2024).
  • 3D Gaussian/DiT-Gaze combines a head/eye-parameterized 3D Gaussian splatting front-end with a diffusion transformer renderer for fine-grained, disentangled 3D gaze redirection (Panchalingam et al., 14 Nov 2025).
  • RadGazeGen introduces ControlNet-style adapters to condition chest X-ray image diffusion on expert gaze heatmaps and radiomic feature maps (Bhattacharya et al., 2024).

Most models optimize conditioning fusion via residual connections, cross-attention, or explicit concatenation, and exploit geometric tokenization, time/frequency encoding, and domain-specific normalization (e.g., down-up sampling, velocity clamping, sin-map scaling) in the gaze feature space.

3. Applications and Problem Domains

The scope of diffusion-based gaze synthesis includes:

Application Domain Representative Model(s) Conditioning Modalities
Continuous sequence generation DiffGaze, DiffEyeSyn Image, user-embedding, scanpath
User-specific synthesis DiffEyeSyn, updated DiffEyeSyn Identity embedding, dynamics
Scanpath prediction, saliency ScanDiff, DiffGaze Image, text/task, scene
Gaze imputation/generation HAGI++ Head/wrist motion, partial gaze
Controllable gaze face synthesis TextGaze Text, sketch, 3D pose, facial landmarks
Gaze-guided image synthesis RadGazeGen Gaze heatmap, radiomics, text
3D gaze redirection DiT-Gaze 3D Gaussian splatting, latent codes

Use cases include large-scale data augmentation for calibration-free tracking and biometrics (Jiao et al., 2024, Hasan et al., 13 Nov 2025), privacy-preserving surrogate data (Hasan et al., 28 Jan 2026), robust scanpath modeling for visual attention studies (Cartella et al., 30 Jul 2025, Jiao et al., 2024), and multimodal medical image generation guided by expert radiologist attention (Bhattacharya et al., 2024). Advanced models such as HAGI++ enable recovery of natural gaze kinematics even under extreme (100%) data loss by leveraging multimodal wearable inputs (Jiao et al., 4 Nov 2025).

4. Empirical Evaluation and Metrics

Quantitative and qualitative benchmarks are standardized across the literature, with task-specific metrics:

Models such as DiffGaze narrow the gap to human agreement in continuous scanpath (DTW ↓ 20% over previous SoTA) and match human-level saccade/fixation statistics (Jiao et al., 2024). Subject-specific diffusion (DiffEyeSyn variants) reduce spatial error by a factor of 4× over GANs and reach cosine embedding similarity ≈0.92–0.95, near the human within-subject level (Jiao et al., 2024, Hasan et al., 13 Nov 2025). 3D-aware DiT-Gaze achieves gaze-redirection errors of 6.353° (↓4.1% vs. 3DGS) and increased perceptual scores (Panchalingam et al., 14 Nov 2025). HAGI++ attains 25% reduction in MAE compared to strong interpolation baselines and accurate velocity-profile matching, even when gaze is fully missing and inferred purely from head/wrist (Jiao et al., 4 Nov 2025). Gaze-guided image synthesis (RadGazeGen) improves localization and structure agreement (SSIM up to 0.7045) and attains lower FID relative to prior multi-control diffusion methods (Bhattacharya et al., 2024).

5. User Specificity, Privacy, and State Signal Attenuation

Recent diffusion-based gaze synthesis frameworks explicitly address the dual requirements of subject specificity and privacy. The introduction of user identity guidance losses and pre-trained authenticator-based embeddings allows fine control over idiosyncratic oculomotor signatures (Jiao et al., 2024, Hasan et al., 13 Nov 2025, Hasan et al., 28 Jan 2026). Updated models demonstrate that, while high-level kinematic realism and subject-identity cues are preserved, internal state features (such as fatigue or mental workload) are suppressed: correlations between synthetic gaze features and subjective reports drop to non-significant levels, contrasting with original real-gaze signals where ρ0.20.4\rho \sim 0.2–0.4 (p < 0.05) can be detected for state-dependent attributes (Hasan et al., 28 Jan 2026). This property enables the use of synthetic gaze for privacy-preserving benchmarking, AR/VR simulation, and data dissemination where state leakage is a concern.

6. Emerging Directions: Multimodality, Control, and Synthesis in Context

Contemporary research extends diffusion-based gaze synthesis to encompass multimodal and task-aware synthesis frameworks:

  • Text and semantic control is effected via cross-attention on CLIP-derived features from free-form text, enabling natural language to control facial gaze in photo-realistic synthesis without direct gaze-value annotation (Wang et al., 2024).
  • Image and scanpath co-conditioning allows for the generation of plausible scanpaths modulated by both visual features and explicit task cues (object reference, search instructions) via transformer cross-modality (Cartella et al., 30 Jul 2025, Jiao et al., 2024).
  • Medical imaging applications now exploit gaze-encoded attention (radiologist fixation heatmaps) as a spatial prior in text-to-image diffusion, improving the anatomical plausibility and regional fidelity of synthesized diagnostics (Bhattacharya et al., 2024).

Pure gaze sequence generation, as in HAGI++, demonstrates robust gaze inference under substantially missing input, leveraging auxiliary inertial or limb-mounted sensors (head, wrist) for behavioral context (Jiao et al., 4 Nov 2025). 3D Gaussian/diffusion fusion (DiT-Gaze) provides a differentiable, geometry-aware representation for fine-grained gaze redirection, with orthogonality constraints for disentangling gaze from expression/pose (Panchalingam et al., 14 Nov 2025).

7. Limitations and Current Challenges

Despite significant advances, key methodological and practical challenges remain:

  • Dependency on domain-specific embeddings: User-specific synthesis relies on the representational power of authenticator models (e.g., EKYT) and could be limited by their discriminative granularity (Jiao et al., 2024, Hasan et al., 28 Jan 2026).
  • Computational and data costs: High-resolution diffusion synthesis, especially for images or large datasets, remains computationally intensive (Bhattacharya et al., 2024, Panchalingam et al., 14 Nov 2025). Gaze conditioning data can be expensive or unavailable for certain modalities (e.g., 3D medical imaging).
  • Generalization: Cross-domain robustness—transfer to novel settings (AR/VR, head-mounted tracks, lower sampling frequencies)—requires new architectures and foundation models (Jiao et al., 2024, Bhattacharya et al., 2024).
  • Interpretability: While gaze synthesis can mask internal state cues, the interpretability and controllability of synthesized output, particularly in privacy-sensitive scenarios, requires further quantification (Hasan et al., 28 Jan 2026).

Further research is investigating larger and more robust user-embedding extractors ("foundation" user-authenticator analogues), temporal integrations for multimodal controls, extension to AR/VR environments, and compositional generative frameworks for gaze, motion, and semantic context.


References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Diffusion-Based Gaze Synthesis.