Diffusion-Based Gaze Synthesis
- Diffusion-based gaze synthesis is a generative approach that uses denoising diffusion models to create temporally coherent and perceptually realistic gaze trajectories and images.
- It leverages iterative stochastic denoising with specialized conditioning (e.g., image features, user embeddings) for subject-specific synthesis across diverse applications.
- Empirical evaluations demonstrate robust performance through metrics such as DTW, MAE, and FID while ensuring privacy-preserving, controllable synthesis in AR/VR and medical imaging.
Diffusion-based gaze synthesis refers to a family of generative frameworks that employ denoising diffusion probabilistic models (DDPMs) to synthesize, impute, manipulate, or redirectionally control human gaze time-series or gaze-conditioned imagery. These models leverage the iterative stochastic inversion of a forward noising process to generate complex, temporally coherent and perceptually realistic gaze trajectories or images. Diffusion approaches have demonstrated state-of-the-art fidelity and diversity in continuous gaze sequence generation, subject-aware bio-signal synthesis, scanpath simulation, image-conditioned gaze vector synthesis, 3D gaze redirection, imputation under missing data, and the incorporation of gaze as a semantic control for downstream visual synthesis tasks.
1. Mathematical Foundations of Diffusion-Based Gaze Synthesis
Diffusion-based gaze synthesis methods operate by defining a forward noising process for on the clean gaze trajectory (which may be a sequence of gaze vectors, velocities, or higher-order temporally organized gaze features) (Jiao et al., 2024, Jiao et al., 2024, Hasan et al., 13 Nov 2025, Cartella et al., 30 Jul 2025, Jiao et al., 4 Nov 2025). The process transforms into an isotropic Gaussian latent by repeated application of stochastic Gaussian noise. The reverse denoising process, parameterized by a neural model , iteratively reconstructs a synthetic gaze sequence by sampling from a series of conditional Gaussians of the form
with explicit dependence on the conditioning variable (such as user embeddings, image features, or auxiliary sensor data). The objective minimized during training is an MSE (simplified score matching) loss between the true and predicted noise:
where is constructed as .
Many diffusion-based gaze synthesis models augment this with specialized conditioning mechanisms—for example, user identity embeddings (for personalized sequence generation (Jiao et al., 2024, Hasan et al., 13 Nov 2025)), image/scene embeddings (image- or scanpath-conditioned generation (Jiao et al., 2024, Cartella et al., 30 Jul 2025)), auxiliary sensor tokens (for motion-based imputation (Jiao et al., 4 Nov 2025)), or explicit orthogonality constraints (for disentangling gaze from other facial or behavioral factors (Panchalingam et al., 14 Nov 2025)). Some models introduce additional losses, such as cosine similarity in a user-authenticator embedding space, to enforce subject specificity (Jiao et al., 2024, Hasan et al., 13 Nov 2025).
2. Key Model Architectures and Conditioning Strategies
Diffusion-based gaze synthesis encompasses a variety of architectural paradigms, unified by the core denoising framework but differentiated in their encoder, conditioner, and denoiser network design:
- DiffGaze employs a 1D residual backbone with dual transformers (temporal and spatial) to model continuous gaze sequences on 360° images, conditioned on Spherical-CNN extracts from equirectangular environments (Jiao et al., 2024).
- DiffEyeSyn and its successors utilize a DiffWave-style or U-Net-based denoising network, with conditioning on both identity-stripped gaze dynamics and subject-identity vectors from EKYT authenticator models (Jiao et al., 2024, Hasan et al., 13 Nov 2025, Hasan et al., 28 Jan 2026). Feature-wise linear modulation (FiLM) and per-block user embedding injection are commonly employed.
- ScanDiff fuses Vision Transformer (ViT) feature maps from the stimulus image and language-derived task encoding at every transformer layer to stochastically generate scanpaths as sequences of fixation-state tokens (Cartella et al., 30 Jul 2025).
- HAGI++ demonstrates a multi-modal transformer denoiser, using cross-modal self-attention and FiLM layers to fuse time-aligned head orientation and wrist motion signals with gaze data for robust imputation and generation under arbitrary missingness (Jiao et al., 4 Nov 2025).
- TextGaze cascades a sketch diffusion module driven by CLIP-based attention embeddings with a sketch-conditioned face image diffusion module, enabling generation of facial images with specified gaze from free-form textual input (Wang et al., 2024).
- 3D Gaussian/DiT-Gaze combines a head/eye-parameterized 3D Gaussian splatting front-end with a diffusion transformer renderer for fine-grained, disentangled 3D gaze redirection (Panchalingam et al., 14 Nov 2025).
- RadGazeGen introduces ControlNet-style adapters to condition chest X-ray image diffusion on expert gaze heatmaps and radiomic feature maps (Bhattacharya et al., 2024).
Most models optimize conditioning fusion via residual connections, cross-attention, or explicit concatenation, and exploit geometric tokenization, time/frequency encoding, and domain-specific normalization (e.g., down-up sampling, velocity clamping, sin-map scaling) in the gaze feature space.
3. Applications and Problem Domains
The scope of diffusion-based gaze synthesis includes:
| Application Domain | Representative Model(s) | Conditioning Modalities |
|---|---|---|
| Continuous sequence generation | DiffGaze, DiffEyeSyn | Image, user-embedding, scanpath |
| User-specific synthesis | DiffEyeSyn, updated DiffEyeSyn | Identity embedding, dynamics |
| Scanpath prediction, saliency | ScanDiff, DiffGaze | Image, text/task, scene |
| Gaze imputation/generation | HAGI++ | Head/wrist motion, partial gaze |
| Controllable gaze face synthesis | TextGaze | Text, sketch, 3D pose, facial landmarks |
| Gaze-guided image synthesis | RadGazeGen | Gaze heatmap, radiomics, text |
| 3D gaze redirection | DiT-Gaze | 3D Gaussian splatting, latent codes |
Use cases include large-scale data augmentation for calibration-free tracking and biometrics (Jiao et al., 2024, Hasan et al., 13 Nov 2025), privacy-preserving surrogate data (Hasan et al., 28 Jan 2026), robust scanpath modeling for visual attention studies (Cartella et al., 30 Jul 2025, Jiao et al., 2024), and multimodal medical image generation guided by expert radiologist attention (Bhattacharya et al., 2024). Advanced models such as HAGI++ enable recovery of natural gaze kinematics even under extreme (100%) data loss by leveraging multimodal wearable inputs (Jiao et al., 4 Nov 2025).
4. Empirical Evaluation and Metrics
Quantitative and qualitative benchmarks are standardized across the literature, with task-specific metrics:
- Trajectory similarity: Dynamic Time Warping (DTW), Levenshtein (LEV), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) for continuous sequence matching (Jiao et al., 2024, Hasan et al., 13 Nov 2025).
- Temporal and scanpath structure: MultiMatch, ScanMatch, Sequence Score (SS), Semantic SS, Recurrence measures (Cartella et al., 30 Jul 2025).
- User specificity: Cosine similarity in authenticator embedding space (Jiao et al., 2024, Hasan et al., 13 Nov 2025).
- Perceptual/realism: User studies, median human-vs-synth rating, Fréchet Inception Distance (FID), Inception Score (IS), CLIP-score (Jiao et al., 2024, Wang et al., 2024, Bhattacharya et al., 2024).
- Saliency maps: Normalized Scanpath Saliency (NSS), AUC, CC, SIM, KL divergence for saliency evaluation (Jiao et al., 2024).
- Gaze-redirection accuracy: Angular error of gaze estimator on generated images (Panchalingam et al., 14 Nov 2025, Wang et al., 2024).
- Imputation/generation: Mean angular error (MAE), Jensen–Shannon divergence of gaze velocity distributions in imputation/generation settings (Jiao et al., 4 Nov 2025).
Models such as DiffGaze narrow the gap to human agreement in continuous scanpath (DTW ↓ 20% over previous SoTA) and match human-level saccade/fixation statistics (Jiao et al., 2024). Subject-specific diffusion (DiffEyeSyn variants) reduce spatial error by a factor of 4× over GANs and reach cosine embedding similarity ≈0.92–0.95, near the human within-subject level (Jiao et al., 2024, Hasan et al., 13 Nov 2025). 3D-aware DiT-Gaze achieves gaze-redirection errors of 6.353° (↓4.1% vs. 3DGS) and increased perceptual scores (Panchalingam et al., 14 Nov 2025). HAGI++ attains 25% reduction in MAE compared to strong interpolation baselines and accurate velocity-profile matching, even when gaze is fully missing and inferred purely from head/wrist (Jiao et al., 4 Nov 2025). Gaze-guided image synthesis (RadGazeGen) improves localization and structure agreement (SSIM up to 0.7045) and attains lower FID relative to prior multi-control diffusion methods (Bhattacharya et al., 2024).
5. User Specificity, Privacy, and State Signal Attenuation
Recent diffusion-based gaze synthesis frameworks explicitly address the dual requirements of subject specificity and privacy. The introduction of user identity guidance losses and pre-trained authenticator-based embeddings allows fine control over idiosyncratic oculomotor signatures (Jiao et al., 2024, Hasan et al., 13 Nov 2025, Hasan et al., 28 Jan 2026). Updated models demonstrate that, while high-level kinematic realism and subject-identity cues are preserved, internal state features (such as fatigue or mental workload) are suppressed: correlations between synthetic gaze features and subjective reports drop to non-significant levels, contrasting with original real-gaze signals where (p < 0.05) can be detected for state-dependent attributes (Hasan et al., 28 Jan 2026). This property enables the use of synthetic gaze for privacy-preserving benchmarking, AR/VR simulation, and data dissemination where state leakage is a concern.
6. Emerging Directions: Multimodality, Control, and Synthesis in Context
Contemporary research extends diffusion-based gaze synthesis to encompass multimodal and task-aware synthesis frameworks:
- Text and semantic control is effected via cross-attention on CLIP-derived features from free-form text, enabling natural language to control facial gaze in photo-realistic synthesis without direct gaze-value annotation (Wang et al., 2024).
- Image and scanpath co-conditioning allows for the generation of plausible scanpaths modulated by both visual features and explicit task cues (object reference, search instructions) via transformer cross-modality (Cartella et al., 30 Jul 2025, Jiao et al., 2024).
- Medical imaging applications now exploit gaze-encoded attention (radiologist fixation heatmaps) as a spatial prior in text-to-image diffusion, improving the anatomical plausibility and regional fidelity of synthesized diagnostics (Bhattacharya et al., 2024).
Pure gaze sequence generation, as in HAGI++, demonstrates robust gaze inference under substantially missing input, leveraging auxiliary inertial or limb-mounted sensors (head, wrist) for behavioral context (Jiao et al., 4 Nov 2025). 3D Gaussian/diffusion fusion (DiT-Gaze) provides a differentiable, geometry-aware representation for fine-grained gaze redirection, with orthogonality constraints for disentangling gaze from expression/pose (Panchalingam et al., 14 Nov 2025).
7. Limitations and Current Challenges
Despite significant advances, key methodological and practical challenges remain:
- Dependency on domain-specific embeddings: User-specific synthesis relies on the representational power of authenticator models (e.g., EKYT) and could be limited by their discriminative granularity (Jiao et al., 2024, Hasan et al., 28 Jan 2026).
- Computational and data costs: High-resolution diffusion synthesis, especially for images or large datasets, remains computationally intensive (Bhattacharya et al., 2024, Panchalingam et al., 14 Nov 2025). Gaze conditioning data can be expensive or unavailable for certain modalities (e.g., 3D medical imaging).
- Generalization: Cross-domain robustness—transfer to novel settings (AR/VR, head-mounted tracks, lower sampling frequencies)—requires new architectures and foundation models (Jiao et al., 2024, Bhattacharya et al., 2024).
- Interpretability: While gaze synthesis can mask internal state cues, the interpretability and controllability of synthesized output, particularly in privacy-sensitive scenarios, requires further quantification (Hasan et al., 28 Jan 2026).
Further research is investigating larger and more robust user-embedding extractors ("foundation" user-authenticator analogues), temporal integrations for multimodal controls, extension to AR/VR environments, and compositional generative frameworks for gaze, motion, and semantic context.
References
- DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images (Jiao et al., 2024)
- DiffEyeSyn: Diffusion-based User-specific Eye Movement Synthesis (Jiao et al., 2024)
- Quantitative and Qualitative Comparison of Generative Models for Subject-Specific Gaze Synthesis: Diffusion vs GAN (Hasan et al., 13 Nov 2025)
- Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction (Cartella et al., 30 Jul 2025)
- HAGI++: Head-Assisted Gaze Imputation and Generation (Jiao et al., 4 Nov 2025)
- TextGaze: Gaze-Controllable Face Generation with Natural Language (Wang et al., 2024)
- 3D Gaussian and Diffusion-Based Gaze Redirection (Panchalingam et al., 14 Nov 2025)
- RadGazeGen: Radiomics and Gaze-guided Medical Image Generation using Diffusion Models (Bhattacharya et al., 2024)
- Privatization of Synthetic Gaze: Attenuating State Signatures in Diffusion-Generated Eye Movements (Hasan et al., 28 Jan 2026)