Gaze-Driven Personalization
- Gaze-driven personalization is a technique that uses user eye movement patterns and saliency maps to adapt AI and HCI systems for tailored experiences.
- Modeling user identity from gaze embeddings employs Siamese networks with triplet loss, achieving up to 98% accuracy in discriminating individual attentional signatures.
- Methodologies span saliency prediction, gaze estimation with few-shot calibration, and federated, privacy-aware personalizations that enhance UI, streaming, and content generation.
Gaze-driven personalization refers to the process of adapting artificial intelligence and human–computer interaction systems to individual users by modeling, inferring, or directly utilizing patterns in their eye movements (gaze) and visual attention. This paradigm exploits the stable, idiosyncratic tendencies in where, when, and how users look at visual content to drive adaptation in predictive, generative, or interactive models. By leveraging both fine-grained gaze traces and high-level attention statistics, gaze-driven personalization underpins a diverse array of applications, including personalized saliency prediction, attention-aware UI/UX, federated learning for gaze estimation, human-aligned language generation, adaptive streaming, and semantic image or video editing.
1. Modeling User Identity from Gaze Embeddings
The core technical challenge is to construct compact, expressive user representations (embeddings) that capture the persistent behavioral and perceptual biases reflected in an individual's gaze data. One state-of-the-art approach employs a Siamese convolutional encoder to map sets of () image–personal-saliency map (PSM) pairs to a -dimensional normalized embedding for each user, using either eye-tracker-derived or synthetically generated saliency maps (Strohm et al., 2024). Triplet margin loss is used to enforce inter-user discriminability, with semi-hard mining:
where embeddings are unit-normalized and the margin .
An optimal value of samples per embedding and embedding dimensionality balance discriminative power with data efficiency. Classification accuracy on held-out users rises from ≈80% () to ≈98% (), illustrating the capacity of gaze-derived embeddings to uniquely identify user attentional signatures (Strohm et al., 2024).
2. Gaze-Based Personalization Methodologies
2.1. Saliency Prediction
Gaze-driven personalization frameworks predict how an individual user will distribute their visual attention over an image, refining a universal saliency predictor (e.g., DeepGazeIIE) via personalized discrepancy maps (Strohm et al., 2024). The typical model fuses the user embedding with input features to modulate convolutional kernels and output . Learning targets the error between predicted and ground truth discrepancy maps using MSE loss.
Alternative approaches condition on observer identity or traits using cGANs, incorporating binary or categorical labels to generate observer-specific saliency maps and demonstrating improved KL-divergence and SSIM over non-personalized baselines in age and language tasks (Yu et al., 2017).
2.2. Gaze Estimation Personalization
Personalization in gaze estimation typically proceeds by either adapting a small latent representation (e.g., 3–6 parameters per eye in SPAZE (Lindén et al., 2018)) by fitting to a handful of user-specific calibration samples, or by introducing modular, compact adaptation layers (adapters) in compact neural architectures, as in DFT Gaze (Hsieh et al., 2024). Few-shot fine-tuning (5–9 samples) suffices to achieve substantial reductions in gaze angular error, with adapters allowing edge-device deployment at low latency.
Recent methods such as Test-Time Personalized Gaze Estimation (TPGaze) meta-learn prompt-like parameters to be optimized unsupervised at inference, leveraging a symmetry loss to calibrate predictions in the absence of labels and achieving adaptation speeds at least 10× faster than full fine-tuning (Liu et al., 2024).
Spatio-temporal attention modules and post-hoc bias correction via Gaussian Processes enable robust, sample-efficient personalization in video-based gaze estimation, with only 3 calibration samples yielding an additional improvement in angular error (Jindal et al., 2024).
2.3. Federated and Privacy-Aware Personalization
In multi-user, data-distributed settings, parameter freezing strategies—such as the FedCPF approach—identify and freeze the subset of model parameters exhibiting highest per-user adaptation during federated learning, thus capturing personal gaze dynamics without full model retraining (Feng et al., 25 Feb 2025). FedCPF yields superior recall, precision, and F1 for gaze estimation in egocentric video, outperforming other PFL baselines.
3. Gaze-Driven Personalization in Generative and Interactive Systems
3.1. Content Generation and Manipulation
Gaze-driven systems support direct content manipulation via gaze-contingent control. GazeGen (Hsieh et al., 2024) introduces a distilled, adapter-personalized gaze estimator (281K parameters) to localize user-specified regions for generative tasks: object addition, deletion, style transfer, and even gaze-seeded text-to-video translation. Real-time gaze prediction conditions region-of-interest (ROI) selection without hand or voice input, facilitating highly personalized visual workflows on low-power edge devices.
3.2. Personalization in Vision-Language Generation
Recent models show that jointly encoding user-specific gaze patterns yields more human-aligned personalized image descriptions compared with linguistic style alone. The DEPER framework learns a subject embedding that fuses scanpath dynamics with linguistic tendencies, aligning a lightweight adapter with a frozen VLM for few-shot, cross-modal persona conditioning (Xue et al., 7 Dec 2025). Empirical results show a 24% absolute improvement over traditional baselines in personalized description tasks.
For text summarization, gaze density and heatmap features extracted from reading sessions enable LLMs to produce summaries that emphasize user-attended content, outperforming selections based on explicit prompts with lower user effort (Ding et al., 25 Jan 2026).
3.3. Attention-Guided User Interfaces and Streaming
Gaze-adaptive presentation systems deploy real-time gaze tracking to guide interface interventions. Adaptive interventions in magazine-style visualizations dynamically highlight data points linked to fixated text references, with user-study evidence that low-literacy users benefit from gaze-driven cuing, while high-literacy users see ceiling effects—suggesting scope for literacy-personalized adaptation (Lallé et al., 2019). In streaming, EyeNexus (Wu et al., 15 Sep 2025) combines low-latency, bandwidth-adaptive, foveated rendering with real-time gaze tracking to improve visual quality and reduce motion sickness in cloud VR.
4. Downstream Applications and Broader Implications
Gaze-driven personalization underpins a range of HCI and perception-centric applications:
- Attentive UI/widgets adapting to user’s active focus (Strohm et al., 2024).
- Personalization of recommender systems via gaze-driven re-ranking or inferred latent traits (He et al., 13 Jan 2026).
- Video summarization, cropping, and content selection tailored by personal gaze (Strohm et al., 2024).
- Gaze-augmented image captioning, VQA, and LLM prompting (Xue et al., 7 Dec 2025, Ding et al., 25 Jan 2026).
- Gaze-based biometric authentication, cognitive/affective state inference (Strohm et al., 2024, He et al., 13 Jan 2026).
Risks include the inadvertent encoding of sensitive demographic or psychological traits (e.g., age, gender, personality), with privacy and profiling implications for biometric embeddings (Strohm et al., 2024, He et al., 13 Jan 2026).
5. Evaluation Metrics and Empirical Benchmarks
Quantitative evaluation of gaze-driven personalization relies on task-specific metrics:
- Saliency: CC, SIM, AUC, NSS, KLD (Strohm et al., 2024), KL-divergence, SSIM, MSE for text regions (Yu et al., 2017).
- Gaze estimation: mean angular error (degrees), Euclidean error (mm, cm) (Lindén et al., 2018, Hsieh et al., 2024), macro F1/recall/precision in egocentric video (Feng et al., 25 Feb 2025).
- Generation: BLEU, CIDEr, METEOR, Object Sequence Score, classification accuracy for matching subject style (Xue et al., 7 Dec 2025).
- Streaming/UI: end-to-end latency, perceptual quality (EWPSNR, EWSSIM), user-rated visual quality/playability/motion sickness (Wu et al., 15 Sep 2025, Lallé et al., 2019).
Statistically significant improvements are regularly observed across both closed-set (seen users) and open-set (unseen users) personalization scenarios, with rigorous ablations isolating the contribution of gaze-specific components (Strohm et al., 2024).
6. Future Directions and Open Challenges
Emerging directions involve:
- Multimodal fusion of gaze with other user signals (e.g., head pose, physiological metrics) to refine embeddings and trait inference.
- Real-time adaptation with privacy-preserving on-device models (Hsieh et al., 2024, Ding et al., 25 Jan 2026).
- Trait-continuous and temporally dynamic personalization for richer user modeling (Yu et al., 2017, He et al., 13 Jan 2026).
- Robustness under data scarcity and cross-domain generalization, with meta-learning and Bayesian methods supporting few-shot adaptation (Liu et al., 2024, Jindal et al., 2024).
- Interpretability and transparency in gaze-driven adaptation, both for user trust and regulatory compliance (Yu et al., 2017).
By systematically leveraging user gaze as an implicit, high-information channel, gaze-driven personalization offers principled, robust pathways to synthesize human factors into adaptive computational systems across vision, language, streaming, and interactive platforms (Strohm et al., 2024, Hsieh et al., 2024, Xue et al., 7 Dec 2025, Feng et al., 25 Feb 2025, He et al., 13 Jan 2026).