VISTA-Beyond: Human Visual Attention Dataset
- VISTA-Beyond is a large-scale, human-annotated dataset that explicitly aligns natural language segments with spatio-temporal human gaze in everyday scenes.
- It employs high-frequency eye-tracking and kernel-density estimation to create precise saliency maps linked to detailed verbal descriptions.
- The dataset supports benchmarking vision-language models using metrics like NCC and AUC, offering insights into human-aligned caption ranking and model explainability.
VISTA-Beyond is a large-scale, human-annotated dataset designed to explicitly align natural language segments with human visual attention in complex, real-world scenes. Distinguished from prior datasets that rely on bounding-box or region-to-phrase mappings, VISTA-Beyond captures the real-time spatio-temporal dynamics of human gaze and language production. Each image–text pair in the dataset is accompanied by a fine-grained eye-tracking saliency map tied to precise linguistic segments, supporting the rigorous analysis of alignment and interpretability in Vision and LLMs (VLMs) (Harshit et al., 2024).
1. Composition and Scope
VISTA-Beyond comprises 508 triplets: each includes a unique image, a corresponding natural language description, and a grayscale, kernel-density-estimated (KDE) saliency map derived from eye-tracking data. All images depict generic "everyday scenes," spanning both indoor and outdoor lifestyle contexts. Each image was described in free-form by a single annotator (one description per image), leading to 508 image–description pairs and 508 associated human attention maps.
Distinctive features relative to established datasets (e.g., Flickr30k-Entities, COCO Captions) include:
- Real-time gaze capture, eschewing bounding box proxies for direct measurement of attentional focus.
- Alignment at the granularity of any text segment, from single words to full sentences.
- Fixation duration weighting in KDE maps, providing a measure sensitive to cognitive effort.
2. Annotation Methodology
Data collection employed the EyeLink 1000 Plus tracker, enabling high-frequency (≥500 Hz) recording of gaze fixations as annotators viewed and described each image. The verbal stream was audio recorded and transcribed, with each word or phrase timestamped to match the tracker’s temporal base. For every linguistic unit with onset and offset times , all fixations satisfying were aggregated.
The spatial distribution of fixations was converted into a KDE-based saliency map:
where and are the coordinates and duration of fixation , respectively. This procedure yields a temporally localized, continuous-valued attention map for each language segment.
Quality assurance was enforced via three mechanisms:
- Standardized eye-tracker calibration prior to annotation.
- Secondary review of each transcript for accuracy.
- Visual inspection of every heatmap–transcript overlay for cross-modal coherence. Pairs with gross misalignments (e.g., references to never-fixated objects) were excluded.
Every image–text–attention triplet originates from a unique annotator; therefore, no inter-annotator or IoU is reported.
3. Data Structure and Partitioning
The released dataset follows a directory-based organization:
| Directory | Content |
|---|---|
| images/ | JPEG files, one per sample |
| transcripts/ | Cleaned text files (one per image) |
| saliency_maps/ | PNG images or NumPy arrays (per map, normalized ) |
No explicit train/validation/test split is specified. A conventional 80/10/10 partition yields:
- 406 examples for training
- 51 for validation
- 51 for testing
This facilitates standard experimental protocols while allowing for flexible downstream evaluation.
4. Dataset Characteristics and Analysis
Key summary statistics of VISTA-Beyond are as follows:
- Total examples: 508 triplets
- Mean description length: $9.6$ words (standard deviation $4.1$)
- Average saliency-map sparsity: of pixels non-zero (standard deviation )
- By default, each description is treated as one segment; further segmentation (e.g., into noun phrases) is feasible post hoc
- Absence of center bias: mean fixation center-of-mass is less than $50$ pixels from the image center in frames
- No additional normalization aside from per-map scaling
This suggests that the dataset’s design minimizes data-driven inductive biases (e.g., center bias) that could confound visual-linguistic alignment studies.
5. Baseline Model Evaluation
Nine off-the-shelf vision–LLMs were evaluated against VISTA-Beyond using two quantitative metrics:
- Normalized Cross-Correlation (NCC):
where is the number of pixels.
- Borji’s Area Under the Curve (AUC): Saliency maps are interpreted as binary classifiers predicting fixation locations, sampling 1,000 positive and 1,000 negative points per map.
The following table summarizes mean bootstrapped standard deviation (five resamples; see Table 2 in the original):
| Model | NCC (Mean ± SD) | AUC (Mean ± SD) |
|---|---|---|
| CLIP | 0.13 ± 0.01 | 0.57 ± 0.005 |
| BLIP-ITM-Base | 0.24 ± 0.01 | 0.63 ± 0.005 |
| BLIP-ITM-Large | 0.17 ± 0.01 | 0.60 ± 0.005 |
| ALBEF | 0.19 ± 0.01 | 0.57 ± 0.005 |
| ViLT | –0.02 ± 0.015 | 0.49 ± 0.01 |
| CLIP-Seg | 0.31 ± 0.02 | 0.67 ± 0.01 |
| OV-Seg | 0.18 ± 0.01 | 0.59 ± 0.005 |
| OpenSeg | 0.14 ± 0.01 | 0.58 ± 0.005 |
| ODISE | 0.16 ± 0.01 | 0.59 ± 0.005 |
CLIP-Seg demonstrates the highest NCC and AUC, and qualitatively its generated heatmaps most closely mirror the spatial patterns of human fixations during naturalistic description.
6. Applications and Constraints
Principal use cases for VISTA-Beyond include:
- Systematic comparison of model-generated saliency with human attention for interpretable VLM debugging and module fine-tuning.
- Human-aligned caption ranking, enabling prioritization of descriptions whose attentional profiles correspond to actual human gaze.
- Benchmarking and evaluation of saliency-driven explainability methods (e.g., Grad-CAM variants) against human ground truth.
Primary limitations are:
- Dataset size: 508 examples, each annotated by a single human subject.
- Limited domain coverage: all images are "everyday scenes," excluding specialized visual domains (e.g., medical imaging).
- Inability to report formal inter-annotator reliability scores, as each image–description pair is uniquely annotated.
- While no "center bias" is apparent in the data, downstream applications requiring different image statistics may need to address this for compatibility.
7. Significance for Vision–Language Research
VISTA-Beyond constitutes a critical resource bridging linguistic and visual attention for the purpose of human-aligned model interpretability. By providing ground-truth triplets at the level of spatio-temporal attentional traces explicitly linked to linguistic content, it supports both algorithmic benchmarking and the development of novel interpretability or grounding techniques. A plausible implication is that future expansion in both annotator pool size and domain coverage could address current dataset limitations and further enable more generalizable insights in multimodal alignment research (Harshit et al., 2024).