RoViST: Visual Storytelling Evaluation Metric
- RoViST is a reference-free metric that evaluates visual storytelling by quantifying visual grounding, narrative coherence, and non-redundancy.
- It leverages fixed backbones like CLIP-ViT, GLoVe, and ALBERT to compute sub-scores, aligning automated evaluation closely with human judgment.
- Experimental results show that RoViST effectively ranks multimodal storytelling models with reduced human–machine evaluation gaps.
RoViST is a reference-free evaluation metric for the visual storytelling task, designed to assess generated narratives grounded in sequences of images. Distinct from traditional n-gram based textual metrics, RoViST directly evaluates the degree to which a story is visually grounded, coherent over sentence transitions, and non-redundant. These dimensions are explicitly encoded as sub-scores: Visual-Grounding (RoViST-VG, ), Coherence (RoViST-C, ), and Non-Redundancy (RoViST-NR, ). RoViST’s design aligns with the unique demands of multimodal generation, offering an automated yet human-aligned approach to narrative evaluation in the context of computer vision and natural language processing (Gado et al., 27 Apr 2025).
1. Formal Structure and Sub-metrics
RoViST comprises three sub-scores, each quantifying an essential narrative quality:
- Visual-Grounding (): Measures the semantic alignment between story nouns and salient visual regions in the associated images. For each noun in sentence , mapped to a word embedding , the maximum cosine similarity is computed against all region embeddings (obtained from a Vision Transformer backbone, e.g., CLIP-ViT-L/14). is the average of these maximum similarities across all nouns in all sentences:
- Coherence (0): Assesses narrative consistency across sentence transitions. A next-sentence-prediction model (e.g., ALBERT, fine-tuned on sequential storytelling corpora) estimates 1 for each 2. 3 is the mean conditional probability over the story:
4
- Non-Redundancy (5): Quantifies the lexical overlap between story sentences via Jaccard similarity. For distinct sentence pairs 6:
7
The redundancy 8 is the average 9 across all sentence pairs, and 0 is defined as 1:
2
The overall RoViST score is given by an equally-weighted mean of these three axes:
3
2. Algorithm and Implementation
RoViST’s workflow can be operationalized as follows:
- Image Feature Extraction: For each image 4, extract region embeddings 5 using a fixed, pre-trained ViT model (e.g., CLIP ViT-L/14, 768-d features).
- Text Feature Computation: For each sentence 6 in the story, apply tokenization and POS-tagging to extract nouns 7. Each 8 is embedded via a static word vector (typically 300-d GLoVe).
- Visual Grounding Calculation: For each noun 9 in 0, compute the cosine similarity with all region embeddings 1 from the corresponding image, and select the maximum. These maxima are averaged over all nouns to produce 2.
- Coherence Calculation: Input the story into a fine-tuned ALBERT model. For 3 to 4, compute 5, average to yield 6.
- Non-Redundancy Calculation: Form lowercased, stop-word-stripped unigram sets 7 for each sentence. Compute pairwise Jaccard similarity, average to 8, and set 9.
- Aggregation: Compute 0 as the mean of 1.
Hyperparameters and core model settings (as employed in (Gado et al., 27 Apr 2025)):
- Visual encoder: CLIP ViT-L/14, frozen
- Noun embeddings: 300-d GLoVe (Twitter or CommonCrawl)
- Coherence model: ALBERT (base), learning rate 2, batch 16, 3 epochs
- Jaccard: unigrams, case-insensitive, stop-words removed
3. Experimental Evaluation and Human Alignment
RoViST was used in (Gado et al., 27 Apr 2025) to evaluate leading visual storytelling models on a 900-example VIST test subset. Sub-metric and aggregate scores were all “higher is better.” Key results (Table 1 reproduced):
| Model | 3 | 4 | 5 |
|---|---|---|---|
| AREL | 0.6001 | 0.5692 | 0.8325 |
| GLACNET | 0.5158 | 0.6875 | 0.9506 |
| KG-Story | 0.7325 | 0.6493 | 0.9991 |
| MCSM+BART | 0.8648 | 0.6651 | 0.8999 |
| VIST-GPT v1 | 0.9401 | 0.7495 | 0.8821 |
| VIST-GPT v2 | 0.9962 | 0.7837 | 0.9301 |
Notably, VIST-GPT v2 achieves the highest Visual-Grounding and Coherence, while KG-Story yields a near-perfect Non-Redundancy score. VIST-GPT v2 provides the most balanced performance across axes.
To evaluate metric validity, a Human–Machine Distance 6 was introduced:
7
where 8 are human story scores and 9 correspond to the model.
Empirical values (lower is better):
| Model | 0 |
|---|---|
| AREL | 0.2403 |
| GLACNET | 0.1896 |
| KG-Story | 0.1457 |
| MCSM+BART | 0.0976 |
| VIST-GPT v1 | 0.0546 |
| VIST-GPT v2 | 0.0459 |
VIST-GPT v2 achieves the lowest human–machine gap, supporting the claim that RoViST aligns more closely with human assessment than prior metrics. The authors note the lack of direct correlation reporting between RoViST and n-gram metrics or subjective ratings, only asserting that n-gram-based measures correlate poorly in this context.
4. Design Rationale and Practical Considerations
Traditional sequence-level metrics (e.g., BLEU, METEOR, ROUGE, CIDEr) were found inadequate for evaluating visual storytelling because they do not measure grounding, narrative logicality, or redundancy. RoViST was designed to address these gaps, leveraging both vision-language modeling and advances in contextual sentence modeling.
Notable features:
- Reference-Free: RoViST requires only the image sequence and generated story, without reliance on ground-truth narrative references.
- Frozen Vision and Language Backbones: No model-specific finetuning is required to apply RoViST, as all employed encoders (CLIP-ViT, GLoVe, ALBERT) are fixed during evaluation.
- Sentence-Level Analysis: The metric explicitly models sentence-to-image and sentence-to-sentence relationships, moving beyond surface-level text features.
Implementation in (Gado et al., 27 Apr 2025) used the following default settings: ViT region encoder frozen; word embeddings fixed; ALBERT coherence model fine-tuned on relevant corpora.
5. Limitations and Comparative Context
While RoViST’s multi-axis reference-free evaluation introduces rigor to visual storytelling, it is fundamentally linked to the properties of its underlying encoders and dataset biases. The value of 1 depends on the expressiveness and coverage of both the noun embedding lexicon and the ViT region representations. 2 reflects the ability of the fine-tuned ALBERT model to proxy narrative plausibility, which may be sensitive to domain transfer. 3 relies on unigram-level overlap and may not capture subtler repetitive structures.
No direct pre-registered correlation with human qualitative ratings or systematic ablation on metric sensitivity is reported in (Gado et al., 27 Apr 2025). However, the low values of 4 and consistent ranking of SOTA models suggest practical utility.
A plausible implication is that as reference-free and multimodal generation expand, metrics in the style of RoViST will become increasingly central, both for benchmarking and for model-selection in generative vision-language research.
6. Relationship to Broader Multimodal Evaluation
RoViST’s focus on region grounding and narrative axes parallels the broader shift toward embedding-based and model-based metrics in vision–language domains. Similar methodological patterns appear in video captioning, image paragraph generation, and visual question answering, emphasizing semantic entailment and grounding over string-based overlap.
RoViST’s reliance on standard feature extractors (CLIP ViT-L/14, GLoVe, ALBERT–all widely adopted in vision-language tasks) ensures reproducibility and facilitates cross-model benchmarking. The metric, as adopted in (Gado et al., 27 Apr 2025), exemplifies a principled approach to multimodal evaluation that aligns with current trends in both NLP and computer vision communities.