Papers
Topics
Authors
Recent
Search
2000 character limit reached

RoViST: Visual Storytelling Evaluation Metric

Updated 25 January 2026
  • RoViST is a reference-free metric that evaluates visual storytelling by quantifying visual grounding, narrative coherence, and non-redundancy.
  • It leverages fixed backbones like CLIP-ViT, GLoVe, and ALBERT to compute sub-scores, aligning automated evaluation closely with human judgment.
  • Experimental results show that RoViST effectively ranks multimodal storytelling models with reduced human–machine evaluation gaps.

RoViST is a reference-free evaluation metric for the visual storytelling task, designed to assess generated narratives grounded in sequences of images. Distinct from traditional n-gram based textual metrics, RoViST directly evaluates the degree to which a story is visually grounded, coherent over sentence transitions, and non-redundant. These dimensions are explicitly encoded as sub-scores: Visual-Grounding (RoViST-VG, GG), Coherence (RoViST-C, CC), and Non-Redundancy (RoViST-NR, RR). RoViST’s design aligns with the unique demands of multimodal generation, offering an automated yet human-aligned approach to narrative evaluation in the context of computer vision and natural language processing (Gado et al., 27 Apr 2025).

1. Formal Structure and Sub-metrics

RoViST comprises three sub-scores, each quantifying an essential narrative quality:

  • Visual-Grounding (GG): Measures the semantic alignment between story nouns and salient visual regions in the associated images. For each noun nn in sentence ss, mapped to a word embedding vnv_n, the maximum cosine similarity is computed against all region embeddings uru_r (obtained from a Vision Transformer backbone, e.g., CLIP-ViT-L/14). GG is the average of these maximum similarities across all nouns in all sentences:

G=1s=1SNss=1SnNsmaxrRsvnurvnurG = \frac{1}{\sum_{s=1}^S |N_s|} \sum_{s=1}^S \sum_{n \in N_s} \max_{r \in R_s} \frac{v_n^\top u_r}{\|v_n\| \|u_r\|}

  • Coherence (CC0): Assesses narrative consistency across sentence transitions. A next-sentence-prediction model (e.g., ALBERT, fine-tuned on sequential storytelling corpora) estimates CC1 for each CC2. CC3 is the mean conditional probability over the story:

CC4

  • Non-Redundancy (CC5): Quantifies the lexical overlap between story sentences via Jaccard similarity. For distinct sentence pairs CC6:

CC7

The redundancy CC8 is the average CC9 across all sentence pairs, and RR0 is defined as RR1:

RR2

The overall RoViST score is given by an equally-weighted mean of these three axes:

RR3

2. Algorithm and Implementation

RoViST’s workflow can be operationalized as follows:

  1. Image Feature Extraction: For each image RR4, extract region embeddings RR5 using a fixed, pre-trained ViT model (e.g., CLIP ViT-L/14, 768-d features).
  2. Text Feature Computation: For each sentence RR6 in the story, apply tokenization and POS-tagging to extract nouns RR7. Each RR8 is embedded via a static word vector (typically 300-d GLoVe).
  3. Visual Grounding Calculation: For each noun RR9 in GG0, compute the cosine similarity with all region embeddings GG1 from the corresponding image, and select the maximum. These maxima are averaged over all nouns to produce GG2.
  4. Coherence Calculation: Input the story into a fine-tuned ALBERT model. For GG3 to GG4, compute GG5, average to yield GG6.
  5. Non-Redundancy Calculation: Form lowercased, stop-word-stripped unigram sets GG7 for each sentence. Compute pairwise Jaccard similarity, average to GG8, and set GG9.
  6. Aggregation: Compute nn0 as the mean of nn1.

Hyperparameters and core model settings (as employed in (Gado et al., 27 Apr 2025)):

  • Visual encoder: CLIP ViT-L/14, frozen
  • Noun embeddings: 300-d GLoVe (Twitter or CommonCrawl)
  • Coherence model: ALBERT (base), learning rate nn2, batch 16, 3 epochs
  • Jaccard: unigrams, case-insensitive, stop-words removed

3. Experimental Evaluation and Human Alignment

RoViST was used in (Gado et al., 27 Apr 2025) to evaluate leading visual storytelling models on a 900-example VIST test subset. Sub-metric and aggregate scores were all “higher is better.” Key results (Table 1 reproduced):

Model nn3 nn4 nn5
AREL 0.6001 0.5692 0.8325
GLACNET 0.5158 0.6875 0.9506
KG-Story 0.7325 0.6493 0.9991
MCSM+BART 0.8648 0.6651 0.8999
VIST-GPT v1 0.9401 0.7495 0.8821
VIST-GPT v2 0.9962 0.7837 0.9301

Notably, VIST-GPT v2 achieves the highest Visual-Grounding and Coherence, while KG-Story yields a near-perfect Non-Redundancy score. VIST-GPT v2 provides the most balanced performance across axes.

To evaluate metric validity, a Human–Machine Distance nn6 was introduced:

nn7

where nn8 are human story scores and nn9 correspond to the model.

Empirical values (lower is better):

Model ss0
AREL 0.2403
GLACNET 0.1896
KG-Story 0.1457
MCSM+BART 0.0976
VIST-GPT v1 0.0546
VIST-GPT v2 0.0459

VIST-GPT v2 achieves the lowest human–machine gap, supporting the claim that RoViST aligns more closely with human assessment than prior metrics. The authors note the lack of direct correlation reporting between RoViST and n-gram metrics or subjective ratings, only asserting that n-gram-based measures correlate poorly in this context.

4. Design Rationale and Practical Considerations

Traditional sequence-level metrics (e.g., BLEU, METEOR, ROUGE, CIDEr) were found inadequate for evaluating visual storytelling because they do not measure grounding, narrative logicality, or redundancy. RoViST was designed to address these gaps, leveraging both vision-language modeling and advances in contextual sentence modeling.

Notable features:

  • Reference-Free: RoViST requires only the image sequence and generated story, without reliance on ground-truth narrative references.
  • Frozen Vision and Language Backbones: No model-specific finetuning is required to apply RoViST, as all employed encoders (CLIP-ViT, GLoVe, ALBERT) are fixed during evaluation.
  • Sentence-Level Analysis: The metric explicitly models sentence-to-image and sentence-to-sentence relationships, moving beyond surface-level text features.

Implementation in (Gado et al., 27 Apr 2025) used the following default settings: ViT region encoder frozen; word embeddings fixed; ALBERT coherence model fine-tuned on relevant corpora.

5. Limitations and Comparative Context

While RoViST’s multi-axis reference-free evaluation introduces rigor to visual storytelling, it is fundamentally linked to the properties of its underlying encoders and dataset biases. The value of ss1 depends on the expressiveness and coverage of both the noun embedding lexicon and the ViT region representations. ss2 reflects the ability of the fine-tuned ALBERT model to proxy narrative plausibility, which may be sensitive to domain transfer. ss3 relies on unigram-level overlap and may not capture subtler repetitive structures.

No direct pre-registered correlation with human qualitative ratings or systematic ablation on metric sensitivity is reported in (Gado et al., 27 Apr 2025). However, the low values of ss4 and consistent ranking of SOTA models suggest practical utility.

A plausible implication is that as reference-free and multimodal generation expand, metrics in the style of RoViST will become increasingly central, both for benchmarking and for model-selection in generative vision-language research.

6. Relationship to Broader Multimodal Evaluation

RoViST’s focus on region grounding and narrative axes parallels the broader shift toward embedding-based and model-based metrics in vision–language domains. Similar methodological patterns appear in video captioning, image paragraph generation, and visual question answering, emphasizing semantic entailment and grounding over string-based overlap.

RoViST’s reliance on standard feature extractors (CLIP ViT-L/14, GLoVe, ALBERT–all widely adopted in vision-language tasks) ensures reproducibility and facilitates cross-model benchmarking. The metric, as adopted in (Gado et al., 27 Apr 2025), exemplifies a principled approach to multimodal evaluation that aligns with current trends in both NLP and computer vision communities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RoViST.