Papers
Topics
Authors
Recent
Search
2000 character limit reached

VaseVQA-3D: 3D VQA Benchmark for Greek Vases

Updated 12 October 2025
  • VaseVQA-3D is a domain-specific benchmark offering 664 high-quality 3D models of ancient Greek vases paired with 4,460 structured question-answer pairs.
  • The dataset employs a rigorous pipeline combining ResNet-50, CLIP filtering, and TripoSG-based 2D-to-3D conversion to ensure archaeological authenticity.
  • The specialized VaseVLM model, optimized via supervised fine-tuning and reinforcement learning, shows a 12.8% improvement in Recall@1 over previous benchmarks.

VaseVQA-3D is a domain-specific benchmark for multimodal, 3D visual question answering (VQA) focused on ancient Greek pottery. It provides a curated collection of high-fidelity 3D vase models accompanied by structured question–answer pairs and enhanced descriptive captions, enabling robust evaluation and training of vision-LLMs (VLMs) in cultural heritage analysis. VaseVQA-3D establishes a comprehensive data construction pipeline and specialized model training methodologies, addressing the unique challenges posed by data scarcity, domain specificity, and the need for archaeologically faithful reasoning in automated artifact analysis (Zhang et al., 6 Oct 2025).

1. Dataset Composition and Pipeline

VaseVQA-3D consists of 664 high-quality 3D models of ancient Greek vases, each annotated with 4,460 visual question–answer pairs encompassing six key archaeological attributes: Fabric, Technique, Shape, Dating, Decoration, and Attribution. The dataset includes enhanced captions for each vase, merging curated archaeological metadata with GPT-4o–produced language refinements.

The data construction pipeline employs several filtering stages originating from a raw corpus of over 30,000 2D images:

  • A ResNet-50 classifier removes low-quality (blurry or low-resolution) candidates.
  • CLIP-based semantic filtering eliminates fragmented objects using prompts ("a complete intact vase…").
  • CLIP-based view selection identifies optimal representative images. After filtering, a subset of 3,880 images undergoes 2D-to-3D conversion using TripoSG, chosen for its geometric and semantic fidelity after benchmarking against alternatives such as Hunyuan3D. The resulting mesh models (GLB format) serve as the reference for VQA and caption generation, forming the core of the VaseVQA-3D resource.
Stage Technique Purpose
Quality Filtering ResNet-50 Removes low-quality raw images
Fragment Removal CLIP + Prompt Ensures complete, unbroken vases
View Selection CLIP Picks best model perspectives
2D-to-3D Conversion TripoSG Generates high-fidelity vase models

2. Addressed Challenges

VaseVQA-3D targets several substantive challenges unique to the cultural heritage domain:

  • Data Scarcity and Long-Tail Distribution: Ancient Greek pottery, being a narrow and culturally significant domain, lacks sufficient large-scale annotated 3D datasets. Generic VLMs poorly generalize to domain-specific terminology and forms.
  • Cultural and Archaeological Authenticity: The pipeline is tuned to preserve archaeological accuracy, retaining only complete, contextually meaningful vases and eliminating visual degradation artifacts common in museum or excavation sources.
  • 3D Understanding for Artifact Analysis: Most prior VQA datasets are strictly 2D; VaseVQA-3D’s QA pairs require spatial reasoning, geometric comprehension, and multi-modal fusion to correctly answer questions about 3D morphology and decorative detail.

3. Model Development: VaseVLM and RLVR

The VaseVLM model is explicitly trained for ancient vase artifact analysis. The development protocol comprises two stages:

For each attribute i, a reward is:

ri={sim(gi,ti)if sim(gi,ti)τ 0otherwiser_i = \begin{cases} \mathrm{sim}(g_i, t_i) & \text{if } \mathrm{sim}(g_i, t_i) \geq \tau \ 0 & \text{otherwise} \end{cases}

where gig_i and tit_i are the generated and target content for attribute ii, sim(,)\mathrm{sim}(\cdot, \cdot) denotes cosine similarity, and threshold τ=0.7\tau = 0.7. The total reward aggregates weighted attribute scores and applies penalties for inappropriate length, repetition, and irrelevance, as well as bonuses for sequence matching:

R=iwiriP+BR = \sum_{i} w_i r_i - P + B

This approach specifically tailors policy optimization toward attributes frequently missed or misclassified during SFT.

4. Performance Evaluation and Metrics

VaseVLM, trained on VaseVQA-3D, demonstrates marked advances over prior state-of-the-art. The experimental evaluation uses metrics sensitive to both lexical and semantic correctness—in particular:

  • Recall@1 (R@1): Measures top-1 prediction accuracy for key attributes; VaseVLM shows a 12.8% improvement relative to previous benchmarks.
  • Lexical Similarity: Reflects mastery of domain-specific terminology; improvement of 6.6% reported.
  • Human Expert Judgement: Archaeologists assess the produced captions and answers for accuracy and appropriateness, with reinforcement-learning–trained models yielding demonstrably superior outcomes.

5. Methodological Connections and Comparative Context

VaseVQA-3D stands in methodological continuity with recent 3DQA datasets and multi-modal VQA research:

  • The dual-encoder and transformer-based fusion paradigms seen in 3DQA-TR (Ye et al., 2021) and advances in programmatic neural symbolic reasoning (Wang et al., 2023) motivate the multi-stage training and evaluation schema used for VaseVLM.
  • Space3D-Bench (Szymanska et al., 2024) illustrates the utility and necessity of diverse data modalities and balanced spatial taxonomy, both concepts mirrored and specialized in VaseVQA-3D’s focus on artifact-centric attribute questions.
  • The attribute-based taxonomy (Fabric, Technique, Shape, Dating, Decoration, Attribution) is rooted in expert analysis and evaluation protocols, expanding on prior dual-encoder appearance-geometry strategies which proved decisive for high-fidelity answer prediction in more generalized 3D scenarios.

6. Implications for Digital Heritage and Technical Pathways

VaseVQA-3D represents a technical advance in digital heritage research:

  • The integration of cutting-edge 3D modeling (TripoSG), semantic filtering, and multi-modal QA annotation sets a new standard for artifact-oriented evaluation.
  • RLVR-based reward engineering and its multi-dimensional verification protocol provide a template for future domain-specific extensions in areas where standard VQA metrics fail to capture expert-level accuracy.
  • The approach is adaptable to other cultural domains and interdisciplinary collaborations, where automated, scalable, and archaeologically faithful analysis is required for preservation, study, and dissemination.
  • The dataset and methods offer a foundation for benchmarking future 3D VLMs in specialized scientific and heritage contexts.

7. Future Directions and Open Research Problems

Current results demonstrate substantial progress but highlight salient open challenges in the field:

  • Scale and Coverage: Extending VaseVQA-3D to encompass other objects, periods, or cross-cultural artifacts remains an open challenge, principally limited by the labor-intensive annotation and faithful 3D reconstruction pipeline.
  • Spatial Reasoning Extensions: Further incorporation of compositional spatial reasoning, multi-object interactions, and higher-order historical inference remain areas for ongoing research.
  • Generalization and Transfer: The transferability of VaseVLM and RLVR frameworks to real-world (e.g., museum, field survey) settings and broader categories depends on continuing advances in 3D reconstruction and multi-modal modeling as well as richer, more diverse datasets.

A plausible implication is that the technical blueprint established by VaseVQA-3D—end-to-end pipeline, multi-stage filtering, RL-based verifiability, and semantic taxonomy conditioning—will underpin future developments in both the automated analysis of cultural heritage and in the more general domain of specialized 3D VQA, with rigorous, explainable benchmarks increasingly central to progress in this area.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VaseVQA-3D Dataset.