VaseVQA-3D: 3D VQA Benchmark for Greek Vases

Updated 12 October 2025

VaseVQA-3D is a domain-specific benchmark offering 664 high-quality 3D models of ancient Greek vases paired with 4,460 structured question-answer pairs.
The dataset employs a rigorous pipeline combining ResNet-50, CLIP filtering, and TripoSG-based 2D-to-3D conversion to ensure archaeological authenticity.
The specialized VaseVLM model, optimized via supervised fine-tuning and reinforcement learning, shows a 12.8% improvement in Recall@1 over previous benchmarks.

VaseVQA-3D is a domain-specific benchmark for multimodal, 3D visual question answering (VQA) focused on ancient Greek pottery. It provides a curated collection of high-fidelity 3D vase models accompanied by structured question–answer pairs and enhanced descriptive captions, enabling robust evaluation and training of vision-LLMs (VLMs) in cultural heritage analysis. VaseVQA-3D establishes a comprehensive data construction pipeline and specialized model training methodologies, addressing the unique challenges posed by data scarcity, domain specificity, and the need for archaeologically faithful reasoning in automated artifact analysis (Zhang et al., 6 Oct 2025).

1. Dataset Composition and Pipeline

VaseVQA-3D consists of 664 high-quality 3D models of ancient Greek vases, each annotated with 4,460 visual question–answer pairs encompassing six key archaeological attributes: Fabric, Technique, Shape, Dating, Decoration, and Attribution. The dataset includes enhanced captions for each vase, merging curated archaeological metadata with GPT-4o–produced language refinements.

The data construction pipeline employs several filtering stages originating from a raw corpus of over 30,000 2D images:

A ResNet-50 classifier removes low-quality (blurry or low-resolution) candidates.
CLIP-based semantic filtering eliminates fragmented objects using prompts ("a complete intact vase…").
CLIP-based view selection identifies optimal representative images. After filtering, a subset of 3,880 images undergoes 2D-to-3D conversion using TripoSG, chosen for its geometric and semantic fidelity after benchmarking against alternatives such as Hunyuan3D. The resulting mesh models (GLB format) serve as the reference for VQA and caption generation, forming the core of the VaseVQA-3D resource.

Stage	Technique	Purpose
Quality Filtering	ResNet-50	Removes low-quality raw images
Fragment Removal	CLIP + Prompt	Ensures complete, unbroken vases
View Selection	CLIP	Picks best model perspectives
2D-to-3D Conversion	TripoSG	Generates high-fidelity vase models

2. Addressed Challenges

VaseVQA-3D targets several substantive challenges unique to the cultural heritage domain:

Data Scarcity and Long-Tail Distribution: Ancient Greek pottery, being a narrow and culturally significant domain, lacks sufficient large-scale annotated 3D datasets. Generic VLMs poorly generalize to domain-specific terminology and forms.
Cultural and Archaeological Authenticity: The pipeline is tuned to preserve archaeological accuracy, retaining only complete, contextually meaningful vases and eliminating visual degradation artifacts common in museum or excavation sources.
3D Understanding for Artifact Analysis: Most prior VQA datasets are strictly 2D; VaseVQA-3D’s QA pairs require spatial reasoning, geometric comprehension, and multi-modal fusion to correctly answer questions about 3D morphology and decorative detail.

3. Model Development: VaseVLM and RLVR

The VaseVLM model is explicitly trained for ancient vase artifact analysis. The development protocol comprises two stages:

Supervised Fine-Tuning (SFT): Foundation vision-LLMs (Qwen2.5-VL variants) undergo LoRA-based parameter-efficient tuning leveraging rendered 360-degree vase videos and the enhanced captions.
Reinforcement Learning with Verifiable Rewards (RLVR): VaseVLM is further optimized using Group Relative Policy Optimization (GRPO) within the RLVR framework. This framework decomposes the archaeological caption into six attributes (Fabric, Technique, Shape, Dating, Decoration, Attribution), assigning semantic weights (Fabric/Technique/Decoration: 0.20 each; Shape/Dating: 0.15; Attribution: 0.10).

For each attribute i, a reward is:

$r_i = \begin{cases} \mathrm{sim}(g_i, t_i) & \text{if } \mathrm{sim}(g_i, t_i) \geq \tau \ 0 & \text{otherwise} \end{cases}$

where $g_i$ and $t_i$ are the generated and target content for attribute $i$ , $\mathrm{sim}(\cdot, \cdot)$ denotes cosine similarity, and threshold $\tau = 0.7$ . The total reward aggregates weighted attribute scores and applies penalties for inappropriate length, repetition, and irrelevance, as well as bonuses for sequence matching:

$R = \sum_{i} w_i r_i - P + B$

This approach specifically tailors policy optimization toward attributes frequently missed or misclassified during SFT.

4. Performance Evaluation and Metrics

VaseVLM, trained on VaseVQA-3D, demonstrates marked advances over prior state-of-the-art. The experimental evaluation uses metrics sensitive to both lexical and semantic correctness—in particular:

Recall@1 (R@1): Measures top-1 prediction accuracy for key attributes; VaseVLM shows a 12.8% improvement relative to previous benchmarks.
Lexical Similarity: Reflects mastery of domain-specific terminology; improvement of 6.6% reported.
Human Expert Judgement: Archaeologists assess the produced captions and answers for accuracy and appropriateness, with reinforcement-learning–trained models yielding demonstrably superior outcomes.

5. Methodological Connections and Comparative Context

VaseVQA-3D stands in methodological continuity with recent 3DQA datasets and multi-modal VQA research:

The dual-encoder and transformer-based fusion paradigms seen in 3DQA-TR (Ye et al., 2021) and advances in programmatic neural symbolic reasoning (Wang et al., 2023) motivate the multi-stage training and evaluation schema used for VaseVLM.
Space3D-Bench (Szymanska et al., 2024) illustrates the utility and necessity of diverse data modalities and balanced spatial taxonomy, both concepts mirrored and specialized in VaseVQA-3D’s focus on artifact-centric attribute questions.
The attribute-based taxonomy (Fabric, Technique, Shape, Dating, Decoration, Attribution) is rooted in expert analysis and evaluation protocols, expanding on prior dual-encoder appearance-geometry strategies which proved decisive for high-fidelity answer prediction in more generalized 3D scenarios.

6. Implications for Digital Heritage and Technical Pathways

VaseVQA-3D represents a technical advance in digital heritage research:

The integration of cutting-edge 3D modeling (TripoSG), semantic filtering, and multi-modal QA annotation sets a new standard for artifact-oriented evaluation.
RLVR-based reward engineering and its multi-dimensional verification protocol provide a template for future domain-specific extensions in areas where standard VQA metrics fail to capture expert-level accuracy.
The approach is adaptable to other cultural domains and interdisciplinary collaborations, where automated, scalable, and archaeologically faithful analysis is required for preservation, study, and dissemination.
The dataset and methods offer a foundation for benchmarking future 3D VLMs in specialized scientific and heritage contexts.

7. Future Directions and Open Research Problems

Current results demonstrate substantial progress but highlight salient open challenges in the field:

Scale and Coverage: Extending VaseVQA-3D to encompass other objects, periods, or cross-cultural artifacts remains an open challenge, principally limited by the labor-intensive annotation and faithful 3D reconstruction pipeline.
Spatial Reasoning Extensions: Further incorporation of compositional spatial reasoning, multi-object interactions, and higher-order historical inference remain areas for ongoing research.
Generalization and Transfer: The transferability of VaseVLM and RLVR frameworks to real-world (e.g., museum, field survey) settings and broader categories depends on continuing advances in 3D reconstruction and multi-modal modeling as well as richer, more diverse datasets.

A plausible implication is that the technical blueprint established by VaseVQA-3D—end-to-end pipeline, multi-stage filtering, RL-based verifiability, and semantic taxonomy conditioning—will underpin future developments in both the automated analysis of cultural heritage and in the more general domain of specialized 3D VQA, with rigorous, explainable benchmarks increasingly central to progress in this area.

Markdown Report Issue Upgrade to Chat

References (4)

VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery (2025)

3D Question Answering (2021)

3D-Aware Visual Question Answering about Parts, Poses and Occlusions (2023)

Space3D-Bench: Spatial 3D Question Answering Benchmark (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VaseVQA-3D Dataset.

VaseVQA-3D: 3D VQA Benchmark for Greek Vases

1. Dataset Composition and Pipeline

2. Addressed Challenges

3. Model Development: VaseVLM and RLVR

4. Performance Evaluation and Metrics

5. Methodological Connections and Comparative Context

6. Implications for Digital Heritage and Technical Pathways

7. Future Directions and Open Research Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

VaseVQA-3D: 3D VQA Benchmark for Greek Vases

1. Dataset Composition and Pipeline

2. Addressed Challenges

3. Model Development: VaseVLM and RLVR

4. Performance Evaluation and Metrics

5. Methodological Connections and Comparative Context

6. Implications for Digital Heritage and Technical Pathways

7. Future Directions and Open Research Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research