VaseVQA-3D: 3D VQA Benchmark for Greek Vases
- VaseVQA-3D is a domain-specific benchmark offering 664 high-quality 3D models of ancient Greek vases paired with 4,460 structured question-answer pairs.
- The dataset employs a rigorous pipeline combining ResNet-50, CLIP filtering, and TripoSG-based 2D-to-3D conversion to ensure archaeological authenticity.
- The specialized VaseVLM model, optimized via supervised fine-tuning and reinforcement learning, shows a 12.8% improvement in Recall@1 over previous benchmarks.
VaseVQA-3D is a domain-specific benchmark for multimodal, 3D visual question answering (VQA) focused on ancient Greek pottery. It provides a curated collection of high-fidelity 3D vase models accompanied by structured question–answer pairs and enhanced descriptive captions, enabling robust evaluation and training of vision-LLMs (VLMs) in cultural heritage analysis. VaseVQA-3D establishes a comprehensive data construction pipeline and specialized model training methodologies, addressing the unique challenges posed by data scarcity, domain specificity, and the need for archaeologically faithful reasoning in automated artifact analysis (Zhang et al., 6 Oct 2025).
1. Dataset Composition and Pipeline
VaseVQA-3D consists of 664 high-quality 3D models of ancient Greek vases, each annotated with 4,460 visual question–answer pairs encompassing six key archaeological attributes: Fabric, Technique, Shape, Dating, Decoration, and Attribution. The dataset includes enhanced captions for each vase, merging curated archaeological metadata with GPT-4o–produced language refinements.
The data construction pipeline employs several filtering stages originating from a raw corpus of over 30,000 2D images:
- A ResNet-50 classifier removes low-quality (blurry or low-resolution) candidates.
- CLIP-based semantic filtering eliminates fragmented objects using prompts ("a complete intact vase…").
- CLIP-based view selection identifies optimal representative images. After filtering, a subset of 3,880 images undergoes 2D-to-3D conversion using TripoSG, chosen for its geometric and semantic fidelity after benchmarking against alternatives such as Hunyuan3D. The resulting mesh models (GLB format) serve as the reference for VQA and caption generation, forming the core of the VaseVQA-3D resource.
| Stage | Technique | Purpose |
|---|---|---|
| Quality Filtering | ResNet-50 | Removes low-quality raw images |
| Fragment Removal | CLIP + Prompt | Ensures complete, unbroken vases |
| View Selection | CLIP | Picks best model perspectives |
| 2D-to-3D Conversion | TripoSG | Generates high-fidelity vase models |
2. Addressed Challenges
VaseVQA-3D targets several substantive challenges unique to the cultural heritage domain:
- Data Scarcity and Long-Tail Distribution: Ancient Greek pottery, being a narrow and culturally significant domain, lacks sufficient large-scale annotated 3D datasets. Generic VLMs poorly generalize to domain-specific terminology and forms.
- Cultural and Archaeological Authenticity: The pipeline is tuned to preserve archaeological accuracy, retaining only complete, contextually meaningful vases and eliminating visual degradation artifacts common in museum or excavation sources.
- 3D Understanding for Artifact Analysis: Most prior VQA datasets are strictly 2D; VaseVQA-3D’s QA pairs require spatial reasoning, geometric comprehension, and multi-modal fusion to correctly answer questions about 3D morphology and decorative detail.
3. Model Development: VaseVLM and RLVR
The VaseVLM model is explicitly trained for ancient vase artifact analysis. The development protocol comprises two stages:
- Supervised Fine-Tuning (SFT): Foundation vision-LLMs (Qwen2.5-VL variants) undergo LoRA-based parameter-efficient tuning leveraging rendered 360-degree vase videos and the enhanced captions.
- Reinforcement Learning with Verifiable Rewards (RLVR): VaseVLM is further optimized using Group Relative Policy Optimization (GRPO) within the RLVR framework. This framework decomposes the archaeological caption into six attributes (Fabric, Technique, Shape, Dating, Decoration, Attribution), assigning semantic weights (Fabric/Technique/Decoration: 0.20 each; Shape/Dating: 0.15; Attribution: 0.10).
For each attribute i, a reward is:
where and are the generated and target content for attribute , denotes cosine similarity, and threshold . The total reward aggregates weighted attribute scores and applies penalties for inappropriate length, repetition, and irrelevance, as well as bonuses for sequence matching:
This approach specifically tailors policy optimization toward attributes frequently missed or misclassified during SFT.
4. Performance Evaluation and Metrics
VaseVLM, trained on VaseVQA-3D, demonstrates marked advances over prior state-of-the-art. The experimental evaluation uses metrics sensitive to both lexical and semantic correctness—in particular:
- Recall@1 (R@1): Measures top-1 prediction accuracy for key attributes; VaseVLM shows a 12.8% improvement relative to previous benchmarks.
- Lexical Similarity: Reflects mastery of domain-specific terminology; improvement of 6.6% reported.
- Human Expert Judgement: Archaeologists assess the produced captions and answers for accuracy and appropriateness, with reinforcement-learning–trained models yielding demonstrably superior outcomes.
5. Methodological Connections and Comparative Context
VaseVQA-3D stands in methodological continuity with recent 3DQA datasets and multi-modal VQA research:
- The dual-encoder and transformer-based fusion paradigms seen in 3DQA-TR (Ye et al., 2021) and advances in programmatic neural symbolic reasoning (Wang et al., 2023) motivate the multi-stage training and evaluation schema used for VaseVLM.
- Space3D-Bench (Szymanska et al., 2024) illustrates the utility and necessity of diverse data modalities and balanced spatial taxonomy, both concepts mirrored and specialized in VaseVQA-3D’s focus on artifact-centric attribute questions.
- The attribute-based taxonomy (Fabric, Technique, Shape, Dating, Decoration, Attribution) is rooted in expert analysis and evaluation protocols, expanding on prior dual-encoder appearance-geometry strategies which proved decisive for high-fidelity answer prediction in more generalized 3D scenarios.
6. Implications for Digital Heritage and Technical Pathways
VaseVQA-3D represents a technical advance in digital heritage research:
- The integration of cutting-edge 3D modeling (TripoSG), semantic filtering, and multi-modal QA annotation sets a new standard for artifact-oriented evaluation.
- RLVR-based reward engineering and its multi-dimensional verification protocol provide a template for future domain-specific extensions in areas where standard VQA metrics fail to capture expert-level accuracy.
- The approach is adaptable to other cultural domains and interdisciplinary collaborations, where automated, scalable, and archaeologically faithful analysis is required for preservation, study, and dissemination.
- The dataset and methods offer a foundation for benchmarking future 3D VLMs in specialized scientific and heritage contexts.
7. Future Directions and Open Research Problems
Current results demonstrate substantial progress but highlight salient open challenges in the field:
- Scale and Coverage: Extending VaseVQA-3D to encompass other objects, periods, or cross-cultural artifacts remains an open challenge, principally limited by the labor-intensive annotation and faithful 3D reconstruction pipeline.
- Spatial Reasoning Extensions: Further incorporation of compositional spatial reasoning, multi-object interactions, and higher-order historical inference remain areas for ongoing research.
- Generalization and Transfer: The transferability of VaseVLM and RLVR frameworks to real-world (e.g., museum, field survey) settings and broader categories depends on continuing advances in 3D reconstruction and multi-modal modeling as well as richer, more diverse datasets.
A plausible implication is that the technical blueprint established by VaseVQA-3D—end-to-end pipeline, multi-stage filtering, RL-based verifiability, and semantic taxonomy conditioning—will underpin future developments in both the automated analysis of cultural heritage and in the more general domain of specialized 3D VQA, with rigorous, explainable benchmarks increasingly central to progress in this area.