ComiCap: A VLMs pipeline for dense captioning of Comic Panels

Published 24 Sep 2024 in cs.CV | (2409.16159v1)

Abstract: The comic domain is rapidly advancing with the development of single- and multi-page analysis and synthesis models. Recent benchmarks and datasets have been introduced to support and assess models' capabilities in tasks such as detection (panels, characters, text), linking (character re-identification and speaker identification), and analysis of comic elements (e.g., dialog transcription). However, to provide a comprehensive understanding of the storyline, a model must not only extract elements but also understand their relationships and generate highly informative captions. In this work, we propose a pipeline that leverages Vision-LLMs (VLMs) to obtain dense, grounded captions. To construct our pipeline, we introduce an attribute-retaining metric that assesses whether all important attributes are identified in the caption. Additionally, we created a densely annotated test set to fairly evaluate open-source VLMs and select the best captioning model according to our metric. Our pipeline generates dense captions with bounding boxes that are quantitatively and qualitatively superior to those produced by specifically trained models, without requiring any additional training. Using this pipeline, we annotated over 2 million panels across 13,000 books, which will be available on the project page https://github.com/emanuelevivoli/ComiCap.

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel VLMs pipeline that leverages a custom ARM metric to generate detailed, attribute-rich captions for comic panels.
The methodology integrates multiple models, including MiniCPM and GroundingDINO, to enhance caption accuracy and ensure proper text grounding.
The release of the extensive ComiCap dataset significantly advances comic analysis and improves accessibility for visually impaired readers.

ComiCap: A VLMs Pipeline for Dense Captioning of Comic Panels

The paper "ComiCap: A VLMs Pipeline for Dense Captioning of Comic Panels" presents a significant advancement in the domain of comic book analysis. The authors introduce a comprehensive pipeline leveraging Vision-LLMs (VLMs) to generate dense and grounded captions for comic panels. This research addresses existing gaps in comic book analysis by providing a detailed contextual understanding of comic scenes, which is particularly beneficial for People with Visual Impairments (PVI).

Introduction

The study of comics poses unique challenges due to the complex interplay of visual and textual elements that convey the story. While humans can easily understand these media, computational methods often fall short, especially in providing context to PVI. Previous research has primarily focused on text extraction and dialog transcription, neglecting the crucial context provided by the visual elements. To bridge this gap, the paper proposes a novel pipeline that utilizes VLMs to generate captions enriched with detailed descriptions and bounding boxes.

Methodology

Captioning Models

The authors employ several state-of-the-art VLMs for caption generation, including PaliGemma, Idefics2, Florence2, and MiniCPM. Each model has undergone unique training regimens, which include combinations of extensive datasets and sophisticated techniques like reinforcement learning from AI feedback (RLAIF-V). Among these, MiniCPM demonstrated superior performance in retaining crucial attributes of comic panels, as evaluated by the newly introduced Attribute Retaining Metric (ARM).

Attribute Retaining Metric (ARM)

ARM is a custom metric designed to evaluate the presence of important attributes in the generated captions, moving beyond conventional metrics like BLEU or METEOR. ARM combines BERT-score with a Jaccard similarity index, ensuring that the captions accurately represent all significant elements in a comic panel. This approach prioritizes comprehensive attribute extraction over exact linguistic matches, providing a more robust assessment of the VLMs' effectiveness in this context.

Text Grounding

The effectiveness of dense captioning partially depends on the accurate association of textual descriptions with corresponding visual elements. The authors employ GroundingDINO for this purpose, which excels in zero-shot object detection scenarios. This model is used to generate bounding boxes for the identified attributes, enhancing the explainability and usability of the captions.

Results

The proposed pipeline was evaluated on a newly created densely annotated test set of comic panels. The results, summarized in Table 1, indicate that MiniCPM, combined with the novel ARM metric, outperformed other models in maintaining attribute accuracy. The pipeline was further refined by separately captioning panels and characters, significantly improving the detail and accuracy of the generated descriptions.

ComiCap Dataset

One of the noteworthy contributions of this work is the creation of the ComiCap dataset. The authors curated over 13,000 comic books from the Digital Comic Museum, resulting in more than 1.5 million panels and 2 million character annotations. This dataset, accompanied by dense captions, is made publicly available, providing a valuable resource for further research in comic analysis.

Implications and Future Work

The implications of this research are multifold. Practically, the proposed pipeline enhances the accessibility of comic books for visually impaired individuals by providing detailed contextual descriptions. Theoretically, the ARM metric introduces a novel approach to evaluating VLM-generated captions, which could be adapted for other domains requiring detailed scene understanding and attribute extraction.

Future research could explore several extensions. Improving attribute extraction and text grounding by integrating more complex models or ensemble methods could enhance caption accuracy. Expanding the dataset to include diverse comic styles and languages would improve the pipeline’s generalizability. Additionally, integrating real-time captioning systems and soliciting feedback from the Blind and PVI communities could further tailor the tool to user needs.

Conclusion

"ComiCap: A VLMs Pipeline for Dense Captioning of Comic Panels" significantly advances the field of automatic comic book analysis. By leveraging VLMs and introducing a novel metric for attribute evaluation, the authors provide a comprehensive method for generating detailed captions that enrich the reading experience for visually impaired individuals. The release of the ComiCap dataset further supports ongoing research, laying the groundwork for future developments in this domain.