TextAug: Test time Text Augmentation for Multimodal Person Re-identification

Published 4 Dec 2023 in cs.CV and cs.LG | (2312.01605v1)

Abstract: Multimodal Person Reidentification is gaining popularity in the research community due to its effectiveness compared to counter-part unimodal frameworks. However, the bottleneck for multimodal deep learning is the need for a large volume of multimodal training examples. Data augmentation techniques such as cropping, flipping, rotation, etc. are often employed in the image domain to improve the generalization of deep learning models. Augmenting in other modalities than images, such as text, is challenging and requires significant computational resources and external data sources. In this study, we investigate the effectiveness of two computer vision data augmentation techniques: cutout and cutmix, for text augmentation in multi-modal person re-identification. Our approach merges these two augmentation strategies into one strategy called CutMixOut which involves randomly removing words or sub-phrases from a sentence (Cutout) and blending parts of two or more sentences to create diverse examples (CutMix) with a certain probability assigned to each operation. This augmentation was implemented at inference time without any prior training. Our results demonstrate that the proposed technique is simple and effective in improving the performance on multiple multimodal person re-identification benchmarks.

Abstract PDF Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper introduces TextAug, which adapts Cutout and CutMix techniques for text to enhance multimodal person re-identification.
It demonstrates significant performance gains on two datasets using models such as vision transformers and ResNet50.
TextAug outperforms traditional text augmentation methods by eliminating dependency on external linguistic resources.

Text augmentation, commonly used to enhance the generalizability of machine learning models, has been predominantly applied to image data. However, the field of Multimodal Person Re-identification, which benefits from both visual and textual data, faces the challenge of extending these techniques to text. The concern lies not only in the computational resources required but also in the availability and quality of multimodal datasets. Notably, most existing text augmentation methods necessitate external data such as thesauri for synonym replacement or pre-trained LLMs for linguistic transformation, which adds to the complexity.

In response to these challenges, a new study introduces a method named "TextAug," which adapts two computer vision data augmentation techniques, "Cutout" (random erasure of image parts) and "CutMix" (blended combination of different images), for textual data in the context of person re-identification. The integration of these techniques, coined “CutMixOut,” creates diverse text examples by randomly removing words or phrases (Cutout) and intermixing parts of multiple sentences (CutMix), augmenting the input without prior training.

The TextAug approach was found to enhance performance across various benchmarks for multimodal person re-identification. Specifically, when applying this technique, substantial improvements were recorded in model performance, in comparison to both image-only and non-augmented text models. The strategy proved to be effective, delivering promising results when tested on two different datasets, via models of various architectures including vision transformers (ViTs) and more traditional ResNet50. Furthermore, TextAug demonstrated superiority over other text augmentation methods, such as synonym replacement, confirming its potential in creating robust input for re-identification systems.

In essence, TextAug has emerged as a simple yet efficient solution to improving the generalization of models in NLP and enhancing the robustness of multimodal person re-identification systems. It showcases the viability of applying image augmentation concepts to text data and highlights the value of synthesizing visual and textual modalities in generating powerful data representations for machine learning tasks.