Integrating Vision and Location with Transformers: A Multimodal Deep Learning Framework for Medical Wound Analysis

Published 14 Apr 2025 in cs.CV | (2504.10452v1)

Abstract: Effective recognition of acute and difficult-to-heal wounds is a necessary step in wound diagnosis. An efficient classification model can help wound specialists classify wound types with less financial and time costs and also help in deciding on the optimal treatment method. Traditional machine learning models suffer from feature selection and are usually cumbersome models for accurate recognition. Recently, deep learning (DL) has emerged as a powerful tool in wound diagnosis. Although DL seems promising for wound type recognition, there is still a large scope for improving the efficiency and accuracy of the model. In this study, a DL-based multimodal classifier was developed using wound images and their corresponding locations to classify them into multiple classes, including diabetic, pressure, surgical, and venous ulcers. A body map was also created to provide location data, which can help wound specialists label wound locations more effectively. The model uses a Vision Transformer to extract hierarchical features from input images, a Discrete Wavelet Transform (DWT) layer to capture low and high frequency components, and a Transformer to extract spatial features. The number of neurons and weight vector optimization were performed using three swarm-based optimization techniques (Monster Gorilla Toner (MGTO), Improved Gray Wolf Optimization (IGWO), and Fox Optimization Algorithm). The evaluation results show that weight vector optimization using optimization algorithms can increase diagnostic accuracy and make it a very effective approach for wound detection. In the classification using the original body map, the proposed model was able to achieve an accuracy of 0.8123 using image data and an accuracy of 0.8007 using a combination of image data and wound location. Also, the accuracy of the model in combination with the optimization models varied from 0.7801 to 0.8342.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that combining visual data with binary-encoded location information using a dual-branch transformer architecture significantly improves wound classification accuracy.
It utilizes a Vision Transformer enhanced with a Discrete Wavelet Transform for detailed image feature extraction, paired with a location branch that captures spatial context.
Swarm-based optimization techniques like FOX and MGTO further refine network parameters, achieving accuracies up to 0.8644 in multi-class classification tasks.

This paper presents a multimodal deep learning framework for classifying medical wounds into categories such as diabetic, pressure, surgical, and venous ulcers. The core idea is to integrate both visual information from wound images and spatial information about the wound's location on the body to improve classification accuracy compared to using images alone.

Methodology:

Data: The study utilizes the AZH dataset, containing 730 wound images labeled by specialists, along with corresponding location information. A body map (initially 484 locations, later simplified to 323) is used to standardize location data.
Architecture: A dual-branch network architecture is proposed:
- Vision Branch: A Vision Transformer (ViT) is used to process the wound images. To enhance feature extraction, especially for textures and edges relevant in wounds, a Discrete Wavelet Transform (DWT) layer is applied to the images before they are fed into the ViT. This ViT+Wavelet module extracts visual features ( $Vit_{latent}$ ).
- Location Branch: Wound location data, initially numerical, is converted into 9-bit binary vectors. These binary sequences are then processed by a standard Transformer architecture to capture spatial context ( $Transformer_{latent}$ ).
- Fusion: The latent feature vectors from both branches ( $Vit_{latent}$ and $Transformer_{latent}$ ) are concatenated ( $Finall_{vector} = Vit_{latent} \bigoplus Transformer_{latent}$ ) and fed into subsequent layers for final classification.
Optimization: The study explores optimizing the network's hyperparameters (like filter sizes, learning rate, regularization, batch size, epochs) and weight vectors using three swarm-based metaheuristic algorithms: Improved Grey Wolf Optimizer (IGWO), Fox Optimization Algorithm (FOX), and improved Gorilla Troops Optimizer (mGTO).

Experiments and Results:

Evaluation Setup: The framework was evaluated on the AZH dataset (both original and augmented versions) using various classification tasks (2-class, 3-class, 4-class, 5-class, and 6-class including background/normal skin). Performance was measured using accuracy, precision, recall, F1-score, specificity, and sensitivity. Comparisons were made against various baseline models (MLP, LSTM, GMRNN, CNNs like VGG, ResNet, EfficientNet, etc.) and different input modalities (Location only, Image only, Image + Location).
Key Findings:
- Multimodal Superiority: Combining image and location data (Image + Location) consistently outperformed models using only images or only locations across most classification tasks.
- Proposed Model Performance: The proposed Vit+Wavelet+Transformer model generally achieved the highest accuracy among all tested configurations for multimodal input. For instance, in the 4-class task on augmented data, it achieved 0.8354 accuracy, and 0.8644 accuracy in the 6-class task on original data.
- ViT+Wavelet Effectiveness: The ViT+Wavelet configuration for image processing outperformed standard ViT and various CNN-based approaches, highlighting the benefit of integrating DWT for wound image analysis.
- Location Encoding: Using binary encoding for location data processed by a Transformer was more effective than simpler methods like MLP or LSTM for location-only input.
- Optimization Impact: Applying swarm-based optimization algorithms (especially FOX and MGTO) further improved the F1-scores compared to the non-optimized models, demonstrating their utility in fine-tuning the network parameters. For the Vit+Wavelet+Transformer model, FOX optimization boosted accuracy to 0.8342 in the 4-class task.
- Complexity: The paper includes an analysis of model complexity (parameters, GFlops, memory usage), showing the trade-offs associated with the proposed architecture and optimization methods.

Conclusion and Future Work:

The study concludes that integrating vision (via ViT+Wavelet) and location (via Transformer on binary encoded data) significantly enhances wound classification accuracy. The proposed multimodal framework, particularly when optimized with swarm algorithms, offers a robust approach. Future work includes exploring advanced loss functions, data augmentation techniques, explainability methods (like SHAP), incorporating more clinical metadata, and large-scale clinical validation.

Markdown Report Issue