- The paper demonstrates that combining visual data with binary-encoded location information using a dual-branch transformer architecture significantly improves wound classification accuracy.
- It utilizes a Vision Transformer enhanced with a Discrete Wavelet Transform for detailed image feature extraction, paired with a location branch that captures spatial context.
- Swarm-based optimization techniques like FOX and MGTO further refine network parameters, achieving accuracies up to 0.8644 in multi-class classification tasks.
This paper presents a multimodal deep learning framework for classifying medical wounds into categories such as diabetic, pressure, surgical, and venous ulcers. The core idea is to integrate both visual information from wound images and spatial information about the wound's location on the body to improve classification accuracy compared to using images alone.
Methodology:
- Data: The study utilizes the AZH dataset, containing 730 wound images labeled by specialists, along with corresponding location information. A body map (initially 484 locations, later simplified to 323) is used to standardize location data.
- Architecture: A dual-branch network architecture is proposed:
- Vision Branch: A Vision Transformer (ViT) is used to process the wound images. To enhance feature extraction, especially for textures and edges relevant in wounds, a Discrete Wavelet Transform (DWT) layer is applied to the images before they are fed into the ViT. This
ViT+Wavelet module extracts visual features (Vitlatent).
- Location Branch: Wound location data, initially numerical, is converted into 9-bit binary vectors. These binary sequences are then processed by a standard Transformer architecture to capture spatial context (Transformerlatent).
- Fusion: The latent feature vectors from both branches (Vitlatent and Transformerlatent) are concatenated (Finallvector=Vitlatent⨁Transformerlatent) and fed into subsequent layers for final classification.
- Optimization: The study explores optimizing the network's hyperparameters (like filter sizes, learning rate, regularization, batch size, epochs) and weight vectors using three swarm-based metaheuristic algorithms: Improved Grey Wolf Optimizer (IGWO), Fox Optimization Algorithm (FOX), and improved Gorilla Troops Optimizer (mGTO).
Experiments and Results:
- Evaluation Setup: The framework was evaluated on the AZH dataset (both original and augmented versions) using various classification tasks (2-class, 3-class, 4-class, 5-class, and 6-class including background/normal skin). Performance was measured using accuracy, precision, recall, F1-score, specificity, and sensitivity. Comparisons were made against various baseline models (MLP, LSTM, GMRNN, CNNs like VGG, ResNet, EfficientNet, etc.) and different input modalities (Location only, Image only, Image + Location).
- Key Findings:
- Multimodal Superiority: Combining image and location data (
Image + Location) consistently outperformed models using only images or only locations across most classification tasks.
- Proposed Model Performance: The proposed
Vit+Wavelet+Transformer model generally achieved the highest accuracy among all tested configurations for multimodal input. For instance, in the 4-class task on augmented data, it achieved 0.8354 accuracy, and 0.8644 accuracy in the 6-class task on original data.
- ViT+Wavelet Effectiveness: The
ViT+Wavelet configuration for image processing outperformed standard ViT and various CNN-based approaches, highlighting the benefit of integrating DWT for wound image analysis.
- Location Encoding: Using binary encoding for location data processed by a Transformer was more effective than simpler methods like MLP or LSTM for location-only input.
- Optimization Impact: Applying swarm-based optimization algorithms (especially FOX and MGTO) further improved the F1-scores compared to the non-optimized models, demonstrating their utility in fine-tuning the network parameters. For the
Vit+Wavelet+Transformer model, FOX optimization boosted accuracy to 0.8342 in the 4-class task.
- Complexity: The paper includes an analysis of model complexity (parameters, GFlops, memory usage), showing the trade-offs associated with the proposed architecture and optimization methods.
Conclusion and Future Work:
The study concludes that integrating vision (via ViT+Wavelet) and location (via Transformer on binary encoded data) significantly enhances wound classification accuracy. The proposed multimodal framework, particularly when optimized with swarm algorithms, offers a robust approach. Future work includes exploring advanced loss functions, data augmentation techniques, explainability methods (like SHAP), incorporating more clinical metadata, and large-scale clinical validation.