Evan_V2 Hybrid Model for Alzheimer’s MRI

Updated 28 January 2026

The paper demonstrates that the hybrid ensemble, Evan_V2, achieves near-perfect accuracy in multiclass Alzheimer’s disease staging from MRI scans by integrating CNN and Transformer representations.
It utilizes ten pretrained backbones, projecting and fusing mid-level features to counter class imbalance and instability, ensuring robust diagnostic performance.
Through comprehensive training, dynamic augmentation, and optimization protocols, Evan_V2 sets a new benchmark for reliable image-based medical diagnostics.

The Evan_V2 hybrid model is a computer vision ensemble architecture designed for the classification of Alzheimer’s disease (AD) stages from brain MRI scans. It strategically leverages both convolutional neural networks (CNNs) and Transformer-based models, integrating their mid-level representations through feature-level fusion to optimize diagnostic accuracy for multiclass AD staging. As described in "A Computer Vision Hybrid Approach: CNN and Transformer Models for Accurate Alzheimer's Detection from Brain MRI Scans" (Hoque et al., 21 Jan 2026), Evan_V2 achieves near-perfect performance on a challenging four-class AD task and exemplifies the upper bound of reliability in image-based ensemble systems for medical diagnostics.

1. Architectural Composition and Workflow

Evan_V2 employs a parallel-processing architecture, taking as input a single 224×224×3 MRI slice. The model routes the input through ten distinct pretrained vision backbones—comprising five CNNs and five Transformer-family models—before projecting and fusing these features for final classification.

The backbone selection includes:

CNNs: EfficientNet-B0, ResNet50, DenseNet201, MobileNetV3-Small, VGG16
Transformers: ViT-Base (Vision Transformer), ConvTransformer, PatchTransformer, MLP-Mixer, SimpleTransformer

Each backbone produces a mid-level feature vector via global average pooling (CNNs) or direct token embedding (Transformers). The vectors are individually projected via fully connected layers with ReLU activations to a common subspace (e.g., ℝ²⁵⁶), concatenated, further fused through another dense (ReLU) layer, and finally mapped to class logits via a fully connected classification head with softmax activation. The architectural pipeline ensures all model outputs contribute to a unified joint representation prior to decision-making.

2. Mathematical Formalism of Feature-Level Fusion

Let $f_i \in \mathbb{R}^{d_i}$ denote the feature vector from backbone $i$ . For each $i$ :

Project to $\mathbb{R}^d$ :

$\hat{f}_i = W_i f_i + b_i$

where $W_i \in \mathbb{R}^{d \times d_i}$ , $b_i \in \mathbb{R}^d$ .

Concatenate projected vectors into a joint feature:

$\tilde{f} = [\hat{f}_1; \hat{f}_2; \dots; \hat{f}_{10}] \in \mathbb{R}^{10d}$

Fuse features:

$z = \mathrm{ReLU}(W_{fuse}\tilde{f} + b_{fuse}), \quad W_{fuse} \in \mathbb{R}^{h \times 10d}, \; b_{fuse} \in \mathbb{R}^h$

Classification output:

$\hat{y} = \mathrm{softmax}(W_{cls}z + b_{cls})$

Alternatively, learnable per-backbone weights $i$ 0 can define a convex combination:

$i$ 1

This fusion is executed at the feature (not probability) level, countering class-imbalance and individual model instability by learning a robust aggregate representation that reflects both local and global cues.

3. Training Protocol and Optimization

Evan_V2 is trained by minimizing categorical cross-entropy across four AD classes. Optimization uses Adam (learning rate $i$ 2, $i$ 3, $i$ 4, $i$ 5), with L2 weight decay ( $i$ 6) and a Reduce-On-Plateau scheduler (factor 0.5, patience 5 epochs). Regularization includes dropout ( $i$ 7) within each projection head and in the fusion head.

Dynamic data augmentation applied on the fly (TensorFlow tf.data) comprises random rotations ( $i$ 8), horizontal flips, and random zoom (0.8–1.2 scaling). Training is performed in batches of 32, for up to 50 epochs, with early stopping (patience=10 on validation loss). Hardware utilized is NVIDIA Tesla T4 (16GB VRAM) with CUDA 12.2 and cuDNN 9.1.

4. Empirical Performance and Quantitative Evaluation

On the four-class AD detection task, Evan_V2 demonstrates near-perfect results:

Model	Accuracy	Precision	Recall	F1-score	ROC AUC
EfficientNetB0	0.9821	0.9818	0.9821	0.9819	0.9924
ResNet50	0.9883	0.9879	0.9883	0.9877	0.9951
ViT-Base	0.9538	0.9541	0.9538	0.9530	0.9852
...	...	...	...	...	...
Evan_V2 (Hybrid)	0.9999	0.9872	0.9923	0.9989	0.9968

The confusion matrix for Evan_V2 is strictly diagonal, with misclassifications essentially removed from all AD stages including the underrepresented Moderate Dementia class. A total of 13,445/13,445 correct predictions for Non-Demented, 2,691/2,694 for Very Mild Dementia (3 misclassified as Mild Dementia), 2,499/2,500 for Mild Dementia, and 487/488 for Moderate Dementia (1 misclassified as Mild Dementia). No statistical significance test is reported, but the diagonality evidences ensemble stability.

5. Clinical Implications and Robustness Considerations

The hybridization in Evan_V2 exploits the spatial discrimination of CNNs for fine-scale features and the contextual modeling capacity of transformers for global dependencies. The resultant ensemble diminishes individual model idiosyncrasies through feature fusion, reducing label noise even in low-prevalence classes.

Clinical strengths include:

High robustness across all AD severity stages, decreasing the risk of mis-staging that could delay intervention.
Exceptional ROC AUC (0.9968), minimizing false positives in Non-Demented predictions.
Stable performance on class-imbalanced real-world data.

Noted limitations:

Elevated computational cost: parallel execution of ten large models increases inference latency.
Dataset bias: validation restricted to single-center, 2D-slice OASIS MRI; further studies needed for 3D/longitudinal or multi-center data.
Limited explainability: the black-box fusion impedes interpretability; saliency mapping (e.g., Grad-CAM) and attention-based explanation mechanisms are recommended enhancements.
Restricted modalities: extension to multimodal staging (incorporation of clinical scores such as CDR, MMSE) remains out of scope in the current implementation.

6. Comparative Model Analysis

Compared to single-model variants, the hybrid strategy reduces both global and class-specific errors. In performance terms, Evan_V2 surpasses the strongest standalone CNN (ResNet50: 98.83% accuracy) and Transformer (ViT-Base: 95.38% accuracy), achieving 99.99% accuracy and 0.9989 F1-score. Individual Transformer models, while competitive, display greater class-wise instability, which the hybrid effectively mitigates through learned ensemble weighting.

In sum, Evan_V2 exemplifies the synergistic benefit of feature-level fusion across diverse deep-learning backbones for medical image recognition, achieving state-of-the-art reliability in AD staging within the evaluated scope (Hoque et al., 21 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

A Computer Vision Hybrid Approach: CNN and Transformer Models for Accurate Alzheimer's Detection from Brain MRI Scans (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Evan_V2 Hybrid Model.