EKDC-Net: Expert-Guided Calibration for Fine Classification
- The paper introduces a dual-module approach that integrates CAM-based local knowledge extraction with uncertainty-guided decision calibration.
- EKDC-Net is a modular architecture that fuses data-driven and expert representations, significantly enhancing tree species identification accuracy.
- The system achieves state-of-the-art performance with a Top-1 accuracy gain of +6.42% and robust improvements on long-tail and few-shot benchmark datasets.
The Expert Knowledge-Guided Classification Decision Calibration Network (EKDC-Net) is a modular architecture for fine-grained visual classification in domains exhibiting pronounced long-tail distributions and high inter-class similarity, with primary application to tree species identification. EKDC-Net introduces an external domain expert into the learning workflow, coupling data-driven representations from standard image classification backbones with knowledge-centric calibration, via a dual-module framework for knowledge extraction and uncertainty-aware fusion. The system demonstrates significant improvements over backbone-only and conventional fusion approaches, establishing state-of-the-art results on the large-scale CU-Tree102 dataset as well as auxiliary challenging benchmarks (Long et al., 23 Jan 2026).
1. Network Architecture and Workflow
EKDC-Net is designed as a lightweight "plug-and-play" add-on to conventional image classifiers (ResNet-50, Swin, ViT, etc.). Its processing pipeline comprises three principal stages:
1. Backbone Feature & Logit Extraction:
The input image is processed by a backbone to extract multi-scale feature maps , where . Initial class logits are also generated, with the number of species.
- Local Prior–Guided Knowledge Extraction Module (LPKEM): Utilizes the backbone’s feature maps and logits (plus the original image) to compute Channel Activation Maps (CAMs) that spatially highlight discriminative regions, filtering out background. CAM-derived binary masks are applied to a frozen vision transformer expert (BioCLIP2), which processes only the foreground token sequences to output expert-level feature representations . These are aggregated into expert logits via an MLP.
- Uncertainty-Guided Decision Calibration Module (UDCM): Integrates information from both the backbone and expert by quantifying class-level and instance-level uncertainties for each. These are concatenated and projected to generate a bin-based distribution over calibration weights, yielding a soft blending coefficient , facilitating adaptive logit fusion: .
2. Local Prior–Guided Knowledge Extraction Module (LPKEM)
LPKEM operationalizes expert knowledge extraction and grounding via three sub-routines:
- CAM-Based Local Prior:
Pseudo-label selection is performed via . For every scale and channel ,
and
producing scale-specific activation maps .
- Binary Mask Generation:
Each is upsampled to the foundation expert’s token grid and binarized via median thresholding:
- Expert Feature Extraction:
Masked token sequences are fed into the frozen BioCLIP2 expert to obtain , while the unmasked full image yields . Aggregation is performed as:
This module explicitly localizes expert attention, suppressing background distractions and enhancing key discriminative features.
3. Uncertainty-Guided Decision Calibration Module (UDCM)
UDCM addresses the fusion of backbone and expert output via a dual-uncertainty mechanism:
- Class-Level Uncertainty:
Encoded in a learnable embedding initialized to represent class difficulty, using:
Top-3 class logits’ difficulty weights are extracted for both agents.
- Instance-Level Uncertainty:
For softmax probabilities ,
- Calibration Coefficient and Fusion:
Concatenate class and instance uncertainty: (for backbone and expert). These are passed through MLPs and a bin-classifier to yield soft calibration weights:
Logit fusion then proceeds by .
This adaptive scheme dynamically allocates trust between the backbone and expert based on both prior class difficulty and sample-level prediction entropy.
4. Training Strategy and Optimization
EKDC-Net is optimized end-to-end except for the frozen expert, using the cross-entropy objective:
Stop-gradient is applied to to prevent trivial collapse. All trainable components (mask projections, MLPs, uncertainty predictors) are updated jointly using SGD with learning rate 0.0005 over 100 epochs. Bin count is set to and smoothing parameter .
5. Dataset Design: CU-Tree102 and Evaluation Protocols
The CU-Tree102 dataset comprises 9,134 expert-curated images spanning tree species, split into train/val/test sets (80\%/10\%/10\%). Samples are drawn from real outdoor sources, ensuring coverage and challenging ambiguous cases. CU-Tree102 features pronounced class imbalance (the largest class: 286 samples; smallest: 11), reflecting real-world frequency. Two auxiliary datasets evaluate generalizability and long-tail robustness: RSTree (8,324 samples, 23 classes, severe tail) and Jekyll (4,804 samples, 23 classes).
6. Experimental Results and Comparative Analysis
- Performance Gains:
EKDC-Net consistently improves backbone-only and FGVC-specific methods, yielding a Top-1 accuracy gain of +6.42% and macro-F1 improvement of +11.93% over baselines. Greatest improvements are observed in tail classes and few-shot regimes.
- Ablation and Fusion Paradigms:
Adding expert features without mask or calibration provides moderate gains; adding LPKEM alone with naïve fusion provides minor additional benefit. Full LPKEM+UDCM realizes best performance (e.g. CGL backbone: 81.47% 86.65% accuracy). Standard feature/logit-level fusion methods saturate at 81% accuracy, whereas UDCM outperforms by 5%.
- Robustness on RSTree:
Standard backbones collapse in extreme long-tail (macro-F1 6.6%), EKDC-Net archives macro-F1 49.6% and Top-1 accuracy 87.8%. Macro-F1 improves by 170% in most imbalanced settings.
- Generalization (Jekyll):
CU-Tree102-pretrained models drop to 30% accuracy on Jekyll without fine-tuning. EKDC-Net elevates results to the 39–59% range, averaging relative gain of +30.56%.
- Parameters and Efficiency:
The system introduces only 0.08M additional parameters.
7. Technical Significance and Applicability
EKDC-Net advances fine-grained classification by tightly integrating region-wise localization via CAMs, leveraging foundation-model priors, and blending outputs using explicit uncertainty modeling. This approach effectively mitigates biases and performance collapse in imbalanced scenarios and visually ambiguous cases. The lightweight, modular nature makes it suitable for deployment across diverse backbones and datasets. A plausible implication is that expert-guided, uncertainty-calibrated fusion can generalize to other domains suffering from data scarcity, long-tail distributions, and confounding similar-class noise.
The CU-Tree102 dataset and reference implementation are publicly available, facilitating further benchmarking and adaptation (Long et al., 23 Jan 2026).