GRAFT-Net: Adapting CNNs for Novel Sensor Modalities

Updated 26 January 2026

GRAFT-Net is a self-supervised algorithm that adapts intensity-trained CNNs to novel sensor modalities (thermal, event-based, hyperspectral) by grafting a tailored front end.
The method uses a composite loss that combines feature reconstruction, evaluation, and style components to ensure alignment with the original network without additional inference overhead.
GRAFT-Net achieves near-backbone performance with minimal parameter updates and training data, significantly improving detection benchmarks and enabling rapid prototyping for emerging sensors.

GRAFT-Net is a network grafting algorithm that enables adaptation of powerful deep convolutional architectures, originally trained on standard intensity images, to accommodate arbitrary new vision sensor modalities such as thermal, event-based, or hyperspectral inputs. The method is self-supervised, requiring no manual labels in the novel modality and incurring no additional inference costs, making it practical for emerging sensor types lacking large annotated datasets (Hu et al., 2020).

1. Architectural Overview

GRAFT-Net separates the pre-existing, intensity-trained convolutional network ("backbone") into three consecutive blocks: the front end ( $N_f$ ), the middle network ( $N_\text{mid}$ ), and the final layers ( $N_\text{last}$ ). The novel contribution is the replacement of the front end with a lightweight, modality-specific front end ( $\text{GN}_f$ ) designed to process unconventional inputs (e.g., thermal frames, event volume grids) while ensuring that the dimensions of the output feature tensor match those produced by $N_f$ .

During deployment, the grafted architecture operates as follows on an input from the new modality $V_t$ :

$\hat{H}_t = \text{GN}_f(V_t)\ \hat{R}_t = N_\text{mid}(\hat{H}_t)\ \hat{Y}_t = N_\text{last}(\hat{R}_t)$

The original middle and last blocks ( $N_\text{mid}$ and $N_\text{last}$ ) remain fixed, transferring high-level task expertise learned on labeled intensity images. The grafted front end typically comprises a shallow subset of the earliest residual blocks of the original backbone.

2. Self-Supervised Alignment Loss

Training $\text{GN}_f$ requires only synchronized, unlabeled pairs of standard intensity images ( $I_t$ ) and new modality inputs ( $V_t$ ). The objective is to align the $\text{GN}_f$ -produced features with those generated by the original $N_f$ on matched pairs, using the following composite loss:

Feature Reconstruction Loss (FRL):

$\mathcal{L}_\text{recon} = \mathrm{MSE}(H, \hat{H})$

Forces $\text{GN}_f$ to produce front-end activations closely matching $N_f(I_t)$ .

Feature Evaluation Loss (FEL):

$\mathcal{L}_\text{eval} = \mathrm{MSE}(R, \hat{R})$

Ensures alignment of downstream, mid-network activations.

Feature Style Loss (FSL):

Penalizes discrepancies in second-order statistics via Gram matrices:

$G(F)^{i,j} = \sum_t (F_t^{(i)} - \text{mean}) (F_t^{(j)} - \text{mean})$

$\mathcal{L}_\text{style} = \gamma_h \mathrm{MSE}(G(H), G(\hat{H})) + \gamma_r \mathrm{MSE}(G(R), G(\hat{R}))$

with typical hyperparameters $\gamma_h \simeq\text{1e6--1e7}$ , $\gamma_r = \text{1e7}$ .

The total alignment loss is:

$\mathcal{L}_\text{tot} = \alpha \mathcal{L}_\text{recon} + \beta \mathcal{L}_\text{eval} + \mathcal{L}_\text{style}$

with $\alpha = \beta = 1$ by default. Empirical ablation shows most gains from FRL and FEL, with FSL providing modest additional benefit.

3. Training Procedure and Parameterization

GRAFT-Net training uses only modest amounts of unlabeled, time-synchronized intensity and novel modality frames (typically cropped to $224\times224$ patches), matched per timestep. The $\text{GN}_f$ parameters are trained with the Adam optimizer (learning rate 1e-4), batch size 8, and 100 epochs on a single NVIDIA RTX 2080 Ti GPU, typically converging within 1.5 hours. Because only the light front end is trained and not the backbone, training time is approximately 5% of that required for full supervision.

The choice of $\text{GN}_f$ depth significantly affects performance:

S1: 0 residual blocks (40k parameters, 0.06% total)
S2: 1 residual block (279k parameters, 0.45%)
S3: 2 residual blocks (3.2M parameters, 5.17%)

Deeper front ends (e.g., S3) yield better alignment and task performance, while shallower configurations lead to significant drops.

4. Empirical Results and Performance Benchmarks

GRAFT-Net achieves near-backbone performance with substantially reduced parameter updates and training labels. Summary of reported results:

Task	Standard Supervised Model	GRAFT-Net (Self-supervised)	Notes
FLIR thermal object detection	YOLOv3 on intensity: AP50=30.36	GN $_f$ +backbone: AP50=45.27±1.14	49.1% rel. gain vs. night intensity
	Thermal Faster-R-CNN: AP50=53.97		Requires $\sim$ 47M labeled images
Event-camera car detection (MVSEC)	YOLOv3 on intensity: AP50=73.53	DVS-3: 70.14±0.36 / DVS-10: 70.35±0.51	40% data: 66.75, 10% data: 47.88
MNIST digit classification (N-MNIST)	LeNet-5 on MNIST: 0.92% error	1.47% error (8% params)	Fully supervised N-MNIST: 1.11% error

Total FLOPs and inference latency for GRAFT-Net are essentially unchanged from the original backbone, with no extra preprocessing (e.g., no intensity reconstruction for event-based data).

5. Practical Considerations and Modalities

GRAFT-Net is modality-agnostic, accommodating any raw sensor output structured as a 2D or 3D tensor, including hyperspectral cubes, polarization maps, and time-of-flight depth data. The only prerequisite is the availability of a modest quantity of synchronized, unlabeled pairs with standard intensity frames.

The optimal grafting point trades off between alignment (deeper front ends) and efficiency (shallower front ends). Middle network alignment depth for FEL is less critical; it is mainly a choice balancing spatial and channel resolution. Style losses are optional, with primary alignment benefits delivered by FRL and FEL. Extensions may include contrastive or adversarial alignment losses.

6. Limitations and Prospective Research

GRAFT-Net requires accurate spatial calibration between the intensity and novel modality sensors (e.g., consistent cropping and alignment). The method underperforms if the new sensor data are weakly correlated with intensity features or if the available unlabeled set is too small (less than 500 pairs). Modalities that share little representational structure with intensity images (e.g., sonar) present significant challenges.

Future research directions include multi-scale grafts (aligning features at various spatial resolutions), learnable gating (allowing adaptive fusion of intensity and modality features), and dataset selection strategies (active sampling for enhanced convergence of $\text{GN}_f$ ).

7. Context and Significance

By enabling adaptation of pretrained convolutional backbones to new vision modalities without labels or costly retraining, GRAFT-Net widens access to deep learning for resource-constrained sensor applications. The method facilitates rapid prototyping and deployment for novel sensors, maintaining inference speed and parameter efficiency, with demonstrated competitive performance across diverse benchmarks (Hu et al., 2020). A plausible implication is accelerated adoption of unconventional imaging sensors in detection and classification tasks, circumventing the prohibitive annotation costs previously required for deep architectures.

Markdown Report Issue Upgrade to Chat

References (1)

Learning to Exploit Multiple Vision Modalities by Using Grafted Networks (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GRAFT-Net.