ContraSoM: Multi-Modal Channel Pre-training

Updated 22 January 2026

ContraSoM is a contrastive pre-training strategy that aligns visual, LiDAR, and CSI modalities to extract robust, channel-aware embeddings.
It employs cross-modal contrastive learning with GRU-based temporal extrapolation and modality-specific augmentations to enhance generalization.
The modular WiFo-M² architecture facilitates plug‐and‐play transfer to diverse wireless tasks, demonstrating marked improvements in performance and NMSE metrics.

ContraSoM is a contrastive pre-training strategy developed to extract generalizable out-of-band (OOB) channel features from multi-modal sensing, central to the WiFo-M $^2$ foundation model for wireless communications. The methodology tightly couples visual (image), LiDAR, and channel state information (CSI) modalities, leveraging sophisticated cross-modal contrastive learning, temporal extrapolation, and modality-specific augmentation. ContraSoM enables plug-and-play transfer of robust channel-aware embeddings across a wide range of wireless transceiver tasks and environments, thereby improving generalization and minimizing downstream fine-tuning overhead (Zhang et al., 14 Jan 2026).

1. Objective and Contrastive Loss Formulations

The core objective of ContraSoM is to enforce alignment between embeddings from visual/LiDAR sequences and their corresponding CSI representations at future timestamps while maximizing dissimilarity to other (cross-sample or augmented) embedding pairs. This is formalized as a symmetric InfoNCE loss operating over multi-modal embedding sequences.

Processed image sequences $I^{(i)}$ , labeled LiDAR point clouds $P^{(i)}$ , and CSI $H^{(i)}$ are independently encoded to frame-level features $u_t^{I,(i)}$ , $u_t^{L,(i)} \in \mathbb{R}^d$ . These pass through a gated recurrent unit (GRU) with a projection head, producing temporally indexed embeddings $v_\tau^{I,(i)}$ , $v_\tau^{L,(i)} \in \mathbb{R}^d$ for timestamps $\tau \in \{T, T+\Delta\tau, \ldots\}$ , while a frozen WiFo transformer outputs CSI embeddings $z_\tau^{C,(i)} \in \mathbb{R}^d$ .

For each sample $i$ and timestamp $\tau$ , $(v_\tau^{x,(i)}, z_\tau^{C,(i)})$ is a positive pair ( $x \in \{I, L\}$ ), with all cross-sample and, for LiDAR, augmented pairs as negatives. The symmetric InfoNCE contrastive loss is

$\mathcal{L}_{\text{InfoNCE}}(Z_a, Z_b) = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(s(z_a^{(i)}, z_b^{(i)})/\tau_e)}{\sum_{j=1}^N \exp(s(z_a^{(i)}, z_b^{(j)})/\tau_e)}$

$\mathcal{L}_{\text{sym}}(Z_a, Z_b) = \tfrac{1}{2}\left[\mathcal{L}_{\text{InfoNCE}}(Z_a, Z_b) + \mathcal{L}_{\text{InfoNCE}}(Z_b, Z_a)\right]$

where $s(u,v)$ denotes cosine similarity, and $\tau_e$ is the temperature parameter (set to $0.1$ in practice).

The branch-specific losses are:

Image–CSI: $\mathcal{L}^{I} = \mathcal{L}_{\mathrm{sym}}(\{v^{I,(i)}_\tau\}, \{z^{C,(i)}_\tau\})$
LiDAR–CSI (with diffusion augmentation):
- $\mathcal{L}_{\text{contra}} = \lambda_1\mathcal{L}_{\mathrm{sym}}(v^0, z^C) + \lambda_2\mathcal{L}_{\mathrm{sym}}(\tilde{v}^{L(1)}, z^C) + \lambda_3\mathcal{L}_{\mathrm{sym}}(\tilde{v}^{L(2)}, z^C)$
- Standard diffusion (DDPM) objective: $\mathcal{L}_{\text{diff}} = \mathbb{E}_{t,v^0,\epsilon} [\|\epsilon - \epsilon_\theta(v^0_t, t)\|^2]$
- Full loss: $\mathcal{L}^L = \mathcal{L}_{\text{contra}} + \lambda_{\text{diff}}\mathcal{L}_{\text{diff}}$ , with $(\lambda_1, \lambda_2, \lambda_3) = (1, 0.25, 0.25)$ , $\lambda_{\text{diff}} = 0.3$

This contrastive formulation compels embeddings from heterogeneous sensors to encode the latent physical drivers of the wireless channel at high temporal granularity.

2. Model Architecture

ContraSoM leverages a modular, multi-branch encoder design:

Image branch (WiFo-M $^2$ -Img): ResNet-34 extracts spatial features, followed by a GRU for temporal modeling and a linear projection head, yielding $v_\tau^{I,(i)}$ .
LiDAR branch (WiFo-M $^2$ -LiDAR): PointNet backbone for 3D geometry, GRU, and projection head to generate $v_\tau^{L,(i)}$ .
CSI branch: Frozen WiFo transformer yields $z_\tau^{C,(i)}$ .
Temporal sequence-to-sequence: Each GRU predicts embeddings at multiple future timestamps, bridging the sparse sensor frame rates and the denser CSI sampling frequency.
Downstream adapters: At inference, modality embeddings $v^I_\tau, v^L_\tau$ are concatenated or fused and input to lightweight adapters (e.g., MLPs for beam prediction or channel estimation) with only minor parameter overhead.

This architecture facilities plug-and-play transfer; frozen backbones provide task-agnostic, robust OOB features, and adapters are fine-tuned for specific transceiver tasks.

3. Modality-Specific Data Augmentation

Robust generalization is supported by modality-tailored data augmentation:

Images: Pixel-level augmentations comprise color jitter (brightness, contrast, saturation, hue), Gaussian blur ( $p=0.5$ , $\sigma \sim U(0.1,2.0)$ ), random erasing ( $p=0.3$ ), and normalization (ImageNet mean/std). These generate two randomly perturbed “views” $\tilde{I}$ per raw image to encourage invariance to appearance variations.
LiDAR: Feature-space augmentation via diffusion models. Forward noising uses the DDPM process

$v_m = \sqrt{\bar{\alpha}_m} v_0 + \sqrt{1-\bar{\alpha}_m} \epsilon, \quad \epsilon \sim \mathcal{N}(0,I),\quad \bar{\alpha}_m=\prod_{s=1}^m(1-\beta_s)$

Followed by DDIM-based backward sampling (24 steps, stride $\Delta m \approx m/8$ ) to produce two distinct, geometry-preserving augmented features $\tilde{v}^{L(1)}, \tilde{v}^{L(2)}$ .

These augmentations yield embeddings that are invariant to nuisance variations (illumination, occlusion, viewpoint, point-cloud sparsity), yet sensitive to channel-relevant geometric cues.

4. Pre-training Pipeline

ContraSoM's unified multi-modal pre-training pipeline consists of:

Data preprocessing: Images are preprocessed via YOLOv8 for vehicle detection, bounding boxes are assigned to targets via azimuth, and boxes are overwritten with HSV→RGB encodings of receiver relative angle. LiDAR frames undergo DBSCAN clustering, centroid matching, and framewise cluster tracking, with points labeled as receiver/building/background.
Feature extraction: Temporal sequences of images ( $n_I + 1$ frames) and point clouds ( $n_L+1$ frames) are processed by the relevant backbones and GRUs to form $u_t^{I/L}$ and downstream embeddings $v_\tau^{I/L}$ . CSI sequences are synchronously encoded by the frozen WiFo transformer.
Batch construction: For each mini-batch ( $N=1024$ for LiDAR), for each sample $i$ and timestamp $\tau$ , one positive pair and $2N-1$ negatives are constructed.
Optimization: AdamW optimizer; 100 epochs; learning rate $1 \times 10^{-4}$ (LiDAR), $5 \times 10^{-4}$ (Image); weight decay $5 \times 10^{-4}$ ; cosine learning rate annealing with a 10-epoch warm-up; embedding dimension $d=512$ .

The full pipeline yields densely sampled, robust OOB embeddings amenable to downstream wireless communication tasks.

5. Theoretical Basis for OOB Generalization

ContraSoM’s training strategy enhances OOB feature generalization through several mechanisms:

Cross-modal alignment: By forcing visual/LiDAR embeddings at time $\tau$ to align with CSI embeddings $z_\tau^C$ (and repel others), the network learns to attend to latent factors such as physical scatterer geometry, LoS/NLoS conditions, and mobility, which causally affect wireless channel evolution.
Temporal extrapolation: The GRUs are explicitly trained to forecast future representations (“hallucinate” forthcoming fine-grained features), compensating for the temporal sampling disparity between sensors and channel measurements.
Augmentation-induced invariance: Pixel-space augmentations for images and diffusion-based feature augmentation for LiDAR instill invariance to superficial perturbations while maintaining task-relevant sensitivity, thus yielding channel-aware but robust embeddings.
Task and scenario transferability: These embeddings generalize across tasks (beam selection, channel estimation, interpolation, prediction) and propagate to diverse operational scenarios including traffic intersections, urban environments, campuses, and real 60 GHz channel measurements.

6. Experimental Results and Ablation Analysis

Empirical results demonstrate the effectiveness and generalizability of ContraSoM-pretrained WiFo-M $^2$ backbones:

Beam Prediction: WiFo-M $^2$ -Img matches or outperforms Vision-BP baselines (+3% Top-1 accuracy on L4–L6) on 7 test links, despite frozen backbones. Multi-modal fusion (MM-BP vs. WiFo-M $^2$ full) yields only a ∼2% Top-1 gap, evidencing high channel-awareness in learned embeddings.
Channel Estimation (NMSE): WiFo-M $^2$ variants consistently lower NMSE by $0.2$–$0.8$ dB against CENN and FCDAMP baselines. Multi-modal fusion gives maximum improvements; single modalities still yield $0.1$–$0.5$ dB gain.
Channel Interpolation (NMSE): On SR-CI and LPCCNet baselines, average NMSE gains are $1$–$2$ dB in favorable configurations, with smaller/no gains at very high resolution (e.g., 256 antennas).
Channel Prediction (uplink→downlink NMSE): Up to $1.5$ dB NMSE reduction is observed (from $-10.74$ dB→ $-12.21$ dB on L1) with consistent gains on 5/7 links.
Ablation Studies:
- Removing temporal feature extrapolation degrades CE (NMSE $-3.38\to -3.23$ ), CI ( $-9.63 \to -8.87$ ), and CP ( $-3.38 \to -3.30$ ).
- Replacing ContraSoM pre-training with ImageNet-based weights degrades LPCCNet performance from $-9.63$ dB to $-7.55$ dB.
- Eliminating diffusion augmentation lowers performance by $0.1$–$0.6$ dB.
Cross-Scenario Generalization: WiFo-M $^2$ features maintain strong gains on three unseen datasets: up to $1.3$ dB NMSE improvement on CI, $2.3$ dB on CP, and improvement to $100\%$ Top-3 accuracy on one beam-prediction scenario (from $91\%$ ).
Deployment Overhead: WiFo-M $^2$ -LiDAR has $0.07$ million parameters and $3.3$ ms inference; WiFo-M $^2$ -Img uses $9.62$ million parameters and $6.0$ ms inference; per-task adapters add at most $0.26$ million parameters.

7. Significance and Impact

ContraSoM achieves universal, task-agnostic OOB channel embeddings suitable for boosting a broad spectrum of wireless communication tasks with minimal per-task adaptation. By unifying contrastive cross-modal learning, temporal feature extrapolation, and modality-wise data augmentation, it realizes a plug-and-play foundation model paradigm, significantly improving scalability, robustness, and out-of-distribution generalization in sensor-aided wireless systems (Zhang et al., 14 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

WiFo-M$^2$: Plug-and-Play Multi-Modal Sensing via Foundation Model to Empower Wireless Communications (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ContraSoM.

ContraSoM: Multi-Modal Channel Pre-training

1. Objective and Contrastive Loss Formulations

2. Model Architecture

3. Modality-Specific Data Augmentation

4. Pre-training Pipeline

5. Theoretical Basis for OOB Generalization

6. Experimental Results and Ablation Analysis

7. Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ContraSoM: Multi-Modal Channel Pre-training

1. Objective and Contrastive Loss Formulations

2. Model Architecture

3. Modality-Specific Data Augmentation

4. Pre-training Pipeline

5. Theoretical Basis for OOB Generalization

6. Experimental Results and Ablation Analysis

7. Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research