ContraSoM: Multi-Modal Channel Pre-training
- ContraSoM is a contrastive pre-training strategy that aligns visual, LiDAR, and CSI modalities to extract robust, channel-aware embeddings.
- It employs cross-modal contrastive learning with GRU-based temporal extrapolation and modality-specific augmentations to enhance generalization.
- The modular WiFo-M² architecture facilitates plug‐and‐play transfer to diverse wireless tasks, demonstrating marked improvements in performance and NMSE metrics.
ContraSoM is a contrastive pre-training strategy developed to extract generalizable out-of-band (OOB) channel features from multi-modal sensing, central to the WiFo-M foundation model for wireless communications. The methodology tightly couples visual (image), LiDAR, and channel state information (CSI) modalities, leveraging sophisticated cross-modal contrastive learning, temporal extrapolation, and modality-specific augmentation. ContraSoM enables plug-and-play transfer of robust channel-aware embeddings across a wide range of wireless transceiver tasks and environments, thereby improving generalization and minimizing downstream fine-tuning overhead (Zhang et al., 14 Jan 2026).
1. Objective and Contrastive Loss Formulations
The core objective of ContraSoM is to enforce alignment between embeddings from visual/LiDAR sequences and their corresponding CSI representations at future timestamps while maximizing dissimilarity to other (cross-sample or augmented) embedding pairs. This is formalized as a symmetric InfoNCE loss operating over multi-modal embedding sequences.
Processed image sequences , labeled LiDAR point clouds , and CSI are independently encoded to frame-level features , . These pass through a gated recurrent unit (GRU) with a projection head, producing temporally indexed embeddings , for timestamps , while a frozen WiFo transformer outputs CSI embeddings .
For each sample and timestamp , is a positive pair (), with all cross-sample and, for LiDAR, augmented pairs as negatives. The symmetric InfoNCE contrastive loss is
where denotes cosine similarity, and is the temperature parameter (set to $0.1$ in practice).
The branch-specific losses are:
- Image–CSI:
- LiDAR–CSI (with diffusion augmentation):
- Standard diffusion (DDPM) objective:
- Full loss: , with ,
This contrastive formulation compels embeddings from heterogeneous sensors to encode the latent physical drivers of the wireless channel at high temporal granularity.
2. Model Architecture
ContraSoM leverages a modular, multi-branch encoder design:
- Image branch (WiFo-M-Img): ResNet-34 extracts spatial features, followed by a GRU for temporal modeling and a linear projection head, yielding .
- LiDAR branch (WiFo-M-LiDAR): PointNet backbone for 3D geometry, GRU, and projection head to generate .
- CSI branch: Frozen WiFo transformer yields .
- Temporal sequence-to-sequence: Each GRU predicts embeddings at multiple future timestamps, bridging the sparse sensor frame rates and the denser CSI sampling frequency.
- Downstream adapters: At inference, modality embeddings are concatenated or fused and input to lightweight adapters (e.g., MLPs for beam prediction or channel estimation) with only minor parameter overhead.
This architecture facilities plug-and-play transfer; frozen backbones provide task-agnostic, robust OOB features, and adapters are fine-tuned for specific transceiver tasks.
3. Modality-Specific Data Augmentation
Robust generalization is supported by modality-tailored data augmentation:
- Images: Pixel-level augmentations comprise color jitter (brightness, contrast, saturation, hue), Gaussian blur (, ), random erasing (), and normalization (ImageNet mean/std). These generate two randomly perturbed “views” per raw image to encourage invariance to appearance variations.
- LiDAR: Feature-space augmentation via diffusion models. Forward noising uses the DDPM process
Followed by DDIM-based backward sampling (24 steps, stride ) to produce two distinct, geometry-preserving augmented features .
These augmentations yield embeddings that are invariant to nuisance variations (illumination, occlusion, viewpoint, point-cloud sparsity), yet sensitive to channel-relevant geometric cues.
4. Pre-training Pipeline
ContraSoM's unified multi-modal pre-training pipeline consists of:
- Data preprocessing: Images are preprocessed via YOLOv8 for vehicle detection, bounding boxes are assigned to targets via azimuth, and boxes are overwritten with HSV→RGB encodings of receiver relative angle. LiDAR frames undergo DBSCAN clustering, centroid matching, and framewise cluster tracking, with points labeled as receiver/building/background.
- Feature extraction: Temporal sequences of images ( frames) and point clouds ( frames) are processed by the relevant backbones and GRUs to form and downstream embeddings . CSI sequences are synchronously encoded by the frozen WiFo transformer.
- Batch construction: For each mini-batch ( for LiDAR), for each sample and timestamp , one positive pair and $2N-1$ negatives are constructed.
- Optimization: AdamW optimizer; 100 epochs; learning rate (LiDAR), (Image); weight decay ; cosine learning rate annealing with a 10-epoch warm-up; embedding dimension .
The full pipeline yields densely sampled, robust OOB embeddings amenable to downstream wireless communication tasks.
5. Theoretical Basis for OOB Generalization
ContraSoM’s training strategy enhances OOB feature generalization through several mechanisms:
- Cross-modal alignment: By forcing visual/LiDAR embeddings at time to align with CSI embeddings (and repel others), the network learns to attend to latent factors such as physical scatterer geometry, LoS/NLoS conditions, and mobility, which causally affect wireless channel evolution.
- Temporal extrapolation: The GRUs are explicitly trained to forecast future representations (“hallucinate” forthcoming fine-grained features), compensating for the temporal sampling disparity between sensors and channel measurements.
- Augmentation-induced invariance: Pixel-space augmentations for images and diffusion-based feature augmentation for LiDAR instill invariance to superficial perturbations while maintaining task-relevant sensitivity, thus yielding channel-aware but robust embeddings.
- Task and scenario transferability: These embeddings generalize across tasks (beam selection, channel estimation, interpolation, prediction) and propagate to diverse operational scenarios including traffic intersections, urban environments, campuses, and real 60 GHz channel measurements.
6. Experimental Results and Ablation Analysis
Empirical results demonstrate the effectiveness and generalizability of ContraSoM-pretrained WiFo-M backbones:
- Beam Prediction: WiFo-M-Img matches or outperforms Vision-BP baselines (+3% Top-1 accuracy on L4–L6) on 7 test links, despite frozen backbones. Multi-modal fusion (MM-BP vs. WiFo-M full) yields only a ∼2% Top-1 gap, evidencing high channel-awareness in learned embeddings.
- Channel Estimation (NMSE): WiFo-M variants consistently lower NMSE by $0.2$–$0.8$ dB against CENN and FCDAMP baselines. Multi-modal fusion gives maximum improvements; single modalities still yield $0.1$–$0.5$ dB gain.
- Channel Interpolation (NMSE): On SR-CI and LPCCNet baselines, average NMSE gains are $1$–$2$ dB in favorable configurations, with smaller/no gains at very high resolution (e.g., 256 antennas).
- Channel Prediction (uplink→downlink NMSE): Up to $1.5$ dB NMSE reduction is observed (from dB→ dB on L1) with consistent gains on 5/7 links.
- Ablation Studies:
- Removing temporal feature extrapolation degrades CE (NMSE ), CI (), and CP ().
- Replacing ContraSoM pre-training with ImageNet-based weights degrades LPCCNet performance from dB to dB.
- Eliminating diffusion augmentation lowers performance by $0.1$–$0.6$ dB.
- Cross-Scenario Generalization: WiFo-M features maintain strong gains on three unseen datasets: up to $1.3$ dB NMSE improvement on CI, $2.3$ dB on CP, and improvement to Top-3 accuracy on one beam-prediction scenario (from ).
- Deployment Overhead: WiFo-M-LiDAR has $0.07$ million parameters and $3.3$ ms inference; WiFo-M-Img uses $9.62$ million parameters and $6.0$ ms inference; per-task adapters add at most $0.26$ million parameters.
7. Significance and Impact
ContraSoM achieves universal, task-agnostic OOB channel embeddings suitable for boosting a broad spectrum of wireless communication tasks with minimal per-task adaptation. By unifying contrastive cross-modal learning, temporal feature extrapolation, and modality-wise data augmentation, it realizes a plug-and-play foundation model paradigm, significantly improving scalability, robustness, and out-of-distribution generalization in sensor-aided wireless systems (Zhang et al., 14 Jan 2026).