WiFo-M²: Wireless Optimization and Multi-Modal Sensing

Updated 22 January 2026

WiFo-M² is a dual-approach framework that combines hardware-level programmable metasurfaces for targeted Wi-Fi beamforming with software-driven multi-modal sensing for enhanced transceiver performance.
The metasurface module achieves focused energy redistribution using binary phase control, delivering up to 10× received power gains and efficient spatial reconfiguration in dynamic environments.
The foundation model leverages contrastive and diffusion pretraining to synchronize heterogeneous sensor data with high-speed CSI updates, facilitating plug-and-play wireless augmentation.

WiFo-M $^2$ refers to a pair of distinct yet convergent research directions leveraging advanced hardware and machine learning to enhance wireless communication systems via programmable electromagnetic control and universal multi-modal sensing. The term encompasses (1) a programmable metasurface-based wireless field optimizer for commodity Wi-Fi ("Controllable Enhancements of Wi-Fi Signals at Desired Locations Without Extra Energy Using Programmable Metasurface" (Shuang et al., 2019)) and (2) a foundation model for plug-and-play multi-modal out-of-band sensing and transceiver augmentation ("WiFo-M $^2$ : Plug-and-Play Multi-Modal Sensing via Foundation Model to Empower Wireless Communications" (Zhang et al., 14 Jan 2026)). Each direction approaches spatial and temporal manipulation of radio signals at different layers of the wireless stack: the first via hardware-level spatial field reconfiguration, and the second via software-level universal OOB feature extraction and transfer into transceiver modules.

1. Motivation and Problem Statement

The rapid proliferation of sensor-rich environments—ranging from intelligent transportation and autonomous factories to Internet of Things (IoT) deployments—has driven a need for more adaptive, efficient, and context-aware wireless infrastructure. Two key challenges persist:

Physical-layer spatial control: Traditional Wi-Fi and wireless deployments with omnidirectional or fixed-pattern antennas lack “on-demand” spatial energy control, leading to inefficiencies and signal wastage. Passive or semi-passive field control devices are attractive, but have historically lacked dynamic, fine-grained programmability.
Universal multi-modal augmentation: Modern intelligent platforms possess cameras and LiDAR providing valuable out-of-band (OOB) environmental context. However, integrating these heterogeneous streams into transceiver design remains bespoke and scenario-specific. Further, the low framerate of sensing modules (10–20 Hz) contrasts with the high CSI update rate (hundreds of Hz), complicating synchronized OOB feature extraction.

WiFo-M $^2$ encapsulates concrete solutions: (1) for physical-layer spatial optimization via reconfigurable metasurfaces, and (2) for software-layer OOB feature alignment, inference, and fusion across diverse wireless tasks.

2. Programmable Metasurface Enhancement for Wi-Fi (WiFo-M $^2$ –PM)

2.1 Structural Overview

WiFo-M $^2$ –PM is a 24 × 24 unit-cell binary-phase programmable metasurface designed for the 2.4 GHz Wi-Fi band (Shuang et al., 2019). Each meta-atom is composed of a copper patch (37 mm × 37 mm), FR-4 bias layer, and is connected to ground through a PIN diode and RF choke, providing a discrete 0/π phase toggle per cell. The metasurface’s 576 individually addressable cells (total aperture 1.296 m × 1.296 m) are controlled by a master PC connected to a Cyclone IV FPGA, which in turn drives 24 parallel shift-register chains at 50 MHz. The typical group response is |Γ| > 0.8 over 55 MHz (2.401–2.456 GHz).

2.2 Electromagnetic Principles and Control

The field manipulation is modeled via the Huygens’ equivalent-current paradigm. The induced currents in each meta-atom generate the scattered electric field; the total reflected field at an observation point is computed by superposition of all addressable cells:

$E_s(r) = \sum_{n_x=1}^{24} \sum_{n_y=1}^{24} A(r; n_x, n_y) \cdot I_{n_x, n_y}$

where $I_{n_x, n_y}$ denotes the binary-phase state of each cell.

A modified Gerchberg–Saxton (G-S) iterative procedure solves for binary coding matrices to focus or pattern ambient Wi-Fi power at prescribed locations or shapes, subject to a cost function penalizing the difference between normalized field amplitude and target maps.

2.3 Experimental and Quantitative Results

Focusing/Beamforming

Near-field scan results demonstrate full-width at half-maximum (FWHM) of focused energy lobes at $0.2-0.3$ m (at $z=1.5$ m) with predicted/measured agreement within $10$– $^2$ 0\%.
Efficiency: For example, at $^2$ 1 m, $^2$ 2 dB focus efficiency varies from 21%–40% across patterns.

Over-the-air Wi-Fi Gains

At a focus point $^2$ 3, optimized sequences enhance received power by approximately $^2$ 4 compared to full-ON metasurface state.
Dual-point focusing demonstrates concurrent power enhancement at multiple spatial locations.

Robustness and Limitations

Patterns are stable within $^2$ 5– $^2$ 6 dB under static conditions; movement of the router or environment degrades performance.
The NP-hardness of the binary phase-optimization restricts global optimality (G-S finds local solutions unless assisted by metaheuristics).
Indoor multipath and metasurface aperture limit pattern contrast and sharpness.

2.4 System Latency and Improvements

Full reconfiguration latency (bitstream and switching) is under $^2$ 7 μs, with PC-FPGA latency dominated by USB ( $^2$ 8milliseconds).
Future improvements include multi-level phase cells, SoC integration for sub-ms control, and active amplitude/phase adjustment.

2.5 Use Cases

Targeted wireless coverage in homes/offices, interference mitigation, dynamic spatial multiplexing, reconfigurable “digital walls,” device-free sensing, and energy harvesting are identified as practical domains.

3.1 Architecture and Workflow

WiFo-M $^2$ 0–FM is the first reported plug-and-play foundation model for OOB multi-modal sensing in wireless communication (Zhang et al., 14 Jan 2026). Its architecture includes:

Backbones for each modality: Images are processed via a ResNet-34 + GRU temporal head (WiFo-M $^2$ 1-Img); LiDAR streams use a PointNet + GRU (WiFo-M $^2$ 2-LiDAR).
Temporal sequence-to-sequence modeling: GRUs extrapolate (“future infer”) OOB features $^2$ 3 at times $^2$ 4 where sensor frames may not be available, from sequences of historical input.
Plug-in workflow: The pre-trained, frozen WiFo-M $^2$ 5 backbone is connected to downstream transceiver modules (beam prediction, channel estimation, channel interpolation, channel prediction) via a small adapter (typically a 2-layer MLP), with only the adapter and task head fine-tuned per deployment.

3.2 Data Preprocessing and Labeling

Image branch: Candidate receivers are detected and tracked (YOLOv8), their horizontal angular spans computed, and the ground truth azimuth is hue-encoded into the bounding box for paired channel state information (CSI) matching.
LiDAR branch: Ground-removed, DBSCAN-clustered, and receiver-identifying point clusters are labeled (+1 for receivers, −1 for static structures), appended as a fourth channel for the backbone.

3.3 ContraSoM Pre-training Strategy

A modality-agnostic contrastive pre-training pipeline (“ContraSoM”) aligns modality-specific OOB features with CSI features via symmetric InfoNCE losses. Specifically:

For images and CSI, symmetric InfoNCE aligns batch-paired feature vectors using cosine similarity at shared time indices.
LiDAR pretraining incorporates diffusion augmentations, adding two randomly noised feature-space views to increase feature invariance, with an additional DDPM diffusion loss term for robustness.

$^2$ 6

$^2$ 7

3.4 Modality-Specific Augmentation

Image augmentation: Color jitter, Gaussian blur, and random erasing in pixel-space.
LiDAR augmentation: Feature-space forward noising (DDPM) and DDIM-based reverse denoising generate novel views for contrastive learning.

3.5 OOB Feature Inference

At inference, WiFo-M $^2$ 8–FM utilizes only historical sensor windows to produce features for immediate and next-frame timepoints, allowing transceiver modules to operate at high update rates despite slow sensor framerates.

4. Experimental Validation and Performance

4.1 Pre-training and Dataset Scope

WiFo-M $^2$ 9–FM is pre-trained on two large synthetic datasets (intersection/traffic @28 GHz, dense building @4.95 GHz, ∼10k samples total), with evaluations across seven “seen” synthetic and three “unseen” (real or novel) testbeds.

4.2 Performance Benchmarks

Task	Comparison/Metric	WiFo-M $^2$ 0 Outcome
Beam Prediction	Top-1 accuracy	LiDAR +5–10 pts vs MM-BP-LiDAR; image stream at parity with Vision-BP; multi-modal gap <2%
Channel Estimation	NMSE	0.1–0.4 dB lower NMSE vs baselines, up to 63% drop cross-scenario
Channel Interpolation	NMSE	Up to 50% RMSE drop; cross-scenario 5–20% improvement
Channel Prediction	NMSE	10–25% (−0.4…−1.0 dB) better than WiFo or Transformer baselines
Ablation	NMSE / Accuracy	Confirmed value of temporal extrapolation and contrastive/diffusion pretraining
Latency/Complexity	Runtime	<10 ms (all modalities serial) per instance, plug-in: 0.02–0.3 M params per adapter

Ablation studies highlight critical dependence on temporal feature extrapolation and the effectiveness of contrastive and diffusion augmentations. WiFo-M $^2$ 1 demonstrates high generalization, as evidenced by improvement over both MM-BP and CENN/FCDAMP/LPCCNet baselines in cross-scenario evaluations (e.g., “DeepSense‐6G” and “ViWi” datasets).

4.3 Deployment and Scalability

WiFo-M $^2$ 2 freeze-trained backbones can augment diverse wireless transceiver modules with negligible additional inference time and parameter count. Only shallow adapters per module are trained for task-specific deployment, yielding broad portability and scalability.

5. Limitations, Open Problems, and Future Directions

Programmable metasurfaces are subject to NP-hard binary optimization, limiting achievable pattern complexity and global search capacity. Multipath and specular reflection present enduring obstacles in uncontrolled environments.
WiFo-M $^2$ 3–FM performance is bounded by the quality of available OOB sensors and the fidelity of the synthetic-to-real domain adaptation. Temporal feature extrapolation alleviates, but does not fully bridge, the gap in sample rate disparity.
Potential enhancements involve integrating multi-level phase metasurface control, metaheuristic global search, closed-loop feedback via real-time SNR/RSSI from clients, amplitude-plus-phase hybrid elements, and expansion of foundation model pre-training to encompass further sensing modalities and tasks.
A plausible implication is the convergence of programmable meta-atoms for spatial control and universal OOB fusion models for adaptive, context-aware wireless systems—potentially enabling real-time, environment-sensitive communications infrastructure without necessitating custom tuning for each scenario.

6. Impact and Application Domains

WiFo-M $^2$ 4 systems underscore an important trend toward physically and logically reconfigurable wireless environments:

Metasurface-based control offers direct energy redistribution for wireless coverage, dynamic interference mitigation, multi-user MIMO, spatial multiplexing, “digital wall” smart environments, passive sensing, and ambient energy harvesting (Shuang et al., 2019).
Foundation OOB models enable modular, universal enhancement of wireless transceivers, accelerating agile prototyping and deployment for multi-modal settings, fostering generalization across environments ("zero-shot" and "few-shot" OOB transfer), and offering systematic, quantifiable gains in core wireless signal processing tasks (Zhang et al., 14 Jan 2026).

These approaches jointly advance the programmability, adaptability, and universality of next-generation wireless communication and sensing systems.

Markdown Report Issue Upgrade to Chat

References (2)

Controllable Enhancements of Wi-Fi Signals at Desired Locations Without Extra Energy Using Programmable Metasurface (2019)

WiFo-M$^2$: Plug-and-Play Multi-Modal Sensing via Foundation Model to Empower Wireless Communications (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WiFo-M$^2$.