CSI-ResNet-A Architecture for Efficient Crowd Counting
- The paper introduces CSI-ResNet-A, a neural architecture that integrates self-supervised contrastive pre-training and adapter-based fine-tuning to overcome domain shift in CSI-based crowd counting.
- It processes sliding windows of CSI amplitude data through Conv1D stems, residual blocks with squeeze-and-excitation, and efficient adapter modules to drastically reduce trainable parameters.
- The architecture achieves near-perfect generalisation with low Mean Absolute Error and high accuracy on real-world datasets, enabling robust and scalable IoT occupancy estimation.
CSI-ResNet-A is a parameter-efficient neural architecture for device-free crowd-counting using WiFi Channel State Information (CSI) streams. It is designed to address the domain shift problem, where models trained on CSI data from one environment struggle to generalize to new locations. The architecture integrates self-supervised contrastive pre-training and adapter-based fine-tuning, achieving state-of-the-art domain generalization and robust deployment in practical IoT scenarios. CSI-ResNet-A processes CSI amplitude windows, transforms them into embeddings, and utilizes a stateful counting machine for stable occupancy estimation. Performance evaluations demonstrate near-perfect generalization indices and significant parameter reductions during adaptation, while maintaining accuracy benchmarks on real-world datasets.
1. Architectural Design and Layer Structure
CSI-ResNet-A processes a sliding window of CSI amplitude data of shape subcarriers timesteps. The input undergoes an initial Conv1D stem (52 → 64 channels, kernel=7, stride=2, padding=3), BatchNorm1d, ReLU, and MaxPool1d (kernel=3, stride=2, padding=1), producing feature maps sized . Three sequential residual stages follow, each comprising two non-bottleneck basic blocks:
- Stage 1: 2 blocks at 64 channels (feature map remains )
- Stage 2: 2 blocks at 128 channels ( after stride-2 downsampling; second block keeps )
- Stage 3: 2 blocks at 256 channels ( after downsampling; second block keeps )
The output head involves global average pooling over the temporal axis, squeezing to a 256-dimensional vector, then a fully connected layer mapping to produce the final embedding .
Each residual block implements:
- Conv1D (, kernel=3, stride=s, padding=1) BN ReLU
- Conv1D (, kernel=3, stride=1, padding=1) BN
- Squeeze-and-Excitation (SE) module (reduction ratio ): squeeze, excitation by , then channel-wise scaling,
- Adapter module (see next section),
- Shortcut addition ,
- ReLU.
Parameter counts per block: Stage 1, Block 1.1 () ≈ 27,600; Stage 2, Block 2.1 () ≈ 86,448; Stage 3, Block 3.1 () ≈ 346,192. Output head parameters: 32,896. The grand total is 1,092,806 parameters (Custance et al., 5 Jan 2026).
2. Adapter Module and Parameter Efficiency
Adapter modules are inserted in every residual block after the SE module and before the residual shortcut. Each adapter performs a channel-wise bottleneck of size , implemented as two Conv1D projections plus residual:
where , , , , and "" denotes convolution along the temporal axis.
Per-block adapter parameters: . Total across six blocks is 30,438, constituting only ≈2.8% of model parameters (Custance et al., 5 Jan 2026).
3. CSI Data Processing and Counting Pipeline
Raw CSI streams ( = 100 Hz, 52 subcarriers) undergo preprocessing:
- 4th-order Butterworth low-pass filter at Hz () removes high-frequency noise.
- Sliding window of length 100 samples (1 s), stride 50 (50% overlap).
- Windows labeled as "enter," "exit," or "no_event" if pure within.
Processed windows ( by permutation) enter the encoder, outputting . The counting head comprises a fully-connected layer ( logits), followed by softmax yielding . Event streams are integrated into an occupancy estimate by a three-state debounce/cooldown machine:
| State | Transition Rule | Purpose |
|---|---|---|
| NO_EVENT | Wait for non-zero event | Event detection |
| DEBOUNCING | Confirm event over EVENT_THRESHOLD=5 consecutive windows | Robust counting |
| WAIT_FOR_NO_EVENT | After update, wait for COOLDOWN_PERIOD=10 "no_event" to reset | Noise robustness |
4. Self-Supervised Pre-Training and Contrastive Learning
CSI-ResNet-A leverages a two-stage pipeline, beginning with self-supervised pre-training using the NT-Xent/InfoNCE objective. For batch size , each window yields two augmented views (, ), encoded and projected to normalized vectors , :
where , .
Augmentations include jitter (), scaling (), and random segment permutation ().
Training specifics: large unlabeled source dataset (WiFlow), Adam optimizer (lr=1×10⁻³, batch size=128, epochs=50). Post-pre-training, the projection head is discarded; encoder weights are frozen (Custance et al., 5 Jan 2026).
5. Adapter-Based Few-Shot Domain Adaptation
Adapter-based fine-tuning utilizes -shot labeled examples () from the target domain, training exclusively the adapter modules and the classification head. Optionally, unsupervised ADDA aligns source and target domains via adversarial objectives:
Standard cross-entropy loss is employed for label prediction:
Optimization: Adam with discriminative learning rates (1×10⁻⁴ for adapters, 1×10⁻³ for head), batch size=16, epoch=25, repeated over 10 runs.
Trainable parameter breakdown:
- Adapters: 30,438
- Classification head: 387 Total: 30,800 (2.8% of model) (Custance et al., 5 Jan 2026).
6. Performance Metrics and Generalisation
CSI-ResNet-A sets new benchmarks for CSI crowd-counting. On WiFlow, in 10-shot regimes, it achieves a cleaned Mean Absolute Error (MAE) of 0.44 in unsupervised linear probe mode, surpassing traditional RF baselines (MAE > 2.1). On the WiAR benchmark, full fine-tuning yields 99.67% accuracy (all parameters trainable); adapter-only fine-tuning yields 98.84% accuracy (within 0.83% of full, with 97.2% fewer parameters).
Generalisation Index (), defined for any metric as (invert for metrics where lower is better), formally quantifies robustness. For MAE, ; indicates perfect cross-domain performance. Empirically, CSI-ResNet-A attains –$1.00$ and in transfer scenarios, indicating highly effective domain invariance (Custance et al., 5 Jan 2026).
7. Impact and Significance
CSI-ResNet-A demonstrates high efficacy for real-world, privacy-preserving device-free occupancy sensing using CSI. The integration of self-supervised contrastive learning with adapter-based parameter-efficient fine-tuning provides practical and scalable domain adaptation without extensive retraining. The architecture's robust performance, quantified by low MAE and high generalisation index under minimal supervised data, represents a significant advancement for deployable IoT crowd-counting systems.