CSI-ResNet-A Architecture for Efficient Crowd Counting

Updated 12 January 2026

The paper introduces CSI-ResNet-A, a neural architecture that integrates self-supervised contrastive pre-training and adapter-based fine-tuning to overcome domain shift in CSI-based crowd counting.
It processes sliding windows of CSI amplitude data through Conv1D stems, residual blocks with squeeze-and-excitation, and efficient adapter modules to drastically reduce trainable parameters.
The architecture achieves near-perfect generalisation with low Mean Absolute Error and high accuracy on real-world datasets, enabling robust and scalable IoT occupancy estimation.

CSI-ResNet-A is a parameter-efficient neural architecture for device-free crowd-counting using WiFi Channel State Information (CSI) streams. It is designed to address the domain shift problem, where models trained on CSI data from one environment struggle to generalize to new locations. The architecture integrates self-supervised contrastive pre-training and adapter-based fine-tuning, achieving state-of-the-art domain generalization and robust deployment in practical IoT scenarios. CSI-ResNet-A processes CSI amplitude windows, transforms them into embeddings, and utilizes a stateful counting machine for stable occupancy estimation. Performance evaluations demonstrate near-perfect generalization indices and significant parameter reductions during adaptation, while maintaining accuracy benchmarks on real-world datasets.

1. Architectural Design and Layer Structure

CSI-ResNet-A processes a sliding window of CSI amplitude data of shape $(C=52)$ subcarriers $\times$ $(L=100)$ timesteps. The input undergoes an initial Conv1D stem (52 → 64 channels, kernel=7, stride=2, padding=3), BatchNorm1d, ReLU, and MaxPool1d (kernel=3, stride=2, padding=1), producing feature maps sized $64 \times 25$ . Three sequential residual stages follow, each comprising two non-bottleneck basic blocks:

Stage 1: 2 blocks at 64 channels (feature map remains $64\times25$ )
Stage 2: 2 blocks at 128 channels ( $128\times13$ after stride-2 downsampling; second block keeps $128\times13$ )
Stage 3: 2 blocks at 256 channels ( $256\times7$ after downsampling; second block keeps $256\times7$ )

The output head involves global average pooling over the temporal axis, squeezing to a 256-dimensional vector, then a fully connected layer mapping $256\to128$ to produce the final embedding $e \in \mathbb{R}^{128}$ .

Each residual block implements:

Conv1D ( $C'\to C'$ , kernel=3, stride=s, padding=1) $\rightarrow$ BN $\rightarrow$ ReLU
Conv1D ( $C'\to C'$ , kernel=3, stride=1, padding=1) $\rightarrow$ BN
Squeeze-and-Excitation (SE) module (reduction ratio $r=16$ ): squeeze, excitation by $s = \sigma(W_2\,\text{ReLU}(W_1\,z))$ , then channel-wise scaling,
Adapter module (see next section),
Shortcut addition $x_\ell + F(x_\ell)$ ,
ReLU.

Parameter counts per block: Stage 1, Block 1.1 ( $C'=64$ ) ≈ 27,600; Stage 2, Block 2.1 ( $C'=128$ ) ≈ 86,448; Stage 3, Block 3.1 ( $C'=256$ ) ≈ 346,192. Output head parameters: 32,896. The grand total is 1,092,806 parameters (Custance et al., 5 Jan 2026).

2. Adapter Module and Parameter Efficiency

Adapter modules are inserted in every residual block after the SE module and before the residual shortcut. Each adapter performs a channel-wise bottleneck of size $d=C'/16$ , implemented as two $1\times1$ Conv1D projections plus residual:

$x_{out} = x_{in} + W_{up} \circ \mathrm{ReLU}(W_{down} \circ x_{in} + b_{down}) + b_{up}$

where $W_{down} \in \mathbb{R}^{d \times C' \times 1}$ , $b_{down} \in \mathbb{R}^d$ , $W_{up} \in \mathbb{R}^{C' \times d \times 1}$ , $b_{up} \in \mathbb{R}^{C'}$ , and " $\circ$ " denotes convolution along the temporal axis.

Per-block adapter parameters: $C' \cdot d + d + d \cdot C' + C' \approx 2C'^2/16 + (C'+C'/16)$ . Total across six blocks is 30,438, constituting only ≈2.8% of model parameters (Custance et al., 5 Jan 2026).

3. CSI Data Processing and Counting Pipeline

Raw CSI streams ( $F_s$ = 100 Hz, 52 subcarriers) undergo preprocessing:

4th-order Butterworth low-pass filter at $f_c=8$ Hz ( $W_n=0.16$ ) removes high-frequency noise.
Sliding window of length 100 samples (1 s), stride 50 (50% overlap).
Windows labeled as "enter," "exit," or "no_event" if pure within.

Processed windows $w_i \in \mathbb{R}^{52\times100}$ ( $\to\mathbb{R}^{100\times52}$ by permutation) enter the encoder, outputting $e_i \in \mathbb{R}^{128}$ . The counting head comprises a fully-connected layer ( $128\to3$ logits), followed by softmax yielding $p_{enter}, p_{exit}, p_{no\_event}$ . Event streams are integrated into an occupancy estimate by a three-state debounce/cooldown machine:

State	Transition Rule	Purpose
NO_EVENT	Wait for non-zero event	Event detection
DEBOUNCING	Confirm event over EVENT_THRESHOLD=5 consecutive windows	Robust counting
WAIT_FOR_NO_EVENT	After update, wait for COOLDOWN_PERIOD=10 "no_event" to reset	Noise robustness

4. Self-Supervised Pre-Training and Contrastive Learning

CSI-ResNet-A leverages a two-stage pipeline, beginning with self-supervised pre-training using the NT-Xent/InfoNCE objective. For batch size $N$ , each window yields two augmented views ( $w_i$ , $w_j$ ), encoded and projected to normalized vectors $z_i$ , $z_j$ :

$\mathcal{L}_{i,j} = -\log\, \frac{\exp(\mathrm{sim}(z_i,z_j)/\tau)}{\sum_{k=1}^{2N} \mathbf{1}_{[k\neq i]}\exp(\mathrm{sim}(z_i,z_k)/\tau)}$

where $\mathrm{sim}(u,v)=u^\top v/(\|u\|\|v\|)$ , $\tau=0.1$ .

Augmentations include jitter ( $\epsilon \sim \mathcal{N}(0,\sigma_j^2), \sigma_j=0.03$ ), scaling ( $\alpha \sim \mathcal{N}(1,\sigma_s^2), \sigma_s=0.1$ ), and random segment permutation ( $k\in\{2,3,4,5\}$ ).

Training specifics: large unlabeled source dataset (WiFlow), Adam optimizer (lr=1×10⁻³, batch size=128, epochs=50). Post-pre-training, the projection head is discarded; encoder weights are frozen (Custance et al., 5 Jan 2026).

5. Adapter-Based Few-Shot Domain Adaptation

Adapter-based fine-tuning utilizes $k$ -shot labeled examples ( $k=1,5,10$ ) from the target domain, training exclusively the adapter modules and the classification head. Optionally, unsupervised ADDA aligns source and target domains via adversarial objectives:

$\mathcal{L}_D = -\mathbb{E}_{x_S \sim S}[\log D(G_S(x_S))] - \mathbb{E}_{x_T \sim T}[\log(1-D(G_T(x_T))]$

$\mathcal{L}_{G_T} = -\mathbb{E}_{x_T \sim T}[\log D(G_T(x_T))]$

Standard cross-entropy loss is employed for label prediction:

$\mathcal{L}_{CE} = -\sum_{i=1}^N y_i \log \hat{y}_i$

Optimization: Adam with discriminative learning rates (1×10⁻⁴ for adapters, 1×10⁻³ for head), batch size=16, epoch=25, repeated over 10 runs.

Trainable parameter breakdown:

Adapters: 30,438
Classification head: $\sim$ 387 Total: $\sim$ 30,800 ( $\approx$ 2.8% of model) (Custance et al., 5 Jan 2026).

6. Performance Metrics and Generalisation

CSI-ResNet-A sets new benchmarks for CSI crowd-counting. On WiFlow, in 10-shot regimes, it achieves a cleaned Mean Absolute Error (MAE) of 0.44 in unsupervised linear probe mode, surpassing traditional RF baselines (MAE > 2.1). On the WiAR benchmark, full fine-tuning yields 99.67% accuracy (all parameters trainable); adapter-only fine-tuning yields 98.84% accuracy (within 0.83% of full, with 97.2% fewer parameters).

Generalisation Index ( $\mathrm{GI}_P$ ), defined for any metric $P$ as $GI_P = P_{target}/P_{source}$ (invert $P$ for metrics where lower is better), formally quantifies robustness. For MAE, $GI_{MAE} = MAE_{source}/MAE_{target}$ ; $GI=1.0$ indicates perfect cross-domain performance. Empirically, CSI-ResNet-A attains $GI_{Acc}\approx0.98$ –$1.00$ and $GI_{MAE}\gg1$ in transfer scenarios, indicating highly effective domain invariance (Custance et al., 5 Jan 2026).

7. Impact and Significance

CSI-ResNet-A demonstrates high efficacy for real-world, privacy-preserving device-free occupancy sensing using CSI. The integration of self-supervised contrastive learning with adapter-based parameter-efficient fine-tuning provides practical and scalable domain adaptation without extensive retraining. The architecture's robust performance, quantified by low MAE and high generalisation index under minimal supervised data, represents a significant advancement for deployable IoT crowd-counting systems.

Markdown Report Issue Upgrade to Chat

References (1)

Parameter-Efficient Domain Adaption for CSI Crowd-Counting via Self-Supervised Learning with Adapter Modules (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CSI-ResNet-A Architecture.

CSI-ResNet-A Architecture for Efficient Crowd Counting

1. Architectural Design and Layer Structure

2. Adapter Module and Parameter Efficiency

3. CSI Data Processing and Counting Pipeline

4. Self-Supervised Pre-Training and Contrastive Learning

5. Adapter-Based Few-Shot Domain Adaptation

6. Performance Metrics and Generalisation

7. Impact and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CSI-ResNet-A Architecture for Efficient Crowd Counting

1. Architectural Design and Layer Structure

2. Adapter Module and Parameter Efficiency

3. CSI Data Processing and Counting Pipeline

4. Self-Supervised Pre-Training and Contrastive Learning

5. Adapter-Based Few-Shot Domain Adaptation

6. Performance Metrics and Generalisation

7. Impact and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research