Papers
Topics
Authors
Recent
Search
2000 character limit reached

CSI-ResNet-A Architecture for Efficient Crowd Counting

Updated 12 January 2026
  • The paper introduces CSI-ResNet-A, a neural architecture that integrates self-supervised contrastive pre-training and adapter-based fine-tuning to overcome domain shift in CSI-based crowd counting.
  • It processes sliding windows of CSI amplitude data through Conv1D stems, residual blocks with squeeze-and-excitation, and efficient adapter modules to drastically reduce trainable parameters.
  • The architecture achieves near-perfect generalisation with low Mean Absolute Error and high accuracy on real-world datasets, enabling robust and scalable IoT occupancy estimation.

CSI-ResNet-A is a parameter-efficient neural architecture for device-free crowd-counting using WiFi Channel State Information (CSI) streams. It is designed to address the domain shift problem, where models trained on CSI data from one environment struggle to generalize to new locations. The architecture integrates self-supervised contrastive pre-training and adapter-based fine-tuning, achieving state-of-the-art domain generalization and robust deployment in practical IoT scenarios. CSI-ResNet-A processes CSI amplitude windows, transforms them into embeddings, and utilizes a stateful counting machine for stable occupancy estimation. Performance evaluations demonstrate near-perfect generalization indices and significant parameter reductions during adaptation, while maintaining accuracy benchmarks on real-world datasets.

1. Architectural Design and Layer Structure

CSI-ResNet-A processes a sliding window of CSI amplitude data of shape (C=52)(C=52) subcarriers ×\times (L=100)(L=100) timesteps. The input undergoes an initial Conv1D stem (52 → 64 channels, kernel=7, stride=2, padding=3), BatchNorm1d, ReLU, and MaxPool1d (kernel=3, stride=2, padding=1), producing feature maps sized 64×2564 \times 25. Three sequential residual stages follow, each comprising two non-bottleneck basic blocks:

  • Stage 1: 2 blocks at 64 channels (feature map remains 64×2564\times25)
  • Stage 2: 2 blocks at 128 channels (128×13128\times13 after stride-2 downsampling; second block keeps 128×13128\times13)
  • Stage 3: 2 blocks at 256 channels (256×7256\times7 after downsampling; second block keeps 256×7256\times7)

The output head involves global average pooling over the temporal axis, squeezing to a 256-dimensional vector, then a fully connected layer mapping 256128256\to128 to produce the final embedding eR128e \in \mathbb{R}^{128}.

Each residual block implements:

  1. Conv1D (CCC'\to C', kernel=3, stride=s, padding=1) \rightarrow BN \rightarrow ReLU
  2. Conv1D (CCC'\to C', kernel=3, stride=1, padding=1) \rightarrow BN
  3. Squeeze-and-Excitation (SE) module (reduction ratio r=16r=16): squeeze, excitation by s=σ(W2ReLU(W1z))s = \sigma(W_2\,\text{ReLU}(W_1\,z)), then channel-wise scaling,
  4. Adapter module (see next section),
  5. Shortcut addition x+F(x)x_\ell + F(x_\ell),
  6. ReLU.

Parameter counts per block: Stage 1, Block 1.1 (C=64C'=64) ≈ 27,600; Stage 2, Block 2.1 (C=128C'=128) ≈ 86,448; Stage 3, Block 3.1 (C=256C'=256) ≈ 346,192. Output head parameters: 32,896. The grand total is 1,092,806 parameters (Custance et al., 5 Jan 2026).

2. Adapter Module and Parameter Efficiency

Adapter modules are inserted in every residual block after the SE module and before the residual shortcut. Each adapter performs a channel-wise bottleneck of size d=C/16d=C'/16, implemented as two 1×11\times1 Conv1D projections plus residual:

xout=xin+WupReLU(Wdownxin+bdown)+bupx_{out} = x_{in} + W_{up} \circ \mathrm{ReLU}(W_{down} \circ x_{in} + b_{down}) + b_{up}

where WdownRd×C×1W_{down} \in \mathbb{R}^{d \times C' \times 1}, bdownRdb_{down} \in \mathbb{R}^d, WupRC×d×1W_{up} \in \mathbb{R}^{C' \times d \times 1}, bupRCb_{up} \in \mathbb{R}^{C'}, and "\circ" denotes convolution along the temporal axis.

Per-block adapter parameters: Cd+d+dC+C2C2/16+(C+C/16)C' \cdot d + d + d \cdot C' + C' \approx 2C'^2/16 + (C'+C'/16). Total across six blocks is 30,438, constituting only ≈2.8% of model parameters (Custance et al., 5 Jan 2026).

3. CSI Data Processing and Counting Pipeline

Raw CSI streams (FsF_s = 100 Hz, 52 subcarriers) undergo preprocessing:

  • 4th-order Butterworth low-pass filter at fc=8f_c=8 Hz (Wn=0.16W_n=0.16) removes high-frequency noise.
  • Sliding window of length 100 samples (1 s), stride 50 (50% overlap).
  • Windows labeled as "enter," "exit," or "no_event" if pure within.

Processed windows wiR52×100w_i \in \mathbb{R}^{52\times100} (R100×52\to\mathbb{R}^{100\times52} by permutation) enter the encoder, outputting eiR128e_i \in \mathbb{R}^{128}. The counting head comprises a fully-connected layer (1283128\to3 logits), followed by softmax yielding penter,pexit,pno_eventp_{enter}, p_{exit}, p_{no\_event}. Event streams are integrated into an occupancy estimate by a three-state debounce/cooldown machine:

State Transition Rule Purpose
NO_EVENT Wait for non-zero event Event detection
DEBOUNCING Confirm event over EVENT_THRESHOLD=5 consecutive windows Robust counting
WAIT_FOR_NO_EVENT After update, wait for COOLDOWN_PERIOD=10 "no_event" to reset Noise robustness

4. Self-Supervised Pre-Training and Contrastive Learning

CSI-ResNet-A leverages a two-stage pipeline, beginning with self-supervised pre-training using the NT-Xent/InfoNCE objective. For batch size NN, each window yields two augmented views (wiw_i, wjw_j), encoded and projected to normalized vectors ziz_i, zjz_j:

Li,j=logexp(sim(zi,zj)/τ)k=12N1[ki]exp(sim(zi,zk)/τ)\mathcal{L}_{i,j} = -\log\, \frac{\exp(\mathrm{sim}(z_i,z_j)/\tau)}{\sum_{k=1}^{2N} \mathbf{1}_{[k\neq i]}\exp(\mathrm{sim}(z_i,z_k)/\tau)}

where sim(u,v)=uv/(uv)\mathrm{sim}(u,v)=u^\top v/(\|u\|\|v\|), τ=0.1\tau=0.1.

Augmentations include jitter (ϵN(0,σj2),σj=0.03\epsilon \sim \mathcal{N}(0,\sigma_j^2), \sigma_j=0.03), scaling (αN(1,σs2),σs=0.1\alpha \sim \mathcal{N}(1,\sigma_s^2), \sigma_s=0.1), and random segment permutation (k{2,3,4,5}k\in\{2,3,4,5\}).

Training specifics: large unlabeled source dataset (WiFlow), Adam optimizer (lr=1×10⁻³, batch size=128, epochs=50). Post-pre-training, the projection head is discarded; encoder weights are frozen (Custance et al., 5 Jan 2026).

5. Adapter-Based Few-Shot Domain Adaptation

Adapter-based fine-tuning utilizes kk-shot labeled examples (k=1,5,10k=1,5,10) from the target domain, training exclusively the adapter modules and the classification head. Optionally, unsupervised ADDA aligns source and target domains via adversarial objectives:

LD=ExSS[logD(GS(xS))]ExTT[log(1D(GT(xT))]\mathcal{L}_D = -\mathbb{E}_{x_S \sim S}[\log D(G_S(x_S))] - \mathbb{E}_{x_T \sim T}[\log(1-D(G_T(x_T))]

LGT=ExTT[logD(GT(xT))]\mathcal{L}_{G_T} = -\mathbb{E}_{x_T \sim T}[\log D(G_T(x_T))]

Standard cross-entropy loss is employed for label prediction:

LCE=i=1Nyilogy^i\mathcal{L}_{CE} = -\sum_{i=1}^N y_i \log \hat{y}_i

Optimization: Adam with discriminative learning rates (1×10⁻⁴ for adapters, 1×10⁻³ for head), batch size=16, epoch=25, repeated over 10 runs.

Trainable parameter breakdown:

  • Adapters: 30,438
  • Classification head: \sim387 Total: \sim30,800 (\approx2.8% of model) (Custance et al., 5 Jan 2026).

6. Performance Metrics and Generalisation

CSI-ResNet-A sets new benchmarks for CSI crowd-counting. On WiFlow, in 10-shot regimes, it achieves a cleaned Mean Absolute Error (MAE) of 0.44 in unsupervised linear probe mode, surpassing traditional RF baselines (MAE > 2.1). On the WiAR benchmark, full fine-tuning yields 99.67% accuracy (all parameters trainable); adapter-only fine-tuning yields 98.84% accuracy (within 0.83% of full, with 97.2% fewer parameters).

Generalisation Index (GIP\mathrm{GI}_P), defined for any metric PP as GIP=Ptarget/PsourceGI_P = P_{target}/P_{source} (invert PP for metrics where lower is better), formally quantifies robustness. For MAE, GIMAE=MAEsource/MAEtargetGI_{MAE} = MAE_{source}/MAE_{target}; GI=1.0GI=1.0 indicates perfect cross-domain performance. Empirically, CSI-ResNet-A attains GIAcc0.98GI_{Acc}\approx0.98–$1.00$ and GIMAE1GI_{MAE}\gg1 in transfer scenarios, indicating highly effective domain invariance (Custance et al., 5 Jan 2026).

7. Impact and Significance

CSI-ResNet-A demonstrates high efficacy for real-world, privacy-preserving device-free occupancy sensing using CSI. The integration of self-supervised contrastive learning with adapter-based parameter-efficient fine-tuning provides practical and scalable domain adaptation without extensive retraining. The architecture's robust performance, quantified by low MAE and high generalisation index under minimal supervised data, represents a significant advancement for deployable IoT crowd-counting systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CSI-ResNet-A Architecture.