Lightweight Transformer Model for IDS

Updated 7 January 2026

The paper presents lightweight transformer architectures that achieve over 90% parameter reduction while maintaining robust IDS detection accuracy.
Key methodologies include layer pruning, dimensionality reduction, quantization, and knowledge distillation to ensure efficient operation in resource-constrained environments.
Applications span IoT, autonomous vehicles, and drones, with federated learning and privacy measures bolstering secure real-time intrusion detection.

A lightweight transformer model for intrusion detection systems (IDS) represents a class of architectures that utilize transformer-based self-attention mechanisms while optimizing for parameter-efficiency, low latency, and minimal computational resource consumption. Such models are explicitly designed for deployment in resource-constrained contexts, including edge devices, IoT nodes, and in-vehicle platforms supporting safety-critical real-time operations.

1. Architectures and Key Components

Lightweight IDS transformers depart from canonical transformer designs by aggressively reducing network depth, hidden dimensionality, and the number of attention heads, while leveraging tailored architectural innovations for efficiency. Representative exemplars include:

FedSecureFormer: Employs a 6-layer encoder-only transformer with 2 self-attention heads per layer, model dimension $d_\text{model} = 64$ , per-head $d_k = 32$ , and FFN dimension $d_\text{ff} = 256$ . It introduces learnable absolute positional encodings and a multi-query, multi-head pooling block with 4 learned queries at the output stage. Classification is performed via a linear head over 20 classes. Total parameter count is approximately 1.7M (≈90% fewer than standard BERT encoders), which translates to ≈80% lower memory usage and ≈90% fewer FLOPs than 110M-param BERT variants (S et al., 30 Dec 2025).
FedLiTeCAN: Implements an even more minimalistic two-layer encoder-only transformer with 2 heads (per-head $d_k=32$ ), $d=64$ , and an FFN of 256 units. Construction follows standard forward operations: MHSA, FFN, and LayerNorm with residuals. The architecture prepends a learnable [CLS] token, adds learned or sinusoidal positional encodings, and employs a focal loss for handling class imbalance. It has only ~104K parameters and a model file size of 0.4 MB (S et al., 30 Dec 2025).
TSLT-Net: Adopts a single MHA (2 heads) following a dense spatial embedding and reshape; specifically, input features are projected to a 128-dim embedding, reshaped to 16 tokens × 8 dims, layered normalization, then MHA, global average pooling, and a 64-unit dense layer before softmax classification. This yields 9,722 parameters and a 0.04 MB footprint (Biswas et al., 3 Oct 2025).
Dynamic Temporal Positional EIDS: Utilizes only a single transformer encoder layer (4 heads, $d_m=8$ , $d_{ff}=16$ ), for <5.1K parameters, directly embedding raw network packet bytes and leveraging temporal position encoding schemes (dynamic sin, Fourier, RoPE) for early intrusion detection in IoT (Panopoulos et al., 22 Jun 2025).
Optimized BERT: Lightweight BERT for IDS is achieved by retaining only L=4 encoder layers, reducing hidden size to 256, and using 4 heads. Post-training linear quantization to 8-bits yields an 89.85% parameter reduction (42.63 MB to 4.26 MB unquantized; ≃30.38 MB quantized) compared to BERT-base, with only a 0.02% accuracy drop (Adjewa et al., 2024).

2. Methodologies for Model Compression and Efficiency

Architectural parameter reduction is generally obtained through a combination of the following techniques:

Layer and Head Pruning: Decreasing the number of encoder layers to as low as 1–6 and limiting attention heads to 2–4 (compared to 12–16 in canonical transformers) dramatically decreases parameter count and inference cost (S et al., 30 Dec 2025, S et al., 30 Dec 2025, Panopoulos et al., 22 Jun 2025, Adjewa et al., 2024).
Dimensionality Reduction: Hidden size is often set to $d=64$ or lower (FedSecureFormer, FedLiTeCAN, TSLT-Net), and feed-forward layers are accordingly resized (e.g., $d_{ff}=4d$ ).
Quantization: Post-training linear quantization (e.g., 8-bit per-channel) reduces memory and may halve inference time, with minimal impact on accuracy (Adjewa et al., 2024, Biswas et al., 3 Oct 2025).
Knowledge Distillation: Models such as BERT-of-Theseus progressively replace teacher modules with lightweight student modules, guided by teacher-student KL divergence loss, achieving up to 90% parameter reduction with competitive performance (Kheddar, 2024).
Sparse and Hybrid Architectures: Hybrids integrating CNN/LSTM for local feature extraction and shallow self-attention blocks, or multi-frequency transformers, focus computation and may further restrict depth and width, although these remain relatively under-studied in IDS (Kheddar, 2024).

3. Federated Learning and Security Considerations

Lightweight IDS transformers are frequently embedded in federated learning (FL) frameworks to address data privacy, distribution skew, and regulatory constraints:

Aggregation: The standard FedAvg protocol aggregates global model parameters as a weighted average of local updates; FedProx introduces proximal regularization to limit client drift (S et al., 30 Dec 2025, S et al., 30 Dec 2025, Adjewa et al., 2024).
Differential Privacy: Gradient clipping and the addition of Gaussian noise to local gradients, monitored via a Rènyi DP accountant, yield rigorous $(\epsilon,\delta)$ -differential privacy guarantees during federated optimization (S et al., 30 Dec 2025).
Device-Edge Convergence: FL experiments reveal minimal delay or performance drop compared to centralized training (FedSecureFormer: –1.03% accuracy centralized vs. FL, –4.04% with DP; FedLiTeCAN: 6.46% maximal FL drop), especially when scaling client counts and local epochs to address non-IID data (S et al., 30 Dec 2025, S et al., 30 Dec 2025, Adjewa et al., 2024).

4. Dataset Selection, Preprocessing, and Augmentation

Lightweight transformer IDS models are typically evaluated on specialized, multi-class vehicular, IoT, drone, or general-purpose network intrusion datasets:

Datasets: VeReMi Extension for vehicular misbehavior (FedSecureFormer), CAN-bus attack sets (FedLiTeCAN), ISOT Drone dataset (TSLT-Net), CICIoT2023 (EIDS), and Edge-IIoTset (Optimized BERT) (S et al., 30 Dec 2025, S et al., 30 Dec 2025, Biswas et al., 3 Oct 2025, Panopoulos et al., 22 Jun 2025, Adjewa et al., 2024).
Feature Engineering: Input representations include sliding windowed feature matrices (FedSecureFormer: $[20\times9]$ ), direct raw packet bytes (EIDS: $d=448$ per token), and dense embedding projections (TSLT-Net: 128-dim, reshaped to $16\times8$ ). Positional encodings (learned or sinusoidal) are systematically incorporated (S et al., 30 Dec 2025, S et al., 30 Dec 2025, Biswas et al., 3 Oct 2025, Panopoulos et al., 22 Jun 2025).
Augmentation: Comprehensive pipelines—subflow truncation, jitter/timing noise injection, packet drop/insertion, and GAN-based generation—are employed to enhance model robustness to unseen attacks and noise (S et al., 30 Dec 2025, Panopoulos et al., 22 Jun 2025).

5. Performance Metrics and Empirical Results

Rigorous evaluation is performed using standard IDS metrics (accuracy, precision, recall, F1), early detection benchmarks, latency, and memory footprint:

Model	Params	Accuracy	Inference Latency	Edge Memory	FL Drop	Dataset(s)
FedSecureFormer	1.7M	93.69%	3.78 ms/seq (Nano)	6.8 MB	–1.03% (FL)	VeReMi, GAN Attn
FedLiTeCAN	104K	>98.5–99.9%	0.61 ms/msg (Nano)	0.4 MB	6.46% max	Car-Hack, Survival
TSLT-Net	9.7K	99.99%	<1 ms (A53)	0.04 MB	—	ISOT Drone
EIDS	5K	96.67%	<2 ms/flow (RasPi)	<20 MB	—	CICIoT2023
Opt. BERT (4L)	11.2M	97.77%	0.45 s (Pi4)	30 MB quant.	1–8% (FL)	Edge-IIoTset
BERT-of-Theseus	788	99%	—	—	—	CIC-IDS, TON_IoT

Centralized and federated deployments exhibit similar high detection rates; minimal, highly-optimized models such as TSLT-Net and EIDS achieve near-perfect accuracy with only 0.04 MB/5 KB parameter footprints and millisecond-level inference latencies on low-power MCUs (S et al., 30 Dec 2025, S et al., 30 Dec 2025, Biswas et al., 3 Oct 2025, Panopoulos et al., 22 Jun 2025, Adjewa et al., 2024, Kheddar, 2024).

6. Design Trade-Offs, Best Practices, and Open Challenges

Key guidelines and observed trade-offs include:

Model Size vs. Accuracy: Parameter reduction of >90% (e.g., from 110M to 1.7M or <0.1M) typically yields only a 1–2% drop in classification accuracy. Aggressive compression below 1M parameters may require distillation or architecture-specific tuning (S et al., 30 Dec 2025, S et al., 30 Dec 2025, Panopoulos et al., 22 Jun 2025, Kheddar, 2024).
Throughput and Latency: Real-time operation is achievable (<10 ms/sequence or <1 ms/sample) on embedded CPUs, ARM Cortex-A53, or Jetson Nano (S et al., 30 Dec 2025, S et al., 30 Dec 2025, Biswas et al., 3 Oct 2025).
Critical Open Challenges: Underexplored avenues include systematic exploitation of sparse/low-rank attention, structured pruning, mixed-precision inference, and explicit design of IDS-specific lightweight transformers (e.g., Linformer, MobileViT), as well as dedicated benchmarking of latency and resource usage within real-world deployments (Kheddar, 2024).
Deployment Practice: Recommendations include local hashing/tokenization of flows, per-channel INT8 quantization, strict sliding-window input packing to bound compute, and periodic student–teacher retraining pipelines (S et al., 30 Dec 2025, Adjewa et al., 2024, Kheddar, 2024).

7. Application Domains and Extensions

Lightweight transformer IDS models have been developed and evaluated in various application domains:

Connected and Autonomous Vehicles (CAV): FedSecureFormer, FedLiTeCAN tailored for in-vehicle CAN bus and multi-client federated deployment (S et al., 30 Dec 2025, S et al., 30 Dec 2025).
IoT Networks: EIDS and optimized BERT architectures for early detection and scalable 5G/IoT edge security (Panopoulos et al., 22 Jun 2025, Adjewa et al., 2024).
Drone and UAV Networks: TSLT-Net for drone-specific multi-class and anomaly detection at the edge (Biswas et al., 3 Oct 2025).

This suggests that the architectural advances found in these works (parameter minimization, federated security, advanced temporal encoding, robust augmentation pipelines) are highly generalizable to a wider range of cyber-physical and embedded security domains. Significant future work is needed to standardize evaluation metrics for lightweight transformer IDSs, exploit more advanced efficiency techniques, and ensure interpretability and updatability of compact models over time.