qAttCNN: QoE Prediction for Encrypted Video
- The paper introduces a novel qAttCNN architecture that integrates a 1D-to-2D embedding, masked multi-head self-attention, and a ResNet CNN head to predict QoE metrics from encrypted traffic.
- It achieves state-of-the-art mean absolute error percentages of 2.14% for BRISQUE and 7.39% for FPS, outperforming baseline models and ablated versions.
- The method enables real-time QoE monitoring for ISPs using only fine-grained packet size sequences, paving the way for effective network-level quality management.
The QoE Attention Convolutional Neural Network (qAttCNN) is a self-attention-augmented deep convolutional neural architecture for real-time prediction of Quality of Experience (QoE) metrics from encrypted video-conferencing traffic. Designed to address the limitations faced by Internet Service Providers (ISPs) in QoE assessment on fully encrypted streams, qAttCNN leverages only the sequence of observed packet sizes at fine temporal resolution to infer no-reference metrics—specifically, BRISQUE and frames per second (FPS)—for video traffic such as WhatsApp video calls. By combining a 1D-to-2D embedding, masked multi-head self-attention, and a ResNet convolutional head, the model achieves state-of-the-art mean absolute error percentages (MAEPs) of 2.14% for BRISQUE and 7.39% for FPS without requiring access to media payloads (Sidorov et al., 11 Jan 2026).
1. qAttCNN Architecture Overview
The qAttCNN model is fine-tuned to extract QoE-relevant information from a window of consecutive packet size measurements representing the encrypted video traffic:
- Input: For each sample, 350 consecutive packet sizes at 1 ms intervals, .
- Preprocessing: Each is normalized (zero mean, unit variance), transformed using a real-valued Fast Fourier Transform (FFT) with only the real part retained, and batched to form input tensors of shape .
- Embedding Module (EMBD): Consists of a single convolution lifting the $1$D input to a “image,” enabling the learning of spatial (temporal-correlation) patterns and serving as a learnable feature projection.
- Multi-Head Self-Attention (MHSA): Operates on the embedding. The tensor is decomposed into queries, keys, and values (, , ). Scaled dot-product attention with causal masking is applied:
Here, masks future positions, heads are concatenated, and a second conv reduces the output back to .
- ResNet-Based CNN Head: The self-attended feature map is processed via a ResNet (experimentally: ResNet-18, ‑34, and ‑50), utilizing standard blocks (3×3 convolutions, batch normalization, ReLU, identity skips, and pooling). The resulting vector (size 350) is flattened and input to a fully connected (FC) layer.
- Output Head: A single linear neuron generates the scalar predicted QoE metric (either BRISQUE or FPS).
Separate models are trained for each QoE metric; inference is strictly no-reference and no-payload.
2. Training Protocol and Optimization
- Loss Function: The objective aligns precisely with the evaluation criterion, using Mean Absolute Error Percentage (MAEP):
where is the target QoE and is the prediction.
- Optimizer and Hyperparameters:
- Adam optimizer with initial learning rate .
- Cyclical learning rate (LR) scheduling: 200-epoch period, with stepwise decrements initially every 50 epochs and then exponential decay in the final quarter.
- Dropout applied to the final FC layer, increasing to and held.
- Data augmentation: Additive Gaussian noise and random row shuffling.
- Transfer learning: ResNet backbone initialized from ImageNet, backbone layers are frozen except for last FC layer, while EMBD/ATT modules are fully trainable.
- Training Duration: 1,000 epochs per cross-validation fold. Batch size is not specified but indicated as typically 32–128.
- Data Split: 10-fold cross-validation; in each fold, 90% for training and 10% for held-out testing.
3. Dataset Construction and Feature Analysis
- WhatsApp Video Call Dataset:
- 51,341 samples, each representing a 1 ms aggregation window.
- Collected in a controlled smartphone–laptop scenario using Wireshark.
- Features include packet_size to packet_size (other fields like time, IP/port, protocol ignored in modeling).
- Ground Truth Labels:
- BRISQUE: No-reference perceptual metric for video quality (values from ≈20 to ≈100).
- FPS: Client-computed, typically around 20 FPS.
- Feature Properties:
4. Benchmarking and Ablation Studies
The table below summarizes MAEP scores of qAttCNN against baseline models and under ablation conditions:
| Model / Variant | BRISQUE (MAEP %) | FPS (MAEP %) |
|---|---|---|
| qAttCNN (ResNet-34 head) | 2.14 ± 0.025 | — |
| qAttCNN (ResNet-18 head) | — | 7.39 ± 0.237 |
| QoENet1D-Optimized (baseline) | 2.61 ± 0.026 | — |
| QoENet1D-Basic (baseline) | — | 10.93 ± 0.301 |
| Random Forest | 5.54 ± 0.030 | 15.80 ± 0.297 |
| No CNN head | 16.34 | 28.16 |
| No Attention | 2.90 | 10.58 |
| No FFT | 2.47 | 11.07 |
| Trainable EMBD | 2.16 | 8.24 |
| Short data (256 features) | 2.88 | 9.72 |
Statistical significance is established via non-overlapping standard errors across 10 folds.
Model Depth Selection:
- For BRISQUE, ResNet-34 head yields the lowest MAEP (2.14%), outperforming ResNet-18 (2.99%) and ResNet-50 (2.53%).
- For FPS, ResNet-18 bests shallower or deeper heads (7.39%), while ResNet-34 (9.09%) and ResNet-50 (19.51%) underperform.
Ablation confirms the contributions of each architectural component: omitting the CNN head or attention module substantially increases error. Removing FFT or using truncated input modestly degrades accuracy.
5. Core Formulations and Theoretical Rationale
- Self-Attention:
The division by stabilizes variance (proof: if , then ).
- MAEP Evaluation/Loss:
- Cyclical Learning Rate:
- ADF Stationarity Test:
These formulations underpin the architectural and optimization choices.
6. Deployability, Limitations, and Future Directions
- Operational Relevance for ISPs:
- Enables real-time QoE monitoring via only packet size counters at 1 ms granularity, avoiding payload decryption or direct media access.
- Viable for deployment in on-path network devices with standard GPU resources, as the architecture is compact (EMBD and ATT 1×1 convs plus compact ResNet head).
- Predicts BRISQUE and FPS in time to trigger network-level mitigation (e.g., rerouting, traffic shaping).
- Limitations and Open Questions:
- Generalization to platforms beyond WhatsApp (e.g., Zoom, Google Meet, Telegram) requires empirical validation.
- Model backbone alternatives such as ResNeXt or MobileNet may further reduce latency.
- Multi-task extensions could enable joint BRISQUE and FPS prediction in a unified model.
- Online continual learning mechanisms could adapt the model to evolving codecs and transport protocols.
- Attention maps offer a potential pathway to explainable QoE prediction, identifying traffic segments most influential for quality degradation.
In sum, qAttCNN advances the state of the art in encrypted video QoE inference by fusing temporal packet statistics, self-attention, and convolutional feature extraction, delivering accurate, practical metrics for real-time operational use by ISPs (Sidorov et al., 11 Jan 2026).