qAttCNN: QoE Prediction for Encrypted Video

Updated 18 January 2026

The paper introduces a novel qAttCNN architecture that integrates a 1D-to-2D embedding, masked multi-head self-attention, and a ResNet CNN head to predict QoE metrics from encrypted traffic.
It achieves state-of-the-art mean absolute error percentages of 2.14% for BRISQUE and 7.39% for FPS, outperforming baseline models and ablated versions.
The method enables real-time QoE monitoring for ISPs using only fine-grained packet size sequences, paving the way for effective network-level quality management.

The QoE Attention Convolutional Neural Network (qAttCNN) is a self-attention-augmented deep convolutional neural architecture for real-time prediction of Quality of Experience (QoE) metrics from encrypted video-conferencing traffic. Designed to address the limitations faced by Internet Service Providers (ISPs) in QoE assessment on fully encrypted streams, qAttCNN leverages only the sequence of observed packet sizes at fine temporal resolution to infer no-reference metrics—specifically, BRISQUE and frames per second (FPS)—for video traffic such as WhatsApp video calls. By combining a 1D-to-2D embedding, masked multi-head self-attention, and a ResNet convolutional head, the model achieves state-of-the-art mean absolute error percentages (MAEPs) of 2.14% for BRISQUE and 7.39% for FPS without requiring access to media payloads (Sidorov et al., 11 Jan 2026).

1. qAttCNN Architecture Overview

The qAttCNN model is fine-tuned to extract QoE-relevant information from a window of consecutive packet size measurements representing the encrypted video traffic:

Input: For each sample, 350 consecutive packet sizes at 1 ms intervals, $X = [x_1, x_2, \ldots, x_{350}] \in \mathbb{R}^{350}$ .
Preprocessing: Each $X$ is normalized (zero mean, unit variance), transformed using a real-valued Fast Fourier Transform (FFT) with only the real part retained, and batched to form input tensors of shape $B \times 350$ .
Embedding Module (EMBD): Consists of a single $1\times1$ convolution lifting the $1$D input to a $350 \times 350$ “image,” enabling the learning of spatial (temporal-correlation) patterns and serving as a learnable feature projection.
Multi-Head Self-Attention (MHSA): Operates on the $350\times350$ embedding. The tensor is decomposed into queries, keys, and values ( $Q$ , $K$ , $V \in \mathbb{R}^{350 \times d_k}$ ). Scaled dot-product attention with causal masking is applied:

$Z = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V$

Here, $M$ masks future positions, $H = 35$ heads are concatenated, and a second $1\times1$ conv reduces the output back to $350\times350$ .

ResNet-Based CNN Head: The self-attended feature map is processed via a ResNet (experimentally: ResNet-18, ‑34, and ‑50), utilizing standard blocks (3×3 convolutions, batch normalization, ReLU, identity skips, and pooling). The resulting vector (size 350) is flattened and input to a fully connected (FC) layer.
Output Head: A single linear neuron generates the scalar predicted QoE metric (either BRISQUE or FPS).

Separate models are trained for each QoE metric; inference is strictly no-reference and no-payload.

2. Training Protocol and Optimization

Loss Function: The objective aligns precisely with the evaluation criterion, using Mean Absolute Error Percentage (MAEP):

$L(\{t_i\}, \{p_i\}) = \frac{100}{N} \sum_{i=1}^{N} |1 - p_i/t_i|$

where $t_i$ is the target QoE and $p_i$ is the prediction.

Optimizer and Hyperparameters:
- Adam optimizer with initial learning rate $\ell_0 = 5\times10^{-3}$ .
- Cyclical learning rate (LR) scheduling: 200-epoch period, with stepwise decrements initially every 50 epochs and then exponential decay in the final quarter.
- Dropout applied to the final FC layer, increasing to $p=0.5$ and held.
- Data augmentation: Additive Gaussian noise $\mathcal{N}(0,1)$ and random row shuffling.
- Transfer learning: ResNet backbone initialized from ImageNet, backbone layers are frozen except for last FC layer, while EMBD/ATT modules are fully trainable.
Training Duration: 1,000 epochs per cross-validation fold. Batch size is not specified but indicated as typically 32–128.
Data Split: 10-fold cross-validation; in each fold, 90% for training and 10% for held-out testing.

3. Dataset Construction and Feature Analysis

WhatsApp Video Call Dataset:
- 51,341 samples, each representing a 1 ms aggregation window.
- Collected in a controlled smartphone–laptop scenario using Wireshark.
- Features include packet_size $_1$ to packet_size $_{350}$ (other fields like time, IP/port, protocol ignored in modeling).
Ground Truth Labels:
- BRISQUE: No-reference perceptual metric for video quality (values from ≈20 to ≈100).
- FPS: Client-computed, typically around 20 FPS.
Feature Properties:
- Weak pairwise correlations, except for the tail (packet_size $_{270 \ldots 350}$ ), which shows non-stationarity and often zeros.
- Stationarity tested using Augmented Dickey–Fuller (ADF): truncating to $256$ input features improved stationarity (from 67% to 87%), but the final architecture retained all $350$ features.

4. Benchmarking and Ablation Studies

The table below summarizes MAEP scores of qAttCNN against baseline models and under ablation conditions:

Model / Variant	BRISQUE (MAEP %)	FPS (MAEP %)
qAttCNN (ResNet-34 head)	2.14 ± 0.025	—
qAttCNN (ResNet-18 head)	—	7.39 ± 0.237
QoENet1D-Optimized (baseline)	2.61 ± 0.026	—
QoENet1D-Basic (baseline)	—	10.93 ± 0.301
Random Forest	5.54 ± 0.030	15.80 ± 0.297
No CNN head	16.34	28.16
No Attention	2.90	10.58
No FFT	2.47	11.07
Trainable EMBD	2.16	8.24
Short data (256 features)	2.88	9.72

Statistical significance is established via non-overlapping standard errors across 10 folds.

Model Depth Selection:

For BRISQUE, ResNet-34 head yields the lowest MAEP (2.14%), outperforming ResNet-18 (2.99%) and ResNet-50 (2.53%).
For FPS, ResNet-18 bests shallower or deeper heads (7.39%), while ResNet-34 (9.09%) and ResNet-50 (19.51%) underperform.

Ablation confirms the contributions of each architectural component: omitting the CNN head or attention module substantially increases error. Removing FFT or using truncated input modestly degrades accuracy.

5. Core Formulations and Theoretical Rationale

Self-Attention:

$Z = \mathrm{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

The division by $\sqrt{d_k}$ stabilizes variance (proof: if $Q, K \sim \mathcal{N}(0,1)$ , then $\mathrm{Var}(Q_i \cdot K_j) = d_k$ ).

MAEP Evaluation/Loss:

$\varepsilon = \frac{100}{N} \sum \left|1 - \frac{p_i}{t_i}\right|$

Cyclical Learning Rate:

$\ell(t_0 + P) = \ell(t_0)$

ADF Stationarity Test:

$\Delta X_t = \gamma X_{t-1} + \beta t + \alpha + \sum_{i=1}^L \delta_i \Delta X_{t-i} + \varepsilon_t$

These formulations underpin the architectural and optimization choices.

6. Deployability, Limitations, and Future Directions

Operational Relevance for ISPs:
- Enables real-time QoE monitoring via only packet size counters at 1 ms granularity, avoiding payload decryption or direct media access.
- Viable for deployment in on-path network devices with standard GPU resources, as the architecture is compact (EMBD and ATT 1×1 convs plus compact ResNet head).
- Predicts BRISQUE and FPS in time to trigger network-level mitigation (e.g., rerouting, traffic shaping).
Limitations and Open Questions:
- Generalization to platforms beyond WhatsApp (e.g., Zoom, Google Meet, Telegram) requires empirical validation.
- Model backbone alternatives such as ResNeXt or MobileNet may further reduce latency.
- Multi-task extensions could enable joint BRISQUE and FPS prediction in a unified model.
- Online continual learning mechanisms could adapt the model to evolving codecs and transport protocols.
- Attention maps offer a potential pathway to explainable QoE prediction, identifying traffic segments most influential for quality degradation.

In sum, qAttCNN advances the state of the art in encrypted video QoE inference by fusing temporal packet statistics, self-attention, and convolutional feature extraction, delivering accurate, practical metrics for real-time operational use by ISPs (Sidorov et al., 11 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

qAttCNN - Self Attention Mechanism for Video QoE Prediction in Encrypted Traffic (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to QoE Attention Convolutional Neural Network (qAttCNN).