Federated Learning for Video Violence Detection

Updated 28 December 2025

The paper introduces a federated learning approach that decentralizes video data processing to enhance privacy while maintaining detection accuracy.
It employs methods like DP-SGD, secure aggregation, and LoRA-based adapters to optimize resource use and reduce communication overhead.
Hybrid architectures combining CNNs and vision-language models address non-IID challenges, ensuring efficient, real-time violence detection.

Federated learning frameworks for video violence detection address the dual requirement of data privacy and effective moderation in increasingly decentralized surveillance and content-moderation environments. By shifting computation and training onto edge devices, such frameworks prevent raw video data from leaving client devices, mitigate bandwidth costs, and reduce the risks associated with centralized inference pipelines. These systems combine advanced spatio-temporal modeling, parameter-efficient adaptation, privacy amplification techniques, and resource-aware deployment to operationalize robust, privacy-preserving violence detectors on massive video corpora.

1. Federated System Architectures and Protocols

Federated video violence detection frameworks orchestrate collaborative model training between a central server and multiple distributed client devices, each holding exclusive access to local video data. Most implementations use the FedAvg protocol, wherein at each round, the server broadcasts the current global model (or partial parameters) to all or a subset of clients, who then perform local updates and transmit masked or encrypted gradients or parameter deltas back to the server for aggregation (Tao et al., 21 Dec 2025, Thuau et al., 10 Nov 2025, Kassir et al., 1 Apr 2025).

Key architectural instantiations include:

FedVideoMAE: The server maintains only LoRA-based adapter parameters $\theta$ (comprising low-rank adapters, soft prompts, and classification head), while all clients store the entire (frozen) VideoMAE backbone $W_0$ and their raw video data. Each round consists of the server broadcasting $\theta^{(t)}$ , client-side DP-SGD adaptation of $\theta$ , secure aggregation with pairwise random masking, and simple averaging on the server: $\theta^{(t+1)} = (1/K) \sum_{i=1}^K \Delta\theta_i + \theta^{(t)}$ (Tao et al., 21 Dec 2025).
Personalized Federated Learning (PFL): The global model is partitioned into globally shared base layers and client-local personalization layers, with only the former averaged by the server. Clients refine their personal decision layers to adapt to skewed local distributions (Kassir et al., 1 Apr 2025).
Hybrid FL/VLMs/CNNs: In hybrid systems, lightweight 3D-CNNs serve as always-on detectors, and LoRA-adapted vision-LLMs (VLMs) are engaged on-demand for ambiguous or context-rich queries. Servers coordinate backbone/adapter aggregation and may selectively invoke clients for resource efficiency (Thuau et al., 10 Nov 2025, Thuau et al., 20 Oct 2025).
FedMIL + DPPQ: In weakly-supervised settings, each client holds "bags" (clips) segmented into instances (frames or chunks), with attention-based pooling for bag-level labels. A determinantal point process with quality kernel (DPPQ) selects a diverse client subset for each round, improving training under non-IID distributions and bandwidth constraints (Bastola et al., 2024).

2. Spatio-temporal Video Representation and Model Design

State-of-the-art federated frameworks exploit transformer-based, convolutional, and multi-instance architectures to encode the temporal dynamics and spatial semantics of video segments:

VideoMAE: Employs a masked autoencoding strategy over 16-frame clips (224×224), dividing input into 16×16×2 patches, embedding them, and applying high-ratio (e.g., 75%) random token masking. The encoder processes only unmasked tokens, while a lightweight decoder reconstructs the original. Self-supervised reconstruction loss for each clip is $L_{\text{rec}} = \| X_\text{recon} - X_\text{original} \|_2^2$ (Tao et al., 21 Dec 2025).
3D CNNs (Diff-Gated, CNN3D): Spatio-temporal blocks apply 3D convolutions over video blocks (e.g., 24 or 16 frames/windows), with gating mechanisms on either frame differences or optical flow. Personalized or frugal versions decouple the global backbone and client-local heads (Quentin et al., 2023, Thuau et al., 10 Nov 2025, Thuau et al., 20 Oct 2025).
Vision-LLMs with LoRA: VLMs (e.g., LLaVA-7B, Ovis-8B) process frame sequences through a frozen image encoder (CLIP), feeding the resulting embeddings, with a prompt, to a language module. Only low-rank adapters (LoRA) at each attention block are trained/federated. For each weight matrix $W_0$ , LoRA updates are $W = W_0 + \Delta W = W_0 + A B$ , with $A \in \mathbb{R}^{d \times r}$ , $B \in \mathbb{R}^{r \times d}$ , and $W_0$ 0 (Thuau et al., 20 Oct 2025, Thuau et al., 10 Nov 2025, Tao et al., 21 Dec 2025).
Attention-based MIL Heads: For weakly supervised video interpretation, attention-MIL heads pool frame- or chunk-level features to bag-level predictions, with learnable gating parameters and attention weights (Bastola et al., 2024).

3. Privacy, Communication, and Efficiency Optimizations

Federated video violence detection must enforce strong defense-in-depth to prevent data leakage and reduce resource consumption:

Differential Privacy (DP-SGD): Clients apply per-sample gradient clipping (e.g., $W_0$ 1 norm $W_0$ 2), add Gaussian noise (calibrated by a Rènyi-based DP accountant for $W_0$ 3-DP), and update only adapters or heads. For single-step mechanism, $W_0$ 4; practical $W_0$ 5 is adjusted for composition/sampling (Tao et al., 21 Dec 2025).
Secure Aggregation: Before transmission, each client applies pairwise random masks to their updates, ensuring the server can only recover the aggregate, not individual updates. Masks cancel out across clients, guaranteeing update privacy during the aggregation step (Tao et al., 21 Dec 2025).
Parameter- and Communication-Efficiency: By restricting updates to adapters/prompts/class heads (e.g., 5.5M out of 156M VideoMAE params), frameworks such as FedVideoMAE achieve up to $W_0$ 6 bandwidth reduction (22 MB vs. 624 MB per round) (Tao et al., 21 Dec 2025). Quantized LoRA adapters and local heads further decrease communication burden (Thuau et al., 10 Nov 2025).
DPPQ Client Selection: FedMIL utilizes a determinantal point process with quality kernel leveraging client feature diversity and local loss, maximizing representational coverage when only a portion of clients can participate each round, thus maintaining accuracy under non-IID and bandwidth constraints (Bastola et al., 2024).

Method	Trainable Params	Per-round Bandwidth	DP/SA Supported	Test Accuracy (ε=1–5)
FedVideoMAE	5.5M (3.5%)	22 MB	Yes	65–66%
Full-model FL (VMAE)	156M	624 MB	No	77.25%

4. Empirical Performance and Privacy–Utility Trade-offs

Quantitative results are established on canonical datasets, often under strict privacy regimes and realistic non-IID conditions:

FedVideoMAE (RWF-2000): Baseline accuracy 77.25% (no DP). Under $W_0$ 7, $W_0$ 8: 65.25% accuracy, F1 = 65.04%, ROC-AUC = 71.15%; for $W_0$ 9, 66.00% accuracy, F1 = 63.43%, ROC-AUC = 71.89%. Utility degradation plateaus (flat accuracy curve) for $\theta^{(t)}$ 0 due to strong noise (Tao et al., 21 Dec 2025).
Hybrid CNN/VLM Approaches: Personalized CNN3D yields 90.75% accuracy, AUC = 92.59%, and energy cost ~240 Wh. LoRA-tuned VLMs (LLaVA) achieve slightly lower calibration (ROC AUC = 91.24%), at higher training energy (570 Wh). Zero-shot VLM inference (e.g., Ovis-8B) delivers decent accuracy (65.31–81% for multiclass with semantic grouping), but at the lowest per-inference energy (Thuau et al., 10 Nov 2025, Thuau et al., 20 Oct 2025).
PFL Architectures: With personalization layers, accuracy increases to 99.3% (RWF-2000 + Crowd Violence, balanced/imbalanced splits) (Kassir et al., 1 Apr 2025).
FedMIL + DPPQ: On non-IID crash detection, DPPQ achieves AUC improvements of 0.005–0.08 over vanilla DPP or random selection, with persistent gains at low data/utilization rates. The same architecture is directly adaptable to violence detection scenarios (Bastola et al., 2024).

5. Mitigation of Non-IID and Heterogeneity

Heterogeneity in client data and label distribution (non-IIDness) is a prevailing challenge in federated video analysis:

Stratified Data Split and Sampling: For experimental simulation, data is partitioned such that each client receives balanced numbers of violent/non-violent clips (stratified split), or is subject to label/domain skew (e.g., Dirichlet sampling per client) (Quentin et al., 2023, Thuau et al., 20 Oct 2025, Thuau et al., 10 Nov 2025).
Personalization Layers and Local Heads: Model partitioning (decoupled global backbone/personal head) addresses distributional divergence; personalization layers are never aggregated, stabilizing performance on heterogeneous clients (Kassir et al., 1 Apr 2025, Thuau et al., 10 Nov 2025).
Proximal Regularization: FedProx-style regularization terms penalize divergence from the global model, stabilizing local updates under skewed distributions (Kassir et al., 1 Apr 2025).
Hierarchical Label Grouping: For multiclass settings, semantic embedding and clustering (e.g., k-means over CLIP embeddings) group related categories to enhance VLM generalization (Thuau et al., 10 Nov 2025).
Client Clustering/DPPQ: Grouping clients or scheduling via DPPQ increases coverage over rare-event or diverse-feature domains, improving robustness (Bastola et al., 2024).

6. Hybrid and Sustainable Deployment Strategies

Frameworks increasingly deploy "editor's term": hybrid inference architectures—composing efficient CNNs for always-on screening with on-demand large VLMs for rare or ambiguous events.

Decision Logic: Inference pipelines run a CNN on every incoming clip, escalate to VLM (LoRA or zero-shot) only if CNN confidence is low or contextual output is needed. Thresholds (e.g., CNN confidence $\theta^{(t)}$ 1) can be tuned for optimal trade-off between efficiency and accuracy (Thuau et al., 10 Nov 2025, Thuau et al., 20 Oct 2025).
Resource Accounting: Energy consumption and carbon emissions are explicitly tracked: 3D CNN (PFL) ~240 Wh/13.4 g CO $\theta^{(t)}$ 2 e per full training, LoRA-tuned VLM ~570 Wh/31.9 g, zero-shot VLM inference ~200 Wh/11.2 g (Thuau et al., 10 Nov 2025, Thuau et al., 20 Oct 2025).
Architectural Modularity: Edge devices execute lightweight CNNs and selectively download or update adapters/prompts for VLMs. Regional servers may coordinate adapter fusion and advanced reasoning (Thuau et al., 20 Oct 2025, Thuau et al., 10 Nov 2025).
Weak Supervision and MIL: In low-label or rare-event settings, MIL enables bag-level learning without frame-level annotation, reducing annotation effort and facilitating scalable training via federated protocols (Bastola et al., 2024).

7. Implementation Recipes and Pseudocode References

Research contributions typically include explicit algorithmic blueprints. For instance, FedVideoMAE presents a detailed pseudocode for on-device federated adapter training with DP-SGD and secure aggregation (Tao et al., 21 Dec 2025). Hybrid pipelines are articulated via stepwise logic and code snippets delineating CNN-first/VLM-on-escalation workflows (Thuau et al., 10 Nov 2025, Thuau et al., 20 Oct 2025). DPPQ sampling for FedMIL is outlined mathematically and algorithmically for reproducibility in non-IID distributed settings (Bastola et al., 2024).

These frameworks collectively provide quantitative, reproducible templates for deploying privacy-preserving, resource-frugal, and accurate video violence detection across heterogeneous, decentralized surveillance and moderation networks.