From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety

Published 31 Mar 2026 in cs.CV and cs.AI | (2603.29777v1)

Abstract: Public spaces such as transport hubs, city centres, and event venues require timely and reliable detection of potentially violent behaviour to support public safety. While automated video analysis has made significant progress, practical deployment remains constrained by latency, privacy, and resource limitations, particularly under edge-computing conditions. This paper presents the design and demonstrator-based deployment of a hybrid edge-based action detection system that combines skeleton-based motion analysis with vision-LLMs for semantic scene interpretation. Skeleton-based processing enables continuous, privacy-aware monitoring with low computational overhead, while vision-LLMs provide contextual understanding and zero-shot reasoning capabilities for complex and previously unseen situations. Rather than proposing new recognition models, the contribution focuses on a system-level comparison of both paradigms under realistic edge constraints. The system is implemented on a GPU-enabled edge device and evaluated with respect to latency, resource usage, and operational trade-offs using a demonstrator-based setup. The results highlight the complementary strengths and limitations of motioncentric and semantic approaches and motivate a hybrid architecture that selectively augments fast skeletonbased detection with higher-level semantic reasoning. The presented system provides a practical foundation for privacy-aware, real-time video analysis in public safety applications.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a hybrid edge-based system combining lightweight skeleton motion analysis with vision-language models for semantic action detection in public spaces.
The methodology leverages optimized GPU deployment, fast pose estimation (13.2 ms/frame), and high accuracy (e.g., 93.1% with BlockGCN) for real-time performance.
The evaluation reveals trade-offs between latency, resource consumption, and contextual interpretation, supporting scalable, privacy-preserving public surveillance.

Hybrid Edge-Based Action Detection for Public Safety: System Design, Evaluation, and Implications

Introduction

"From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety" (2603.29777) proposes a system-level approach to real-time action detection in public spaces using edge computing. The architectural focus relies on combining lightweight skeleton-based motion analysis and advanced vision-LLMs (VLMs) for semantic interpretation. The work operationalizes this hybrid system on a GPU-enabled edge device and empirically evaluates trade-offs between latency, resource consumption, interpretability, and operational robustness under realistic constraints. The primary contributions are the architectural synthesis, comparative analysis, and deployment-centric validation rather than novel model development.

System Architecture and Implementation

The deployed system operates on an NVIDIA Jetson AGX Thor, directly ingesting RGB video streams from a 5MP USB camera. The architecture is modular, supporting both skeleton and VLM backends via FastAPI, with a unified React-based surveillance dashboard. Data processing remains local, retaining privacy by avoiding raw video transmission.

Skeleton-Based Pipeline

Pose estimation leverages YOLOv26L-Pose for efficient multi-person 2D keypoint extraction, with inference latency optimized via TensorRT (13.2 ms/frame). Persistent identity tracking uses ByteTrack, mitigating trajectory fragmentation under occlusion. Skeleton data buffering amortizes inference costs, producing action classification at ~1 Hz. 2D-to-3D pose lifting is achieved via MotionBERT, enabling compatibility with pretrained GCN models (ProtoGCN, CTR-GCN). Rigorous keypoint/joint remapping bridges COCO, Human3.6M, and Kinect V2 skeleton formats. Interaction pairing incorporates spatial proximity, supporting NTU RGB+D mutual action classes, while classification thresholds for risk alerts are empirically tuned.

Vision-LLM Pipeline

The VLM backend employs Qwen3.5-VL and similar large-scale multimodal models, providing object/context awareness and zero-shot reasoning with natural language prompts. Scene analysis integrates dual-stream sampling (context stream at 1FPS, action stream at 6FPS), recursive summary-based short-term memory for narrative continuity, and prompt-based deployment for flexible anomaly detection across diverse environments.

Comparative Performance Evaluation

Edge deployment constrains compute, memory, and power. Skeleton-based processing yields stable real-time throughput (41.9 eFPS), low latency (2.54 s), and moderate resource utilization (14.5 GB unified memory; 36.3 W avg GPU). In contrast, VLM inference, especially with 35B parameter models, incurs high latency (5.49 s), lower throughput (1.34 eFPS), and elevated memory usage (101.7 GB), with GPU draw up to 34.1 W peak. These measurements, obtained under demonstrator operation, highlight the operational feasibility and scalability limitations of semantic models on current edge hardware.

Strong numerical results for skeleton-GCN models include BlockGCN accuracy of 93.1% (NTU-60, X-Sub protocol) and comparable figures for CTR-GCN/InfoGCN. VLMs (InternVideo2, Video-STAR) achieve up to 99.7% base-to-novel accuracy and competitive zero-shot performance across benchmarks (UCF-101, Kinetics-400/600), demonstrating open-vocabulary capabilities with minimal supervision.

Advantages and Limitations

Skeleton-based analysis confers low latency, privacy, and edge feasibility by abstracting actors to non-identifiable coordinate representations. However, it suffers from context blindness, leading to confusion between kinematically similar actions with distinct semantics (e.g., theft vs. handshake) and vulnerability to occlusion/shadow-induced errors. Closed-set supervision limits responsiveness to emergent threats.

VLMs overcome these semantic limitations, extracting scene-level context and enabling zero-shot anomaly detection. However, they introduce significant computational overhead and are currently bounded by hardware constraints for real-time, multi-stream deployment.

Hybrid Architecture and Agent-Based Design

A hybrid agent-based workflow enables fast skeletal filtering with selective triggering of semantic VLM analysis, reducing resource consumption while enhancing contextual interpretation. Planned agent layer integration aims to coordinate perception, confidence handling, and operator-in-the-loop feedback. Decentralized deployment across multiple edge devices necessitates robust, privacy-conscious event exchange protocols, addressing synchronization, reliability, and legal compliance.

Operational and Regulatory Challenges

Edge-deployed AI in public spaces faces practical obstacles including sensor noise, input variability, and regulatory complexity. The system's suitability is shaped by real-world iterative testing, legal framework adaptation, and acceptance criteria centered on transparency and auditability. Inter-device communication and federated situation awareness are key for scaling deployment while preserving data locality and compliance.

Theoretical and Practical Implications, Future Directions

This work substantiates the complementary strengths and weaknesses of geometric motion-based and semantic scene-based action detection. Practically, the hybrid architecture provides a scalable foundation for privacy-aware, real-time surveillance systems in public safety. Theoretically, it foregrounds the necessity of combining closed-set supervised models with open-vocabulary reasoning for robustness in dynamic environments.

Progress in edge hardware and model optimization will likely enable more comprehensive integration and faster VLM inference. Future work should focus on agent orchestration, multi-modal interaction modeling, low-latency pipelines, and extensive operational validation under complex scenarios. Research into scalable communication architectures and regulatory harmonization will be vital for large-scale deployment.

Conclusion

A hybrid edge-based action detection architecture that combines skeleton-driven motion analysis and vision-language semantic reasoning offers a balanced approach for public safety video analysis. Skeleton models provide robust, privacy-preserving real-time monitoring, while VLMs enable nuanced interpretation of complex and "unseen" events. Real-world constraints dictate a system-level design that integrates these paradigms, facilitating practical, legally-compliant deployment in security-relevant contexts. Future advances in edge computing, model co-design, and agent-based orchestration will further expand the efficacy and operational reach of such systems.

Markdown Report Issue