VideoFace2.0: Real-Time Face Re-ID System
- The paper introduces VideoFace2.0, a modular system for real-time face re-identification that achieves high precision (91.5%) and a 93% reduction in false-ID assignments.
- It integrates SCRFD-based detection, ArcFace recognition, and passive tracking-by-detection to ensure spatial-temporal consistency and efficient gallery management.
- The system supports applications in broadcast production, media analytics, and dataset curation, with prospects for adaptive filtering and embedded deployment.
VideoFace2.0 is a modular video analytics system designed for real-time, open-world face re-identification (ReID), cataloging, and structured story creation from raw video content. Developed as an extension to conventional video production pipelines, VideoFace2.0 executes spatial-temporal localization, identity assignment, and data export for every unique face in a video sequence. The system architecture synthesizes high-sensitivity face detection, deep face recognition, passive tracking-by-detection, and structured identity management, enabling robust operation in media analysis, television production, and automated ML dataset curation (Brkljač et al., 4 May 2025).
1. System Architecture and Data Flow
VideoFace2.0 comprises a four-stage pipeline with tightly integrated modules:
1. Input Ingestion: Video streams, files, or live camera feeds are decoded into frames .
- Face Detection (): For each frame, the SCRFD detector yields face hypotheses with detection confidence .
- Face Recognition (): Each valid detection is embedded as via an ArcFace-based network.
- Passive Tracking-by-Detection (): Invoked when no gallery match meets a similarity threshold, tracking confirms spatial-temporal consistency for new identity candidates using an IoU criterion.
- Gallery Management (): Confirmed identities and their embeddings are maintained.
- Post-Filtering: Newly emergent identities are retained for frames before being promoted.
The per-frame processing workflow is as follows:
- Detect face candidates .
- Prune candidates with .
- For each remaining , compute embedding and search for a gallery match using the distance metric.
- If , assign the matched identity; otherwise, invoke tracking.
- If tracking confirms (via IoU ), add as a new gallery entry.
- Post-filter new identities until verified persistence.
This architecture emphasizes efficiency by invoking computationally costly tracking only when necessary, and by post-filtering transient artifacts.
2. Mathematical Formulation
Key computational operations are formally defined:
- Face Embedding: For face , compute an aligned embedding:
where for the implemented ArcFace model.
- Similarity Metric: Assign a candidate to an existing identity if
is minimized and falls below threshold .
- Identity Assignment: If
assign to the gallery identity with minimum distance; else treat as a new identity.
- Tracking Confirmation: Spatial consistency is confirmed if
where is the intersection-over-union of bounding boxes.
- Post-Filtering: New identities must reappear within frames to qualify for gallery inclusion.
3. Component Implementation and Performance
VideoFace2.0 integrates the following:
- Face Detector: SCRFD (buffalo_l model), pretrained on WebFace.
- Recognizer: ArcFace (additive angular margin, 512-dim embedding, 325 MB package).
- Runtime Stack: ONNX Runtime with CUDA on an NVIDIA RTX 3050 (4 GB), Intel i7 CPU, and 16 GB RAM.
- I/O Libraries: OpenCV and FFmpeg.
- Optimizations:
- Detector tuned for high recall.
- Tracking-on-failure paradigm to minimize overhead.
- Batched inference via ONNX CUDA bindings.
- Configurable thresholds: , , , .
- Performance: 18–25 fps on consumer-grade notebooks with all modules enabled.
This suggests the technical stack is designed for portability and ease of integration with conventional video workflows.
4. Ablation Study and Quantitative Evaluation
Effectiveness is assessed through staged ablation:
| Method | Precision | Recall | F₁ | False-ID Reduction |
|---|---|---|---|---|
| A. Detection only | 62.3 % | 48.1 % | 0.54 | Baseline |
| B. + Recognition | 78.9 % | 65.2 % | 0.71 | 73 % |
| C. + Track+Filter | 91.5 % | 88.0 % | 0.90 | 93 % |
The full system achieves a 93 % reduction in false new-identity assignments over the baseline and yields the most temporally consistent tracks with fewest identity-switches, as documented in qualitative analyses (Fig. 2d of (Brkljač et al., 4 May 2025)).
5. Identity Cataloging and Structured Output
For each confirmed identity , VideoFace2.0 maintains a comprehensive log:
- Catalog Elements:
- Time-stamped frame indices and bounding-box coordinates
- Per-frame embeddings
- Optional: facial landmarks, age/gender estimation, temporal dynamics, and audio transcript links (if present)
- Export Modes:
- Overlay videos with bounding box and ID labels
- Identity-cropped face video streams or mouth-region clips for lip-reading
- Anonymized “silent” logs suitable for dataset sharing
Export is performed in a single input traversal; subsequent rendering of stories is offline and computationally inexpensive.
6. Application Domains and Prospective Development
Principal deployment scenarios include:
- Broadcast Production: Automated assembly of speaker-specific segments (e.g., talk shows, interviews, sports coverage).
- Media Analytics: Quantification of participant screen time, co-presence, and diarization for editorial purposes.
- Large-Scale Dataset Creation: Curated, identity-resolved face and mouth clips for research in lip reading, multimodal speech recognition, and related ML tasks.
Proposed future directions encompass adaptive gallery management (e.g., online clustering, dynamic thresholds), advanced embedding models with robustness to pose/occlusion, embedded system porting, and explicit temporal modeling to enhance identity persistence through temporary occlusion (Brkljač et al., 4 May 2025).