VideoFace2.0: Real-Time Face Re-ID System

Updated 19 February 2026

The paper introduces VideoFace2.0, a modular system for real-time face re-identification that achieves high precision (91.5%) and a 93% reduction in false-ID assignments.
It integrates SCRFD-based detection, ArcFace recognition, and passive tracking-by-detection to ensure spatial-temporal consistency and efficient gallery management.
The system supports applications in broadcast production, media analytics, and dataset curation, with prospects for adaptive filtering and embedded deployment.

VideoFace2.0 is a modular video analytics system designed for real-time, open-world face re-identification (ReID), cataloging, and structured story creation from raw video content. Developed as an extension to conventional video production pipelines, VideoFace2.0 executes spatial-temporal localization, identity assignment, and data export for every unique face in a video sequence. The system architecture synthesizes high-sensitivity face detection, deep face recognition, passive tracking-by-detection, and structured identity management, enabling robust operation in media analysis, television production, and automated ML dataset curation (Brkljač et al., 4 May 2025).

1. System Architecture and Data Flow

VideoFace2.0 comprises a four-stage pipeline with tightly integrated modules:

1. Input Ingestion: Video streams, files, or live camera feeds are decoded into frames $I(t)$ .

Face Detection ( ${\cal D}$ ): For each frame, the SCRFD detector yields face hypotheses $F_i$ with detection confidence $\sigma(F_i)$ .
Face Recognition ( ${\cal R}$ ): Each valid detection is embedded as ${\cal R}_{F_i}\in\mathbb R^d$ via an ArcFace-based network.
Passive Tracking-by-Detection ( ${\cal T}$ ): Invoked when no gallery match meets a similarity threshold, tracking confirms spatial-temporal consistency for new identity candidates using an IoU criterion.
Gallery Management ( ${\cal G}$ ): Confirmed identities $G_j$ and their embeddings ${\cal R}_{G_j}$ are maintained.
Post-Filtering: Newly emergent identities are retained for $t_{\min}$ frames before being promoted.

The per-frame processing workflow is as follows:

Detect face candidates ${\cal F} = {\cal D}(I(t)) = \{F_1,\dots,F_N\}$ .
Prune candidates with $\sigma(F_i) < \sigma_h$ .
For each remaining $F_i$ , compute embedding and search for a gallery match using the distance metric.
If $\min_j d({\cal R}_{F_i},{\cal R}_{G_j}) < \tau_d$ , assign the matched identity; otherwise, invoke tracking.
If tracking confirms (via IoU $\ge\tau$ ), add as a new gallery entry.
Post-filter new identities until verified persistence.

This architecture emphasizes efficiency by invoking computationally costly tracking only when necessary, and by post-filtering transient artifacts.

2. Mathematical Formulation

Key computational operations are formally defined:

Face Embedding: For face $F_i$ , compute an aligned embedding:

${\cal R}_{F_i}\in\mathbb R^d$

where $d=512$ for the implemented ArcFace model.

Similarity Metric: Assign a candidate to an existing identity if

$d({\cal R}_{F_i},{\cal R}_{G_j}) = 1 - \frac{{\cal R}_{F_i} \cdot {\cal R}_{G_j}}{\|{\cal R}_{F_i}\|\;\|{\cal R}_{G_j}\|}$

is minimized and falls below threshold $\tau_d$ .

Identity Assignment: If

$\min_j d({\cal R}_{F_i},{\cal R}_{G_j}) < \tau_d$

assign $F_i$ to the gallery identity with minimum distance; else treat as a new identity.

Tracking Confirmation: Spatial consistency is confirmed if

$\mathrm{IoU}(F_i, G_c) \ge \tau$

where $\mathrm{IoU}$ is the intersection-over-union of bounding boxes.

Post-Filtering: New identities must reappear within $t_{\min}$ frames to qualify for gallery inclusion.

3. Component Implementation and Performance

VideoFace2.0 integrates the following:

Face Detector: SCRFD (buffalo_l model), pretrained on WebFace.
Recognizer: ArcFace (additive angular margin, 512-dim embedding, $\approx$ 325 MB package).
Runtime Stack: ONNX Runtime with CUDA on an NVIDIA RTX 3050 (4 GB), Intel i7 CPU, and 16 GB RAM.
I/O Libraries: OpenCV and FFmpeg.
Optimizations:
- Detector tuned for high recall.
- Tracking-on-failure paradigm to minimize overhead.
- Batched inference via ONNX CUDA bindings.
- Configurable thresholds: $\sigma_h=0.6$ , $\tau_d=0.6$ , $\tau=0.8$ , $t_{\min}=60$ .
Performance: 18–25 fps on consumer-grade notebooks with all modules enabled.

This suggests the technical stack is designed for portability and ease of integration with conventional video workflows.

4. Ablation Study and Quantitative Evaluation

Effectiveness is assessed through staged ablation:

Method	Precision	Recall	F₁	False-ID Reduction
A. Detection only	62.3 %	48.1 %	0.54	Baseline
B. + Recognition	78.9 %	65.2 %	0.71	73 %
C. + Track+Filter	91.5 %	88.0 %	0.90	93 %

The full system achieves a 93 % reduction in false new-identity assignments over the baseline and yields the most temporally consistent tracks with fewest identity-switches, as documented in qualitative analyses (Fig. 2d of (Brkljač et al., 4 May 2025)).

5. Identity Cataloging and Structured Output

For each confirmed identity $G_j$ , VideoFace2.0 maintains a comprehensive log:

Catalog Elements:
- Time-stamped frame indices and bounding-box coordinates
- Per-frame embeddings ${\cal R}_{G_j}(t)$
- Optional: facial landmarks, age/gender estimation, temporal dynamics, and audio transcript links (if present)
Export Modes:
- Overlay videos with bounding box and ID labels
- Identity-cropped face video streams or mouth-region clips for lip-reading
- Anonymized “silent” logs suitable for dataset sharing

Export is performed in a single input traversal; subsequent rendering of stories is offline and computationally inexpensive.

6. Application Domains and Prospective Development

Principal deployment scenarios include:

Broadcast Production: Automated assembly of speaker-specific segments (e.g., talk shows, interviews, sports coverage).
Media Analytics: Quantification of participant screen time, co-presence, and diarization for editorial purposes.
Large-Scale Dataset Creation: Curated, identity-resolved face and mouth clips for research in lip reading, multimodal speech recognition, and related ML tasks.

Proposed future directions encompass adaptive gallery management (e.g., online clustering, dynamic thresholds), advanced embedding models with robustness to pose/occlusion, embedded system porting, and explicit temporal modeling to enhance identity persistence through temporary occlusion (Brkljač et al., 4 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Transforming faces into video stories -- VideoFace2.0 (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VideoFace2.0.