UIBVFED: Virtual Facial Expression Dataset
- UIBVFED is a virtual facial expression dataset comprising 20 stylized 3D avatars, 7 expression categories, and 5 viewpoints per expression, totaling 700 images.
- It offers detailed CSV/JSON annotations that support both zero-shot and supervised learning paradigms in HCI, VR, and affective computing research.
- Benchmarked for real-time performance under latency constraints, UIBVFED motivates the development of lightweight, domain-specific architectures for facial expression recognition.
The UIBVFED (Virtual Facial Expression Dataset) is a standardized benchmark collection for facial expression recognition in virtual avatars. Distinguished by its use of stylized, non-photorealistic 3D character renderings, UIBVFED is designed to support zero-shot and conventional supervised learning paradigms in Human-Computer Interaction (HCI), Virtual Reality (VR), and affective computing. It provides a controlled combination of expression categories, avatar geometries, and viewpoint variations to facilitate systematic analysis of facial emotion classification, especially under real-time latency constraints relevant to therapeutic and social computing contexts (Benyamin, 22 Jan 2026).
1. Dataset Structure and Content
UIBVFED consists of high-fidelity synthetic images of 3D avatars (“characters”), each rendered under tightly controlled facial configurations. The canonical dataset configuration includes:
- Characters (C): 20 distinct stylized avatars, differentiated by head shape, facial proportions, hairstyle, skin tone, and overall stylization (ranging from realistic to highly exaggerated “chibi”/anime forms). Mesh topology and rigging are otherwise consistent across all characters.
- Expressions (E): 7 distinct facial expressions, directly mapped onto Ekman’s “Basic Six” emotions (ANGER, DISGUST, FEAR, JOY, SADNESS, SURPRISE) plus NEUTRAL.
- Viewpoints/Variations (I): Each character-expression pair is captured from approximately 5 camera angles, including frontal and ±30° of yaw and pitch, with occasional intensity gradations.
This results in a combinatorial dataset size of images in its canonical release (with some subsequent versions reaching 1,000–1,200 due to expanded viewpoint and intensity labels). All images are provided in 512×512 PNG format, with optional access to source VRoid/OBJ/FBX geometry files.
| Quantity | Symbol | Value (canonical) | Notes |
|---|---|---|---|
| Number of characters | C | 20 | Stylized avatar rigs |
| Number of expressions | E | 7 | Basic 6 emotions + Neutral |
| Views/variations per expr. | I | 5 | Camera angle/intensity sweeps |
| Total images | N_total | 700 | 512×512 PNG |
The organizational directory structure follows per-character and per-expression subfolders, with filenames tagged by viewpoint, character, and class.
2. Annotation Schema and Metadata
Annotations are distributed in auxiliary CSV or JSON tables, and detail the following fields for each image:
character_id: Unique integer/index pointing to avatar mesh.expression_label: One of seven categorical codes (e.g., “anger”, “joy”).viewpoint_id: Encodes camera angle or intensity setting.filename: Direct path to PNG file.- Optionally: tags for “stylistic intensity” and “lighting condition”.
There are no bounding-box or landmark coordinates, as recognition tasks are framed at the global face level. All images are standardized for resolution and orientation.
3. Expression Taxonomy and Rendering Protocol
Each expression is constructed according to widely accepted psychophysical and action-unit conventions, aiming for cross-avatar consistency. For example:
- ANGER: brows lowered and drawn together, lips pressed, eyes glaring.
- DISGUST: nose wrinkled, upper lip raised, cheeks lifted.
- JOY: cheeks raised, orbicularis oculi activation (simulating “crow’s feet”), smiling mouth.
These configurations are mapped directly onto mesh rig controls in the avatar generator (VRoid Studio, Blender, or proprietary engines referenced by Oliver et al.), ensuring that muscle group articulations are realistic and suitable for machine vision models. The NEUTRAL case is defined by an absence of marked contraction in any major facial area.
Multiple viewpoints are achieved via virtual camera sweeps at specified angular intervals.
4. Preprocessing, Augmentation, and Splitting
In Benyamin’s benchmark (Benyamin, 22 Jan 2026), no preprocessing or augmentation is performed beyond the dataset’s intrinsic normalization. Images are consumed “as-is” in both zero-shot and pretrained evaluation runs. However, for model retraining, standard augmentations include:
- Random horizontal flips (excluding neutral frames).
- Brightness/contrast jitter to simulate variable VR lighting.
- Small rotations/zooms (±5°), mainly for robustness analyses.
UIBVFED does not release fixed train/validation/test splits, as its original context emphasized psychophysical analysis rather than pure benchmarking. The default split is 70/15/15 across , ensuring per-character and per-expression class balance. Researchers typically stratify splits according to these axes.
5. Application Domains and Benchmarking Significance
UIBVFED serves as a primary testbed for facial expression recognition in non-photorealistic virtual characters, with special emphasis on latency-accuracy constraints imposed by real-time VR and social computing. In recent studies:
- YOLOv11n (Nano architecture) achieves optimal face detection latency of ~54 ms (CPU-only inference) and 100% detection accuracy for avatar faces (Benyamin, 22 Jan 2026).
- State-of-the-art Vision Transformers (CLIP, SigLIP, ViT-FER) exhibit severe trade-offs: while architectures can process UIBVFED-style images, accuracy remains sub-threshold (<23%) or computational latency is excessive (>150 ms), failing real-time contingency benchmarks.
This points to a substantial bottleneck (“Latency Wall”) for generic FER models in stylized settings and motivates the development of lightweight, domain-specific architectures for VR rehabilitation, especially in contexts involving Autism Spectrum Disorder (ASD) therapy or telepresence social skill training.
6. Licensing, Access, and Citation
UIBVFED is released under the Creative Commons Attribution (CC BY 4.0) license, permitting unrestricted academic and commercial usage with proper attribution. The canonical download site is the Plos ONE supplement (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0231266) with the primary citation: Oliver, M. M., Amengual Alcover, E. (2020). “UIBVFED: Virtual facial expression dataset,” PLoS ONE 15(4): e0231266. All benchmarking results with latency constraints are attributed to Benyamin (2024) (Benyamin, 22 Jan 2026).
7. Limitations and Considerations
Key limitations of UIBVFED arise from the exclusive use of stylized synthetic data:
- Absence of naturalistic photorealism may restrict generalization to real human faces.
- No landmark or pixel-level facial muscular annotations are provided.
- Expression intensities and head pose sweeps, though controlled, represent a fixed, finite set.
- No temporal sequences (video frames) are included.
Use cases in psychophysical research, VR therapy, and real-time emotion-aware AI must account for these constraints and validate models against more naturalistic benchmarks when targeting deployment in heterogeneous environments.
A plausible implication is that future expansion of UIBVFED may involve the inclusion of more diverse avatar styles, continuous expression intensity scales, and annotated temporal dynamics, further supporting the calibration of facial expression recognizers in VR and HCI applications.