UIBVFED: Virtual Facial Expression Dataset

Updated 29 January 2026

UIBVFED is a virtual facial expression dataset comprising 20 stylized 3D avatars, 7 expression categories, and 5 viewpoints per expression, totaling 700 images.
It offers detailed CSV/JSON annotations that support both zero-shot and supervised learning paradigms in HCI, VR, and affective computing research.
Benchmarked for real-time performance under latency constraints, UIBVFED motivates the development of lightweight, domain-specific architectures for facial expression recognition.

The UIBVFED (Virtual Facial Expression Dataset) is a standardized benchmark collection for facial expression recognition in virtual avatars. Distinguished by its use of stylized, non-photorealistic 3D character renderings, UIBVFED is designed to support zero-shot and conventional supervised learning paradigms in Human-Computer Interaction (HCI), Virtual Reality (VR), and affective computing. It provides a controlled combination of expression categories, avatar geometries, and viewpoint variations to facilitate systematic analysis of facial emotion classification, especially under real-time latency constraints relevant to therapeutic and social computing contexts (Benyamin, 22 Jan 2026).

1. Dataset Structure and Content

UIBVFED consists of high-fidelity synthetic images of 3D avatars (“characters”), each rendered under tightly controlled facial configurations. The canonical dataset configuration includes:

Characters (C): 20 distinct stylized avatars, differentiated by head shape, facial proportions, hairstyle, skin tone, and overall stylization (ranging from realistic to highly exaggerated “chibi”/anime forms). Mesh topology and rigging are otherwise consistent across all characters.
Expressions (E): 7 distinct facial expressions, directly mapped onto Ekman’s “Basic Six” emotions (ANGER, DISGUST, FEAR, JOY, SADNESS, SURPRISE) plus NEUTRAL.
Viewpoints/Variations (I): Each character-expression pair is captured from approximately 5 camera angles, including frontal and ±30° of yaw and pitch, with occasional intensity gradations.

This results in a combinatorial dataset size of $N_{total} = C \times E \times I = 700$ images in its canonical release (with some subsequent versions reaching 1,000–1,200 due to expanded viewpoint and intensity labels). All images are provided in 512×512 PNG format, with optional access to source VRoid/OBJ/FBX geometry files.

Quantity	Symbol	Value (canonical)	Notes
Number of characters	C	20	Stylized avatar rigs
Number of expressions	E	7	Basic 6 emotions + Neutral
Views/variations per expr.	I	5	Camera angle/intensity sweeps
Total images	N_total	700	512×512 PNG

The organizational directory structure follows per-character and per-expression subfolders, with filenames tagged by viewpoint, character, and class.

2. Annotation Schema and Metadata

Annotations are distributed in auxiliary CSV or JSON tables, and detail the following fields for each image:

character_id: Unique integer/index pointing to avatar mesh.
expression_label: One of seven categorical codes (e.g., “anger”, “joy”).
viewpoint_id: Encodes camera angle or intensity setting.
filename: Direct path to PNG file.
Optionally: tags for “stylistic intensity” and “lighting condition”.

There are no bounding-box or landmark coordinates, as recognition tasks are framed at the global face level. All images are standardized for resolution and orientation.

3. Expression Taxonomy and Rendering Protocol

Each expression is constructed according to widely accepted psychophysical and action-unit conventions, aiming for cross-avatar consistency. For example:

ANGER: brows lowered and drawn together, lips pressed, eyes glaring.
DISGUST: nose wrinkled, upper lip raised, cheeks lifted.
JOY: cheeks raised, orbicularis oculi activation (simulating “crow’s feet”), smiling mouth.

These configurations are mapped directly onto mesh rig controls in the avatar generator (VRoid Studio, Blender, or proprietary engines referenced by Oliver et al.), ensuring that muscle group articulations are realistic and suitable for machine vision models. The NEUTRAL case is defined by an absence of marked contraction in any major facial area.

Multiple viewpoints are achieved via virtual camera sweeps at specified angular intervals.

4. Preprocessing, Augmentation, and Splitting

In Benyamin’s benchmark (Benyamin, 22 Jan 2026), no preprocessing or augmentation is performed beyond the dataset’s intrinsic normalization. Images are consumed “as-is” in both zero-shot and pretrained evaluation runs. However, for model retraining, standard augmentations include:

Random horizontal flips (excluding neutral frames).
Brightness/contrast jitter to simulate variable VR lighting.
Small rotations/zooms (±5°), mainly for robustness analyses.

UIBVFED does not release fixed train/validation/test splits, as its original context emphasized psychophysical analysis rather than pure benchmarking. The default split is 70/15/15 across $N_{total}$ , ensuring per-character and per-expression class balance. Researchers typically stratify splits according to these axes.

5. Application Domains and Benchmarking Significance

UIBVFED serves as a primary testbed for facial expression recognition in non-photorealistic virtual characters, with special emphasis on latency-accuracy constraints imposed by real-time VR and social computing. In recent studies:

YOLOv11n (Nano architecture) achieves optimal face detection latency of ~54 ms (CPU-only inference) and 100% detection accuracy for avatar faces (Benyamin, 22 Jan 2026).
State-of-the-art Vision Transformers (CLIP, SigLIP, ViT-FER) exhibit severe trade-offs: while architectures can process UIBVFED-style images, accuracy remains sub-threshold (<23%) or computational latency is excessive (>150 ms), failing real-time contingency benchmarks.

This points to a substantial bottleneck (“Latency Wall”) for generic FER models in stylized settings and motivates the development of lightweight, domain-specific architectures for VR rehabilitation, especially in contexts involving Autism Spectrum Disorder (ASD) therapy or telepresence social skill training.

6. Licensing, Access, and Citation

UIBVFED is released under the Creative Commons Attribution (CC BY 4.0) license, permitting unrestricted academic and commercial usage with proper attribution. The canonical download site is the Plos ONE supplement (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0231266) with the primary citation: Oliver, M. M., Amengual Alcover, E. (2020). “UIBVFED: Virtual facial expression dataset,” PLoS ONE 15(4): e0231266. All benchmarking results with latency constraints are attributed to Benyamin (2024) (Benyamin, 22 Jan 2026).

7. Limitations and Considerations

Key limitations of UIBVFED arise from the exclusive use of stylized synthetic data:

Absence of naturalistic photorealism may restrict generalization to real human faces.
No landmark or pixel-level facial muscular annotations are provided.
Expression intensities and head pose sweeps, though controlled, represent a fixed, finite set.
No temporal sequences (video frames) are included.

Use cases in psychophysical research, VR therapy, and real-time emotion-aware AI must account for these constraints and validate models against more naturalistic benchmarks when targeting deployment in heterogeneous environments.

A plausible implication is that future expansion of UIBVFED may involve the inclusion of more diverse avatar styles, continuous expression intensity scales, and annotated temporal dynamics, further supporting the calibration of facial expression recognizers in VR and HCI applications.

Markdown Report Issue Upgrade to Chat

References (1)

The Latency Wall: Benchmarking Off-the-Shelf Emotion Recognition for Real-Time Virtual Avatars (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UIBVFED Dataset.