MultiEgo Dataset for Egocentric Multi-View Research

Updated 20 December 2025

MultiEgo dataset is a comprehensive collection of multi-view egocentric videos featuring synchronized 3D pose annotations to support advanced human motion and scene reconstruction studies.
The dataset integrates synthetic and real-world modalities, enabling performance benchmarking in multi-person pose estimation, free-viewpoint video synthesis, and interaction analysis.
Its rigorous calibration and synchronization protocols deliver sub-millisecond temporal alignment and centimeter-level accuracy, setting new standards in egocentric computer vision research.

The MultiEgo dataset refers to a class of recent large-scale multi-view egocentric video datasets specifically designed for research in human motion understanding, dynamic scene reconstruction, and free-viewpoint video synthesis. These datasets feature synchronized video streams from multiple body-worn or head-mounted cameras and precise 3D pose or 6-DoF trajectory annotations, aiming to support the development and benchmarking of perception algorithms in egocentric, social, and activity-rich contexts. Two principal datasets—MultiEgoView (Hollidt et al., 25 Feb 2025) and MultiEgo (Li et al., 12 Dec 2025)—define the state of the art in this domain, offering both synthetic and real-world modalities, rigorous calibration protocols, and multi-participant social scenarios.

1. Dataset Purpose and Research Motivation

MultiEgo datasets address a critical gap in egocentric computer vision: the lack of large, rigorously annotated corpora capturing multi-view video from lightweight, body- or head-mounted cameras during human activity and social interaction. Previous work focused predominantly on static multi-view camera arrays or single-view head-mounted footage, insufficient for multi-person pose estimation, dynamic scene reconstruction, and action recognition under natural camera motion or frequent occlusion. MultiEgoView and MultiEgo provide a unified resource to enable training and objective evaluation of algorithms for:

Multi-view egocentric 3D human pose estimation and tracking
4D (spatio-temporal) dynamic scene modeling and free-viewpoint video (FVV) synthesis
Human–object and human–human interaction understanding from densely overlapping egocentric streams
Camera localization, inertial fusion, and visual-inertial navigation

2. Dataset Composition and Scope

MultiEgoView (“MultiEgo” in some contexts; Editor's term: MultiEgoView for disambiguation) consists of a large synthetic–real paired corpus designed around six egocentric camera views rigidly affixed to human subjects:

Subset	Synthetic	Real-world
Duration	119.4 hr (77.4M frames @ 30 Hz)	5 hr (~1–3 sessions × 13 participants)
Cameras	6 (head, pelvis, L/R wrist, L/R knee)	6 GoPros worn identically
Annotation	SMPL-X 3D pose, per-frame, per-camera IMU	Xsens-based SMPL-X pose (17-sensor suit)
Environments	4 UV-scale virtual scenes (city, courtyard, etc.)	Scanned university courtyard

Synthetic data is generated using the EgoSim simulator, replaying AMASS motion capture trajectories on virtual avatars in Unreal Engine, with sensor noise and motion artifacts simulated through spring-damper mounts and virtual IMU readouts. Real-world data combines synchronized GoPro video and Xsens 3D pose data, time-aligned by inertial–audio events (claps) and Virtual Caliper-based body measurements.

MultiEgo targets 4D dynamic scene reconstruction in real social interaction scenarios:

Aspect	Details
Scenarios	5 canonical social settings (meetings, performances, presentations)
Participants	5 per scene, each with RayNeo X2 AR-glasses (head-mounted)
Modalities	RGB video (1920×1080, 30 Hz) + gyroscope (3-DoF, 50 Hz)
Duration	~90 s per camera per scene (13,735 total frames across all scenes/views)
Annotation	Per-frame 6-DoF camera pose, sub-millisecond synchronization

All AR glasses are synchronized via a dedicated client-server Wi-Fi system with hardware timestamping, ensuring frame-level temporal alignment and accurate multi-participant pose recovery. Intrinsics are calibrated via COLMAP, with radial and tangential distortion supported; extrinsics and 6-DoF trajectories are derived via multi-view structure-from-motion (SfM) and visual-inertial fusion.

3. Data Acquisition, Calibration, and Annotation

MultiEgoView Pipeline

Synthetic Production: Avatars are animated by AMASS MoCap trajectories (SMPL-X/FBX), with virtual cameras attached at six anatomical locations. Environments span hand-modeled cities, Polycam-scanned courtyards, skyscraper scenes, and parks. Up to four avatars interact in the same scene, enabling multi-person interaction scenarios.
Real-World Recording: 13 participants donned six GoPro cameras plus Xsens suits, performing 35 BABEL-annotated AMASS motions. Cameras (5× HERO 10, 1× HERO 9) captured 1080p video at 30 fps, with 118° horizontal FOVs.
Synchronization: An initial calibration sets the body reference; a loud clap provides a synchronization anchor. Video and pose sequences are temporally aligned using this event.
Annotation: All data is paired per-frame: PNG or MP4 video by camera site and frame, CSV/JSON files for SMPL-X pose parameters (55 joints, 6D representation per Zhou et al. 2019), and IMU data streams (simulated in synthetic, real in GoPro via Xsens).

MultiEgo Pipeline

Acquisition: Each participant wears a RayNeo X2 AR-glass, providing RGB+gyroscope data. Server-client Wi-Fi handles synchronized remote control and timestamping; wall-clock UTC timestamps are assigned at 100 ns resolution.
Processing: Color grading is performed for white-balance and de-flickering. Monocular camera trajectories are estimated via multiple visual SLAM/SfM pipelines (AnyCam, Mega-SaM, CUT3R, MonST3R, PySLAM), fused with gyroscope data through an Extended Kalman Filter for smooth 6-DoF trajectories.
Calibration: COLMAP is used for intrinsics reconstruction (radial and tangential distortion), with a mean reprojection error ≲0.5 px across all devices. Scene coordinate systems are anchored to fixed scene objects.
Annotation and Quality: Rotation RMSE is <0.3°, translation RMSE <10 mm, temporal jitter <0.5 ms after synchronization—suitable for precision scene alignment and dynamic FVV.

4. Data Structure, Access, and Usage Protocols

Both datasets are organized for reproducibility and ease of downstream benchmarking:

Directory Structure: Segregated by subset (Synthetic/Real) or scenario/participant, then further into per-camera video streams (“camera_<site>_frame.png”) and aligned per-frame pose annotation files.
Splits and Benchmarks:
- MultiEgoView: BABEL-60 protocol used for synthetic splits (60/20/20 train/val/test across 5 s clips). Real data split 80/20 train/test or by participant. Baselines include global MPJPE, PA-MPJPE, MTE, MRE, MJAE, Jerk metrics—benchmarked with a ViT-based multi-view model.
- MultiEgo: Each scenario is ~90 s, promoting benchmarking of short-duration, multi-participant reconstructions. Baselines are established on 4DGaussian, 3DGStream, and Deformable-3DGS methods, reporting PSNR, SSIM, and LPIPS on held-out viewpoints.
Licensing and Access: MultiEgoView is distributed under CC BY-NC-SA; EgoSim code and models are GPL-3.0. MultiEgo is available at https://woxelikeloud.github.io/multiego/; MultiEgoView at https://siplab.org/projects/EgoSim.

5. Baseline Methods, Evaluation, and Quantitative Results

MultiEgoView Baselines

A composite loss is used to train a baseline 3D pose estimator:

$L = \lambda_\theta L_\theta + \lambda_p L_p + \lambda_v L_v + \lambda_{t_r} L_{t_r} + \lambda_{R_r} L_{R_r} + \lambda_{t_g} L_{t_g} + \lambda_{R_g} L_{R_g} + \lambda_z L_z$

Terms include 6D joint rotation loss, ℓ₁ joint position error, root translation/rotation error (relative and global), velocity consistency, and embedding token regularization. Example weights: $\lambda_\theta=10$ , $\lambda_p=25$ , $\lambda_v=40$ , $\lambda_{t_r}=25$ , $\lambda_{R_r}=15$ , $\lambda_{t_g}=1$ , $\lambda_{R_g}=0.025$ , $\lambda_z=5\times10^{-4}$ .

MultiEgo 4D Reconstruction Benchmarks

Three scene reconstruction approaches—3DGStream, 4DGaussian, Deformable-3DGS—are evaluated:

Method	PSNR ↑	SSIM ↑	LPIPS ↓
4DGaussian	25.74	0.843	0.298
4DGaussian w/timestamps	25.62	0.841	0.301
Deformable-3DGS	23.03	0.833	0.296
3DGStream	22.74	0.763	0.316

The Presentation scene achieved the highest PSNR (≈28.24 dB) and SSIM (≈0.899), while the Statement scene, characterized by large rotations and specularities, was the most challenging (PSNR ≈20.4–24.0 dB) (Li et al., 12 Dec 2025).

Qualitative analysis indicates that 4DGaussian preserves static background fidelity but oversmooths dynamic content; 3DGStream better captures fast nonrigid motion but increases noise in static regions. Timestamp-aware optimization provides marginal gains, reflecting effective pipeline synchronization.

6. Applications, Limitations, and Future Directions

Applications

Dynamic mesh and 4D scene reconstruction from egocentric, multi-participant video streams
Neural radiance field (NeRF) extensions for head-paced, large-baseline egocentric data
Action recognition and interaction understanding with dense pose and camera annotation
Visual-inertial odometry and SLAM with sub-millisecond synchronization

Strengths

Dense, authentic egocentric coverage and multi-person interactions
Rigid temporal synchronization (sub-ms in MultiEgo), cm-level pose accuracy
Challenging capture settings—occlusions, reflections, rapid dynamics, projector-induced color shifts

Limitations

Absence of depth, semantics, or multi-modal sensor data beyond RGB and gyroscope in MultiEgo
Short per-scene durations (~90 s) in MultiEgo may constrain long-term tracking studies
Absence of ground-truth volumetric geometry or per-pixel semantic labels in MultiEgo
Fixed 30 Hz video frame rate may limit capture fidelity for very high-speed motion (>10 m/s)

This suggests that while MultiEgo datasets advance the field, further augmentations—e.g., volumetric ground truth, longer-duration captures, or additional sensing modalities—would further benefit research in 4D dynamic perception.

7. Impact and Benchmark Status

MultiEgoView and MultiEgo establish a new benchmark paradigm for evaluating egocentric perception algorithms under multi-view, multi-person, and motion-rich conditions. They have been foundational for developing and validating approaches in 3D pose estimation, dynamic neural rendering, and FVV systems. The datasets’ rigorous synchronization, calibration, and annotation protocols set new standards for reproducibility and cross-method comparability in egocentric scene understanding (Hollidt et al., 25 Feb 2025, Li et al., 12 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (2)

EgoSim: An Egocentric Multi-view Simulator and Real Dataset for Body-worn Cameras during Motion and Activity (2025)

MultiEgo: A Multi-View Egocentric Video Dataset for 4D Scene Reconstruction (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MultiEgo Dataset.