InterHuman Dataset Overview

Updated 24 November 2025

InterHuman is a comprehensive dataset featuring high-fidelity 3D skeletal motion captures and natural language annotations for two-person interactions.
It provides precise SMPL-based kinematic data from 7,779 clips and synchronized multi-view RGB captures at 30 FPS, enabling detailed motion analysis.
Rich, diverse textual descriptions paired with refined data acquisition support research in generative modeling and multimodal synthesis of coordinated behaviors.

The InterHuman dataset is a large-scale resource for the study and generation of two-person interaction motions, providing high-fidelity 3D skeleton trajectories and richly annotated natural language descriptions. It is intended to support research in generative modeling of human interaction, motion synthesis, and multimodal understanding of coordinated behaviors.

1. Dataset Composition and Modalities

InterHuman contains 7,779 two-person interaction clips comprising approximately 107 million frames (corresponding to about 6.56 hours at 30 frames per second). Each clip spans up to 10 seconds and includes the synchronized motion trajectories of both participants. Modalities provided are:

3D skeletal motion: Recovered by fitting the SMPL model to multi-view RGB footage (76 synchronized cameras at 1920×1080, 30 FPS). Each subject’s kinematic skeleton is parameterized by positions, velocities, local joint rotations, and foot-ground contact flags.
Natural language descriptions: Each interaction is paired with three independent free-form textual captions, emphasizing varied and comprehensive annotation (Liang et al., 2023).

2. Motion and Textual Annotation Schema

The dataset adopts SMPL-derived kinematic trees with 22–24 joints per subject. The state at frame $t$ for a performer $p$ is expressed as:

$x_t^{(p)} = [\, \mathbf{j}_g^p,\, \mathbf{j}_g^v,\, \mathbf{j}^r,\, \mathbf{c}^f\,]$

where $\mathbf{j}_g^p \in \mathbb{R}^{3N_j}$ (joint world positions), $\mathbf{j}_g^v \in \mathbb{R}^{3N_j}$ (joint velocities), $\mathbf{j}^r \in \mathbb{R}^{6N_j}$ (6D local joint rotations), and $\mathbf{c}^f \in \{0,1\}^4$ (foot contacts).

Text descriptions, averaged at three per clip (no fixed template), provide narrative detail about interaction type, physical actions, and inter-actor relations. Textual diversity is emphasized to enable robust text-conditioned generation (Liang et al., 2023).

3. Data Acquisition and Processing Pipeline

3D motion data is acquired from markerless multi-view setups, leveraging 76 Z-CAM RGB cameras in a dome arrangement. The SMPL model fitting is performed per frame to obtain accurate joint trajectories. Annotation involves:

Clip segmentation: Each interaction is manually segmented to a duration not exceeding 10 seconds.
Text captioning: Three annotators supply unique free-form descriptions for each interaction.
Preprocessing: Data are cleaned to remove incomplete frames and smoothed for minor jitters.
Augmentation: Training augmentation is performed via left–right performer mirroring and by swapping roles in captions (“person A” ↔ “person B”) (Liang et al., 2023).

4. Data Representation and Mathematical Structure

The InterHuman dataset structures two-person interaction motion as synchronized trajectories:

For each time index $i$ , the state pair $\{x_a^i, x_b^i\}$ encodes both performers’ kinematic features.
Inter-actor global relational features are provided:
- Joint-distance map: $M(x_a, x_b) \in \mathbb{R}^{N_j \times N_j}$ , where $M_{ij} = \|\text{position}_i^a - \text{position}_j^b\|_2$
- Relative orientation: $O(\mathrm{IK}(x_a), \mathrm{IK}(x_b))$ is the signed yaw between subjects (via inverse kinematics).
Regularization terms in downstream models encourage preservation of spatial relations and orientation during synthesis, with a damped loss schedule only applied in earlier diffusion steps (Liang et al., 2023, Ponce et al., 2024).

5. Extensions: Individual Motion Descriptions

To enable more granular conditioning, the in2IN model (Ponce et al., 2024) automatically generates per-person textual descriptions using GPT-3.5-turbo. For each interaction sample described by a “global” caption, a prompt instructs the LLM to produce two lines: one per individual, detailing their specific motion (e.g., “Individual Motion 1: One person is moving and then throws a punch. Individual Motion 2: One person falls over and stays on the ground.”). These are not manually verified and may contain hallucinated content, but are validated by spot checks.

The overall schema after extension is:

$(X_a, X_b, c_{\mathrm{I}}, c_{i,1}, c_{i,2})$

where $c_{\mathrm{I}}$ is the original interaction prompt, and $c_{i,1}$ , $c_{i,2}$ are LLM-generated per-person prompts (Ponce et al., 2024).

6. Dataset Splits, Coverage, and Distribution

InterHuman’s original publication specifies:

Total clips: 7,779
Total frames: ~107M
Frame rate: 30 FPS (fixed)
Clip length: Up to 10 seconds, mean and median around 3 seconds ( $\sim90$ frames/clip)

Clips span a wide variety of interaction types, such as handshake, hug, pass-object, greeting, arguing, dancing, and martial arts, among others. The training/validation/test splits are 90%/5%/5%, organized by script/dialogue such that there is no actor overlap across splits.

7. Limitations and Considerations

Documented limitations include:

Actor pool: 18 pairs for daily and 12 pairs for expert motions, predominantly young adults; thus, limited demographic diversity.
Interaction types: Skewed toward scripted or professional actions, with limited representation of spontaneous or chaotic multi-person interactions.
Scope: Restricted to two-person interactions. There is no support for group (more than two) conversational or physical exchanges.
Ethical precautions: All actors provided informed consent; identifying visuals are not released; content from sensitive contexts (e.g., arguments) is anonymized (Liang et al., 2023).

A plausible implication is that while InterHuman is comprehensive for dyadic interactions in controlled environments, it does not sample the full spectrum of human physical or social diversity seen in unconstrained group settings.

8. Relevance to Modeling and Research

InterHuman is widely used for training and benchmarking generative models of human interaction. All current algorithms utilizing InterHuman adopt the explicit joint-level representation, often incorporating both the motion sequence and text conditioning (global and, with extensions, individual). Model architectures frequently leverage mutual/cross-attention across actors and are evaluated on their ability to synthesize semantically consistent, physically plausible, and diverse interactions when conditioned on text (Liang et al., 2023, Ponce et al., 2024).

The dataset has been further extended for research into individualized motion generation, as in in2IN, enabling greater control over intra-personal diversity within generated interactions, while maintaining inter-person coherence via novel attention conditioning and composition strategies (Ponce et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

InterGen: Diffusion-based Multi-human Motion Generation under Complex Interactions (2023)

in2IN: Leveraging individual Information to Generate Human INteractions (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InterHuman Dataset.