CoDanceBench: Multi-Subject Animation Benchmark

Updated 20 January 2026

CoDanceBench is a curated benchmark featuring 20 dance video clips with 2-5 subjects, including humans and cartoons, to evaluate multi-agent animation.
It defines two core tasks—multi-subject pose transfer and cardinality-mismatch animation—addressing spatial misalignment and heterogeneous subject presence using per-subject masks.
Leveraging an innovative Unbind–Rebind paradigm, CoDanceBench achieves superior quantitative metrics (PSNR, SSIM, LPIPS) and high human preference scores over existing models.

CoDanceBench is a curated multi-subject character animation evaluation benchmark introduced to facilitate comprehensive, controlled assessment of pose-driven animation models under robust, realistic, and challenging multi-agent conditions. It is developed as part of the "CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation" framework, specifically to address the deficiencies of existing benchmarks, which are largely limited to single-person settings and do not account for spatial misalignment, arbitrary subject cardinality, or heterogeneous agent types (Tan et al., 16 Jan 2026).

1. Dataset Construction and Protocol

CoDanceBench comprises 20 multi-subject dance video clips, each containing between 2 and 5 actors, including both humans and anthropomorphic cartoon characters. The actors demonstrate diverse spatial formations (side-by-side, staggered depth, crossing trajectories, etc.), providing a critical testbed for algorithms that must generalize across heterogeneous spatial relationships and appearances.

From each raw video, the following are extracted:

Reference Image: A single frame depicting all subjects in their initial poses.
Driving Pose Sequence ( $I^p_{1:F}$ ): A multi-person skeleton map for each frame, following an OpenPose-style representation.
Subject Masks ( $\mathcal{M}_1, \dots, \mathcal{M}_N$ ): High-quality binary segmentation maps, produced with Segment-Anything, for per-actor localization in the reference image.

CoDanceBench functions exclusively as a test set. There are no designated train or validation splits; all 20 videos constitute the evaluation protocol. Importantly, none of these videos are used for training—CoDance is trained solely on single-subject datasets (including TikTok, UBC Fashion, and 1,200 self-collected solo dance clips).

2. Principal Benchmarking Tasks

Two core evaluation tasks are defined:

2.1 Multi-Subject Pose Transfer

Inputs:

Reference image $I^r$ with $N \geq 2$ subjects.
Driving pose video $I^p_{1:F}$ with $N$ per-frame skeletons (often spatially misaligned with $I^r$ ).
Per-subject binary masks $\mathcal{M}_k$ .

Goal: Synthesize a video $I^g_{1:F}$ where each reference subject follows the corresponding driving skeleton, with faithful identity preservation, accurate subject-pose correspondence, and temporal coherence.

2.2 Cardinality-Mismatch Animation

Inputs: The same reference $I^r$ with $N$ subjects, but only one driving skeleton track.
Goal: Animate all $N$ subjects using the single available pose track, thus evaluating robustness to cardinality mismatch (number of skeletons $\ne$ number of subjects).

In both cases, models are evaluated on data they have not previously encountered, with no exposure to multi-agent configurations during training.

3. Metrics and Human Evaluation

Quantitative assessment leverages both frame-level and sequence-level metrics. Let $x$ denote a generated frame and $y$ the corresponding ground-truth.

Peak Signal-to-Noise Ratio (PSNR):

$\mathrm{PSNR}(x, y) = 10 \log_{10}\left( \frac{L^2}{\mathrm{MSE}(x, y)} \right)$

where $\mathrm{MSE}(x, y) = \frac{1}{HW} \sum_{i, j} (x_{i, j} - y_{i, j})^2$ .

Structural Similarity Index (SSIM) per Wang & Bovik.
L₁ Distance: $\|x - y\|_1 = \sum_{i, j} |x_{i, j} - y_{i, j}|$ .
LPIPS: Learned Perceptual Image Patch Similarity, using LPIPS score from deep net activations:\

$\mathrm{LPIPS}(x, y) = \sum_\ell w_\ell \|f_\ell(x) - f_\ell(y)\|_2$

Fréchet Inception Distance (FID):

$\mathrm{FID}(P_r, P_g) = \|\mu_r - \mu_g\|^2 + \mathrm{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})$

FID-VID: Same as FID but using video-level features (e.g., from 3D CNNs).
Fréchet Video Distance (FVD): Computed on intermediate video features as per Unterthiner et al.

A human evaluation protocol involves 10 raters conducting pairwise A/B tests over 400 trials per method (20 reference identities $\times$ 20 driving clips), scored on video quality, identity preservation, and temporal consistency.

4. Baselines and Comparative Results

Eight contemporary state-of-the-art single-subject animation models are repurposed as baselines:

Baseline Method	Venue/Dataset	Year
AnimateAnyone	CVPR	2024
MusePose	arXiv	2024
ControlNeXt	arXiv	2024
MimicMotion	ICML	2025
UniAnimate	SCIS	2025
Animate-X	ICLR	2025
StableAnimator	CVPR	2025
UniAnimateDiT	arXiv	2025

CoDance sets the best or near-best numeric scores for all frame-based and sequence-based metrics, notably:

LPIPS: 0.580 (best)
PSNR: 12.21 (best)
SSIM: 0.592 (best)
L₁: $1.24 \times 10^{-4}$ (best)
FID: 221.40 (close to best)
FID-VID: 180.50 (best)
FVD: 2494.76 (best)

Competing methods typically score LPIPS $\approx$ 0.58–0.63, PSNR $\approx$ 11.0–12.0, SSIM $\approx$ 0.55–0.60, but achieve substantially worse FID-VID/FVD scores ( $>$ 2800). In human studies, CoDance achieves 90% preference for video quality, 88% for identity preservation, and 83% for temporal consistency, outperforming all baselines.

5. Problem Motivation and Methodological Innovations

Traditional animation models impose a rigid “pose-to-pixel” spatial binding, leading to severe failures (entangled limbs, identity confusion, hallucinated subjects) when the number of subjects or their spatial arrangement diverges from training conditions. CoDanceBench was established to surface these failure cases and rigorously test approaches that claim robustness to multi-subject and misaligned settings.

The CoDance solution employs an Unbind–Rebind paradigm:

Unbind: Injects stochastic pose/feature perturbations via a pose shift encoder, enforcing the learning of a location-agnostic, subject-agnostic motion code.
Rebind: Utilizes semantic conditioning from text prompts and spatial cues from per-subject masks to reattach motion to explicit subjects, maintaining correspondences despite spatial/identity ambiguities.

This dual-stage approach is empirically validated on CoDanceBench, demonstrating seamless generalization to arbitrary subject counts, classes (human/cartoon), and spatial configurations.

6. Limitations and Prospective Extensions

The creators of CoDanceBench identify several constraints:

Dataset Scale: The benchmark comprises 20 videos, restricting statistical power and the diversity of motion and subject configurations. Larger and more eclectic datasets would better challenge generalization.
Motion Diversity: Current focus is limited to dance movements; inclusion of broader activities (e.g., sports, daily actions) would offer more comprehensive evaluation.
Background Complexity: All scenes feature static backgrounds, precluding the assessment of occlusion handling and dynamic background integration.
Absence of Train/Validation Splits: Only test data is provided, leaving a gap for community-driven training and annotation of multi-subject video material.

A plausible implication is that future benchmarks expanding context, action richness, and annotation depth would enable deeper investigation of multi-agent animation models’ real-world readiness.

7. Significance and Position in the Field

CoDanceBench constitutes the first dedicated benchmark for evaluating multi-subject pose transfer and cardinality-mismatch animation under realistic misalignment and identity-ambiguity scenarios. Its protocols and metrics illuminate key challenges neglected by prevailing single-subject paradigms. CoDance’s performance on this benchmark, guided by Unbind–Rebind mechanisms, underscores the importance of decoupling motion semantics from spatial reference and reassociating them through per-subject controls (Tan et al., 16 Jan 2026). CoDanceBench thereby sets a precedent for the evaluation of robust, flexible multi-agent character animation models and provides foundational infrastructure for future work in this domain.

Markdown Report Issue Upgrade to Chat

References (1)

CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoDanceBench.