CoDanceBench: Multi-Subject Animation Benchmark
- CoDanceBench is a curated benchmark featuring 20 dance video clips with 2-5 subjects, including humans and cartoons, to evaluate multi-agent animation.
- It defines two core tasks—multi-subject pose transfer and cardinality-mismatch animation—addressing spatial misalignment and heterogeneous subject presence using per-subject masks.
- Leveraging an innovative Unbind–Rebind paradigm, CoDanceBench achieves superior quantitative metrics (PSNR, SSIM, LPIPS) and high human preference scores over existing models.
CoDanceBench is a curated multi-subject character animation evaluation benchmark introduced to facilitate comprehensive, controlled assessment of pose-driven animation models under robust, realistic, and challenging multi-agent conditions. It is developed as part of the "CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation" framework, specifically to address the deficiencies of existing benchmarks, which are largely limited to single-person settings and do not account for spatial misalignment, arbitrary subject cardinality, or heterogeneous agent types (Tan et al., 16 Jan 2026).
1. Dataset Construction and Protocol
CoDanceBench comprises 20 multi-subject dance video clips, each containing between 2 and 5 actors, including both humans and anthropomorphic cartoon characters. The actors demonstrate diverse spatial formations (side-by-side, staggered depth, crossing trajectories, etc.), providing a critical testbed for algorithms that must generalize across heterogeneous spatial relationships and appearances.
From each raw video, the following are extracted:
- Reference Image: A single frame depicting all subjects in their initial poses.
- Driving Pose Sequence (): A multi-person skeleton map for each frame, following an OpenPose-style representation.
- Subject Masks (): High-quality binary segmentation maps, produced with Segment-Anything, for per-actor localization in the reference image.
CoDanceBench functions exclusively as a test set. There are no designated train or validation splits; all 20 videos constitute the evaluation protocol. Importantly, none of these videos are used for training—CoDance is trained solely on single-subject datasets (including TikTok, UBC Fashion, and 1,200 self-collected solo dance clips).
2. Principal Benchmarking Tasks
Two core evaluation tasks are defined:
2.1 Multi-Subject Pose Transfer
- Inputs:
- Reference image with subjects.
- Driving pose video with per-frame skeletons (often spatially misaligned with ).
- Per-subject binary masks .
- Goal: Synthesize a video where each reference subject follows the corresponding driving skeleton, with faithful identity preservation, accurate subject-pose correspondence, and temporal coherence.
2.2 Cardinality-Mismatch Animation
- Inputs: The same reference with subjects, but only one driving skeleton track.
- Goal: Animate all subjects using the single available pose track, thus evaluating robustness to cardinality mismatch (number of skeletons number of subjects).
In both cases, models are evaluated on data they have not previously encountered, with no exposure to multi-agent configurations during training.
3. Metrics and Human Evaluation
Quantitative assessment leverages both frame-level and sequence-level metrics. Let denote a generated frame and the corresponding ground-truth.
- Peak Signal-to-Noise Ratio (PSNR):
where .
- Structural Similarity Index (SSIM) per Wang & Bovik.
- L₁ Distance: .
- LPIPS: Learned Perceptual Image Patch Similarity, using LPIPS score from deep net activations:\
- Fréchet Inception Distance (FID):
- FID-VID: Same as FID but using video-level features (e.g., from 3D CNNs).
- Fréchet Video Distance (FVD): Computed on intermediate video features as per Unterthiner et al.
A human evaluation protocol involves 10 raters conducting pairwise A/B tests over 400 trials per method (20 reference identities 20 driving clips), scored on video quality, identity preservation, and temporal consistency.
4. Baselines and Comparative Results
Eight contemporary state-of-the-art single-subject animation models are repurposed as baselines:
| Baseline Method | Venue/Dataset | Year |
|---|---|---|
| AnimateAnyone | CVPR | 2024 |
| MusePose | arXiv | 2024 |
| ControlNeXt | arXiv | 2024 |
| MimicMotion | ICML | 2025 |
| UniAnimate | SCIS | 2025 |
| Animate-X | ICLR | 2025 |
| StableAnimator | CVPR | 2025 |
| UniAnimateDiT | arXiv | 2025 |
CoDance sets the best or near-best numeric scores for all frame-based and sequence-based metrics, notably:
- LPIPS: 0.580 (best)
- PSNR: 12.21 (best)
- SSIM: 0.592 (best)
- L₁: (best)
- FID: 221.40 (close to best)
- FID-VID: 180.50 (best)
- FVD: 2494.76 (best)
Competing methods typically score LPIPS 0.58–0.63, PSNR 11.0–12.0, SSIM 0.55–0.60, but achieve substantially worse FID-VID/FVD scores (2800). In human studies, CoDance achieves 90% preference for video quality, 88% for identity preservation, and 83% for temporal consistency, outperforming all baselines.
5. Problem Motivation and Methodological Innovations
Traditional animation models impose a rigid “pose-to-pixel” spatial binding, leading to severe failures (entangled limbs, identity confusion, hallucinated subjects) when the number of subjects or their spatial arrangement diverges from training conditions. CoDanceBench was established to surface these failure cases and rigorously test approaches that claim robustness to multi-subject and misaligned settings.
The CoDance solution employs an Unbind–Rebind paradigm:
- Unbind: Injects stochastic pose/feature perturbations via a pose shift encoder, enforcing the learning of a location-agnostic, subject-agnostic motion code.
- Rebind: Utilizes semantic conditioning from text prompts and spatial cues from per-subject masks to reattach motion to explicit subjects, maintaining correspondences despite spatial/identity ambiguities.
This dual-stage approach is empirically validated on CoDanceBench, demonstrating seamless generalization to arbitrary subject counts, classes (human/cartoon), and spatial configurations.
6. Limitations and Prospective Extensions
The creators of CoDanceBench identify several constraints:
- Dataset Scale: The benchmark comprises 20 videos, restricting statistical power and the diversity of motion and subject configurations. Larger and more eclectic datasets would better challenge generalization.
- Motion Diversity: Current focus is limited to dance movements; inclusion of broader activities (e.g., sports, daily actions) would offer more comprehensive evaluation.
- Background Complexity: All scenes feature static backgrounds, precluding the assessment of occlusion handling and dynamic background integration.
- Absence of Train/Validation Splits: Only test data is provided, leaving a gap for community-driven training and annotation of multi-subject video material.
A plausible implication is that future benchmarks expanding context, action richness, and annotation depth would enable deeper investigation of multi-agent animation models’ real-world readiness.
7. Significance and Position in the Field
CoDanceBench constitutes the first dedicated benchmark for evaluating multi-subject pose transfer and cardinality-mismatch animation under realistic misalignment and identity-ambiguity scenarios. Its protocols and metrics illuminate key challenges neglected by prevailing single-subject paradigms. CoDance’s performance on this benchmark, guided by Unbind–Rebind mechanisms, underscores the importance of decoupling motion semantics from spatial reference and reassociating them through per-subject controls (Tan et al., 16 Jan 2026). CoDanceBench thereby sets a precedent for the evaluation of robust, flexible multi-agent character animation models and provides foundational infrastructure for future work in this domain.