AnyTop: Character Animation Diffusion with Any Topology

Published 24 Feb 2025 in cs.GR, cs.AI, and cs.CV | (2502.17327v2)

Abstract: Generating motion for arbitrary skeletons is a longstanding challenge in computer graphics, remaining largely unexplored due to the scarcity of diverse datasets and the irregular nature of the data. In this work, we introduce AnyTop, a diffusion model that generates motions for diverse characters with distinct motion dynamics, using only their skeletal structure as input. Our work features a transformer-based denoising network, tailored for arbitrary skeleton learning, integrating topology information into the traditional attention mechanism. Additionally, by incorporating textual joint descriptions into the latent feature representation, AnyTop learns semantic correspondences between joints across diverse skeletons. Our evaluation demonstrates that AnyTop generalizes well, even with as few as three training examples per topology, and can produce motions for unseen skeletons as well. Furthermore, our model's latent space is highly informative, enabling downstream tasks such as joint correspondence, temporal segmentation and motion editing. Our webpage, https://anytop2025.github.io/Anytop-page, includes links to videos and code.

Abstract PDF Upgrade to Chat

Summary

The paper presents a diffusion model that generates realistic motion animations across arbitrary skeletal topologies by leveraging transformer-based denoising and graph learning techniques.
It details a novel skeletal representation that processes irregular motion data with joint-level features and integrates topological conditioning for enhanced spatial control.
The approach offers practical applications in animation generation, motion retargeting, segmentation, and editing while addressing scalability and computational challenges.

Animating characters with diverse skeletal structures is a significant challenge in computer graphics due to the inherent irregularity of skeletal data and the scarcity of datasets containing motions for various topologies. Traditional methods often focus on a single skeleton or variations within a limited range of topologies (like homeomorphic graphs), requiring topology-specific model adjustments or separate models for each skeleton. This limits their scalability and generalization capabilities.

AnyTop is a diffusion model designed to overcome these limitations by generating motions for characters with arbitrary skeletal structures using only their skeleton as input. It employs a transformer-based denoising network specifically adapted for graph learning and irregular data.

Core Concepts and Implementation

AnyTop's design hinges on handling the variability and irregular nature of skeletons and their motion data.

Motion Representation: Motion is represented as a 3D tensor

X \in \mathbb{R}^{N \times J \times D}

, where

N

is the maximum number of frames,

J

is the maximum number of joints across the dataset, and

D

is the feature dimension per joint. To handle varying frame and joint counts, padding is applied. Each joint (except the root) is represented by a 13-dimensional vector: root-relative position (3D), 6D joint rotation, linear velocity (3D), and foot contact label (1D). The root joint has specific features (rotational velocity, linear velocity, height). A key distinction from some prior work is that AnyTop maintains features at the joint level, treating each joint at each frame as a separate token, enabling fine-grained spatial control and generalization.

# Pseudocode for motion data structure
motion_data = {
    'frames': N,
    'joints': J,
    'features_per_joint': D,
    'data_tensor': shape (N, J, D), # Padded tensor
    'padding_mask': shape (N, J), # Mask indicating real vs padded joints
    # Per-joint features example (for a non-root joint):
    # [pos_x, pos_y, pos_z, rot_6d_1, ..., rot_6d_6, vel_x, vel_y, vel_z, contact]
}

Skeletal Structure Representation: A skeleton (or topology) $S$ $S$ is represented by four components:
- $P_S \in \mathbb{R}^{J \times D}$ : The rest-pose, formatted similarly to a single motion frame's joint features (root-relative positions, zero rotations/velocities/contacts).
- $R_S \in \mathbb{N}_0^{J \times J}$ : Joint relations matrix. $R_S[i, j]$ encodes the relationship between joint $i$ and $j$ (child, parent, sibling, no-relation, self, end-effector).
- $D_S \in \mathbb{N}_0^{J \times J}$ : Topological distances matrix. $D_S[i, j]$ is the shortest path distance between joints $i$ and $j$ in the skeletal graph, up to a maximum distance $d_{max}$ .
- $N_S$ : Textual descriptions (names) of the joints.
Architecture: AnyTop is a DDPM consisting of:
- Enrichment Block: Integrates skeleton information into the noisy motion $X_t$ . $P_S$ is projected and concatenated along the temporal axis as frame 0. $N_S$ are encoded using a T5 model, projected, and added to the features of their corresponding joints across all frames. This results in enhanced data of shape $\mathbb{R}^{(N+1) \times J \times F}$ .
- Skeletal Temporal Transformer (STT) Block: A stack of $L$ $L$ encoder layers processing the enriched motion tokens. Each layer contains:
  - Skeletal Attention: Self-attention applied across joints within the same frame. Crucially, unlike prior methods, this attention is between all joints, not just adjacent ones. Topological information ( $R_S, D_S$ ) is integrated into the attention maps to allow joints to prioritize topologically closer neighbors while still accessing information from distant parts.
  - Temporal Attention: Self-attention applied along the temporal axis for each joint independently, within a temporal window $W$ for efficiency.
  - Feed-forward block.
Topological Conditioning Scheme: Graph properties ( $R_S, D_S$ ) are integrated into the skeletal attention mechanism by learning separate query ( $E^D_q, E^R_q$ ) and key ( $E^D_k, E^R_k$ ) embeddings for distances and relations. These embeddings are used to compute additional attention maps ( $a^D, a^R$ ) which are added to the standard dot-product attention map $q_i \cdot k_j$ . The final attention score $a_{ij}$ is a scaled sum: $a_{ij} = \frac{q_i \cdot k_j + a^D_{ij} + a^R_{ij}}{\sqrt{F}}$ . This allows the model to learn relationships informed by skeletal structure.
Training:
- Dataset: Trained on a processed version of the Truebones Zoo dataset, containing diverse skeletons.
- Sampling & Augmentations: A Balancing Sampler addresses data imbalance. Skeletal augmentations (random joint removal/addition) are used to improve generalization to unseen topologies, though updating $D_S$ for these augmentations is computationally expensive.
- Objectives: A standard simple diffusion loss ( $\mathcal{L}_{simple}$ ) combined with a geodesic loss ( $\mathcal{L}_{rot}$ ) for 6D joint rotations to ensure fidelity in rotation space. $\mathcal{L} = \mathcal{L}_{simple} + \lambda_{rot}\mathcal{L}_{rot}$ .

Practical Applications

AnyTop's design and its informative latent space enable several practical applications:

Character Animation Generation: The primary application is generating natural and diverse motions for a given skeleton, including those not seen during training. This is achieved by running the DDPM inference process, starting from a noisy sample and denoising it using the trained AnyTop model conditioned on the target skeleton's properties. The model can handle various skeleton types (bipeds, quadrupeds, insects, etc.).
Spatial and Temporal Correspondence: Using Diffusion Features (DIFT) extracted from intermediate layers, AnyTop's latent space can reveal semantic similarities.
- Spatial Correspondence: Averaging DIFT features along the temporal axis for each joint provides a joint-level descriptor. Cosine similarity between joint features from different skeletons can identify semantically corresponding joints (e.g., mapping a monkey's hand joint to a fox's paw joint).
- Temporal Correspondence: Averaging DIFT features along the skeletal axis for each frame provides a frame-level descriptor. Cosine similarity between frame features can identify similar poses or actions across different motions or skeletons (e.g., finding the equivalent of a 'pecking' frame in a chicken motion within a different bird's motion). These correspondence capabilities can be used for tasks like motion retargeting or aligning actions.
Temporal Segmentation: By extracting frame-level DIFT features from a generated or ground truth motion, reducing dimensionality (e.g., via PCA), and applying clustering (e.g., K-means), motion sequences can be automatically segmented into meaningful action phases (e.g., idle, walking, aggressive).
Motion Editing (In-betweening and Body-Part Editing): AnyTop supports editing by fixing a subset of the motion tokens (either temporally for in-betweening or spatially for body-part editing) during the diffusion sampling process. At each denoising step, the model's prediction for the fixed tokens is overwritten with the original fixed values, while the remaining tokens are synthesized. This allows generating plausible motion seamlessly connected to the fixed parts. This extends motion editing techniques previously limited to standard humanoids to arbitrary topologies.
- In-betweening: Fix the initial and final frames (or segments) of a motion sequence and generate the intermediate frames.
- Body-Part Editing: Fix the motion of specific joints (e.g., lower body) and generate the motion for the remaining joints (e.g., upper body) for the entire sequence.

Implementation Considerations

Data Preprocessing: A robust preprocessing pipeline is essential to align diverse motion capture data from sources like Truebones Zoo. This includes standardizing orientation, scale, root definition, joint naming, and connectivity, as well as ensuring natural rest poses and generating foot contact indicators.
Handling Varying Sizes: Padding is necessary to handle skeletons with different numbers of joints and motions with different durations. Padding masks should be used in the attention mechanisms to ignore padded tokens.
Computational Requirements: Training AnyTop on a dataset of diverse skeletons requires significant computational resources (e.g., 24 hours on an NVIDIA RTX A6000). Inference is faster (e.g., on an RTX 2080 Ti). The skeletal augmentation involving updating topological distances has $O(J^2)$ complexity, which can be a bottleneck for skeletons with many joints.
Hyperparameter Tuning: Parameters like the number of diffusion steps ( $T$ ), transformer layers ( $L$ ), latent dimension ( $F$ ), temporal window size ( $W$ ), and maximum topological distance ( $d_{max}$ ) need to be tuned based on dataset size and complexity.
Robustness to Data Noise: Despite preprocessing, data artifacts might remain. The diffusion model's inherent robustness to noise can help, but data quality is still a limitation.
Foot Contact: Post-processing with Inverse Kinematics (IK) is used to ensure accurate foot contact with the ground, which is crucial for motion quality, especially for ground-based characters.

In summary, AnyTop provides a practical framework for arbitrary character animation generation by adapting diffusion models and transformer architectures to handle the irregular and diverse nature of skeletal data. Its key innovations in skeletal representation and topological conditioning enable generalization across different topologies and empower downstream applications like correspondence, segmentation, and editing, making it a valuable tool for 3D content creation pipelines involving diverse characters.