Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models

Published 22 Dec 2025 in cs.CV | (2512.19692v1)

Abstract: Generating realistic human-human interactions is a challenging task that requires not only high-quality individual body and hand motions, but also coherent coordination among all interactants. Due to limitations in available data and increased learning complexity, previous methods tend to ignore hand motions, limiting the realism and expressivity of the interactions. Additionally, current diffusion-based approaches generate entire motion sequences simultaneously, limiting their ability to capture the reactive and adaptive nature of human interactions. To address these limitations, we introduce Interact2Ar, the first end-to-end text-conditioned autoregressive diffusion model for generating full-body, human-human interactions. Interact2Ar incorporates detailed hand kinematics through dedicated parallel branches, enabling high-fidelity full-body generation. Furthermore, we introduce an autoregressive pipeline coupled with a novel memory technique that facilitates adaptation to the inherent variability of human interactions using efficient large context windows. The adaptability of our model enables a series of downstream applications, including temporal motion composition, real-time adaptation to disturbances, and extension beyond dyadic to multi-person scenarios. To validate the generated motions, we introduce a set of robust evaluators and extended metrics designed specifically for assessing full-body interactions. Through quantitative and qualitative experiments, we demonstrate the state-of-the-art performance of Interact2Ar.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a novel autoregressive diffusion model with cooperative denoising and mixed memory, enabling high-fidelity, text-driven full-body interactions.
It employs specialized denoising heads for global trajectories, body, and hand articulation to ensure detailed motion generation and smooth transitions.
The model outperforms prior state-of-the-art methods in metrics like R-Precision, FID, and jerk-based smoothness while supporting versatile downstream applications.

Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models

Introduction

The synthesis of physically plausible, text-conditioned full-body human-human interactions remains a nontrivial problem in generative modeling due to the necessity for fine-grained temporal and spatial coherence across multiple agents, including detailed hand articulation. Previous approaches either omitted hand modeling, suffered from limited inter-agent information flow, or generated motion with insufficient contextual adaptability. “Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models” (2512.19692) addresses these deficiencies via a novel architecture and learning paradigm that integrates cooperative agent denoising, body-part specialization, and autoregressive–diffusion with an adaptive mixed-memory mechanism. The model enables not only high-fidelity motion generation but also downstream adaptive applications, including compositional motion synthesis, perturbation adaptation, and multi-human interactions.

Architectural Innovations

Interact2Ar extends the diffusion modeling paradigm with a tripartite architectural configuration:

Multi-Head Denoiser: A joint encoder feeds noised motion and conditioning to three specialized denoising heads for the global trajectory, body, and hands, allowing each submodule to concentrate on relevant subspaces while maintaining cross-part dependencies.
Cooperative Denoisers: Each interactant is modeled with a parallel stream; cross-attention between streams ensures effective bidirectional flow of interaction cues and physical constraints, with weight sharing ensuring parameter efficiency and generalization to arbitrary dyad configurations.
Autoregressive Inference: Unlike prior approaches, Interact2Ar performs step-wise motion generation where each block is conditioned on previous context frames. This improves adaptability to ongoing interaction dynamics and enables real-time or incremental synthesis scenarios.
Figure 1: Architectural illustration of Interact2Ar’s denoising heads, cooperative streams, and autoregressive pipeline.

This configuration enables strong performance in both precision and fine-grained realism, particularly for high-dimensional hand articulations, which previous architectures either omitted or modeled inefficiently.

Mixed Memory for Temporal Context Management

Interacting agents often require both immediate and long-term historical information to maintain physical plausibility and avoid local repetitive artifacts or global inconsistency. To this end, Interact2Ar integrates a novel Mixed Memory mechanism:

Short-Term (Dense) Memory: Immediate preceding frames are retained at full temporal resolution as context for the denoiser, enabling seamless transitions, local continuity, and rapid adaptation to recent cues.
Long-Term (Sparse) Memory: A downsampled buffer of frames extends the context window without proportional memory overhead, allowing global consistency and avoidance of mid-sequence drift or repetition artifacts in extended generative rollouts.
Figure 2: Mixed Memory enables access to detailed local context and long-term interaction history with limited memory cost.

Ablation experiments demonstrate that naive memory expansion impedes convergence, while Mixed Memory provides superior metrics at reduced resource cost.

Evaluation Methodology

To address the known limitations of prior evaluation protocols (e.g., insensitivity to global trajectory artifacts and inadequate hand assessment), the authors develop a robust set of evaluators:

All evaluators operate on global joint positions (not rotations), ensuring sensitivity to spatial misalignments, especially critical in interaction contexts.
Evaluators are body-component-specific (full, body, hands), supporting fine-grained analysis. This mitigates the risk of architectures overfitting to specific body parts at the expense of others.

Comprehensive quantitative and qualitative evaluation on the Inter-X dataset demonstrates the superiority of Interact2Ar across R-Precision, FID, multimodality/diversity, and jerk-based smoothness metrics, using both the baseline and newly proposed evaluators.

Comparative Results

Interact2Ar (both with and without autoregression) consistently surpasses prior SOTA models (e.g., InterGen, InterMask) in full-body, body-specific, and hand-specific metrics. Most notably, the autoregressive variant provides additional gains in adaptability and transition smoothness, addressing a previously unaddressed axis in multi-human generation.

Figure 3: User study: Average ranking results highlight Interact2Ar’s strong performance in text alignment and hand fidelity.

Qualitative comparisons show superior compositional alignment and hand motion for Interact2Ar.

Figure 4: Interact2Ar vs. InterMask: Higher-quality interactions with improved text alignment and hand realism via Interact2Ar.

Adaptive and Downstream Applications

Autoregressive sampling with Mixed Memory brings nontrivial practical implications:

Temporal Motion Composition: Supports concatenation of distinct interaction segments (e.g., multiple textual prompts in sequence) with physically plausible transitions.
Disturbance Adaptation: Handles exogenous perturbations (e.g., trajectory jarring, actor swap) without global sequence artifacts.
Multi-Human Synthesis: Enables expansion beyond dyadic scenes, with smooth handoff between sequentially interacting agents by leveraging episodic memory.
Figure 5: Interact2Ar’s autoregressive–diffusion enables compositionality, adaptation to state perturbation, and sequential multi-human interactions.

Implications, Limitations, and Future Work

The architectural and evaluation advances embodied by Interact2Ar meaningfully shift the baseline for text-conditioned, full-body, multi-human interaction generation. By incorporating detailed hand kinematics and autoregressive adaptation, this model facilitates practical deployment in virtual assistants, simulation, telepresence, and entertainment pipelines, where high-quality, adaptive, and compositional interaction is essential.

However, constraints remain primarily in dataset limitations, especially the lack of shape diversity in ground truth motion data. While SMPL-X provides strong parameterization for generative efficiency, it may obscure population-level variability crucial for some downstream tasks (e.g., hand contact precision, social signal diversity). Extending datasets and model conditioning to cover broader anthropometric distributions will further enhance realism.

From a theoretical perspective, the efficacy of Mixed Memory for extended context windows suggests promise for analogous architectures in other autoregressive diffusion tasks (e.g., language and vision multimodal modeling), especially those requiring efficient balance between local coherence and global structure.

Conclusion

Interact2Ar represents a holistic advance for modeling realistic, adaptable, text-driven human-human interactions with explicit hand modeling. The fusion of cooperative denoising, body-part specialization, autoregressive generation, and Mixed Memory offers compelling gains across practical and theoretical axes, and sets a rigorous evaluation and ablation standard for future research in multi-agent human motion synthesis (2512.19692).

Markdown Report Issue