AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

Published 1 Apr 2025 in cs.CV | (2504.01014v2)

Abstract: Recent advancements in image and video synthesis have opened up new promise in generative games. One particularly intriguing application is transforming characters from anime films into interactive, playable entities. This allows players to immerse themselves in the dynamic anime world as their favorite characters for life simulation through language instructions. Such games are defined as infinite game since they eliminate predetermined boundaries and fixed gameplay rules, where players can interact with the game world through open-ended language and experience ever-evolving storylines and environments. Recently, a pioneering approach for infinite anime life simulation employs LLMs to translate multi-turn text dialogues into language instructions for image generation. However, it neglects historical visual context, leading to inconsistent gameplay. Furthermore, it only generates static images, failing to incorporate the dynamics necessary for an engaging gaming experience. In this work, we propose AnimeGamer, which is built upon Multimodal LLMs (MLLMs) to generate each game state, including dynamic animation shots that depict character movements and updates to character states, as illustrated in Figure 1. We introduce novel action-aware multimodal representations to represent animation shots, which can be decoded into high-quality video clips using a video diffusion model. By taking historical animation shot representations as context and predicting subsequent representations, AnimeGamer can generate games with contextual consistency and satisfactory dynamics. Extensive evaluations using both automated metrics and human evaluations demonstrate that AnimeGamer outperforms existing methods in various aspects of the gaming experience. Codes and checkpoints are available at https://github.com/TencentARC/AnimeGamer.

Abstract PDF Upgrade to Chat

Summary

Overview of AnimeGamer: Infinite Anime Life Simulation

The paper at hand presents "AnimeGamer," a novel framework for infinite anime life simulation, utilizing the capabilities of Multimodal LLMs (MLLMs). The aim is to transform anime film characters into interactive entities within a gaming environment, allowing for a seamless life simulation experience guided by linguistic instructions. This development in generative models caters specifically to creating dynamic, evolving games without predefined boundaries, thus redefining player interaction with anime worlds.

Technical Foundation

AnimeGamer builds upon the shortcomings of previous attempts, which predominantly failed to incorporate the visual context and dynamic nature vital for coherent gameplay. These prior methods primarily generated static images, which did not suffice for a comprehensive gaming experience. By introducing action-aware multimodal representations, AnimeGamer addresses these limitations. It combines both visual and text-based cues to generate representations that can be converted into high-quality, dynamic video clips using a video diffusion model.

Key innovations in AnimeGamer include:

Action-Aware Multimodal Representations: These facilitate the capture of nuanced character movements and video context, which are crucial for sound game dynamics. This is achieved by encoding both motion descriptors and visual data as inputs.
Video Diffusion Model: Utilized for the precise transformation of multimodal representations into coherent video outputs, this model is fine-tuned to ensure contextual consistency across animations.
Contextual Consistency Mechanism: By taking into account historical game states and dynamically predicting forthcoming states, the system maintains coherence throughout gameplay.

Evaluation Metrics and Results

AnimeGamer's effectiveness was assessed using a comprehensive set of both automated and human-evaluative metrics, which include measures for character and semantic consistency, motion quality, and state updates. The system exhibited superior performance over comparable methods such as GSC, GFC, and GC, particularly excelling in maintaining character and contextual consistency. Notably, AnimeGamer demonstrated:

Improved Character Consistency: With higher CLIP-I and DreamSim scores, reflecting better alignment with reference characters.
Superior Semantic Consistency: Evident from scores on CLIP-T evaluations, indicating higher fidelity between the generated and intended actions and settings.
Responsive to Motion Quality: Achieved high accuracy and low error rates in predicting action intensities consistent with game dynamics.

These results underscore AnimeGamer's potential for providing an immersive gaming experience aligned with players' linguistic inputs, surpassing current baseline methodologies in coherence and variability.

Future Directions and Implications

The implications of AnimeGamer extend beyond mere entertainment, as this technology could spearhead developments in interactive storytelling, personalized game experiences, and virtual environment design. The framework's capacity to generalize from predefined datasets to broaden narratives across diverse character archetypes offers promising avenues for the integration of AI in creative disciplines.

The paper asserts that while AnimeGamer sets a new standard for generative anime simulations, further research is warranted to explore the generalization potential for characters and scenarios beyond the existing closed domains. This progression could lead to more personalized and complex interactions within AI-generated environments in both gaming and virtual applications.

In conclusion, AnimeGamer exemplifies a significant advancement in leveraging MLLMs to enrich interaction within virtual worlds, offering a more dynamic and contextually rich anime simulation experience. This work sets a foundation for future explorations into the applications of AI in interactive media and games.