Interactive Fiction with LLM Agents

Updated 5 February 2026

Interactive Fiction Games with LLM Agents are narrative systems that integrate LLM creativity with symbolic planning and rule-based filtering.
They use modular architectures that decouple context management, action generation, and narrative feedback to enhance player immersion and narrative coherence.
Recent evaluations highlight advancements in reinforcement learning and episodic memory, while addressing challenges in long-term context retention and scalability.

Interactive fiction (IF) games with LLM agents represent the intersection of natural language processing, agentic reasoning, and narrative generation, enabling highly flexible and adaptive player experiences. These systems have catalyzed advances in game AI, cognitive modeling, psychological measurement, and interactive drama, driven by a diverse methodological landscape that integrates prompt engineering, reinforcement learning, symbolic planning, rule-based filtering, and self-supervised training. The following sections synthesize the technical and conceptual architectures, key evaluation paradigms, and emerging research directions that define the field.

1. Core Architectures and Design Patterns

LLM-based IF agent architectures are typically organized around modular pipelines that decouple context management, action generation, and narrative feedback. Dominant patterns include:

Prompt-centric orchestration: Many contemporary IF engines use context managers to concatenate evolving game state, instructions, scenario templates, and chat history, before constructing a prompt for the LLM—examples include the empathy-driven "A Day in Their Shoes" system, which relies on system prompt engineering and few-shot exemplars without model fine-tuning (Yuan et al., 9 May 2025).
Hybrid generation-filtering pipelines: Human-devised rules, templates, or heuristics filter or override LLM outputs at key junctures. The "Werewolf" agent interleaves LLM output with a 14-rule engine for situation detection, outputting a pre-authored template for strategic moves (e.g., situational lying) and otherwise deferring to free-form LLM generation (Sato et al., 2024).
Dynamic action synthesis and state management: Systems like STORY2GAME explicitly parse story events into actions with LLM-generated preconditions and effects, formalize them as executable code, and support on-the-fly dynamic action generation in response to unanticipated player input (Zhou et al., 6 May 2025). State transitions are grounded in executable predicates over a symbolic data model.
Structured map-building and episodic memory: The LPLH framework advocates for internal graph-based mapping of locations and affordances (nodes, edges, labeled with descriptions), modular action representations, and experience libraries for feedback-driven refinement (Zhang et al., 18 May 2025).
Multi-agent dramatization: The Drama Machine decomposes character simulation into parallel LLM agents with "Ego" and "Superego" roles, supporting both intersubjective dialogue and intra-character monologue—additive to a Director and, in some cases, a global Narrator agent (Magee et al., 2024).

The architectural choices reflect trade-offs in linguistic creativity, strategic consistency, computational efficiency, and memory management.

2. Sequential Interaction, State Update, and Planning

IF agents driven by LLMs must manage both a long-horizon memory and immediate interactive context, especially in open-ended environments:

Context/history management: Interactive frameworks limit and order prompt content by preserving only salient recent turns (as in NetPlay's 500-token cap (Jeurissen et al., 2024)) or by regularly condensing or summarizing past events (as in PsychoGAT's episodic memory summarization (Yang et al., 2024)).
Autosave/backtracking: The TextQuests benchmark uniquely empowers agents to autosave after every move and restore prior states, creating a controlled paradigm for trial-and-error learning without external knowledge (Phan et al., 31 Jul 2025).
Skill selection API: Agents like NetPlay decompose the decision loop into discrete "skills" (automated subroutines parameterized over game-relevant arguments). The LLM is prompted to select a skill and arguments at each step, with execution and event tracking managed by a separate module (Jeurissen et al., 2024).

Action selection incorporates chain-of-thought prompting, structured scoring over action candidates, or hybrid pipelines that include rule-based filters and template selection (Zhang et al., 18 May 2025, Sato et al., 2024, Kostka et al., 2017).

3. Narrative and Persona Conditioning

LLM agent behavior in IF emphasizes not just mechanical competence but also character coherence and emotional engagement:

Persona-based style transfer: The Werewolf agent conditions LLM utterance generation on hand-defined persona prefixes, each specifying stylistic and grammatical quirks to simulate distinctive characters (e.g., "Princess" vs. "Kansai dialect") (Sato et al., 2024).
Playwriting-guided narrative control: Hybrid systems employ playwriting constraints and iterative LLM critique to enforce structural plot elements (e.g., suspense, emotional tension) and inject macro-/micro-level narrative techniques. Playwriting-guided pipelines show marked improvements in dramatic qualities over naive baseline generation, as measured by human evaluators (Wu et al., 25 Feb 2025).
Superego/Ego modeling: In multi-agent simulations, an "Ego" agent authors a character’s external utterances, while a "Superego" agent acts as an internal critic, either rewriting input, revising drafts, or critiquing proposed actions before they are finalized and presented to the user (Magee et al., 2024).
Adaptive reflection: Plot-based reflection mechanisms periodically revise the set of plot objectives and character motivations based on detected player intent, allowing for incremental, player-driven story divergence (Wu et al., 25 Feb 2025).

These strategies enhance immersion (first-person presence and narrative coherence) and agency (ability of player actions to meaningfully shift story trajectories).

4. Symbolic and Subsymbolic Learning Mechanisms

LLM-IF agent research deploys a range of learning strategies to support exploration, generalization, and decision-making in stateful, combinatorial environments:

Supervised and reinforcement learning: LPLH formalizes policy learning as a softmax over context-embedded state representations, combining cross-entropy loss on human-like actions with RL-based update terms to maximize expected cumulative reward (Zhang et al., 18 May 2025).
Self-supervised skill transfer via procedural generation: STARLING constructs large corpora of LLM-generated IF games, spanning diverse skills and subtasks, for self-supervised RL pretraining (Basavatia et al., 2024). Agents pretrained on multiple procedurally generated games show improved generalization to new tasks.
Heuristic/NLP-based action scoring: The Golovin agent scores actions via a mixture of corpus statistics (pattern frequency), semantic similarity (word2vec cosine), attention weights (LSTM), rarity (IDF), and surface word overlap. Action selection is performed by sampling from a softmax over these scores (Kostka et al., 2017).
Dynamic action expansion: STORY2GAME supports real-time extension of the game’s action set by parsing new player commands into symbolic predicates, updating engine state, and retroactively adjusting previously-generated logic (e.g., if a new object attribute is suddenly invoked by a dynamic command) (Zhou et al., 6 May 2025).

These learning modules are informed by cognitive science principles (e.g., dual-process theory, schema acquisition, episodic memory) (Zhang et al., 18 May 2025), and serve as substrates for spatial/narrative reasoning, affordance extraction, and adaptive planning.

5. Evaluation Benchmarks and Metrics

Rigorous empirical evaluation is a hallmark of recent IF-LLM research:

Game progress & solution rate: TextQuests quantifies progress as the fraction of labeled narrative checkpoints completed, with auxiliary metrics for harm (ethicality) and normalized score. No model in the latest benchmarks fully completes any title without external hints (Phan et al., 31 Jul 2025).
Qualitative human judgment: Narrative-driven and role-agent systems are evaluated on multiple subjective axes—immersion, agency, character consistency, narrative interest—and often utilize blind ratings from expert evaluators (Wu et al., 25 Feb 2025, Sato et al., 2024).
Psychometric validity: In PsychoGAT, narrative-embedded psychological assessments are scored on reliability (Cronbach's alpha, Guttman's λ₆), convergent/discriminant validity (AVE, Fornell-Larcker), and human-rated content quality (Yang et al., 2024).
Sample efficiency and transfer: In the STARLING framework, pretraining across hundreds of generated games yields marked gains in both normalized score and sample efficiency compared to vanilla RL and human baselines (Basavatia et al., 2024).

Results consistently underscore the tension between generative linguistic capacity, strategic reliability, and the limits of context retention for long-term planning.

6. Open Challenges and Future Directions

Despite advances in system modularity and prompt design, persistent issues remain:

Long-term memory and retrieval: Context truncation, memory bloat, and "long-context hallucinations" degrade agent performance as history grows (e.g., NetPlay, TextQuests) (Jeurissen et al., 2024, Phan et al., 31 Jul 2025). Use of dynamic memory nets and context summarization is recommended but not fully realized.
Hierarchical planning and subgoal tracking: Agents often lack explicit subgoal decomposition, leading to looping, unproductive exploration, and failure to backtrack—integrated "notes" modules or internal planners are needed (Phan et al., 31 Jul 2025, Zhang et al., 18 May 2025).
Rule-template scalability and coverage: Hand-crafted rule sets (e.g., in the Werewolf agent) scale poorly as environment complexity grows, suggesting a need for learned or automatically induced controllers (Sato et al., 2024).
Narrative integrity under dynamic expansion: Real-time action expansion and retroactive game logic repair, as in STORY2GAME, are hampered by object disambiguation, attribute clashes, and broken solvability (Zhou et al., 6 May 2025).
Evaluation generalizability: Current benchmarks often privilege text-only, single-agent games; there is an identified need for multi-agent, multimodal, or cross-cultural evaluation frameworks (Yuan et al., 9 May 2025).
Ethical behavior and stereotype avoidance: Qualitative studies warn of LLMs unintentionally reinforcing occupational stereotypes or narrative biases in interactive assessments (Yuan et al., 9 May 2025).

Solutions explored or proposed include retrieval-augmented generation, explicit feedback signals, dynamic skill adaptation, periodic reflection/inference loops, and the integration of multimodal input/output layers (VR, image generation).

Interactive fiction games with LLM agents embody a rapidly evolving, multi-paradigm research direction spanning symbolic architectures, statistical RL, narrative control, and psychological modeling. Leading work demonstrates the promise of modular, hybrid systems that interleave LLM creativity with symbolic checking, episodic memory, and explicit narrative scaffolding, but substantial challenges remain in scaling, robustness, and human-like adaptability (Yuan et al., 9 May 2025, Zhang et al., 18 May 2025, Yang et al., 2024, Phan et al., 31 Jul 2025, Sato et al., 2024, Zhou et al., 6 May 2025, Kostka et al., 2017, Jeurissen et al., 2024, Basavatia et al., 2024, Wu et al., 25 Feb 2025, Magee et al., 2024, Hausknecht et al., 2019).