Papers
Topics
Authors
Recent
Search
2000 character limit reached

MeepleLM: Persona-Driven Playtesting LLM

Updated 19 January 2026
  • MeepleLM is a large language model framework that simulates subjective board game playtesting using explicit Mechanics–Dynamics–Aesthetics reasoning and persona-specific feedback.
  • It leverages a Qwen3-8B backbone with LoRA adaptation and a 16,384-token context window to efficiently process detailed rulebooks and reviews.
  • The framework integrates sophisticated persona distillation and MDA pipelines to deliver diverse, scalable, and actionable game design critiques.

MeepleLM is a LLM framework designed to function as a virtual playtester for board games by simulating the subjective feedback of diverse player archetypes. The model advances Human–AI collaboration in game design by operationalizing explicit Mechanics–Dynamics–Aesthetics (MDA) reasoning and encoding player persona-specific preferences, outperforming contemporary LLMs in community alignment, critique quality, and opinion diversity (Li et al., 12 Jan 2026).

1. Model Architecture and Generation Pipeline

MeepleLM is based on the Qwen3-8B LLM backbone, which is extended via low-rank adaptation (LoRA) utilizing the LLaMA-Factory toolkit. The model supports a context window of 16,384 tokens with Flash Attention v2 for efficient inference over lengthy rulebook and review sequences.

The generation process is rigorously two-staged:

  • Stage 1: Latent MDA Chain Construction ("Slow Thinking")
    • Mechanics: The model identifies salient rule components referenced in reviews.
    • Dynamics: It infers real-time system interactions triggered by those components.
    • Aesthetics: Anticipates the emotional and experiential outcomes, modulated by a designated player persona profile.
  • Stage 2: Critique Output
    • Produces both a numeric rating and detailed textual feedback reflective of the chosen persona.

Supervised fine-tuning is applied to maximize the joint likelihood across the full chain of MDA reasoning and critique output:

L=t=1[Z;Y]logP(sts<t,R,Pprofile)L = - \sum_{t=1}^{|\left[ \mathcal{Z};\mathcal{Y} \right]|} \log P(s_t \mid s_{<t}, \mathcal{R}, \mathcal{P}_{\text{profile}})

where R\mathcal{R} encodes the rulebook context and Pprofile\mathcal{P}_{\text{profile}} the persona. LoRA hyperparameters include rank r=32r=32, α=64\alpha=64, dropout=0.1, AdamW optimizer, learning rate 5×1055 \times 10^{-5}, and three training epochs (Li et al., 12 Jan 2026).

2. Dataset Construction and MDA Reasoning

Rulebook Structure

A curated corpus of 1,727 structurally normalized board-game rulebooks is compiled from BoardGameGeek, stratified by market rank, cognitive weight, publication year, and key mechanics. The preprocessing pipeline comprises:

  • PDF extraction via MinerU,
  • Markdown restructuring using Qwen3-235B (yielding sections such as Objective, Components, Gameplay Flow),
  • Logical consistency cross-checks and completion by GPT-5.1.

Review Selection and Facet Labeling

From an initial pool of approximately 1.8 million user reviews across BoardGameGeek, BoardGameArena, Tabletopia, GStone, and QPBG, 150,000 reviews are selected through:

  • Automated filtering (removing off-topic/short/rating-incongruent samples),
  • Quality scoring based on explicit anchoring to MDA facets,
  • Facet-aware stratified sampling maximized for semantic diversity and sentiment fidelity (ensuring Pearson r=0.92r=0.92 with source ratings).

Operationalization of MDA

For each (rulebook, review) pair, a latent chain Z={zmech,zdyn,zaest}\mathcal{Z} = \{z_{\text{mech}}, z_{\text{dyn}}, z_{\text{aest}}\} is distilled using Qwen3-235B, verified by GPT-5.1 to mitigate hallucinations. These chains are prepended during supervised fine-tuning as explicit thinking steps (delimited as "> …"), compelling the model to reason through MDA before generating critiques.

3. Persona Distillation and Annotation

Archetype Identification

A semi-automated clustering pipeline forms the foundation of persona modeling:

  • Reviews are embedded by concatenating sentiment tier, facet focus, and raw text, then processed via Qwen3-Embedding-8B and K-Means (K=15K=15).
  • For each cluster, central reviews are sampled and processed by GPT-5.1 to draft persona sketches.
  • Overlapping clusters are manually merged to yield five canonical archetypes:
    • System Purist: Favors control, dislikes randomness.
    • Efficiency Essentialist: Prioritizes fun/time optimization.
    • Narrative Architect: Values immersive storytelling and 4X dynamics.
    • Social Lubricator: Seeks party and interaction-driven experiences.
    • Thrill Seeker: Engages with risk, tension, and push-your-luck elements.

Persona Labeling

All reviews are meta-annotated with one of the five personas by a majority-vote among three GPT-5.1 inferences. During model training, the persona profile is encoded as a system instruction, directly influencing the Dynamics→Aesthetics reasoning pathway for accurate persona-aligned simulation.

Clustering–Refinement Pseudocode

1
2
3
4
5
6
7
8
Embed ← QwenEmbed(review_text_with_meta)
Clusters ← KMeans(Embed, K=15)
For each cluster c:
    samples ← top20(reviews by centrality)
    sketch_c ← GPT5.1_Profile(samples)
personas ← HumanMerge(sketch_1…sketch_15)  # yield 5
For each review r:
    persona_label[r] ← GPT5.1_Classify(r, personas) majority-vote

4. Experimental Design and Quantitative Results

MeepleLM is evaluated on a held-out test split of 207 games (including 34 unpublished at training time), stratified by complexity and average rating. For each game–persona pair, 100 simulated reviews are generated, reflecting real-world persona distributions. Baselines include GPT-5.1, Gemini3-Pro, Qwen3-235B (high-quality API mode), and an untuned Qwen3-8B.

Evaluation encompasses macro-level, micro-level, and utility-centric metrics:

Model MAE↓ WD↓ τ↑ Fact.%↑ Dist-2↑ Div.↑ Op-Rec↑
MeepleLM 0.6576 0.2205 0.2817 98.86 0.7117 4.34 69.77
GPT-5.1 0.9874 0.9496 0.2555 99.46 0.6934 4.26 63.44
Gemini3-Pro 1.4277 0.5092 0.2465 98.28 0.6480 3.98 57.74
Qwen3-235B 1.2288 0.6350 0.1449 98.95 0.6572 3.56 54.27
Qwen3-8B 0.8906 1.0119 0.0492 97.88 0.5936 1.58 11.39

Metrics definitions include:

  • MAE: Mean absolute error of predicted vs. ground-truth ratings,
  • WD: Wasserstein distance between predicted and empirical rating distributions,
  • Kendall's τ: Rank correlation,
  • Fact.%: Fact-check accuracy (verified by Gemini3-Flash),
  • Dist-2: Distinct-2 bigram ratio for lexical variety,
  • Div.: Semantic diversity across MDA layers,
  • Op-Rec: Opinion Recovery Rate (|matched_ground_truth_viewpoints/VGT×100%|/|V_{GT}| \times 100\%).

Ablation studies indicate critical dependency on both MDA reasoning and persona conditioning: removal of MDA or persona reduces rank correlation (τ) and viewpoint diversity. Human A/B preference studies yield a MeepleLM win rate of 78.3% on familiar games and 74.2% on unfamiliar titles regarding authenticity, resonance, and risk warning (Li et al., 12 Jan 2026).

5. Applications and Generalization

MeepleLM functions as a low-cost, rapid virtual playtester in board game co-design, serving as an alternative to iterative human testing. Its persona-aligned critique aids designers targeting audience-specific gameplay adjustments (e.g., minimizing randomness for System Purists or intensifying social mechanics for Social Lubricators) and supports early-stage detection of negative user experiences.

The methodology is extendable to interactive systems beyond physical board games:

  • Video games: Simulating archetypal player reactions for iterative level design and balancing.
  • Educational or serious games: Tailoring feedback to learning-style persona profiles.
  • Multimodal extensions: Planned enhancements include integration of visual (component images, board layouts) and audiovisual modalities to enrich the Aesthetics reasoning dimension.

6. Implications for Human–AI Collaboration and Future Research

MeepleLM establishes a new paradigm for audience-aligned, empathy-aware AI critique in game design. By transforming static rulebooks into dynamic, persona-aware virtual playtesting agents, the model offers a scalable solution to capturing emergent user experience diversity. Future research avenues involve:

  • Modeling individual-level avatar players derived from longitudinal play traces,
  • Integration with procedural content generation (PCGML) for real-time, persona-driven design optimization,
  • Cross-modal grounding approaches to critique that evaluate not only gameplay rules but also art, user interface, and tactile components.

A plausible implication is the potential for MeepleLM to influence iterative design workflows across entertainment and educational domains, as well as to function as a prototype framework for experience-aware evaluation in other forms of interactive media (Li et al., 12 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MeepleLM.