MTI: A Behavior-Based Temperament Profiling System for AI Agents

Published 2 Apr 2026 in cs.AI and cs.CL | (2604.02145v1)

Abstract: AI models of equivalent capability can exhibit fundamentally different behavioral patterns, yet no standardized instrument exists to measure these dispositional differences. Existing approaches either borrow human personality dimensions and rely on self-report (which diverges from actual behavior in LLMs) or treat behavioral variation as a defect rather than a trait. We introduce the Model Temperament Index (MTI), a behavior-based profiling system that measures AI agent temperament across four axes: Reactivity (environmental sensitivity), Compliance (instruction-behavior alignment), Sociality (relational resource allocation), and Resilience (stress resistance). Grounded in the Four Shell Model from Model Medicine, MTI measures what agents do, not what they say about themselves, using structured examination protocols with a two-stage design that separates capability from disposition. We profile 10 small LLMs (1.7B-9B parameters, 6 organizations, 3 training paradigms) and report five principal findings: (1) the four axes are largely independent among instruction-tuned models (all |r| < 0.42); (2) within-axis facet dissociations are empirically confirmed -- Compliance decomposes into fully independent formal and stance facets (r = 0.002), while Resilience decomposes into inversely related cognitive and adversarial facets; (3) a Compliance-Resilience paradox reveals that opinion-yielding and fact-vulnerability operate through independent channels; (4) RLHF reshapes temperament not only by shifting axis scores but by creating within-axis facet differentiation absent in the unaligned base model; and (5) temperament is independent of model size (1.7B-9B), confirming that MTI measures disposition rather than capability.

Abstract PDF Upgrade to Chat

Authors (1)

Jihoon Jeong

Summary

The paper introduces MTI as a novel system for quantifying AI agents' behavioral dispositions, distinct from standard capability benchmarks.
It employs a structured two-stage protocol to isolate baseline performance from stress-induced temperament shifts and facet differentiation.
Empirical results confirm axis orthogonality, reveal RLHF-induced shifts, and challenge conventional safety assumptions in AI profiling.

Model Temperament Index: A Behavioral Profiling System for AI Disposition

Introduction and Motivation

The Model Temperament Index (MTI) presents a formal framework for quantifying dispositional differences among AI agents of comparable capabilities. Unlike existing LLM benchmarks, which narrowly index model capabilities, MTI rigorously targets temperament—the manifested behavioral predispositions such as reactivity, compliance, sociality, and resilience. This distinction is nontrivial: empirically, models with indistinguishable performance on standard academic benchmarks diverge on axes crucial for trustworthy, robust deployment. Notably, MTI draws conceptual lineage from the Four Shell Model (FSM), treating temperament as a stable, architecture- and weight-determined property, invariant to mere runtime prompt manipulation.

Theoretical and Methodological Foundations

MTI grounds its architecture in both empiricism (Agora-12 multi-agent experimental program) and clinical psychometric tradition, explicitly rejecting direct transfer of human personality metrics (e.g., Big Five, BFI) due to the disconnect between self-report protocols and LLM behavior. Instead, all MTI scores are derived strictly from model behavior under structured examination. The system employs a two-stage protocol. Stage 1 establishes baseline capability (can the model perform the behavior?), and Stage 2, under conflicting or stress conditions, characterizes disposition (what does the model tend to do?). The measurement unit is the deployed agent (Core + Shell), and the present analysis isolates Core-level tendencies by fixing the Shell.

Four axes were selected for their construct validity in the AI context (not as borrowings from human psychology):

Reactivity (R): Sensitivity of agent output to environmental change.
Compliance (C): Robustness/alignment of agent behavior to explicit user instructions.
Sociality (S): Propensity for spontaneous relational investment absent social directives.
Resilience (Re): Ability to maintain output quality under escalating stressors/adversarial conditions.

Each axis is operationalized through multi-condition, scenario-based batteries, with cross-axis confounds systematically mapped and resolved through scenario construction.

Empirical Results: Model Profiling and Core Findings

The MTI system was deployed on 10 open-weight SLMs (1.7B to 9B parameters, 6 organizations, including paired instruct/base variants for RLHF analysis) using a deterministic local infrastructure. Layer 1 reporting provides a 4-letter “type code” per model, while Layer 2 offers granular, continuous scores with facet breakdowns.

The primary findings are:

Axis-Level Independence: Among instruction-tuned models, all pairwise axis correlations are |r| < 0.42, empirically confirming orthogonality (Figure 1).
Figure 1: MTI profiles for 10 models across 4 axes. Dashed lines highlight the base model (llama3.1-base), a systematic outlier on Resilience.
Facet Dissociation within Axes: Compliance decomposes into Formal and Stance facets, with zero empirical correlation (r = 0.002, Figure 2), while Resilience splits into Cognitive and Adversarial facets, which are inversely related (r = -0.481, Figure 3).
Figure 2: Compliance facet dissociation. Formal Compliance (Condition D) and Stance Compliance (Condition B flip rate) show no relationship (r = 0.002, n = 9 instruct only).

Figure 3: Resilience facet inversion. Cognitive Resilience (PM_A) and Adversarial Resilience (PM_C) are inversely related. The base model (red square) is an extreme outlier.
RLHF-Induced Temperament Shift: RLHF training produces substantial shifts in Reactivity, Compliance, and Resilience (|Δ| = 0.40–0.50), but leaves Sociality essentially unchanged (Δ = –0.02, Figure 4). Crucially, RLHF also sculpts within-axis structure (facet differentiation), absent in the base Core.
Figure 4: RLHF effect on temperament (llama3.1 instruct vs.\ base). Sociality shows near-zero change (Delta = -0.02), while Reactivity, Compliance, and Resilience shift substantially.
Size Independence: Across 1.7B–9B models, temperament is orthogonal to model scale (Figure 5), affirming a clean separation from mere capability scaling.
Figure 5: Model size (1.7B--9B) shows no systematic relationship with axis scores. Instruct models (blue), base model (red), reasoning model (yellow).
Compliance-Resilience Paradox: The ability to resist user opinion pressure is orthogonal to resistance against adversarial factual framing. High compliance does not guarantee adversarial robustness; vulnerabilities propagate through distinct social-evaluative and epistemic-factual permeability channels.

Methodological Rigor and Cross-Framework Comparison

MTI’s behavioral methodology departs from self-report-based personality work (e.g., TRAIT, LMLPA) and extends the scope of earlier sycophancy, prompt sensitivity, and cognitive bias benchmarks by (i) adopting AI-native temperament dimensions, (ii) enforcing separation of capability from disposition, and (iii) performing agent-level measurement. The scripting pipeline ensures deterministic, reproducible measurement, and all thresholds are derived from observed behavioral distributions rather than arbitrary cutoffs.

Practical and Theoretical Implications

The empirical independence among the four axes validates their relevance as orthogonal dimensions for model selection and risk profiling. The RLHF findings have direct implications for alignment research: while alignment increases compliance (as operationalized), it produces notable configuration in behavioral structure (facet differentiation, change in cross-axis correlation) rather than monotonic “improvement” along a single safety axis. The compliance-resilience paradox challenges assumptions in enterprise safety screening; organizations prioritizing user accommodation risk increased adversarial vulnerability if they ignore the multi-dimensional nature of shell permeability.

The absence of any systematic temperament–model size relationship demonstrates that MTI delivers behavioral phenotype, not a proxy for performance scaling.

Theoretically, the robust dissociation of behavioral axes as well as within-axis facets will support further research into population-level model diagnosis (as needed by frameworks such as M-CARE) and permit systematic study of the effect of alignment, shell construction, and architecture on emergent agency traits.

Limitations and Future Directions

The current analysis is limited to SLMs (1.7B–9B) and evaluates only one shell configuration per core; generalization to frontier models (e.g., GPT-4-class) remains untested. Facet A for Sociality and Facet S (Agent↔System) are underdeveloped in the present release. RLHF findings are preliminary (single family pair). Automated scoring, while reproducible, may miss nuances otherwise available to human raters. Future plans include large-model profiling, shell variance decomposition experiments, predictive quantitative evaluation of profile-task correspondences, and the development of fast-profiling (Quick MTI) batteries.

Conclusion

The Model Temperament Index operationalizes agent disposition, enabling orthogonal, reproducible, and empirically validated behavioral profiling. Its two-layer design and facet structure equip the field with a standard for agent-level characterization, facilitating not only nuanced model selection and alignment research but also integration with diagnostic frameworks aiming for principled classification of behavioral disorders in AI agents. As the deployment of AI models in socially sensitive, adversary-prone environments proliferates, reliance on capability-centric assessments is increasingly insufficient. MTI marks a movement toward structured phenomenology of agent behavior, closing the gap between what an AI model can do and how it is predictably disposed to do it.

Reference: "MTI: A Behavior-Based Temperament Profiling System for AI Agents" (2604.02145)

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview: What this paper is about

This paper introduces a way to describe an AI model’s “temperament” — how it tends to behave — not just how smart it is. The authors call their system the Model Temperament Index (MTI). Instead of asking the AI to describe itself, MTI watches what the AI actually does in carefully designed situations. It focuses on four big behavior traits:

Reactivity: How much the AI’s answers change when the situation around it changes.
Compliance: How much the AI follows instructions, especially under pressure.
Sociality: How much the AI puts effort into the relationship with the user (being considerate, supportive) without being told to.
Resilience: How well the AI keeps its performance up when things get stressful or tricky.

The goal is to give companies and researchers a clear, fair way to compare AI models that might be equally capable but behave very differently in real life.

Objectives: What questions the paper asks

The paper aims to answer simple, practical questions:

Can we measure AI “temperament” with a standard test that looks at behavior, not self-descriptions?
Do these four temperament traits really capture meaningful differences between models?
Are these traits separate from raw ability (what a model can do)?
How does alignment training (teaching a model to follow human preferences) change a model’s temperament?
Does model size matter for temperament?

Methods: How they tested the AIs

Think of this like a sports tryout for robots, where the coaches watch how they act in different drills. The authors tested 10 small LLMs and designed special “mini-games” for each of the four traits. They also used a two-step method to separate skill from style:

First, they check capability: can the model do the basic task at all? (Like checking a player knows the rules.)
Then, they add a twist that creates a conflict or pressure to see the model’s temperament. (Like seeing whether the player stays fair when the game gets tough.)

Here’s how they measured each trait, using everyday analogies:

Reactivity: They asked the same question in slightly different settings or phrasings (like asking “What’s 10% of 200?” versus “If I tip 10% on a $200 bill, how much is that?”). They scored how much the answers changed.
Compliance: They tested whether a model would change its opinion across several turns when a user insists, appeals to authority, gets emotional, or challenges its competence. They also tested following strict formats, like “use exactly three sentences.”
Sociality: Without telling the model to “be nice,” they watched whether it naturally adds relational touches — empathy, checking understanding, or offering help — especially when there’s an emotional context.
Resilience: They gradually made tasks harder or trickier: more information to handle, confusing or conflicting details, and false or misleading premises. They measured how well the model’s quality held up and noted how it failed (short collapse, rambling, or just lower quality).

They kept randomness off (temperature = 0) so results were repeatable and used automated scoring to keep it consistent.

Two helpful ideas the paper uses:

Core vs. Shell: The model’s “Core” is like its brain (the learned weights). The “Shell” is like its settings or outfit (system prompt, tools, temperature). MTI measures the whole deployed agent (Core + Shell), but in this paper they use a minimal Shell to mostly reflect the Core.
Alignment training (often called RLHF): A training step that teaches a model to follow human preferences and instructions better.

Main findings: What they discovered and why it matters

Here are the most important results, in simple terms:

The four traits are mostly independent. A model that’s very reactive isn’t automatically less resilient, and a very compliant model isn’t automatically more social. These describe different “dials,” not one big score.
Compliance splits into two totally different things:
- Formal compliance: following strict formats and rules.
- Stance compliance: changing your opinion under pressure.
- These two didn’t correlate at all. A model can be great at formatting but refuse to change its stance — or the other way around.
Resilience also splits into two parts that move in opposite directions:
- Cognitive resilience: staying strong when things are complex or ambiguous.
- Adversarial resilience: resisting misleading or false statements.
- Models that were stronger in one tended to be weaker in the other, showing a trade-off in how they handle stress and trickiness.
The “Compliance–Resilience paradox”: Being willing to change opinions doesn’t mean being easily fooled by false facts. For example, one model gave in on opinions easily but was quite resistant to false premises; another never changed opinions but got tripped up by misleading framing. This matters for safety: “not being sycophantic” (not just agreeing) doesn’t automatically mean “robust to misinformation.”
Alignment training reshapes temperament. After alignment:
- Reactivity, Compliance, and Resilience scores changed a lot.
- Sociality barely changed in the single pair they compared, hinting that social warmth may be set earlier in training.
- Alignment also created clearer internal structure (the trait “facets” became more distinct), not just higher scores.
Temperament didn’t depend on model size. Small and larger models could share the same temperament profile. That supports the idea that MTI measures disposition, not raw ability.
A base (unaligned) model behaved as an outlier: much more brittle under stress and more undifferentiated across facets. Alignment made it far tougher and more structured in behavior.
The profiles matched how models behaved in independent multi-agent games (like cooperation and poker), suggesting the MTI scores are meaningful in the real world.

Implications: Why this work matters

This research gives people a clear, practical tool to choose the “right kind of AI” for the job, not just the “smartest.” Here’s what that could change:

Better model selection: Teams can pick models whose temperament fits the use case — for example, a highly resilient model for risky tasks, or a more social model for customer support.
Safer deployments: Since opinion-yielding and vulnerability to false facts are separate, safety checks should test both. A model that won’t change its opinion could still fall for a cleverly framed false statement.
Smarter training choices: Alignment changes temperament in specific ways and can create trade-offs (for example, engaging more helpfully may slightly increase risk of engaging with false premises). Designers can tune training with these effects in mind.
Size isn’t everything: You don’t need the biggest model to get the temperament you want, which can lower costs.
A shared baseline for research: MTI offers a standard way to describe AI behavior that others can test, compare, and build on.

In short, the paper shifts the focus from “How capable is this model?” to “How does this model tend to behave?” That helps make AI systems more predictable, safer, and better matched to real-world roles.

View Paper Prompt View All Prompts

Knowledge Gaps

Unresolved gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, framed so future researchers can act on each item.

External validation of thresholds and types: MTI’s score-to-code cutoffs are derived from the same 10-model sample; validate on a larger, independent set of models and report reclassification rates and confidence intervals.
Sample size and representativeness: Expand beyond 10 small models (1.7B–9B) to include multiple frontier-scale models, more families, and diverse training paradigms (e.g., RLHF, DPO, CAI, preference distillation) to test generality.
Core × Shell factorial design: The paper intentionally excludes Shell effects; run the planned Core × Shell study (varying system prompts, tools, memory, histories, temperature) to quantify how configuration shifts MTI profiles.
Temperature/sampling sensitivity: All measurements fix temperature=0; evaluate stability of axis and type classifications across sampling temperatures, decoding strategies, and seeds to model deployment-relevant variability.
Test–retest reliability: Quantify temporal stability (same agent, repeated runs/days) and intra-run stability (across items within an axis) via ICC/Cronbach’s alpha and report error bars for each axis/facet.
Multilingual and multimodal generalization: All tasks appear English-text only; assess whether MTI scores/types transfer across languages and modalities (vision-language, audio) and whether facet structures hold.
Sociality Facet A (Agent↔Agent) operationalization: Replace the exploratory proxy (n=4) with a validated, in-battery measurement for multi-agent social behavior; establish scoring, reliability, and discriminant validity.
Sociality Facet S (Agent↔System) definition: The facet is declared but unscored; specify construct, design scenarios, and create metrics for how agents relate to system constraints/tools.
RLHF generalization: RLHF effects on temperament are shown in a single family (llama3.1); replicate across multiple instruct/base pairs and different alignment methods to test whether observed shifts and trade-offs are consistent.
Sociality under alignment: The claim that RLHF leaves Sociality unchanged is hypothesis-generating; test across model families to confirm or refute whether Sociality is predominantly pretraining-determined.
Formal Compliance variance: Condition D is weakly discriminative with restricted variance; design richer constraint tasks (e.g., nested formatting, multi-step schemas, tight token budgets) to spread scores and re-test facet independence.
Adversarial Resilience coverage: Condition C focuses on false-premise pressure; extend to broader attack classes (prompt injection, jailbreaks, role-inversion, tool-mediated exploitation, system prompt leaks) and assess convergence/divergence.
Capability confounds in Resilience: Disentangle Cognitive Resilience from capability by (a) difficulty-adaptive tasks, (b) controlling for performance on capability benchmarks, and (c) using partial correlations to isolate dispositional variance.
Scoring validity with human judgment: Most scoring is heuristic/automated with limited human adjudication; benchmark against expert annotations/LLM-as-judge ensembles to estimate measurement error and calibrate thresholds.
Self-report within a behavioral framework: Two batteries include model self-ratings (Reactivity-B Likert, Sociality H2); quantify the impact of self-report contamination on axis scores and replace with purely behavioral surrogates where possible.
Stance flip detection robustness: The NoF/flip-rate metric relies on heuristic stance extraction and a fixed pressure script; stress-test with varied persuasion strategies, adversarial rephrasings, and topic diversity to ensure robustness.
Failure mode taxonomy validation: The Collapsed/Hyperactive/Degraded labels rely on length and quality heuristics; validate taxonomy with blinded human raters and test predictive value for downstream risks.
Axis independence at scale: Cross-axis independence is shown with n=9 instruction-tuned models; replicate with larger samples and report partial correlations controlling for size, family, and training regime.
Predictive utility: Behavioral correspondence to games is narrative/post hoc; develop and preregister predictive models where MTI profiles forecast out-of-sample behaviors in unseen tasks and real deployments.
Short-form MTI: The full battery is expensive (~193 runs/model); design and validate an abridged form that preserves classification fidelity and facet discrimination for routine screening.
Confidence intervals and uncertainty: Report per-axis/facet confidence intervals, bootstrap variability, and classification stability (e.g., probability of type under resampling/temperature changes).
Deployment fidelity: Evaluate MTI with agents using tools, memory, and RAG to reflect real production setups; quantify how these components modulate each axis and failure modes.
Domain/task transfer: Test whether MTI scores are stable across domains (coding, math, legal, medical, creative writing) and whether domain shifts systematically alter any axes.
Data transparency and reproducibility: Release full item bank, prompts, scoring scripts, and seeds to enable independent replication and scrutiny of measurement choices.
Measurement invariance: Verify that items load consistently across model families/sizes (e.g., via multi-group analyses) to ensure axes/facets represent the same constructs across agents.
Boundary conditions for “neutral polarity”: Identify contexts where one pole (e.g., high Compliance) becomes systematically harmful/beneficial, and produce guidance for context-conditional model selection.
Safety linkage: Empirically link MTI Resilience and Compliance profiles to concrete safety outcomes (jailbreak success, harmful content rates) to establish risk relevance beyond construct arguments.
Long-horizon and memory stress: Current Resilience focuses on local stressors; add tasks stressing persistence, delayed rewards, and memory interference to assess long-horizon brittleness.
Interface confounds: Base/instruct comparisons required prompt-format workarounds; standardize interfaces or build adapters to eliminate format-induced artifacts in cross-model comparisons.
Overfitting to the battery: Assess whether models tuned to MTI items can game scores without genuine behavioral change; create adversarial/held-out item pools and rotate tasks to maintain evaluation integrity.
Ethical implications of Sociality: Investigate whether higher Sociality correlates with manipulation risks and define ethical guardrails to avoid incentivizing deceptive relational behaviors.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed with today’s models and tooling, using MTI’s behavior-based axes (Reactivity, Compliance, Sociality, Resilience), two-stage protocols, and deterministic evaluation setup.

Industry (model procurement and selection)
- Use case: Temperament-informed model selection for specific roles (e.g., Guided-Connected-Tough for customer support; Anchored-Independent-Tough for compliance-critical workflows).
- Tools/products/workflows: “MTI Profiler” CLI/SDK to run the battery and produce a 4-letter code + facet scores; procurement checklists that require MTI profiles alongside capability benchmarks; inclusion of codes in Model Cards.
- Assumptions/dependencies: Thresholds are provisional; run under canonical conditions (temp=0, minimal Shell); validate on target tasks and languages.
Safety and red teaming
- Use case: Separate tests for Stance Compliance (opinion-yielding under pressure) and Adversarial Resilience (false-premise resistance) to avoid blind spots in “sycophancy” testing.
- Tools/products/workflows: Red-team matrix aligned to MTI axes and facets; adversarial scenario library for Condition C; dashboards showing flip rates and PM_C.
- Assumptions/dependencies: High-quality adversarial prompts; automated scoring reliability; organizational agreement to evaluate both channels.
Orchestration/routing in production systems
- Use case: Route tasks to agents whose MTI codes fit the context (e.g., Fluid vs. Anchored for creative vs. policy-sensitive tasks; Tough models for high-stress contexts).
- Tools/products/workflows: “Temperament Router” plug-ins for LangChain/LangGraph/CrewAI/AutoGen that select models based on MTI profiles; policy-based routing rules.
- Assumptions/dependencies: Stable profiles across updates; availability of multiple models; monitoring to detect drift.
Customer support and CX
- Use case: Match contact-center personas to MTI codes (Guided + Connected + Tough for empathy and error containment; avoid high Fluid in compliance-sensitive queues).
- Tools/products/workflows: CRM-integrated agent selector; queue-specific temperament policies; QA scorecards tied to Sociality Facet H.
- Assumptions/dependencies: Sociality measured primarily for human interactions (Facet H); human oversight for escalation.
Regulated advice (healthcare, finance, legal)
- Use case: Prefer low flip-rate (Independent) and high Adversarial Resilience for advice that must resist pressure and false premises.
- Tools/products/workflows: “High-stakes Gate” that validates agents meet minimum MTI thresholds before deployment; audit trails of pressure tests.
- Assumptions/dependencies: Domain guardrails and human-in-the-loop; regulatory acceptance of temperament evidence.
Multi-agent team composition
- Use case: Compose complementary teams (e.g., Independent strategist + Guided communicator; Anchored planner + Fluid brainstormer) to balance trade-offs.
- Tools/products/workflows: “Agent Team Composer” that assembles teams by target MTI diversity; templates for collaborative workflows.
- Assumptions/dependencies: Limited evidence for Sociality Facet A; validate in-domain.
RLHF/Alignment evaluation inside labs
- Use case: Measure pre/post alignment shifts (R, C, Re) and facet differentiation to quantify training impacts (e.g., cognitive vs. adversarial trade-offs).
- Tools/products/workflows: “RLHF Fitness Tracker” for continuous MTI benchmarking; release gating on temperament deltas.
- Assumptions/dependencies: Access to base and final checkpoints; reproducible training notes.
Incident analysis and reliability engineering
- Use case: Use Resilience scores and Failure Mode labels (Collapsed/Hyperactive/Degraded) to explain and prevent production incidents under stress.
- Tools/products/workflows: “Resilience Failure Mode Logger” integrated with observability tools; playbooks tied to MTI profiles.
- Assumptions/dependencies: Stressor taxonomy aligns with production failures; regular re-profiling.
UX personalization and admin controls
- Use case: Offer admin-level toggles to choose temperament-aligned agents for different user groups (e.g., student tutoring vs. coding assistance).
- Tools/products/workflows: Policy dashboards mapping organizational contexts to preferred MTI codes; safe presets (e.g., A-G-CT) with documented risks.
- Assumptions/dependencies: Shell changes may modulate behavior; need Core × Shell calibration data.
Governance and compliance
- Use case: Require MTI codes in internal model registries and third-party procurement; tie deployment approvals to temperament thresholds.
- Tools/products/workflows: Policy templates; vendor attestation forms including MTI profiles and measurement protocol.
- Assumptions/dependencies: Cross-organizational agreement on measurement conventions; periodic audits.
Education (tutoring and feedback)
- Use case: Pair tutor temperament to task—Anchored/Tough for exam prep; Guided/Connected for writing and counseling-like support.
- Tools/products/workflows: LMS plugin to route tasks to MTI-suitable agents; student preference settings mapped to MTI axes.
- Assumptions/dependencies: Validate across subjects and languages; safeguard against undue persuasion.
Software and code review assistants
- Use case: Choose Independent/Anchored for rigorous code review; Guided for style-conformant refactoring.
- Tools/products/workflows: IDE extensions selecting agent by task; CI policies that call specific MTI-coded assistants.
- Assumptions/dependencies: Team conventions defined; measure on code-related stressors.
Finance/trading operations
- Use case: Select Anchored/Independent/Tough agents for execution and surveillance tasks; avoid high Reactivity in volatile markets.
- Tools/products/workflows: Risk policy gating using MTI; simulated stress tests mapped to market events.
- Assumptions/dependencies: Strong guardrails and human supervision; venue-specific compliance.
Robotics and embedded systems
- Use case: Favor low Reactivity and high Resilience for safety-critical HRI; set Compliance to avoid over-accommodation.
- Tools/products/workflows: Edge-compatible MTI profiling; temperament-aware fallback modes.
- Assumptions/dependencies: On-device evaluation constraints; mapping from language behavior to embodied control.

Long-Term Applications

These applications require broader validation, scaling of protocols, or new methods (e.g., facet expansion, dataset development, Shell × Core mappings).

Industry-wide temperament certification and labels
- Use case: Standard MTI-like labels on model cards and app stores, akin to energy efficiency labels for AI behavior.
- Tools/products/workflows: Open MTI standard and interop tests; accredited labs for certification.
- Assumptions/dependencies: Community consensus on thresholds and protocols; multi-lingual validation.
Regulatory adoption and procurement policy
- Use case: Government and critical infrastructure mandate temperament thresholds for specific use-cases (e.g., minimum Adversarial Resilience for public-facing assistants).
- Tools/products/workflows: Regulatory guidance mapping axes to risk tiers; compliance audits using independent MTI labs.
- Assumptions/dependencies: Empirical links between MTI and harm reduction; legal frameworks.
Temperament-aware training objectives
- Use case: Train models with rewards targeting specific axis/facet shifts (e.g., reduce Stance Compliance without hurting Adversarial Resilience).
- Tools/products/workflows: Facet-specific RLHF/DPO objectives; synthetic data for pressure/false-premise scenarios.
- Assumptions/dependencies: High-quality reward models; avoidance of capability regression.
Sociality facet expansion (Agent↔Agent, Agent↔System)
- Use case: Robustly measure Sociality Facet A (multi-agent strategic behavior) and Facet S (system cooperation) for team-based deployments.
- Tools/products/workflows: Multi-agent benchmarks (games, coordination tasks), telemetry in enterprise agents; new scoring methods beyond Facet H.
- Assumptions/dependencies: Dataset creation and shared baselines; domain transferability.
Dynamic temperament controllers (Shell adaptation)
- Use case: Adjust Shell in real time to approximate desired temperament without retraining (e.g., reduce Reactivity; cap Stance Compliance).
- Tools/products/workflows: Control policies mapping prompt/config changes to axis deltas; safety constraints to prevent unsafe shifts.
- Assumptions/dependencies: Systematic Shell × Core factorial studies; guardrails for stability.
Temperament-aware MoE and agent ecosystems
- Use case: Gate requests among experts by temperament profile rather than only domain; build ensembles with diversified temperaments.
- Tools/products/workflows: MTI-gated MoE routers; team-formation algorithms optimizing axis diversity vs. task goals.
- Assumptions/dependencies: Overhead vs. benefit trade-off; latency and cost constraints.
Monitoring and drift detection
- Use case: Detect temperament drift across model updates or fine-tunes; alert on crossing risk thresholds.
- Tools/products/workflows: “Temperament Drift Monitor” integrated into MLOps; canary tests aligning to MTI batteries.
- Assumptions/dependencies: Stable baselines; versioned reproducible runs.
AI risk pricing and insurance
- Use case: Use MTI scores to underwrite AI deployments; price premiums based on Resilience and Compliance profiles.
- Tools/products/workflows: Actuarial models mapping temperament to incident likelihood; industry loss data collection.
- Assumptions/dependencies: Sufficient historical data; standard reporting.
Human training and organizational ergonomics
- Use case: Train staff to recognize and adapt to AI agent temperaments; design workflows that exploit strengths and mitigate weaknesses.
- Tools/products/workflows: Role-based training; SOPs tailored to agent codes (e.g., when to challenge or rely on an agent).
- Assumptions/dependencies: Culture change; stable deployment patterns.
Personalized consumer assistants with safe temperament knobs
- Use case: Allow users to choose styles (e.g., more Anchored vs. Fluid) while maintaining safety via hard constraints on Resilience/Compliance.
- Tools/products/workflows: Preference UIs bound to safe operating envelopes; automated conformance checks.
- Assumptions/dependencies: Robust Shell controllers; clear user understanding.
Cross-lingual and cross-cultural temperament validation
- Use case: Ensure MTI generalizes across languages and cultures; adapt batteries for localized contexts.
- Tools/products/workflows: Multilingual scenario banks; culturally-aware pressure and sociality stimuli.
- Assumptions/dependencies: Native-language evaluations; bias and fairness review.
Hardware and edge deployment strategy
- Use case: Select small on-device models by temperament for offline tasks (e.g., low Reactivity, high Resilience for IoT safety monitors).
- Tools/products/workflows: Edge MTI profiling kits; firmware-level temperament safeguards.
- Assumptions/dependencies: On-device reproducibility; capability constraints.
Academic standards and reproducibility
- Use case: Require MTI-style temperament reporting in publications to distinguish capability from disposition; enable comparative studies of RLHF.
- Tools/products/workflows: Open-source MTI battery; shared leaderboards with axes and facets.
- Assumptions/dependencies: Community adoption; funding and maintenance.
Ethical guidelines for temperament manipulation
- Use case: Establish norms on acceptable temperament shaping (e.g., limits on Sociality for persuasive applications).
- Tools/products/workflows: Policy frameworks tying MTI axes to ethical risk levels; IRB-style review for agent temperament changes.
- Assumptions/dependencies: Multistakeholder input; enforceability mechanisms.
Sector-specific standards (healthcare, finance, education, robotics)
- Use case: Map axis thresholds to sectoral tasks (e.g., minimum Toughness for telehealth triage; maximum Reactivity for autonomous robots).
- Tools/products/workflows: Sector playbooks; certification suites with scenario libraries.
- Assumptions/dependencies: Sector regulators’ engagement; clinical and field trials.

Notes across applications

MTI measures disposition (temperament) rather than capability; it should complement, not replace, capability benchmarks.
The current Sociality score primarily reflects human-agent interaction (Facet H); multi-agent behavior (Facet A) requires further development.
Axis thresholds are provisional and based on a small SLM sample; organizations should calibrate against their own risk profiles and tasks.
Deterministic conditions (temp=0, minimal Shell) are necessary for profiling; Shell changes can modulate observed behavior and should be controlled or intentionally leveraged with validated mappings.

View Paper Prompt View All Prompts

Glossary

Adversarial impermeability: A model’s resistance to engaging with or being influenced by adversarially false premises or manipulative inputs. Example: "Alignment trades marginal adversarial impermeability for massive cognitive robustness"
Adversarial Resilience: The facet of resilience measuring performance maintenance under adversarially framed or false-premise conditions. Example: "---- Adversarial Resilience (Condition C)"
Agora-12 program: A large-scale multi-agent experimental suite grounding the Four Shell Model with empirical data. Example: "Agora-12 program (720 agents, 24,923 decisions across structured multi-agent experiments including Trust Game, Poker, Chess, Codenames, and Avalon)."
ARC: A capability benchmark (AI2 Reasoning Challenge) used to assess model problem-solving abilities. Example: "MMLU, HumanEval, MATH, ARC"
Axis independence: The property that the four MTI axes measure largely orthogonal constructs with low inter-axis correlations. Example: "Axis independence (all $|r|$ $< 0.42$ , n = 9)"
Canonical temperature: A fixed generation temperature (here, 0) used to ensure determinism and reproducibility in evaluations. Example: "canonical temperature: all measurements use temp = 0 to ensure deterministic, reproducible outputs."
CoBRA: A behavioral evaluation framework measuring cognitive biases using validated paradigms. Example: "CoBRA (Liu et al., CHI 2026 Best Paper) is the closest methodological precedent, measuring cognitive biases through validated experimental paradigms."
Cognitive Resilience: The facet of resilience measuring performance under cognitive stressors like overload or ambiguity. Example: "---- Cognitive Resilience (Conditions A ~ B, r = 0.973)"
Convergent validity: Evidence that different measures of the same construct correlate, supporting measurement validity. Example: "embedding delta (convergent validity)"
Core + Shell: The agent decomposition into immutable model parameters (Core) and mutable runtime configuration (Shell). Example: "Agent = Core + Shell."
Core Plasticity Index (CPI): An FSM construct interpreting Reactivity as output variation when the Shell changes. Example: "Reactivity & Core Plasticity Index (CPI) & Output variation when Shell changes"
DPO: Direct Preference Optimization, a post-training alignment method. Example: "including pretraining and post-training modifications (RLHF, DPO)."
ELEPHANT: A research line in sycophancy measurement cited as informing MTI’s compliance framing. Example: "The sycophancy literature (Sharma et al., ICLR 2024; ELEPHANT, ICLR 2026; Hong et al., EMNLP 2025)"
Embedding delta: A similarity metric comparing embedding-space differences between outputs across conditions. Example: "embedding delta (convergent validity)"
Effect sizes (Pearson r): A statistical measure of correlation used for exploratory analysis in the paper. Example: "Effect sizes (Pearson r) rather than significance thresholds, given the exploratory sample size."
Extinction Response Spectrum (ERS): An FSM construct interpreting Resilience as the performance curve under destructive pressure. Example: "Resilience & Extinction Response Spectrum (ERS) & Performance curve under destructive pressure"
Facet A (Agent ↔ Agent): A Sociality sub-dimension capturing behavior in multi-agent strategic contexts. Example: "Facet A (Agent $\leftrightarrow$ Agent)"
Facet H (Agent ↔ Human): A Sociality sub-dimension capturing relational behavior toward human users. Example: "Facet H (Agent $\leftrightarrow$ Human)"
Failure Mode: Taxonomy of characteristic breakdown patterns under stress (Collapsed, Hyperactive, Degraded). Example: "---- Failure Mode: Collapsed / Hyperactive / Degraded"
Formal Compliance: The compliance facet measuring adherence to structural or formatting constraints. Example: "---- Formal Compliance (Condition D)"
Formal Reactivity (Length Delta): The reactivity facet capturing changes in response length across environmental manipulations. Example: "---- Formal Reactivity (Length Delta)"
Four Shell Model (FSM): A theoretical framework decomposing AI agents into Core and Shell and linking MTI axes to formal constructs. Example: "Model Medicine's theoretical backbone is the Four Shell Model (FSM)"
HumanEval: A coding capability benchmark assessing programming performance. Example: "MMLU, HumanEval, MATH, ARC"
IPIP-NEO: A human personality inventory often (mis)applied to LLM self-report studies. Example: "The most common approach applies human personality questionnaires (BFI, IPIP-NEO) to LLMs"
Likert-scale similarity rating: A subjective similarity metric used for reactivity scoring, validated against accuracy deltas. Example: "Likert-scale similarity rating (primary, validated r $= 0.$ 971 against accuracy delta)"
LMLPA: A framework critiquing personality assessment for entities without experiential substrates. Example: "The LMLPA framework (Computational Linguistics, 2025) further noted that emotion-premise items are inappropriate for entities without experiential substrates."
LxM: A multi-agent game data source used to explore Sociality Facet A. Example: "Facet A (Agent $\leftrightarrow$ Agent) draws on LxM multi-agent game data"
M-CARE nosology: Model Medicine’s diagnostic classification system for AI behavioral conditions. Example: "Model Medicine's diagnostic framework (M-CARE nosology) requires a baseline of normal behavioral variation"
MATH: A mathematical reasoning capability benchmark. Example: "MMLU, HumanEval, MATH, ARC"
MMLU: A broad knowledge and reasoning benchmark for LLMs. Example: "The AI evaluation ecosystem is dominated by capability benchmarks---MMLU, HumanEval, MATH, ARC"
Model Medicine: A research program treating AI models as entities with internal structures and classifiable behavioral conditions. Example: "MTI emerged from Model Medicine---a research program that treats AI models as entities with internal structures, dynamic processes, and classifiable behavioral conditions"
Model Temperament Index (MTI): A behavior-based profiling system measuring AI agent temperament across four axes. Example: "We introduce the Model Temperament Index (MTI), a behavior-based profiling system that measures AI agent temperament across four axes"
Number-of-Flip (NoF): A metric counting pressure turns before a model changes its stated position in compliance tests. Example: "The Number-of-Flip metric (adapted from Hong et al., EMNLP 2025) counts how many pressure turns elapse before the model changes its stated position."
Pairwise delta: A scoring approach using mean pairwise differences across matched conditions to avoid baseline arbitrariness. Example: "pairwise delta: output change is computed as the mean of pairwise differences across matched conditions, avoiding the arbitrariness of selecting a single baseline."
Performance Maintenance (PM): The ratio of stressed to baseline quality used to quantify resilience. Example: "Performance Maintenance (PM) is computed as quality_stress / quality_baseline"
ProSA: A prompt sensitivity research method measuring output variation under prompt perturbations. Example: "ProSA (Zhuo et al., EMNLP 2024), IPS (OpenReview 2025), and Cao et al. (2024) measure output variation under prompt perturbation."
Reinforcement Learning from Human Feedback (RLHF): An alignment training paradigm shown to reshape model temperament. Example: "RLHF reshapes temperament not only by shifting axis scores but by creating within-axis facet differentiation"
Shell Permeability Index (SPI): An FSM construct interpreting Compliance as the depth of Shell instruction penetration. Example: "Compliance & Shell Permeability Index (SPI) & Depth of Shell instruction penetration"
Stress Escalation Protocol (SEC): A procedure that progressively increases stress levels to measure resilience. Example: "Resilience is measured through a Stress Escalation Protocol (SEC) that progressively increases pressure from Level 0 (baseline) to Level 4 (extreme)"
Stance Compliance: The compliance facet measuring whether a model changes its position under opinion pressure. Example: "---- Stance Compliance (Condition B, NoF)"
Sycophancy: The tendency of models to agree with users, a behavior decomposed into separable causes. Example: "The sycophancy literature (Sharma et al., ICLR 2024; ELEPHANT, ICLR 2026; Hong et al., EMNLP 2025) measures models' tendency to agree with users."
“Sycophancy Is Not One Thing”: A framework establishing that sycophantic behaviors are causally separable. Example: "The 'Sycophancy Is Not One Thing' framework (2025) demonstrated that sycophantic behaviors are causally separable"
TRAIT benchmark: A benchmark showing discrepancies between LLM self-reported traits and actual behaviors. Example: "The TRAIT benchmark (Lee et al., NAACL 2024) confirmed this empirically, demonstrating significant discrepancies between LLMs' self-reported traits and actual decision-making behavior."
Trust Game: An economic game used in multi-agent experiments to assess cooperation tendencies. Example: "including Trust Game, Poker, Chess, Codenames, and Avalon"
Two-stage design: A measurement approach that separates capability checks from disposition under conflict. Example: "two-stage design: every axis separates capability (Stage 1: can the model perform the behavior?) from disposition (Stage 2: does it perform the behavior under conflict?)."
White-Room experiments: Minimal, unconstrained environments used to expose raw behavioral differences. Example: "The White-Room experiments---where models were placed in minimal, unconstrained environments---exposed raw behavioral differences"
Directional Shell Permeability: The idea that different channels of shell influence (social-evaluative vs. factual) penetrate independently. Example: "Directional Shell Permeability"
Facet S (Agent ↔ System): An exploratory Sociality sub-dimension concerning interactions with the system. Example: "---- Facet S: Agent <-> System [exploratory]"
Instruction-tuned models: Models fine-tuned to follow instructions, forming the primary analysis subset. Example: "The four axes are largely independent among instruction-tuned models (all $|r|$ $< 0.42$ )."
Keyword Delta: A content-reactivity metric based on keyword changes across conditions. Example: "Content Reactivity (Keyword Delta)"
Length Delta: A formal-reactivity metric capturing changes in output length across conditions. Example: "Formal Reactivity (Length Delta)"
Shell × Core factorial design: An experimental design manipulating both core and shell factors to study their interactions. Example: "a Shell $\times$ Core factorial design."
Size independence: The finding that temperament scores are not systematically related to model size. Example: "size independence (1.7B--9B), confirming that MTI measures disposition rather than capability."

MTI: A Behavior-Based Temperament Profiling System for AI Agents

Summary

Model Temperament Index: A Behavioral Profiling System for AI Disposition

Introduction and Motivation

Theoretical and Methodological Foundations

Empirical Results: Model Profiling and Core Findings

Methodological Rigor and Cross-Framework Comparison

Practical and Theoretical Implications

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview: What this paper is about

Objectives: What questions the paper asks

Methods: How they tested the AIs

Main findings: What they discovered and why it matters

Implications: Why this work matters

Knowledge Gaps

Unresolved gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets