Base LLM Behavior Analysis

Updated 29 January 2026

Base LLM behavior is defined as autoregressive generation that learns statistical patterns from large-scale corpora without explicit alignment or safety interventions.
Experimental frameworks like LLM-ABS control prompt structure and stimuli to reveal intrinsic social preferences and fairness biases in these models.
Mechanistic studies decompose base LLM activations into interpretable components, shedding light on inherent rationality limits and emergent linguistic correlates.

A base LLM is an autoregressive transformer trained solely by next-token or similar objectives on large-scale text corpora, without application-specific supervised instruction fine-tuning or reinforcement learning based alignment. The core behavioral profile of such models—referred to as "base LLM behavior"—is shaped primarily by the pretraining dynamics and the model's statistical learning from linguistic patterns, rather than explicit social preference conditioning or safety interventions. Understanding the behavioral, cognitive, and decision-theoretic properties exhibited by base LLMs is essential both for interpretability research and for delineating the effects of downstream modifications.

1. Foundational Properties of Base LLMs

Base LLMs generate completions based on the statistical structure of their pretraining corpora, absent direct alignment or reward shaping for specific tasks. This statistical learning gives rise to emergent behaviors that can be systematically analyzed using decision-theoretic, economic, and linguistic frameworks. Unlike instruction-tuned models, base LLMs do not natively respond to meta-level instructions or safety guardrails, and their outputs reflect priors acquired during pretraining, modulated only by the input prompt context (Einwiller et al., 11 Nov 2025).

In economic games such as the dictator game, base LLMs generate allocations in the absence of contextual incentives or explicit persona priming, providing a "clean" baseline for social preference inquiry. In psycholinguistic tasks, their open-ended justifications expose the latent reasoning chains and hedging present in their output distributions.

2. Experimental Frameworks for Behavioral Assessment

Frameworks such as LLM-ABS ("LLM Agent Behavior Study") have been developed to robustly measure base LLM behavior by controlling for confounds including prompt structure, token framing, and system instructions. The LLM-ABS protocol involves:

Manipulating system prompts ( $x_{\text{sys}}$ ), user input (amount $A$ , unit $U$ ), and neutral prompt paraphrases (10 variants).
Utilizing ensembles of diverse LLM architectures.
Presenting open-ended stimuli to the model, followed by reduction to structured outputs (e.g., closed-form JSON for dictator game allocations).
Repeating trials (up to 100 per experimental cell) to estimate behavioral variance and outlier rates.

Such frameworks are essential to isolate prompt sensitivity and obtain valid baselines for comparative studies of agentic or aligned LLM variants (Einwiller et al., 11 Nov 2025).

Base LLM agents, when tasked with splitting an endowment in the dictator game under "neutral" prompts, display an intrinsic social preference bias. Aggregate analysis finds that, averaged over short and generic system prompts (e.g., empty string, "You are a helpful assistant."), the mean kept proportion $\bar F$ hovers around $0.48$–$0.50$, indicating a near-equal split between self and other. This behavior is robust to small phrasing changes and closely approximates statistical fairness, in sharp contrast to the typical human equilibrium of 72/28 reported by Engel (2011):

System Prompt	Mean kept $\bar F$	SD	p vs. neutral
Empty (“”)	0.48	0.06	0.12 (n.s.)
Helpful assistant	0.50	0.05	—
Assistant	0.49	0.05	0.35 (n.s.)
ChatGPT	0.47	0.07	0.08
DeepSeek default	0.42	0.07	<0.01 (**)
Claude standard	0.50	0.04	0.90 (n.s.)
Grok default	0.40	0.08	<0.05 (*)
Gemini 2.0 flash	0.43	0.09	<0.01 (**)

Elaborate or branded system prompts can drive a marked increase in generosity (lower $\bar F$ ) by 5–10 percentage points, indicating sensitivity of base behavior to system-level framing, albeit independently of prompt length (Einwiller et al., 11 Nov 2025).

4. Linguistic Correlates and Justification Patterns

Analysis of open-ended rationales in behavioral games reveals structured linguistic differences underpinning LLM decisions:

Discourse markers (e.g., "because," "therefore"): Frequency correlates positively with fairness ( $r=+0.32$ , $p<0.01$ ), indicating that responses with more explicit reasoning tend toward 50/50 splits.
Epistemic markers (e.g., "I think," "possibly"): Frequency correlates positively with self-interest ( $A$ 0, $A$ 1), reflecting increased hedging when the LLM chooses a self-favoring split.

These correlations persist across architectures and prompt variants, but model-specific styles are present. For example, DeepSeek prompts associated with fair splits also yield a 30% higher rate of discourse markers compared to self-interested splits, while GPT-4-like models justify less regardless of allocation (Einwiller et al., 11 Nov 2025). This linguistic diagnostic enables a mechanistic probe into LLM "reasoning" motifs without reference to external alignment signals.

5. Behavioral Biases in Decision-making and Expectation Updating

Base LLMs exhibit systematic deviations from normative rationality in settings involving statistical inference, expectation formation, and test-taking:

Bounded rationality and recency effects: As quantified using a Behavioral Kalman Filter (BKF), base LLMs systematically underweight priors ( $A$ 2), overweight individual (micro) signals versus aggregate (macro) signals, and exhibit negative interaction terms ( $A$ 3) reflecting "cognitive discount" when multiple signals are presented. For instance, in economic expectation updating, CEO personas place more weight on macro news than household personas, modulated by the prompt context (Wang et al., 24 Jan 2026).
Base-rate effect: In multiple-choice tasks, base LLMs have intrinsic base-rate preferences for certain answer tokens ("label bias"), leading to inflated accuracy estimates when the preferred token is the correct answer. Cloze prompting yields strong correlation ( $A$ 4 for some models) between base-rate probabilities and measured performance. Counterfactual prompting and the Nvr-X-MMLU benchmark partially correct for this, but residual biases persist (Moore et al., 2024).

These findings challenge the interpretability of zero-shot benchmark scores as indicators of genuine understanding, making explicit the need for robust, bias-controlled evaluation protocols.

6. Mechanistic Insights and Internal Decomposition

Recent mechanistic studies decompose base LLM behavior into linear activation-space components:

Steering vectors: Low-rank gradient-based interventions in layer activations, even when trained on a single example, can reliably induce or suppress complex behaviors in LLMs. Many near-orthogonal steering directions exist for the same behavioral modification, indicating the activation manifold contains broad, flat subspaces for mediating behavior. Promotion interventions are generally more reliable than suppression (Dunefsky et al., 26 Feb 2025).
Assertiveness structure: In assertiveness calibration studies with Llama 3.2, base representations decompose into emotional ( $A$ 5) and logical ( $A$ 6) orthogonal directions, as established by cluster and Gram–Schmidt procedures on mid-layer activations. Emotional steering drives broad increases in assertiveness predictions, while logical steering produces more selective, fact-driven shifts. Removing $A$ 7 broadly reduces overconfidence, while removing $A$ 8 yields highly localized prediction changes (Tsujimura et al., 24 Aug 2025).

These mechanistic decompositions both ground empirical behavioral biases and provide levers for post-hoc alignment or mitigation.

7. Limitations in Mimicking Genuine Human Motivations

Systematic replications of motivated reasoning studies confirm that base LLMs diverge from human behavioral patterns:

In classic political-motivation tasks, base LLMs fail to amplify in-group alignment under directional prompts and do not moderate responses under accuracy prompts.
Models display dramatically reduced variance in outputs (σ{LLM}/σ{Human} ≈ 0.1–0.3), resulting in excessively homogeneous opinions.
Correlations with human condition means are weak, and sign agreement on argument strength assessments is at or below chance.

These results indicate that base LLMs do not exhibit the variance or directional biases characteristic of human motivated reasoning, highlighting the necessity of explicit persona priming or downstream alignment to recover human-like motivational structure (Pate et al., 22 Jan 2026).

8. Synthesis and Implications

Base LLM behavior is the product of learned statistical regularities, prompt context, and architectural constraints. The default social preference is a fairness bias distinct from human norms, but highly prompt-sensitive. Linguistic analysis uncovers internal correlates of fairness and self-interest that are consistent in their predictive value. However, base models show (1) bounded rationality, (2) superficial heuristics in benchmarking, (3) modular subspaces for interpretable behaviors, and (4) lack of genuine motivated reasoning. Practitioners are therefore advised to:

Always document and control system prompt framing.
Use robust baselines and adversarial prompt variations.
Employ linguistic and mechanistic probes to diagnose emergent model biases.

Behavioral steering and decompositional methods provide powerful tools for both modulating and understanding base LLM behavior, but do not obviate the necessity for downstream safety and alignment interventions. Understanding the baseline is thus enduringly foundational in both mechanistic interpretability and system design (Einwiller et al., 11 Nov 2025, Dunefsky et al., 26 Feb 2025, Moore et al., 2024, Tsujimura et al., 24 Aug 2025, Wang et al., 24 Jan 2026, Pate et al., 22 Jan 2026).