Olmo 3 Think 32B LLM Overview
- Olmo 3 Think 32B is a flagship large language model that uses a dense transformer architecture and a multi-stage training recipe to achieve state-of-the-art chain-of-thought reasoning and coding capabilities.
- It integrates diverse, STEM-centric data curation and extended context processing (up to 65K tokens) to support reproducibility and superior performance on academic and practical tasks.
- Post-training alignment through techniques like supervised finetuning, Direct Preference Optimization, and RLVR enhances its reasoning precision and safety while maintaining transparency.
Olmo 3 Think 32B ("O3T 32B") is the flagship LLM of the Olmo 3 family, calibrated for state-of-the-art chain-of-thought reasoning, instruction following, coding, function calling, and long-context comprehension. Released as a fully-open artifact, O3T 32B integrates dense transformer architectural innovations, a multi-stage post-training recipe, curated STEM-centric data mixes, and advanced alignment techniques to deliver leading performance across a broad spectrum of academic and practical tasks. Distinct from other open models, O3T 32B provides exhaustive transparency with full model flow—including checkpoints, dependencies, and data curation—supporting reproducibility and research extension (OLMo et al., 15 Dec 2025).
1. Architectural Foundations
O3T 32B implements a decoder-only transformer architecture comprising 64 layers, each with 5120 hidden units. The attention configuration involves 40 query heads and 8 key-value heads using grouped-query attention for computational efficiency. The feed-forward subnetwork adopts SwiGLU activation, with intermediate size approximately . RMSNorm is employed for output normalization, while QK-Norm regularizes query and key representations. Rotary positional encoding (RoPE) is used with base frequency , further extended by YaRN across global attention layers; 75% of layers employ sliding-window attention (size 4096), while the final layer uses global attention for extended contexts.
During pretraining and midtraining, standard next-token cross-entropy loss is minimized with an additional Z-loss regularizer term ( weight) to penalize over-confident logits:
No auxiliary objectives are used beyond Z-loss (OLMo et al., 15 Dec 2025).
2. Data Curation and Model Flow
The model flow for O3T 32B is structured in three phases:
- Dolma 3 Mix (Pretraining): 5.93 trillion tokens selected from a 9.31 T pool; STEM-weighted via Dirichlet swarm mixing and upsampling. Constituent sources include CommonCrawl (4.51 T), academic PDFs via olmOCR (0.81 T), Stack-Edu GitHub code (0.41 T), arXiv papers in LaTeX (50 B), FineMath3+ pages (152 B), and Wikipedia/Wikibooks (2.5 B). Deduplication is applied at three granularities, removing ~75% redundancy.
- Dolmino Mix (Midtraining): 100 B tokens from a 2.19 T pool, incorporating synthetic math (TinyMATH, CraneMath, MegaMatt), instructional prompts (Tūlu 3 SFT), meta-reasoning traces, QA tasks, and coding data. Decontamination mitigates eval leak risks using IDF-weighted n-gram overlap (OLMo et al., 15 Dec 2025).
- Longmino Mix (Extension): 50 B tokens from 639 B pool, emphasizing long-context synthetic aggregation and long PDFs. Document-packing and intra-doc masking are performed. Context length is extended to 65,536 tokens.
3. Model Training and Hyperparameters
Training occurs with a batch size of 8 million tokens per step using AdamW optimizer (β₂=0.95, no weight decay on embeddings) and bfloat16 precision. Schedules vary: cosine annealing (pretraining over 5.93 T tokens, peak , with warmup), linear decay during midtraining and extension. Computational resources include 1024 H100 GPUs and hybrid sharded data parallelism, yielding model fractional utilization (MFU) ≈ 41% in pretraining.
Checkpoints are saved frequently; midtraining averages two runs, and extension merges the final three checkpoints ("model soup"). Context-parallel inference supports 65K token windows via 8-way all-gather attention. Throughput benchmarks indicate 1960 tok/s/GPU for pretraining, and ≈250 tok/s/GPU for long-context serving on vLLM (OLMo et al., 15 Dec 2025).
4. “Think” Reasoning: Post-Training Alignment
O3T 32B's "Think" capability is elicited through a three-stage post-training recipe:
- Supervised Finetuning (SFT): Dolci Think SFT prompts (~2.25 M), explicitly teaching chain-of-thought (CoT) reasoning.
- Direct Preference Optimization (DPO): Contrastive ranking over 200 K Dolci Think DPO pairs employing a delta-learning heuristic (Qwen3-32B vs. Qwen3-0.6B completions) and GPT-judge rewards, refining reasoning fluency and style (cf. Table 26 and Figure 1).
- Reinforcement Learning with Verifiable Rewards (RLVR): RL via GRPO, augmented with truncated importance sampling, clip-higher, zero-grad filtering, and token-level objectives. Verifiers specific to math (SymPy), code (AWS Lambda testcases), instruction (function-constraint checkers), and chat (GPT-judge) maintain domain-grounded reward signal. RL inference leverages continuous batching and inflight updates for 4× accelerated training (OLMo et al., 15 Dec 2025).
Empirical analysis indicates DPO→RL sequence outperforms SFT→RL. The resulting model demonstrates robust, task-aligned reasoning without sacrificing style diversity.
5. ThinkPilot Prompt Optimization
The ThinkPilot framework (Li et al., 14 Oct 2025), when adapted to O3T 32B, enables automated optimization of the model's reasoning chain via think-prefixes—special instructions prefixed to queries to steer the reasoning process. ThinkPilot utilizes an evolutionary search comprising population initialization (12–20 diverse seed prefixes), fitness evaluation (accuracy-length trade-off), selection (top-k), crossover (LLM blending of behavioral traits), and mutation (targeted amplification/suppression of reasoning behaviors). The behavioral taxonomy includes Task Initialization, Strategic Planning, Knowledge Retrieval, Stepwise Reasoning, Uncertainty Management, and Final Conclusion.
On O3T 32B, optimized think-prefixes significantly improve efficient reasoning (e.g., MATH 500: accuracy from 94.2% to 95.0%, length reduced by 16%) and safety (StrongREJECT harm rate from 27.0% to 1.2%, with no Math accuracy reduction). Convergence to task-preferred behavioral distributions is monitored using an external behavior-classifier prompt. Evolution typically progresses across six generations with length-penalty hyperparameter (Li et al., 14 Oct 2025).
6. Empirical Benchmarks and Capabilities
O3T 32B outperforms other fully-open models in several domains:
- Long-Context Reasoning: Achieves 96.10%→79.70% (RULER 8K–65K); first among open models on HELMET held-out (52.11→43.15).
- Mathematical and Code Tasks: OlmoBaseEval: Math 61.9, Code 39.7, MC_STEM 74.5, MC_NonSTEM 85.6, GenQA 79.8. MATH 96.2, AIME24 80.6, OMEGA 53.4; BigBenchHard 88.6; HumanEval+ 91.5; IFEval 93.8; MMLU 86.4; AlpacaEval2 LC 69.1.
- Instruction and Safety Performance: HarmBench 88.6%, DAN 80.8%, XSTest 51.6%, BBQ 69.1% accuracy, WMDP 40.2%. Function-calling: Intrinsic pass@1 accuracy 55.6% compared to 39.6% without tools on LitQA2, and 75.9% on SimpleQA (OLMo et al., 15 Dec 2025).
Table: Representative O3T 32B Benchmark Scores
| Benchmark | O3T 32B Score | Reference Model |
|---|---|---|
| MATH | 96.2 | Qwen 3: 96.7 |
| BigBenchHard | 88.6 | Qwen 3: 90.6 |
| HumanEval+ | 91.5 | Qwen 3: 91.2 |
| LitQA2 (FC/pass@1) | 55.6 | No tools: 39.6 |
| SimpleQA | 75.9 | No tools: 3.2 |
7. Limitations and Directions for Research
O3T 32B demonstrates persistent gaps on multiple-choice QA (AIME, BBBH) compared to top open-weight models (Qwen 3 VL) despite strong reasoning alignment. RLVR remains computationally intensive, especially with token-level verifiable reward assignment. Future work is indicated in refining post-training strategies, further expanding domain-specific reasoning taxonomies (for code, merge API Retrieval/Syntax Checking), increasing model diversity during prefix optimization, and broadening the auditable scope for verifiable rewards (OLMo et al., 15 Dec 2025, Li et al., 14 Oct 2025).
A plausible implication is that the full model flow and open post-training recipes significantly lower barriers for research-driven distillation, alignment, and safety experimentation, especially in academic and STEM reasoning contexts.