OctoThinker Hybrid-8B-Base Overview

Updated 3 October 2025

OctoThinker Hybrid-8B-Base is a mid-trained language model with 8 billion parameters that uses a Stable-then-Decay two-stage process to enhance reasoning and reinforcement learning compatibility.
The training pipeline leverages extensive mathematical datasets and diversified chain-of-thought branches, ensuring rapid convergence and improved RL stability.
Empirical benchmarks demonstrate that self-driven RL enhancements and robust prompt strategies yield significant performance gains over traditional Transformer models.

OctoThinker Hybrid-8B-Base is a mid-trained, reinforcement learning-compatible 8-billion-parameter LLM, architected for robust reasoning and high scalability in RL environments. Developed on a Llama foundation, OctoThinker Hybrid-8B-Base introduces a two-stage mid-training paradigm leveraging large-scale mathematical, chain-of-thought (CoT), and instruction-following data, and has been specifically refined to overcome limitations in RL scaling observed in standard Transformer-based models. The model is further enhanced through the application of self-driven RL frameworks, enabling stable, high-fidelity reasoning without reliance on gold labels.

1. Mid-Training Strategy: Stable-then-Decay

OctoThinker Hybrid-8B-Base employs a two-stage mid-training approach termed "Stable-then-Decay." In the initial stable phase, the underlying Llama model is trained on a curated mixture of mathematical web data, especially the MegaMath-Web-Pro-Max corpus, for 200 billion tokens under a constant learning rate schedule (i.e., no warmup, LR remains fixed). This ensures rapid convergence to a strong baseline of mathematical and QA competence.

In the decay phase, training is continued for 20 billion tokens, distributed across three chain-of-thought branches: Long CoT, Short CoT, and Hybrid. Here, the learning rate is annealed according to a cosine decay schedule: $LR(\text{step}) = LR_0 \times 0.1 + (LR_0 - LR_0 \times 0.1) \times \frac{1}{2}\left[1 + \cos\left(\pi \frac{\text{step}}{\text{max\_steps}}\right)\right]$ This branching mechanism allows for the incorporation of diverse reasoning styles and prompt formats, optimizing the model for subsequent RL fine-tuning while mitigating mode collapse and instability observed in naive long-CoT mid-training.

2. Training Data Composition and Curation

The mid-training pipeline is enabled by access to several high-quality corpora:

MegaMath-Web-Pro-Max: Over 70 billion tokens comprising web-sourced mathematical documents, QA pairs, and synthetic math reasoning samples. High-quality filtering and refinement are performed using automatic classifiers and LLM-aided prompts.
QA-Style Datasets: Both concise (short CoT) and verbose (long CoT) examples are curated, leveraging sources such as OpenR1-Math-220K. These datasets systematically align the model’s reasoning with formats required by RL tasks.
Instruction Data: Selected samples facilitate prompt adherence and response alignment, serving as regularization within the decay stage.

This composition is critical for strong downstream performance; inferior alternatives (e.g., FineMath-4plus) do not yield comparable RL scaling.

3. RL Compatibility and Comparative Performance

OctoThinker Hybrid-8B-Base demonstrates significant advancements in RL compatibility. Benchmarks—including GSM8K, MATH500, OlympiadBench, and AMC23—show relative improvements of 10–20% over Llama baselines. Against RL-friendly families such as Qwen, the model closes much of the performance gap, with 3B-scale variants matching or surpassing Qwen2.5-3B on several reasoning tasks.

Reinforcement learning outcomes (post mid-training) reveal that tuned OctoThinker variants produce coherent and natural chain-of-thought solutions, in contrast to unstable or verbose outputs from unmodified bases. The approach is specifically credited for reducing training collapse and enabling stable long-context reasoning.

4. RL Training Challenges and Mitigation Strategies

Mid-training with long chain-of-thought data introduces unique challenges:

Verbosity and Unstable Output: Models may respond with excessive verbosity, e.g., repeating boxed answers until token limits.
RL Instability: Training may become erratic, manifesting in incoherent or collapsed reasoning chains.

Several mitigations are implemented:

RL Prompt Template Refinement: More complex, conversational prompts are adopted, instructing self-reflection and answer encapsulation, directly stabilizing output format and length.
Progressive Max-Response Length Scheduling: Instead of permitting full-length outputs from the start (e.g., 8,192 tokens), the output cap is gradually raised across RL steps (2,048 → 4,096 → 8,192), controlling verbosity and enforcing more concise generation.

These strategies materially improve RL outcome consistency and accuracy for both short and long-form mathematical questions.

5. Self-Driven RL: RESTRAIN Enhancements

Recent research with RESTRAIN (Yu et al., 2 Oct 2025) introduces a scalable self-driven RL framework particularly suited to OctoThinker Hybrid-8B-Base:

Pseudo-label Weighting: The method assigns monotonic weights to all unique model-generated answers for a given prompt, ensuring minority plausible answers are preserved.
Negative Rollout Penalization: Rollouts with low self-consistency (determined by a majority count threshold $\kappa$ ) have rewards zeroed and a fixed negative offset $\delta$ subtracted from their advantages, discouraging reinforcement of spurious chains.
Prompt-level Weighting: Prompts are further weighted by their self-consistency, moderating their influence according to prompt reliability.

Integration with Grouped Relative Policy Optimization (GRPO) allows these mechanisms to function robustly within standard RL algorithms. Empirical evaluations show substantial gains: RESTRAIN applied to OctoThinker Hybrid-8B-Base increases Pass@1 on AIME25 by +140.7%, on MMLU_STEM by +26.5%, and on GPQA-Diamond by +12.8% compared to label-free baseline methods. The framework nearly closes the gap with gold label RL training.

6. Open-Source Contributions and Resource Availability

The authors have released both the OctoThinker Hybrid-8B-Base model and its foundational datasets, notably MegaMath-Web-Pro-Max (70B+ tokens), along with mid-training and RL recipe code. This supports reproducibility of the Stable-then-Decay paradigm and enables further empirical studies of RL scaling dynamics, prompt engineering, and mathematical reasoning. The open-source nature facilitates broader adoption and downstream research into efficient reasoning-focused LLMs.

7. Future Directions and Research Opportunities

Potential future directions include:

Refinement of pseudo-label and prompt-weighting: Adaptive shaping functions and dynamic computation of prompt reliability may further balance exploration/exploitation in RL.
Broader domain extension: Extending self-penalizing RL to less-supervised fields (e.g., open scientific domains) can validate generalizability.
Hybrid and minimal-feedback RL: Combining minimal external feedback with robust self-supervision is a plausible path for reducing gold label dependence while maintaining competitive performance.
Scaling experiments: Application to larger models and more diverse reasoning tasks will inform the limits and scalability of both mid-training and RESTRAIN mechanisms.

A plausible implication is that the mid-training and self-driven RL paradigms exemplified by OctoThinker Hybrid-8B-Base offer a template for RL-scaleable model development, balancing pre-training, reasoning competence, and label efficiency under rigorous empirical evaluation.

Markdown Report Issue Upgrade to Chat

References (1)

RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OctoThinker Hybrid-8B-Base.