Falcon Mamba 7B: Pure SSM LLM

Updated 19 February 2026

Falcon Mamba 7B is a 7-billion parameter, attention-free LLM that employs a pure SSM architecture with state-space recurrences, causal convolutions, and low-rank projections.
It is trained on 5.8 trillion tokens using a staged curriculum that scales sequence lengths and integrates techniques like AdamW optimization and RMSNorm stabilization.
The model achieves efficient long-context processing with linear time complexity, supporting context windows from 8K tokens up to 130K tokens via chunk-based inference.

Falcon Mamba 7B is a 7-billion parameter LLM based on a pure State Space Model (SSM) architecture, designed to offer highly competitive performance and superior inference efficiency for long context processing, entirely eliminating attention mechanisms. Distinguished from classical Transformer-based LLMs, Falcon Mamba 7B integrates structured state-space recurrences, causal convolutions, and lightweight projections, substantiating the practical viability of attention-free, sub-quadratic LLMs at scale (Zuo et al., 2024).

1. Model Architecture

Falcon Mamba 7B builds on the "pure" Mamba architecture, as introduced by Gu & Dao (2023), eschewing all dense self-attention in favor of stacked Mamba layers. Each layer comprises several parallel streams operating on the input $x_t \in \mathbb{R}^{d}$ : (1) a two-layer MLP (expansion factor $E = 2$ ), (2) a causal 1D convolution (width $d_{\text{conv}}=4$ ), (3) a low-rank $\Delta$ projection (rank 16), and (4) a structured state-space memory (SSM) cell updating a $N=16$ -dimensional hidden state $h_t$ per channel:

$h_t = A\,h_{t-1} + B\,x_t,\quad s_t = C^\top h_t$

Trainable (usually diagonal) matrices $A,B,C$ parameterize the SSM recurrence. The combined streams are summed, projected via $W_{\text{out}}$ , nonlinearly activated, and residually added to the input,

$y_t = x_t + W_{\text{out}}\,\phi\left( W_1 x_t + s_t + \mathrm{Conv}(x_{t-3:t}) + \Delta x_t \right)$

Layer outputs are stabilized with RMSNorm after each sub-stream. The overall parameter regime is $E = 2$ 0 (channels) per layer and SSM hidden state size $E = 2$ 1, yielding a recurrent memory tensor $E = 2$ 2, i.e., 65,536 parameters per memory state (Ben-Kish et al., 12 May 2025).

Falcon Mamba 7B abandons the quadratic complexity of key-value attention computation, reducing both memory and computational cost. In contrast to the Transformer, which incurs $E = 2$ 3 time and $E = 2$ 4 space per context window of length $E = 2$ 5, Mamba models compute forward passes in $E = 2$ 6 time and $E = 2$ 7 working memory for recurrences. Autoregressive generation is $E = 2$ 8 per token.

2. Training Regime and Data Pipeline

Pre-training took place over 5.8 trillion tokens, utilizing a staged data curriculum to improve scaling and generalization. The primary data mix encompassed:

RefinedWeb ( $E = 2$ 9 5T tokens): high-quality filtered web text
Curated corpora: books, arXiv and PubMed abstracts, USPTO patents, and structured dialogues from Reddit, StackExchange, HackerNews
Code: The Stack (11 programming languages, deduplicated)
Math: Proof-Pile-2 augmented with curated web math (filtered via FastText)
Instruction/data tuning ("decay stage"): $d_{\text{conv}}=4$ 03.7% instruction-style data, synthetic QA, and educational samples

Sequence lengths grew from 2048 to 8192 during four curriculum stages. The last $d_{\text{conv}}=4$ 110\% of tokens introduced instruction tuning and synthetic data over four final epochs. Optimization adopted AdamW ( $d_{\text{conv}}=4$ 2, $d_{\text{conv}}=4$ 3), with a 'Warmup–Stable–Decay' LR schedule, RMSNorm stabilization, batch size ramp-up ( $d_{\text{conv}}=4$ 4 to $d_{\text{conv}}=4$ 5), and distributed training on 256 H100 GPUs (Zuo et al., 2024).

3. Inference Efficiency and Memory Scaling

Falcon Mamba 7B exhibits superior inference efficiency for both short and long context windows due to its absence of key-value (KV) cache scaling with sequence length:

Prefill (parallel): $d_{\text{conv}}=4$ 6 time and $d_{\text{conv}}=4$ 7 memory vs. $d_{\text{conv}}=4$ 8, $d_{\text{conv}}=4$ 9 for Transformer
Decode: $\Delta$ 0 per token vs. $\Delta$ 1 for attention models
Maximum context (A10 24GB): Falcon Mamba 7B supports parallel prefill for contexts to $\Delta$ 28K tokens, and sequential prefill for virtually unlimited length (GPU memory-limited only by model weights)
Throughput: On H100 80GB, Falcon Mamba 7B maintains constant throughput and peak CUDA memory regardless of generated sequence length, supporting generations up to 130K tokens without latency drift (Zuo et al., 2024)

This allows for inference with long contexts without encountering GPU OOM conditions, unlike Transformers which typically exhaust device memory at 4–6K tokens due to quadratic KV cache scaling.

4. Empirical Performance and Benchmarks

Falcon Mamba 7B achieves leading results among SSM-based LLMs and is competitive with or superior to comparable scale Transformers and hybrid (Transformer-SSM) models. Notable evaluation results include:

Benchmark (HF v1/v2)	Falcon Mamba 7B	Transformer/Hybrid Reference
ARC-25	62.03	Llama 3.1 8B: 62.28
HellaSwag	80.82	Mistral 7B: 80.82
MMLU	62.11	Falcon 2 11B: 64.28
Winogrande	73.64	Gemma 7B: 63.75
TruthfulQA	53.42	-
GSM8K	52.54	-
HF v1 Average	64.09	Mistral 7B: 60.97
HF v2 Average	15.04	Gemma 7B: 15.28

On challenging long-context benchmarks, Falcon-Mamba-Inst-7B augmented with chunk-based inference improves performance by 28% on LongBench and achieves a gain from 2.8 to 27.7 accuracy (+888%) on LongBench v2 with contexts up to 150K tokens, surpassing several Transformer baselines (Ben-Kish et al., 12 May 2025).

5. Long-Context Processing and Overflow Mitigation

Recurrent SSMs have a fixed-size memory $\Delta$ 3, which introduces an inherent limitation when condensing very long sequences. When the input length $\Delta$ 4, the memory must summarize an excessive volume of information, leading to "overflow," i.e., loss of retrievable context.

To address this, Falcon-Mamba-Inst-7B implements a chunk-based inference algorithm (Overflow Prevention with Recurrent Memory, OPRM):

Procedure: Split the long context $\Delta$ 5 into $\Delta$ 6 non-overlapping chunks $\Delta$ 7 of maximum safe size $\Delta$ 8. Select the single most relevant chunk for each query using either a min-entropy or max-probability-of-target scoring function.
Overflow avoidance: As each chunk is processed independently into the $\Delta$ 9-width recurrent memory, overflow is prevented for $N=16$ 0 below empirical limits, typically $N=16$ 1– $N=16$ 2 tokens.
Complexity: Baseline prefill complexity $N=16$ 3 improves to $N=16$ 4 for OPRM, as $N=16$ 5, yielding sub-quadratic scaling for ultra-long contexts.

Empirically, chunk-based inference not only stabilizes retrieval and QA tasks but also substantially improves model F1 score and accuracy on long-context benchmarks (Ben-Kish et al., 12 May 2025).

6. Comparative Analysis and Architectural Trade-Offs

Despite literature indicating that hybrid SSM-attention models (e.g., interleaved with sparse or local attention) may excel in some settings, Falcon Mamba 7B demonstrates that a "pure Mamba" design—absent any attention—can match or outperform such hybrids given sufficient parameter count and data scale. Notably, pure SSMs still lag behind classic Transformers in certain in-context learning scenarios (e.g., multi-step copying or retrieval tasks), although targeted injection of chain-of-thought or instruction data can partially alleviate this gap (Zuo et al., 2024).

Advantages:

Linear time and constant memory in context length for both prefill and generation
No sequence-length dependent OOM on commodity GPUs
State-of-the-art accuracy among attention-free LLMs at 7B scale

Limitations:

Ultralong-context (>8K tokens) training regime remains underexplored in the current public checkpoints
Certain retrieval and copy-intensive tasks may still favor Transformer architectures
Maximum per-chunk memory capacity bounds retrievable factual content per pass

Future directions include exploiting larger curricula, multilingually scaling, hybridizing with sparse attention, and pushing SSM parameter scaling beyond 7B to further extend performance boundaries (Zuo et al., 2024, Ben-Kish et al., 12 May 2025).

7. Relationship to Falcon Series and Open Science

Falcon Mamba 7B is distinct from the Falcon-7B model presented in "The Falcon Series of Open LLMs" (Almazrouei et al., 2023). The latter is a Transformer-based decoder-only LLM trained on 1.5T tokens, with multiquery-group attention and rotary positional encodings, while Falcon Mamba 7B is a non-attentive SSLM trained on 5.8T tokens. Both models are openly released under permissive licenses with checkpoints and training code to foster research and ecosystem development, but Falcon Mamba 7B currently represents the most performant, attention-free LLM at the 7B parameter scale.

Further performance comparisons, training regime, and architectural details are summarized below:

Model	Architecture	Params	Pretraining Data	Context Length Trained	Benchmarked Max Context	Benchmark Performance
Falcon Mamba 7B	Pure Mamba (SSM)	7B	5.8T	up to 8K tokens	$N=16$ 68K (with OPRM)	HF-v1/v2: @top, LongBench
Falcon-7B	Transformer	7B	1.5T	2K tokens	2K tokens	EleutherAI avg: 60.8%

This comparison highlights the methodological shift towards SSM-based architectures for efficient, high-performance open-domain LLMs.

Markdown Report Issue Upgrade to Chat

References (3)

Falcon Mamba: The First Competitive Attention-free 7B Language Model (2024)

Overflow Prevention Enhances Long-Context Recurrent LLMs (2025)

The Falcon Series of Open Language Models (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Falcon Mamba 7B.