Papers
Topics
Authors
Recent
Search
2000 character limit reached

Falcon Mamba 7B: Pure SSM LLM

Updated 19 February 2026
  • Falcon Mamba 7B is a 7-billion parameter, attention-free LLM that employs a pure SSM architecture with state-space recurrences, causal convolutions, and low-rank projections.
  • It is trained on 5.8 trillion tokens using a staged curriculum that scales sequence lengths and integrates techniques like AdamW optimization and RMSNorm stabilization.
  • The model achieves efficient long-context processing with linear time complexity, supporting context windows from 8K tokens up to 130K tokens via chunk-based inference.

Falcon Mamba 7B is a 7-billion parameter LLM based on a pure State Space Model (SSM) architecture, designed to offer highly competitive performance and superior inference efficiency for long context processing, entirely eliminating attention mechanisms. Distinguished from classical Transformer-based LLMs, Falcon Mamba 7B integrates structured state-space recurrences, causal convolutions, and lightweight projections, substantiating the practical viability of attention-free, sub-quadratic LLMs at scale (Zuo et al., 2024).

1. Model Architecture

Falcon Mamba 7B builds on the "pure" Mamba architecture, as introduced by Gu & Dao (2023), eschewing all dense self-attention in favor of stacked Mamba layers. Each layer comprises several parallel streams operating on the input xtRdx_t \in \mathbb{R}^{d}: (1) a two-layer MLP (expansion factor E=2E = 2), (2) a causal 1D convolution (width dconv=4d_{\text{conv}}=4), (3) a low-rank Δ\Delta projection (rank 16), and (4) a structured state-space memory (SSM) cell updating a N=16N=16-dimensional hidden state hth_t per channel:

ht=Aht1+Bxt,st=Chth_t = A\,h_{t-1} + B\,x_t,\quad s_t = C^\top h_t

Trainable (usually diagonal) matrices A,B,CA,B,C parameterize the SSM recurrence. The combined streams are summed, projected via WoutW_{\text{out}}, nonlinearly activated, and residually added to the input,

yt=xt+Woutϕ(W1xt+st+Conv(xt3:t)+Δxt)y_t = x_t + W_{\text{out}}\,\phi\left( W_1 x_t + s_t + \mathrm{Conv}(x_{t-3:t}) + \Delta x_t \right)

Layer outputs are stabilized with RMSNorm after each sub-stream. The overall parameter regime is d=4096d=4096 (channels) per layer and SSM hidden state size ds=16d_s=16, yielding a recurrent memory tensor MtR4096×16M_t\in\mathbb{R}^{4096\times16}, i.e., 65,536 parameters per memory state (Ben-Kish et al., 12 May 2025).

Falcon Mamba 7B abandons the quadratic complexity of key-value attention computation, reducing both memory and computational cost. In contrast to the Transformer, which incurs O(L2d)O(L^2 d) time and O(L2)O(L^2) space per context window of length LL, Mamba models compute forward passes in O(Ld)O(L d) time and O(d)O(d) working memory for recurrences. Autoregressive generation is O(1)O(1) per token.

2. Training Regime and Data Pipeline

Pre-training took place over 5.8 trillion tokens, utilizing a staged data curriculum to improve scaling and generalization. The primary data mix encompassed:

  • RefinedWeb (\sim 5T tokens): high-quality filtered web text
  • Curated corpora: books, arXiv and PubMed abstracts, USPTO patents, and structured dialogues from Reddit, StackExchange, HackerNews
  • Code: The Stack (11 programming languages, deduplicated)
  • Math: Proof-Pile-2 augmented with curated web math (filtered via FastText)
  • Instruction/data tuning ("decay stage"): \sim3.7% instruction-style data, synthetic QA, and educational samples

Sequence lengths grew from 2048 to 8192 during four curriculum stages. The last \sim10\% of tokens introduced instruction tuning and synthetic data over four final epochs. Optimization adopted AdamW (β1=0.9\beta_1 = 0.9, β2=0.95\beta_2 = 0.95), with a 'Warmup–Stable–Decay' LR schedule, RMSNorm stabilization, batch size ramp-up (bmin=128b_{\min}=128 to bmax=2048b_{\max}=2048), and distributed training on 256 H100 GPUs (Zuo et al., 2024).

3. Inference Efficiency and Memory Scaling

Falcon Mamba 7B exhibits superior inference efficiency for both short and long context windows due to its absence of key-value (KV) cache scaling with sequence length:

  • Prefill (parallel): O(Ld)O(Ld) time and O(Ld)O(Ld) memory vs. O(L2d)O(L^2 d), O(Ld)O(Ld) for Transformer
  • Decode: O(d)O(d) per token vs. O(Ld)O(L d) for attention models
  • Maximum context (A10 24GB): Falcon Mamba 7B supports parallel prefill for contexts to \sim8K tokens, and sequential prefill for virtually unlimited length (GPU memory-limited only by model weights)
  • Throughput: On H100 80GB, Falcon Mamba 7B maintains constant throughput and peak CUDA memory regardless of generated sequence length, supporting generations up to 130K tokens without latency drift (Zuo et al., 2024)

This allows for inference with long contexts without encountering GPU OOM conditions, unlike Transformers which typically exhaust device memory at 4–6K tokens due to quadratic KV cache scaling.

4. Empirical Performance and Benchmarks

Falcon Mamba 7B achieves leading results among SSM-based LLMs and is competitive with or superior to comparable scale Transformers and hybrid (Transformer-SSM) models. Notable evaluation results include:

Benchmark (HF v1/v2) Falcon Mamba 7B Transformer/Hybrid Reference
ARC-25 62.03 Llama 3.1 8B: 62.28
HellaSwag 80.82 Mistral 7B: 80.82
MMLU 62.11 Falcon 2 11B: 64.28
Winogrande 73.64 Gemma 7B: 63.75
TruthfulQA 53.42 -
GSM8K 52.54 -
HF v1 Average 64.09 Mistral 7B: 60.97
HF v2 Average 15.04 Gemma 7B: 15.28

On challenging long-context benchmarks, Falcon-Mamba-Inst-7B augmented with chunk-based inference improves performance by 28% on LongBench and achieves a gain from 2.8 to 27.7 accuracy (+888%) on LongBench v2 with contexts up to 150K tokens, surpassing several Transformer baselines (Ben-Kish et al., 12 May 2025).

5. Long-Context Processing and Overflow Mitigation

Recurrent SSMs have a fixed-size memory m=dds=65,536m = d \cdot d_s = 65,536, which introduces an inherent limitation when condensing very long sequences. When the input length NmN \gg m, the memory must summarize an excessive volume of information, leading to "overflow," i.e., loss of retrievable context.

To address this, Falcon-Mamba-Inst-7B implements a chunk-based inference algorithm (Overflow Prevention with Recurrent Memory, OPRM):

  • Procedure: Split the long context CC into nn non-overlapping chunks C1,...,CnC_1, ..., C_n of maximum safe size ccmaxc \leq c_{\max}. Select the single most relevant chunk for each query using either a min-entropy or max-probability-of-target scoring function.
  • Overflow avoidance: As each chunk is processed independently into the mm-width recurrent memory, overflow is prevented for cc below empirical limits, typically cmax2000c_{\max}\approx2000–$3000$ tokens.
  • Complexity: Baseline prefill complexity O(NlogN)O(N \log N) improves to O(Nlogc)O(N \log c) for OPRM, as cNc\ll N, yielding sub-quadratic scaling for ultra-long contexts.

Empirically, chunk-based inference not only stabilizes retrieval and QA tasks but also substantially improves model F1 score and accuracy on long-context benchmarks (Ben-Kish et al., 12 May 2025).

6. Comparative Analysis and Architectural Trade-Offs

Despite literature indicating that hybrid SSM-attention models (e.g., interleaved with sparse or local attention) may excel in some settings, Falcon Mamba 7B demonstrates that a "pure Mamba" design—absent any attention—can match or outperform such hybrids given sufficient parameter count and data scale. Notably, pure SSMs still lag behind classic Transformers in certain in-context learning scenarios (e.g., multi-step copying or retrieval tasks), although targeted injection of chain-of-thought or instruction data can partially alleviate this gap (Zuo et al., 2024).

Advantages:

  • Linear time and constant memory in context length for both prefill and generation
  • No sequence-length dependent OOM on commodity GPUs
  • State-of-the-art accuracy among attention-free LLMs at 7B scale

Limitations:

  • Ultralong-context (>8K tokens) training regime remains underexplored in the current public checkpoints
  • Certain retrieval and copy-intensive tasks may still favor Transformer architectures
  • Maximum per-chunk memory capacity bounds retrievable factual content per pass

Future directions include exploiting larger curricula, multilingually scaling, hybridizing with sparse attention, and pushing SSM parameter scaling beyond 7B to further extend performance boundaries (Zuo et al., 2024, Ben-Kish et al., 12 May 2025).

7. Relationship to Falcon Series and Open Science

Falcon Mamba 7B is distinct from the Falcon-7B model presented in "The Falcon Series of Open LLMs" (Almazrouei et al., 2023). The latter is a Transformer-based decoder-only LLM trained on 1.5T tokens, with multiquery-group attention and rotary positional encodings, while Falcon Mamba 7B is a non-attentive SSLM trained on 5.8T tokens. Both models are openly released under permissive licenses with checkpoints and training code to foster research and ecosystem development, but Falcon Mamba 7B currently represents the most performant, attention-free LLM at the 7B parameter scale.

Further performance comparisons, training regime, and architectural details are summarized below:

Model Architecture Params Pretraining Data Context Length Trained Benchmarked Max Context Benchmark Performance
Falcon Mamba 7B Pure Mamba (SSM) 7B 5.8T up to 8K tokens \gg8K (with OPRM) HF-v1/v2: @top, LongBench
Falcon-7B Transformer 7B 1.5T 2K tokens 2K tokens EleutherAI avg: 60.8%

This comparison highlights the methodological shift towards SSM-based architectures for efficient, high-performance open-domain LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Falcon Mamba 7B.