Small Decoder-Only LLMs Overview
- Small Decoder-Only LLMs are autoregressive transformer models that exclusively use a decoder stack for left-to-right sequence modeling, enabling efficient training and deployment.
- They incorporate architectural innovations such as depth/width compression, grouped-query attention, and adapter-based fine-tuning to optimize performance within a parameter range of tens of millions to several billion.
- They are applied in diverse tasks like machine translation, sequence labeling, and robotic planning, though they face challenges such as output hallucination and limitations in complex reasoning.
A small decoder-only LLM is an autoregressive transformer-based neural network with 70 million to roughly 9 billion parameters, using only a decoder (causal) stack for left-to-right sequence modeling, distinguishing it from encoder–decoder or purely masked-language-model (MLM) architectures. The “small” descriptor in academic discourse typically refers to models between tens of millions and several billion parameters—where single- or few-GPU training and inference is tractable. Interest in this class arises from a need for efficient, accessible, and privacy-preserving alternatives to multi-hundred-billion-parameter models, especially in resource-constrained or domain-specialized scenarios.
1. Model Architectures and Training Regimes
Small decoder-only LLMs universally implement variants of the Transformer decoder block: stacked masked multi-head self-attention layers followed by positionwise feed-forward sublayers, with residual connections and layer normalization. Key design axes include:
- Parameter Scale: Models range from as small as 68–125 M parameters (e.g., GPT-2 117M, GPT-Neo-125M, “scratch” decoders) up to 1–2 B (LLämmlein-1B, TinyLlama-1.1B, Gemma 2B), and occasionally to 7–9 B when termed ‘small’ relative to frontier LLMs (Pfister et al., 2024, Lamelas, 7 Jan 2026).
- Width/Depth Variants: Increasing model width (hidden size per layer) or depth (layer count) provides equivalent improvements in test loss per unit FLOPs, but width scaling favors computational efficiency (higher per-step sample throughput) (Caillaut et al., 2024).
- Architectural Extensions:
- "ParallelGPT" splits the stack and fuses twin transformer branches, retaining language modeling performance with marginal parameter overhead.
- "LinearlyCompressedGPT" and "ConvCompressedGPT" reduce dimensionality progressively in deeper layers (via linear or convolutional projections), saving up to 36% parameters and achieving 18% faster training without convergent loss increase (Suresh et al., 2024).
- Tokenization: Most use subword BPE or SentencePiece vocabularies sized for compactness and throughput (e.g., 32k for German-only corpora, 50–61k for English GPT) (Pfister et al., 2024).
- Pretraining Data: High-quality, extensively filtered open datasets (RedPajama-German, The Pile, BookCorpus, or CodeParrot) are the standard, with deduplication, document/paragraph filtering, and tailored token-to-word ratios for noise reduction (Pfister et al., 2024, Lamelas, 7 Jan 2026).
- Optimization: AdamW is uniformly employed with linear-cosine or cosine learning-rate schedules, modest initial rates (1e-4 to 6e-4), careful validation and early-stopping for adaptation (Pfister et al., 2024, Favero et al., 20 Feb 2025, Lamelas, 7 Jan 2026).
2. Scaling Laws and Empirical Performance
Scaling properties for small decoder-only LLMs have been modeled with both parameter-only and data-aware laws:
- Power Law in Parameters:
- Fit to cross-entropy loss for model size (non-embedding parameters). Holds for M–1B, deviates for extrapolation to 7B+ (Caillaut et al., 2024).
- Bivariate “Chinchilla-style” Law:
- Jointly models loss as a function of model size and seen target tokens . Accurately fits observed losses across multiple checkpoints, but fails when generalized across domains or extrapolated beyond data fit (Caillaut et al., 2024).
Empirical test loss, convergence rates, and task metrics demonstrate:
| Model (MT) | Final Loss (nats/token) | Convergence Steps |
|---|---|---|
| 70M | 3.1 | ~80k |
| 160M | 2.9 | ~50k |
| 410M | 2.8 | ~30k |
- Performance Plateaus: Simple sequence labeling or classification tasks saturate early with increasing model size, while complex reasoning continues to benefit from additional tokens or scaling beyond 1B (Pfister et al., 2024).
- Task Competitiveness: On certain benchmarks (e.g., German SuperGLEBer, educational argument mining), models of $1$–$9$B can match or outperform same-size MLM encoders and approach larger LLMs with careful fine-tuning and instruction adaptation (Pfister et al., 2024, Favero et al., 20 Feb 2025).
3. Applications and Task-Specific Adaptation
Small decoder-only LLMs have been deployed and rigorously evaluated in diverse downstream settings:
- Machine Translation: Models as small as 70M–410M achieve monotonic improvements with size for standard translation objectives using causal left-to-right decoding, but law extrapolation fails above 7B parameters. Optimal budget allocation favors smaller models with more data, rather than maximizing parameter count (Caillaut et al., 2024).
- Sequence Labeling (NER, chunking, information extraction): Layer-wise causal-mask removal in large decoders (7B+) enables these models to match or surpass SOTA MLM methods for IOB2 tagging. In contrast, removing masks in sub-100M decoders offers no benefit; MLM pretraining is preferable at this scale. Selective unmasking of higher layers rather than full removal yields best results (Dukić et al., 2024).
- Text Rewriting (Grammar Correction, Simplification): Small models (125–345M) fine-tuned on grammar or simplification tasks fall short (by factors of 2–3 on metrics such as GLEU and M) of LLMs (GPT-3.5, GPT-4), displaying high hallucination rates and limited meaning preservation. Larger SLMs (1–2B) narrow the gap on SARI and Flesch but still trail LLMs (Lamelas, 7 Jan 2026).
- Argument Mining in Education: For essay segmentation and argument-type classification, 7–9B parameter models (Qwen 2.5, Llama 3.1, Gemma 2) outperform encoder/encoder-decoder baselines, with macro-F1 up to ∼87.5. Fine-tuning is crucial for segmentation and type; few-shot prompting suffices for quality assessment. Latency and compute requirements remain compatible with local deployment (Favero et al., 20 Feb 2025).
- Robotic Planning (Task Decomposition): Fine-tuned GPT2-medium (345M) matches in-domain planning accuracy of GPT3.5 (90% vs 95% simulator success), supporting chain-of-thought (CoT) reasoning in single-domain tasks. Qualitative and quantitative analysis reveals competitive breakdown of commands into action sequences (Choi et al., 2024).
4. Efficient Scaling and Architectural Modifications
Reducing model size while retaining task efficacy has advanced through several architectural strategies:
- Parallel Stacks: ParallelGPT doubles the embedding and splits the stack, yielding 10% parameter overhead but allows for branch-dropping during inference for speedup at the cost of minor quality loss (0.2 nats).
- Depth/Width Compression: LinearlyCompressedGPT and ConvCompressedGPT shrink model width after each stage, enabling 36% parameter reduction and ~18% faster training, empirically matching standard transformer learning curves for medium-size code generation tasks (Suresh et al., 2024).
- Grouped-Query Attention: For small-LLMs, grouped queries (group size=4) yield memory and speed gains with minimal impact on downstream performance. Combined with parameter sharding and flash attention, they permit scaling up to 1B on commodity GPUs (Pfister et al., 2024).
- Adapter-based Fine-tuning: QLoRA and LoRA, with 4-bit quantization, make possible the fine-tuning of 7–9B decoders within 8–40GB RAM, favoring low-cost, privacy-preserving customization (Dukić et al., 2024, Favero et al., 20 Feb 2025).
5. Performance Limitations and Failure Modes
- Scaling Law Breakdown: Single-variable power laws derived from <1B models systematically underestimate losses at 6.9B; bivariate laws fit better but cannot generalize across untrained domains or to vastly larger models (Caillaut et al., 2024).
- Data Distribution Dependency: Fitting scaling laws per domain or language direction yields substantially different exponents and loss offsets; cross-domain extrapolation is unreliable (Caillaut et al., 2024).
- Mask Removal in Small Models: Causal mask removal and bidirectional attention injection are futile in sub-100M decoders; bidirectionality requires both model and data scale (Dukić et al., 2024).
- Output Hallucination and Meaning Drift: Fine-tuned SLMs (grammar/simplification) display hallucination rates 3–8 higher than LLMs; output diverges from the source, undermining reliability for correction tasks. Cascading or multi-pass decoding yields inconsistent improvements (Lamelas, 7 Jan 2026).
- Domain Specificity: Small models match LLMs for domain-specific CoT decomposition but fail to generalize to broader or longer-horizon tasks, due to limited capacity and pretraining data scope (Choi et al., 2024).
6. Practical Recommendations and Best Practices
- Model-Data Allocation: Given fixed computational budget, extending data coverage (training tokens) is typically more efficient than scaling model size, especially for models up to 1B parameters (Caillaut et al., 2024).
- Design Ratios: For optimal trade-off in MT and sequence tasks, scale depth and hidden size linearly to maintain depth ≈ hidden size / 85 (e.g., 12 × 768 for 160M) (Caillaut et al., 2024).
- Tokenizer Construction: Corpus- and domain-optimized subword vocabulary (e.g., 32k BPE for German) decreases token count per document, accelerating training (Pfister et al., 2024).
- Validation Strategy: Always cross-validate predicted scaling law–derived model sizes with intermediate checkpoints rather than allocating compute naïvely (Caillaut et al., 2024).
- Fine-tuning: Employ parameter-efficient adapters (e.g., LoRA) rather than full weight adaptation. For sequence labeling, tune groupwise mask removal based on validation F1; do not remove masks in early layers to preserve generation (Dukić et al., 2024).
- Deployment: Use quantized SLMs for client-side/NLP tasks in privacy-sensitive contexts (education, healthcare, internal tools); exploit efficiency gains for on-device processing and edge deployment (Favero et al., 20 Feb 2025, Pfister et al., 2024).
7. Outlook and Open Directions
Key research avenues and unresolved challenges include:
- Scaling Law Universality: Investigating scaling formulations that unify behavior across domains and extrapolate reliably to both smaller and larger models remains open (Caillaut et al., 2024).
- Controlling Hallucinations: Enforcing semantic fidelity and penalizing hallucination presents a durable challenge for SLMs; new objectives, alignment protocols, and data augmentations are necessary (Lamelas, 7 Jan 2026).
- Instruction Tuning and Curriculum: Extending instruction tuning regimes, dynamic mask schedules, and curriculum learning to smaller models promises to improve sample efficiency and generalize bidirectionality or CoT reasoning (Dukić et al., 2024, Choi et al., 2024).
- Task Generalization: Approaches such as mix-of-experts and retrieval-augmentation are proposed to diminish the small-model–LLM performance gap in rewriting, reasoning, and NLU (Lamelas, 7 Jan 2026).
- Tooling and Open Data: Releasing high-quality datasets (“COST” for robotics, domain-specific BPEs, parallel code for compact models) and training scripts is emphasized to facilitate reproducible, domain-custom SLM development (Pfister et al., 2024, Choi et al., 2024, Suresh et al., 2024).
Small decoder-only LLMs stand as a crucial focus for practitioners seeking capable, interpretable, and deployable language technologies, provided their operational scope and intrinsic limitations are rigorously validated and respected.