Llama Guard Variants
- Llama Guard variants are a suite of safety models based on Llama architectures that enable real-time content moderation through instruction tuning and extensible taxonomies.
- They incorporate techniques such as quantization, pruning, and LoRA adapters, as well as multimodal and multilingual extensions to optimize performance from enterprise to edge deployments.
- These models integrate advanced safety classification methods, including multi-class outputs, logical reasoning stages, and dynamic monitoring to robustly counter adversarial attacks.
Llama Guard variants constitute a family of safety guard models built atop Llama-family architectures, targeting the real-time moderation of both user inputs and generated outputs in human-AI conversational settings. These models enforce rigorous content safety through multi-class or binary classification, leverage extensible taxonomies, and are optimized for deployment ranging from enterprise-scale servers to edge devices. The current landscape includes text-only, multimodal, quantized, compact, multilingual, and reasoning-augmented forms, reflecting evolution along axes of efficiency, robustness, extensibility, and adversarial resilience.
1. Architectural and Training Foundations
The canonical Llama Guard is an instruction-tuned LLM-based safeguard model, preserving the underlying transformer architecture of Llama2-7B: a stack of 32 decoder-only layers with rotary position encoding and multi-head self-attention (32 heads, hidden size 4096) (Inan et al., 2023). The guard model is fine-tuned for input–output safety classification using high-quality, policy-annotated datasets (≈14,000 examples), formatted as instruction lists, conversation context, and serialized labels. The model is framed as a multi-class classifier, producing either a binary ("safe", "unsafe") verdict or explicit predicted risk categories per sample.
Instruction tuning leverages serialization strategies where safety categories and task flags are injected as instructions, followed by input context and expected completions (single-token “safe”/“unsafe”, plus comma-separated violation tags). Data-augmentation tactics include dropping non-violated categories, randomizing indices, and removing violated categories to simulate partial taxonomies. The training loss is standard cross-entropy over the token sequence, with learning rate scheduling and AdamW optimization (Inan et al., 2023).
The model’s core taxonomy consists of six violation classes (O1: Violence/Hate, O2: Sexual Content, O3: Illegal Weapons, O4: Controlled Substances, O5: Suicide/Self-Harm, O6: Criminal Planning) and a default "Safe" class. This paradigm supports zero-shot and few-shot adaptation to alternate taxonomies.
2. Parameter-Efficient and Compact Variants
Llama Guard variants have rapidly diversified to address on-device needs and parameter efficiency. Llama Guard 3-1B-INT4 exemplifies this trajectory: it is a 4-bit quantized, pruned variant of the Llama Guard 3-1B model, compressing the model size to ≈440 MB (from 2.8 GB in bfloat16 full precision) using a combination of pruning (16→12 transformer layers, MLP width reduction), quantization-aware training for weights/activations, and head pruning (Fedorov et al., 2024). The result is a real-time-capable (≥30 tokens/s, ≤2.5 s first-token) guardrail for mobile CPUs, maintaining or even exceeding the full precision model’s F1 and false positive rate on multilingual moderation tasks. Pruning of unembedding layers and aggressive quantization do not degrade—and may modestly boost—safety classification metrics, due to combined regularization and distillation from larger teacher models.
The LoRA-Guard variant operates by inserting low-rank adapters solely in self-attention Q/K matrices, yielding a dual-path transformer: one unchanged for text generation, one with LoRA adapters and a lightweight classifier head for moderation (Elesedy et al., 2024). This approach reduces overhead by 100–1000× versus training an entire guard model, with negligible effect on latency or generative path quality.
3. Multimodal and Multilingual Extensions
Recent variants extend the Llama Guard schema beyond monolingual, text-only safety. Llama Guard 3 Vision (based on Llama 3.2-Vision 11B) fuses image and text streams, applying ViT-encoded visual tokens and Llama tokenized text as a concatenated sequence for full-transformer attention (Chi et al., 2024). Both prompt and response classification heads are fine-tuned on multimodal and text data, demonstrating F1=0.938 (response) and substantial outperformance of GPT-4o baselines on the MLCommons taxonomy, with low FPR and robust adversarial resistance (e.g., PGD, GCG attacks).
CultureGuard and SEALGuard variants address language-generalization via large-scale, culturally-adapted multilingual datasets (e.g., Nemotron-Content-Safety-Dataset-Multilingual-v1: 386K samples in 9 languages) and low-rank adaptation of Llama backbones. SEALGuard leverages SeaLLM-3-7B as a backbone and demonstrates that parameter-efficient LoRA adaptation yields near-perfect defense rates (DSR=97%) across 10 Southeast Asian languages, filling a gap left by English-centered moderation (Shan et al., 11 Jul 2025, Joshi et al., 3 Aug 2025).
4. Taxonomy, Logic, and Reasoning-Augmented Guardrails
Successive Llama Guard generations have evolved the taxonomy from fixed, flat sets to fully extensible, meta-learned and logic-enabled taxonomies. For example, Llama Guard 3 extends category coverage (13+ classes, input windows up to 128K tokens), utilizes multi-label sigmoid outputs, and introduces both cross-entropy and Direct Preference Optimization losses (Grattafiori et al., 2024). Taxonomy-adaptive models, such as Roblox Guard, combine LoRA adaptation with chain-of-thought (CoT) rationales, input-inversion, and meta-learning to generalize to unseen category sets at inference (Nandwana et al., 5 Dec 2025).
R²-Guard (-Guard) introduces an explicit logical reasoning stage: transformer-based detectors produce per-category probabilities, which are passed through a set of first-order safety rules encoded in Markov Logic Networks (MLNs) or probabilistic circuits for inference. This hybrid neural-symbolic pipeline yields ≈30–60% higher AUPRC compared to pure LlamaGuard baselines, especially on long-tail/ambiguous classes and adversarial attacks (Kang et al., 2024).
AprielGuard further unifies safety and adversarial detection in a single 8B Llama-architecture, training on a dual-labeled taxonomy that spans content risks and attack types. Structured chain-of-thought traces, required at training time and optionally emitted at inference, provide both interpretability and auxiliary supervision. AprielGuard achieves F1=0.88 (safety), F1=0.93 (adversarial) across diverse benchmarks and outperforms vanilla LlamaGuard-8B in agentic and reasoning-intensive evaluations (Kasundra et al., 23 Dec 2025).
5. Adversarial Robustness and Limiting Factors
Adversarial inputs, especially optimized adversarial suffixes (“Super Suffixes”), can bypass static prompt classifiers such as Llama Prompt Guard 2 (a 12-layer, 86M-parameter lightweight transformer) (Adiletta et al., 12 Dec 2025). Joint optimization over the base LLM and guard model (via alternating Greedy Coordinate Gradient) lets attackers craft suffixes that both trigger target outputs and evade moderation, dramatically lowering refusal rates. Dynamic approaches—such as DeltaGuard, which monitors the cosine similarity between residual streams and special “refusal directions”—restore robustness by detecting shifts in internal representations indicative of misaligned (adversarial) behavior. This suggests that integrating static and dynamic, representation-level detectors is necessary for future Llama Guard variants.
Additionally, benchmarking work shows that quantization (e.g., INT8) can sharply degrade detection and increase latency, and that detection performance does not scale monotonically with model size: compact models (e.g., Llama-Guard-3-1B) often outperform larger, generalist guardrails in threat detection while minimizing memory and latency footprints (Shahin et al., 27 Jan 2026).
6. Comparative Table of Key Llama Guard Variants
| Variant | Parameter Count | Input Type | Notable Features | Benchmark F1 / DSR |
|---|---|---|---|---|
| Llama Guard (original) | 7B | Text | Multi-class, instruction-tuned, adjustable taxonomy | ~0.90 (OpenAIModEval) (Inan et al., 2023) |
| Llama Guard 3-1B-INT4 | 1B (INT4) | Text (multilingual) | Pruned, quantized, mobile-ready | F1=0.904 (EN), ≤0.084 FPR (Fedorov et al., 2024) |
| Llama Guard 3 (8B) | 8B | Text | Multi-label, up to 13 categories, DPO training | –76% VR, +95% FRR (Grattafiori et al., 2024) |
| Llama Guard 3 Vision | 11B | Text+Image | ViT fusion, prompt/response heads, adversarial tested | F1=0.938 (response) (Chi et al., 2024) |
| CultureGuard | 8B + LoRA | Multilingual | Culturally-adapted, high-data-filtering | Harmful-F1=82.4–95.4 (Joshi et al., 3 Aug 2025) |
| SEALGuard | 7B + LoRA | Multilingual | SEA languages, LoRA adaptation, high DSR | DSR=97.2% (SEA) (Shan et al., 11 Jul 2025) |
| LoRA-Guard | Few M LoRA | Text | Ultra-lightweight adapters, dual-path | AUPRC=0.90, 100–1000× smaller (Elesedy et al., 2024) |
| R²-Guard | Custom | Text | PGM-enabled symbolic reasoning addition | AUPRC=0.909 (ToxicChat) (Kang et al., 2024) |
| AprielGuard | 8B | Text | Unified safety/adversarial, CoT reasoning traces | F1=0.88 (safety), 0.93 (adv) (Kasundra et al., 23 Dec 2025) |
| Prompt Guard 2 | 86M | Text | Lightweight, binary classifier; dynamic detection needed | See (Adiletta et al., 12 Dec 2025) |
7. Design Tradeoffs and Future Directions
Llama Guard variants span a design spectrum between full-precision, high-coverage models (better zero-day and long-context reasoning) and heavily compressed/adapter-based models for edge deployment. Empirical studies suggest optimal security performance is not merely a function of parameter count; targeted fine-tuning, logic-informed output fusion, and multi-modal or multilingual training are necessary for robust protection—especially as jailbreak and translation-based adversarial prompts proliferate.
Emerging directions include logic–neural hybrid guardrails (for taxonomic extensibility and rare category protection), dynamic residual-stream monitoring (as in DeltaGuard), on-device quantized models (for consumer hardware), and domain-specialized models (e.g., CircuitGuard for RTL code/IP leakage). A plausible implication is that future Llama Guard systems will adopt layered architectures (static classifier, semantic retrieval, dynamic state analyzers) and continuous human-in-the-loop learning to adapt under active adversarial pressure.
In conclusion, Llama Guard variants have established a diverse, extensible framework for LLM content moderation, spanning high-resource server scenarios to on-device, culturally aware, and multimodal deployments. The field continues to progress towards more robust, interpretable, and parameter-efficient architectures suited to rapidly evolving LLM deployment landscapes (Inan et al., 2023, Fedorov et al., 2024, Chi et al., 2024, Kang et al., 2024, Kasundra et al., 23 Dec 2025, Joshi et al., 3 Aug 2025, Shan et al., 11 Jul 2025, Shahin et al., 27 Jan 2026, Nandwana et al., 5 Dec 2025, Adiletta et al., 12 Dec 2025).