Evaluating Parameter Efficient Methods for RLVR
Abstract: We systematically evaluate Parameter-Efficient Fine-Tuning (PEFT) methods under the paradigm of Reinforcement Learning with Verifiable Rewards (RLVR). RLVR incentivizes LLMs to enhance their reasoning capabilities through verifiable feedback; however, while methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified. In this work, we conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks. Our empirical results challenge the default adoption of standard LoRA with three main findings. First, we demonstrate that structural variants, such as DoRA, AdaLoRA, and MiSS, consistently outperform LoRA. Second, we uncover a spectral collapse phenomenon in SVD-informed initialization strategies (\textit{e.g.,} PiSSA, MiLoRA), attributing their failure to a fundamental misalignment between principal-component updates and RL optimization. Furthermore, our ablations reveal that extreme parameter reduction (\textit{e.g.,} VeRA, Rank-1) severely bottlenecks reasoning capacity. We further conduct ablation studies and scaling experiments to validate our findings. This work provides a definitive guide for advocating for more exploration for parameter-efficient RL methods.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
A Simple Explanation of “Evaluating Parameter Efficient Methods for RLVR”
1. What is this paper about?
This paper studies how to train big LLMs to get better at step-by-step reasoning (like solving math problems) without changing all of their settings. Instead of updating every part of the model, the authors test small “add-ons” called adapters that are cheaper and faster to train. They do this in a special training setup called RLVR, where the model gets a simple reward when its final answer is correct and no reward when it’s wrong.
In short: the paper asks which small, efficient add-ons are best for helping a model learn to reason through trial and error with right/wrong feedback.
2. What questions did the researchers ask?
The study focuses on a few easy-to-understand questions:
- Which adapter method works best for training with “verifiable rewards” (right/wrong checks)?
- Is the most popular adapter, called LoRA, actually the best choice here?
- Why do some “smart” starting tricks for these adapters fail during this kind of training?
- How small can we make these adapters before they become too weak to help?
- Do these results hold across different model sizes and training settings?
3. How did they do the study?
The authors ran a big, fair comparison of more than 12 adapter methods on math reasoning tests. They used two sizes of the same type of reasoning model (about 1.5 billion and 7 billion parameters). To keep things fair, they tried to keep training settings the same across methods (same learning rate, batch size, etc.), and they checked results on several math benchmarks like AIME and MATH-500.
How the training works (in simple terms):
- RLVR is like teaching with a grader that only says “right” or “wrong.” When the model solves a math problem, a program checks if the final answer matches the correct result. If it’s right, reward = 1; otherwise, reward = 0.
- Adapters are small sets of extra knobs added to the model. Instead of remodeling the whole engine, you bolt on a small adjustable part. This is much cheaper and faster to train.
- LoRA and its variants: LoRA is a popular adapter that updates the model using tiny “low-rank” changes. “Rank” here is like how many knobs you have to tweak; more rank = more flexibility. Variants like DoRA, AdaLoRA, and MiSS change the structure to give the adapter smarter flexibility.
- Some methods try special starting points (initializations) based on SVD, which is a math way of finding the “main directions” of change inside a model. The paper looks at whether starting in these directions helps or hurts in RLVR.
They also ran “ablation” tests, which means changing one thing at a time (like batch size, learning rate, or adapter rank) to see if the main conclusions still hold. Finally, they checked if the best methods still win when the model is larger (7B).
Helpful analogies for tricky terms:
- Low-rank: Imagine changing a picture using only a few sliders instead of thousands. It’s faster, but you might miss some details.
- Principal components (from SVD): Think of these as the main “highways” the model usually uses to change. Not all learning needs to happen on the highways; sometimes the best path is the side streets.
- Spectral collapse: When an adapter meant to use side streets gets pulled back onto the main highways and stops learning well.
4. What did they find, and why does it matter?
Here are the main results, with short reasons for why they’re important:
- Structural variants beat standard LoRA in RLVR.
- Methods like DoRA, AdaLoRA, and MiSS consistently did better than regular LoRA on math reasoning. DoRA even beat training the whole model in some cases. This means the usual go-to method (LoRA) isn’t the best choice when training with right/wrong rewards.
- “Smart” SVD-based starting tricks fail in RLVR.
- Methods like PiSSA and MiLoRA that try to start in the model’s “main directions” had serious problems. PiSSA basically collapsed to near-zero performance. MiLoRA started okay but then fell apart. The reason: RLVR seems to learn best off the main highways (on side streets), and these SVD tricks push updates onto the highways—or get dragged there by the gradients—so learning stops working properly.
- There’s a “minimum power” needed for the adapters.
- Ultra-tiny adapters that try to change almost nothing (like VeRA, rank-1, IA3, or just tuning LayerNorm) don’t have enough flexibility. The model needs a certain minimum amount of trainable capacity to learn complex reasoning by trial-and-error. Moderate savings are fine, but extreme cuts hurt.
- The results are robust across settings and scales.
- Changing batch sizes, learning rates, and adapter ranks didn’t change the big picture: structural variants stayed strong. On the bigger 7B model, methods like DoRA and LoRA+ still matched or beat standard LoRA, showing the findings generalize.
Why it matters: If you want to train smarter, cheaper, and faster for reasoning tasks, you shouldn’t just default to standard LoRA. Choose better-structured adapters and don’t compress too much.
5. What’s the impact of this research?
This paper is a practical guide for anyone training reasoning models with verifiable rewards:
- Pick adapters like DoRA, AdaLoRA, or MiSS over standard LoRA for stronger results.
- Avoid SVD-based initializations (PiSSA, MiLoRA) in RLVR; they misalign with how this kind of learning actually progresses.
- Don’t shrink adapters too far—there’s a floor where they become too weak to learn reasoning.
- These tips hold across different model sizes and training choices.
Big picture: With the same or less compute, we can get better reasoning performance by choosing smarter adapter designs. This helps make advanced reasoning models more accessible and efficient, which is useful for math, science, and other tasks that need careful step-by-step thinking.
Knowledge Gaps
Unresolved Knowledge Gaps, Limitations, and Open Questions
Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, framed to guide future research.
- External validity beyond mathematical reasoning: Assess whether the reported PEFT ranking (DoRA > AdaLoRA/MiSS > LoRA; SVD-inits fail; extreme compression collapses) generalizes to other RLVR domains (coding, scientific QA, formal proofs, natural language reasoning) and to multimodal RLVR.
- Cold-start RLVR: Evaluate PEFT methods under R1-Zero–like training (no SFT warm-start) to understand exploration, stability, and sample-efficiency differences across adapters when starting from scratch.
- Longer-horizon training: Test whether conclusions hold under substantially longer training schedules (e.g., ≥50k steps), including asymptotic performance, stability, and catastrophic forgetting patterns for each adapter.
- Algorithm diversity: Go beyond GRPO/DAPO/Dr.GRPO to include PPO variants with KL regularization, entropy bonuses, off-policy methods, and different advantage estimators; quantify adapter-specific sensitivity to these RLVR objectives.
- Reward design breadth: Compare outcome-only binary rewards to process-based or partial-credit rewards, shaped rewards, and noisy/verifier-imperfect settings; study adapter robustness to verifier errors and reward sparsity.
- Group size and sampling strategy: Systematically vary group size G, clipping strategy (E_low/E_high), and dynamic sampling policies to test interaction effects with each adapter’s optimization dynamics.
- Per-method hyperparameter optimization: The study uses unified hyperparameters; conduct rigorous, adapter-specific tuning (rank, alpha, dropout, LR ratios, weight decay, optimizer betas, layer targeting) to disambiguate method-intrinsic merit from suboptimal shared settings.
- Rank and capacity scaling laws: Precisely quantify the “expressivity floor” by mapping performance vs. rank across layers and modules (attention and MLP), including higher ranks (>32), dynamic rank schedules (AdaLoRA-style) per layer, and minimum viable capacity per module.
- Adapter placement and granularity: Move beyond “target all linear modules” to evaluate layer-wise placement strategies (e.g., only attention vs. only MLP; early vs. late layers; per-head vs. per-layer) and their interaction with RLVR signals.
- Composition of methods: Test compound adapters (e.g., DoRA + LoRA+ LR ratios; AdaLoRA + magnitude-direction decoupling; MiSS with LR differentiation) to determine whether benefits are additive or interfere.
- Formal theory for spectral misalignment: Provide a mathematically grounded analysis (not just empirical) explaining why RLVR prefers off-principal updates, and why SVD-based initializations (PiSSA/MiLoRA) collapse; derive conditions under which spectral constraints can succeed.
- Stabilizing off-principal updates: Explore initializations and regularizers that maintain off-principal trajectories (e.g., non-zero minor component magnitude scaling, orthogonality constraints, spectral penalties or noise-injection) and compare to LoRA+ LR ratios.
- Layer- and step-wise spectral profiling: Extend spectral analyses beyond a single gate-proj layer to all attention/MLP modules and across training steps, model sizes, and datasets to confirm generality of gradient alignment patterns.
- Compute, memory, and throughput metrics: Report and compare wall-clock time, GPU memory footprint, tokens processed, and training/inference throughput for each adapter; quantify efficiency-performance trade-offs under equalized compute budgets.
- Reproducibility and variance: Provide multi-seed results, statistical significance tests, and confidence intervals (especially on small benchmarks like AIME); measure sensitivity to generation parameters (temperature, top-p, max tokens) and Pass@1 vs Avg@k.
- KL and regularization sensitivity: The paper fixes KL to zero in DAPO; analyze how non-zero KL penalties, entropy regularization, or ratio-clipping ranges affect adapter performance and stability.
- Dataset scale and composition: Examine how training dataset size, difficulty mix (algebra/geometry/combinatorics), and contamination impact adapter effectiveness; test larger and more diverse RLVR corpora.
- Generalization and side effects: Measure cross-task generalization and potential degradation of non-math capabilities post-RLVR; evaluate catastrophic forgetting and interference effects across adapters.
- Evaluation robustness to verifiers: Probe adversarial or edge cases where models exploit verifier weaknesses; quantify adapter-specific tendencies to overfit verifiable heuristics vs true reasoning.
- Generation policy sensitivity: Validate whether improvements persist under greedy decoding and lower temperatures; analyze adapter-specific diversity vs accuracy trade-offs in sampling.
- Model family breadth: Extend tests beyond DeepSeek-Qwen/Nemo to Llama/Mistral/Mixtral and transformer variants; identify architecture-adapter interactions that alter the performance hierarchy.
- IA3/LN-Tuning variants: Investigate richer activation-scaling schemes (e.g., per-head/per-channel gating, layer-wise mixtures) or combining IA3/LN with small-rank adapters to overcome observed bottlenecks in RLVR.
- Fairness of training steps across scales: The 1.5B and 7B settings use different step counts; equalize total tokens/updates to isolate size effects and confirm adapter rankings.
- Safety and stability: Systematically track reward hacking, entropy collapse, divergence events, and gradient pathologies per adapter; develop diagnostics and intervention protocols tailored to PEFT in RLVR.
- Weight merging and deployment: Empirically study merge-time numerical stability and inference-time consistency for each adapter (especially DoRA/AdaLoRA/MiSS), and propose robust merging schemes for production.
- Decision guidelines for practitioners: Identify regimes (task type, compute budget, model scale, reward design) where standard LoRA may still be preferable; derive actionable heuristics for adapter selection under practical constraints.
Practical Applications
Immediate Applications
Below are actionable, deployable uses that can be implemented with today’s tools, data, and compute.
- Swap standard LoRA for DoRA in RLVR fine-tuning to boost reasoning accuracy at similar or lower trainable parameter budgets
- Sector: software, education, finance, research/academia
- Tools/workflow: Hugging Face TRL + Accelerate + DeepSpeed ZeRO-2; Hugging Face PEFT with DoRA adapters; vLLM for rollouts; DAPO/GRPO/Dr.GRPO as objective; math/code verifiers (latex2sympy, unit tests)
- Why: Structural variants (DoRA, AdaLoRA, MiSS) consistently outperform standard LoRA; DoRA can surpass full-parameter FT in RLVR math tasks
- Assumptions/dependencies: Reliable verifiers exist for the target domain; a stable DoRA implementation targeting all linear modules; sufficient context length for CoT; licensing and access to suitable base models (e.g., DeepSeek-R1-Distill families)
- Adopt LoRA+ when structural adapters aren’t feasible, prioritizing learning-rate ratio tuning over SVD-based initializations
- Sector: software, education, finance
- Tools/workflow: LoRA+ (higher LR on B than A), standard LoRA code paths; same RLVR stack (TRL, DAPO, vLLM)
- Why: LoRA+ proved robust; SVD-informed initializations (PiSSA, MiLoRA) collapse or underperform in RLVR
- Assumptions/dependencies: Adapter architecture remains standard LoRA; ability to set differential LRs; existing RLVR pipeline
- Enforce an “expressivity floor” in resource-limited training: use LoRA-FA or moderate ranks (r=16–32) rather than extreme compression (VeRA, IA3, Rank-1)
- Sector: software, research/academia, startups with constrained GPUs
- Tools/workflow: PEFT configs with LoRA-FA; rank selection of 16–32; monitor memory via ZeRO-2 offloading; long-CoT inference with vLLM
- Why: RLVR requires a minimum trainable capacity; extreme vector-only updates bottleneck reasoning
- Assumptions/dependencies: Some headroom in memory to keep r≥16; domain tasks benefit from RLVR (i.e., have verifiers)
- Establish RLVR MLOps guardrails that encode the paper’s best practices
- Sector: software (MLOps), research/academia, platform providers
- Tools/workflow:
- Method selection policy: prefer DoRA/AdaLoRA/MiSS; avoid PiSSA/MiLoRA; allow LoRA+ as fallback
- Hyperparameter rules: LR tuned carefully; ranks ≥16; batch size flexibility (32–128); DAPO default with swaps to GRPO/Dr.GRPO as needed
- Spectral monitoring: add a “Spectral Guard” to track update energy across singular components; flag principal-spike regressions
- Evaluation: Avg@k and Pass@k metrics; W&B logging
- Why: Encodes findings into repeatable pipelines; early detection of collapse modes
- Assumptions/dependencies: Access to SVD or equivalent spectral tooling; reproducible evaluation harness; engineering time to add monitors
- Upgrade existing math/coding tutors and internal reasoning assistants via RLVR + DoRA
- Sector: education (tutoring), software (code assistants, QA), finance (spreadsheet/checker tools)
- Tools/workflow: RLVR fine-tuning with strict verifiers (symbolic math equivalence, unit tests, schema validators); 1.5B–7B models for cost-effective deployment; nightly RL updates on curated verifiable datasets
- Why: RLVR improves reasoning on verifiable tasks; DoRA yields better accuracy/efficiency than LoRA
- Assumptions/dependencies: High-quality, deterministic verifiers; curated problem sets; governance to prevent reward hacking
- Build lightweight, iterative improvement cycles for small models (1.5B–7B) in production
- Sector: SaaS, startups, on-prem/edge deployments
- Tools/workflow: Continuous RLVR using DoRA or LoRA+ on customer-specific verifiable tasks; quantization-aware serving with vLLM; blue/green deploys with Avg@k gating
- Why: PEFT enables fast, inexpensive iterations; verified rewards simplify objective design
- Assumptions/dependencies: CI/CD for models; safe rollback; strong telemetry on correctness
- Standardize reporting for RLVR efficiency and stability in research and procurement
- Sector: policy (R&D funding and evaluation), research/academia, enterprise AI governance
- Tools/workflow: Require disclosures of trainable parameter fraction, adapter type, spectral diagnostics, Avg@k metrics, and RLVR algorithm settings
- Why: The paper shows large differences by adapter family and collapse modes with certain inits; standardized reporting improves comparability and safety
- Assumptions/dependencies: Consensus among stakeholders; minimal overhead to generate required artifacts
- Domain-specific RLVR with verifiable graders beyond math
- Sector: finance (formula validation, regulatory checks), data engineering (SQL correctness), software (API conformance), legal ops (citation/section matching), cybersecurity (policy-rule verification)
- Tools/workflow: Map binary/deterministic validators to task outputs; RLVR with DoRA/LoRA+; continuous dataset curation for edge cases
- Why: RLVR excels where deterministic verifiers exist; the study’s adapter guidance transfers across RLVR objectives
- Assumptions/dependencies: Robust, low-latency verifiers; scoped tasks where binary correctness is meaningful
Long-Term Applications
These require further research, scaling, or engineering maturity before broad deployment.
- Geometry-aware adapters and regularizers that preserve off-principal optimization in RL
- Sector: research/academia, foundation model labs
- Tools/workflow: New adapter designs or losses that maintain off-principal update trajectories (e.g., spectral penalties, direction-magnitude schedules, adaptive rank scheduling beyond AdaLoRA)
- Why: The paper links RLVR success to off-principal dynamics and shows SVD-based inits misalign; principled mechanisms could make RL more stable
- Assumptions/dependencies: Deeper theoretical grounding of spectral dynamics; efficient on-the-fly spectral estimators
- High-performance RLVR training infrastructure at scale (VeRL-style) with long-horizon CoT
- Sector: software platforms, cloud providers, foundation model labs
- Tools/workflow: Migration from TRL to high-throughput, distributed RLVR frameworks; large-batch sampling with dynamic filtering; longer training schedules; mixed precision and kernel fusion for 32k+ tokens
- Why: Scaling experiments suggest benefits persist to 7B; unlocking 8–70B regimes likely needs stronger infra
- Assumptions/dependencies: Budget for distributed training; engineering to stabilize ultra-long CoT rollouts
- Multimodal and multi-turn RLVR with verifiable graders
- Sector: robotics, healthcare IT, industrial automation, education
- Tools/workflow: Programmatic graders for perception (e.g., executable scene graphs), data-extraction verifiers (e.g., OCR-to-schema), dialogue-state validators; adapters like DoRA tailored to multimodal blocks
- Why: Extending verifiable signals beyond text broadens applicability; structural adapters may again outperform standard LoRA
- Assumptions/dependencies: Creating reliable multimodal verifiers is non-trivial; evaluation must handle ambiguity and partial credit
- Production-grade “PEFT-RL Orchestrator” that auto-selects adapters, ranks, and LRs per task and hardware
- Sector: MLOps, platform providers, enterprises
- Tools/workflow: Policy engine that:
- Detects domain verifier characteristics and selects DoRA/AdaLoRA/MiSS vs. LoRA+
- Tunes r, LR ratios, and batch size within compute/memory constraints
- Monitors spectral signals and auto-mitigates collapse (e.g., switch adapter or LR schedule)
- Why: Encapsulates best practices into a reusable product; reduces expert tuning burden
- Assumptions/dependencies: Sufficient telemetry; robust adapter implementations across frameworks
- Continual RLVR for on-device and edge models with safety/consistency guarantees
- Sector: mobile, IoT, privacy-preserving enterprise
- Tools/workflow: 7B-or-smaller models with quantization; PEFT updates on-device from locally verifiable tasks; periodic server reconciliation to prevent drift
- Why: PEFT enables tiny update footprints; verified tasks enable self-improvement loops
- Assumptions/dependencies: Secure update channels; verifier execution on-device; methods to prevent reward hacking and catastrophic forgetting
- Regulated-domain deployment (e.g., healthcare, finance, gov) with formal verifiers and audit trails
- Sector: healthcare (dose calculations, coding QA), finance (compliance checks), government (form validation, rules engines)
- Tools/workflow: RLVR with certified verifiers and audit logs; adapter-based updates for traceability; rollbacks with reproducible checkpoints
- Why: Verifiable reward fits high-stakes domains; adapter tuning provides controllable change surfaces
- Assumptions/dependencies: Regulatory acceptance of automated verifiers and RL updates; robust validation and monitoring frameworks
- Research programs on safe weight merging and inference-time consistency for adapter-trained RL models
- Sector: research/academia, safety labs
- Tools/workflow: Study numerical stability of merging adapters; inference-time adapter composition; safeguards against distribution shift
- Why: The paper flags deployment stability and merging concerns as open engineering issues
- Assumptions/dependencies: Availability of benchmarks to stress-test merging and compose adapters across tasks
- Cross-lingual and cross-domain RLVR benchmarks and incentives for open science
- Sector: policy (funders, standards bodies), research/academia
- Tools/workflow: Public leaderboards requiring Avg@k, Pass@k, trainable-parameter fractions, adapter types, and spectral diagnostics; seed-managed evaluation to reduce variance
- Why: The study demonstrates that adapter choice materially changes outcomes; standardization improves comparability and resource efficiency
- Assumptions/dependencies: Community buy-in; sustainable hosting and curation
Notes on feasibility across all applications:
- RLVR depends on high-quality, low-latency, deterministic verifiers; without them, benefits weaken or risk reward hacking.
- Gains are shown on math reasoning and extend most naturally to any domain with executable/verifiable outcomes (e.g., code, SQL, schema validation); transfer to open-ended tasks may need new verifier designs.
- Long CoT contexts (16k–32k tokens) raise compute and engineering demands; infrastructure maturity is a limiting factor at larger model scales.
- Adapter availability and correctness across libraries (PEFT/TRL/serving stacks) are practical dependencies; some methods may lag in ecosystem support.
Glossary
- AdaLoRA: A PEFT method that adaptively allocates rank using an SVD-like structure to improve efficiency and performance. "Similarly, AdaLoRA (44.2%) and MiSS (43.4%) consistently outperform standard LoRA."
- Advantage (standardized): In GRPO, the advantage normalized by group statistics to stabilize learning. "represents the standardized advantage within the group (Shao et al., 2024)."
- Avg@k: An evaluation metric that averages accuracy over k generations to reduce variance. "the specific evaluation metrics (e.g., Avg@k)"
- Chain-of-Thought (CoT): Explicit multi-step reasoning traces generated by a model to solve complex problems. "entropy collapse and training instability in long CoT scenarios"
- Clip-Higher strategy: A DAPO technique that decouples clipping bounds to encourage exploration on low-probability tokens. "DAPO introduces a Clip-Higher strategy, which decouples the clipping range into Elow and Ehigh (Yu et al., 2025)."
- DAPO: Decoupled Clip and Dynamic sampling Policy Optimization; an RLVR algorithm improving stability and exploration. "we adopt DAPO as our standard training algorithm"
- DeepSpeed ZeRO-2: A memory-optimization strategy that partitions/offloads optimizer states for large-scale training. "DeepSpeed ZeRO-2 optimization (offloading optimizer states)"
- DoRA: A structural PEFT variant that decouples magnitude and direction of weight updates (weight-decomposed). "DoRA breaks the ceiling with an overall average of 46.6%"
- Dr. GRPO: A GRPO variant that removes biases such as length normalization and difficulty weighting. "Another significant refinement is Dr. GRPO, which identifies and mitigates systematic biases inherent in the original GRPO formulation (Liu et al., 2025)."
- Dynamic Sampling: A training procedure that filters prompts with identical rewards to maintain informative gradients. "Furthermore, DAPO employs Dynamic Sampling to filter out prompts where all outputs yield identical rewards"
- Entropy collapse: A failure mode where policy entropy diminishes, reducing exploration and diversity. "To address challenges such as entropy collapse and training instability in long CoT scenarios"
- GRPO: Group Relative Policy Optimization; an RLVR method estimating advantages via group statistics without a separate critic. "Group Relative Policy Optimization (GRPO)"
- IA3: A PEFT method that scales activations using learned vectors for K, V, and FFN modules. "IA3 (Liu et al., 2022b), which scales activation vectors via element-wise multiplication"
- JustRL: A simple RL recipe that informs the reward formulation used in this work. "The overall reward recipe follows the principles of JustRL (He et al., 2025)."
- KL coefficient: A weight on the KL-divergence regularizer that penalizes deviation from a reference policy. "do not employ a KL coefficient (3)"
- LayerNorm Tuning: A PEFT approach that tunes only LayerNorm gain and bias parameters. "LayerNorm Tuning (Qi et al., 2022)"
- LoRA: Low-Rank Adaptation; constrains weight updates to a low-rank decomposition to reduce trainable parameters. "LoRA (Hu et al.) hypothesizes that the change in weights during adaptation has a low intrinsic rank."
- LoRA-FA: A memory-efficient LoRA variant that freezes matrix A and only trains B. "LoRA-FA (Zhang et al., 2023a) (freezing A)"
- LoRA+: A LoRA variant that uses differentiated learning rates for A and B to improve optimization. "LoRA+ (Hayou et al., 2024)"
- Magnitude-direction decoupling: Separating update magnitude from direction to improve optimization flexibility. "the advantages of magnitude-direction decoupling (in DoRA)"
- MiLoRA: An SVD-informed initialization that targets minor singular components; shown to underperform in RLVR. "MiLoRA (18.0%) significantly trails standard baselines."
- MiSS: A structural PEFT variant using an efficient shard-sharing structure to improve capacity-efficiency trade-offs. "MiSS (Kang & Yin, 2025)"
- Off-principal regime: An optimization behavior where updates occur in non-principal, low-curvature spectral subspaces. "RLVR operates in an off-principal regime"
- Parameter-Efficient Fine-Tuning (PEFT): Techniques that fine-tune large models by training a small subset of parameters or adapters. "Parameter-Efficient Fine-Tuning (PEFT) methods"
- PiSSA: Principal singular values/vectors adaptation; initializes adapters from top SVD components. "PiSSA suffers a catastrophic collapse to near-zero accuracy (0.2%)"
- Principal components: Dominant singular directions capturing most variance in a weight matrix. "Strategies prioritizing principal components e.g., PiSSA (Meng et al., 2024) experience training collapse"
- Rank-1 adapters: Minimal-capacity LoRA adapters with rank 1, often too restrictive for complex updates. "e.g., VeRA or Rank-1 adapters"
- Ratio clipping: PPO/GRPO mechanism that clips the policy probability ratio to stabilize updates. "loss function nuances (such as the specific implementation of KL penalties or ratio clipping)"
- Reinforcement Learning with Verifiable Rewards (RLVR): RL setup where rewards come from deterministic verifiers of outputs. "Reinforcement Learning with Verifiable Rewards (RLVR)"
- rsLoRA: Rank stabilization scaling for LoRA to adjust effective rank during training. "rsLoRA (Kalajdzievski, 2023)"
- Singular Value Decomposition (SVD): Matrix factorization into singular values and vectors used for spectral analysis/initialization. "The substantial underperformance of initialization strategies derived from Singular Value Decomposition warrants a mechanistic explanation."
- Spectral collapse: Failure mode where updates concentrate on principal components, harming learning dynamics. "we uncover a spectral collapse phenomenon in SVD-informed initialization strategies"
- Spectral geometry: The distribution and structure of singular values/vectors of model weights that training should respect. "to preserve the pre-trained spectral geometry."
- Stable rank scaling: A scaling technique to stabilize the effective rank of updates during adapter training. "rsLoRA (Kalajdzievski, 2023), which employs stable rank scaling factors."
- Supervised Fine-Tuning (SFT): Training with labeled targets, often via teacher forcing, as opposed to RL signals. "unlike Supervised Fine-Tuning (SFT), which benefits from dense knowledge transfer via teacher-forcing"
- Surrogate objective: A tractable objective optimized in policy-gradient methods to approximate the true RL objective. "optimizes the following surrogate objective:"
- Teacher-forcing: Training method that feeds the ground-truth token sequence as context to the model. "dense knowledge transfer via teacher-forcing"
- TRL: Transformer Reinforcement Learning; a library/framework for RL on LLMs. "utilizing the RLVR framework on TRL (von Werra et al., 2020)."
- vLLM: A high-throughput inference engine for LLM generation used during rollouts. "we employ the vLLM engine in co-location mode"
- VeRA: Vector-based random matrix adaptation that freezes random projections and trains only scaling vectors. "VeRA (Kopiczko et al., 2023) (freezing random projection matrices and training only scaling vectors)"
- Weight merging: The process of combining adapter weights back into the base model weights for deployment. "such as the numerical stability of weight merging"
Collections
Sign up for free to add this paper to one or more collections.