Determine the optimal LoRA scaling factor α relative to rank

Determine the optimal choice of the LoRA scaling factor α as a function of the adapter rank r for low‑rank adaptation in decoder‑only large language model fine‑tuning, clarifying whether α should be constant across ranks, proportional to r (e.g., α = r or α = 2r), proportional to √r, or follow another scaling rule that yields the best training stability and downstream performance.

Background

LoRA implementations include a scaling hyperparameter α that, together with the adapter rank r, controls the effective step size of the low‑rank updates via the factor y_r = α/r. Two common practices are to fix α across ranks or to scale it with r (e.g., α = r or α = 2r).

In discussing prior work on this choice, the paper notes conflicting guidance: Kalajdzievski (2023) argues for α ∝ √r rather than α ∝ r, Biderman et al. (2024) empirically favor α = 2r, and recent theory suggests an equivalence between tuning α and the learning rate. Despite these perspectives, the paper explicitly states that the optimal α configuration remains unclear.

References

Kalajdzievski (2023) argued that & should scale with the square root of r, rather than linearly (a x r), though the optimal a setup remains unclear.

— Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning (2602.04998 - Lee et al., 4 Feb 2026) in Appendix C, On LoRA Scaling Factor

Determine the optimal LoRA scaling factor α relative to rank

Background

References

Related Problems