Efficient hyperparameter tuning for LLM-JEPA

Develop an efficient hyperparameter tuning method for the LLM-JEPA training objective to explore and identify optimal values of the JEPA loss weight λ and the number of predictor tokens k used in the tied-weights predictor, reducing the significant cost of grid search given that the best accuracy may occur anywhere within the tested grid (λ, k) ∈ {0.5, 1.0, 2.0, 4.0} × {0, 1, 2, 3, 4}.

Background

LLM-JEPA augments standard next-token prediction with a JEPA term, introducing two additional hyperparameters: λ, which balances the JEPA loss relative to the generative objective, and k, the number of predictor tokens appended to enable the tied-weights predictor. The authors evaluate performance over a grid of (λ, k) and observe that optimal configurations can occur anywhere in the grid.

This broad distribution of optimal points makes exhaustive tuning expensive. The authors explicitly note that they have not identified an efficient method to explore the hyperparameter space, although adjacent grid points often produce similar results, hinting at the potential for more efficient search strategies.

References

While we have not identified an efficient method to explore this space, we empirically observe that adjacent grid points often yield similar accuracy, suggesting the potential for a more efficient tuning algorithm.

LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures  (2509.14252 - Huang et al., 11 Sep 2025) in Appendix, Subsection "Hyperparameter Tuning for LLM-JEPA"