Addressing the long‑tail problem in data‑driven astronomy models

Develop robust techniques to handle non‑Gaussian, long‑tailed distributions and out‑of‑distribution samples in the OJALA transformer‑based autoregressive model trained on J-PAS narrow‑band photometry, so that performance does not degrade for edge‑of‑parameter‑space objects such as extreme emission‑line galaxies.

Background

In the paper, the authors present OJALA, a transformer-based autoregressive foundation model for jointly classifying objects and predicting physical parameters using J-PAS narrow-band photometry and ancillary broad-band data. While the model performs strongly overall, they note a persistent limitation shared by deep learning approaches: difficulty with non-Gaussian distributions and with out-of-distribution data.

This issue manifests as degraded performance at the edges of parameter space, particularly for rare populations such as extreme emission-line galaxies. The authors explicitly identify the "long tail" as an unresolved challenge, highlighting the need for methods that better capture and calibrate predictions for sparsely sampled, extreme cases within astronomical datasets.

References

Addressing this long tail problem remains an open challenge in data-driven astronomy.

OJALÁ: Optimizing J-PAS Astronomy for Large-scale Analysis. A foundation model for the SED of galaxies, QSOs and stars  (2604.00661 - Martínez-Solaeche et al., 1 Apr 2026) in Section 6 (Discussion)