LEMUR Dataset: Multi-Domain Scientific Data

Updated 2 December 2025

LEMUR Dataset is a multi-domain collection featuring high-resolution solar imaging, neural network benchmarking, and lemur vocalization spectrograms for Bayesian analysis.
It provides standardized data formats and detailed methodologies, ensuring reproducible research and smooth interoperability across diverse scientific fields.
The dataset supports advanced AutoML benchmarking and astrophysical diagnostics, enabling precise hyperparameter optimization and in-depth bioacoustic studies.

The term “LEMUR Dataset” refers to several distinct datasets—all bearing the LEMUR designation—across scientific disciplines. These include: (1) the LEMUR solar ultraviolet (VUV) telescope and its high-resolution imaging spectrograph data for solar physics; (2) the LEMUR neural network (NN) datasets providing structured neural network architectures, training runs, hyperparameter-performance records, and optuna-based benchmarking for AutoML and hyperparameter optimization research; and (3) the LEMUR animal vocalization spectrogram dataset collected for Bayesian modeling of latent animal spectral shapes. The following sections delineate the principal variants, their systematics, and methodologies, drawing directly from the source literature.

The Large European Module for solar Ultraviolet Research (LEMUR) dataset originates from a high-throughput VUV solar telescope, part of the JAXA Solar-C mission, parameterized for simultaneous high-resolution, multi-wavelength studies of the solar outer atmosphere. The LEMUR module combines a 30 cm off-axis paraboloid primary mirror (focal length 3.6 m, f/12) with a suite of high-resolution VUV spectrographs and imaging cameras.

Core instrument characteristics:

Optics: VUV broadband coating (B₄C ∼100 Å over Mo/Si multilayers) yields SW-band reflectance 25–35%, and LW-band 40–50%.
Spectroscopy: Six channels spanning 170–1270 Å with resolving power $R \sim 1.7\,{\times}\, 10^4$ to $3.2\,{\times}\, 10^4$ , dispersion $\sim10$  mÅ/pixel (SW), $\sim40$  mÅ/pixel (LW).
Imaging: Spatial sampling $0.14''$/pixel; slit length $280''$ (∼2000 px), widths ranging $0.14''–5''$.
Temporal cadence: Down to $0.5$ s (exposures), typical raster scans ( $100''\times 280''$ at $0.28''$) in $\sim50$  s.
Velocity sensitivity: Doppler shifts measurable to $\Delta v \lesssim 2$  km s⁻¹, from line centroid; e.g., $\Delta\lambda = 0.01$ Å at $\lambda=200$ Å implies $\Delta v \sim 1.5$  km/s.

LEMUR data calibration:

Absolute radiometric accuracy $\lesssim 15\%$ (standard-traceable VUV transfer), wavelength scale via onboard lamps and solar reference lines.
Multi-level data reduction: Level 0 (raw telemetry), Level 1 (flux- and wavelength-calibrated spectrograms), Level 2 (physical maps: intensity, Doppler shift, density, temperature), Level 3 (higher-level data products such as 2D Dopplergrams, EM maps).
Data formats: FITS, using standard keywords, Ångström units for wavelength, arcseconds for spatial axes, radiance in $W\,m^{-2}\,sr^{-1}\,\text{Å}^{-1}$ or photon equivalents.

File sizes and volume estimates:

Data rate $\sim 1.5$  Mbps, e.g., a 10-min “QS Fast” raster produces $\sim$ 200–300 MB of Level 0 data.
Science campaigns typically yield 1–2 GB per 24 h, with archival interfaces through JAXA, ESA PSA, NASA SDAC, and integration into the Virtual Solar Observatory (VSO).

Significance:

Enables tracking of plasma flows as small as 1–2 km s⁻¹.
Supports construction of 2D maps of velocity, temperature, density, and composition at 0.14–0.20″ spatial and 10 s temporal scales.
FITS and VO standards facilitate interoperability and multi-mission solar studies.

The LEMUR NN dataset (Editor's term: “LEMUR-NN”) is an open-source suite built for benchmarking, neural architecture analysis, and AutoML workflows. It includes well-structured model implementations (Python, PyTorch), code for 36+ classification CNNs, transformer-based vision models, segmentation/detection heads, and Bayesian/complex-valued architectures, as well as complete hyperparameter-performance traces.

Dataset Scope and Storage:

Architectures Covered: 14+ CV models (AlexNet, ResNet, EfficientNet, SwinTransformer, etc.), segmentation/detection networks (UNet, DeepLabV3, FasterRCNN, RetinaNet), planned NLP via HuggingFace-style transformers.
Tasks: Image classification (CIFAR-10/100, MNIST, SVHN, etc.), semantic segmentation (COCO-Seg2017), object detection (COCO2017), and planned NLU/NLP.
Storage: Each model in its own module (e.g. ab/nn/nn/<ModelName>.py). Results written as per-trial JSON records, and all metrics logged in a normalized SQLite database with schema spanning model, metric, dataset, transforms, and parameters.

Evaluation and Optimization Methodology:

Hyperparameter search: Learning rate $\alpha\sim\text{LogUniform}$ , momentum $\mu\sim\text{Uniform}$ , batch size $2^k$ , categorical data transforms.
Optimization: Optuna TPE-powered search loop, logging up to $4\times10^4$ trials per study for comprehensive grid coverage.
Metrics: Accuracy $(1/N)\sum_{i=1}^N\mathbb{1}(y_i=\hat{y}_i)$ , mean IoU, [email protected], duration (ns/epoch).

API and Extension:

Programmatic access: ab.nn.api.fetch_all_data(), ab.nn.api.data(), ab.nn.api.get_model_code(), all enabling one-shot retrieval of (code, hyperparams, metrics).
VR and Edge: TorchScript/ONNX export, post-training int8 quantization, resource profiling (latency, memory).

Statistical and Graphical Tools:

Outputs: Raw and aggregated performance plots, distribution histograms, rolling means, and correlation heatmaps.
Plugins: Standalone tools for graphical summaries (nn-plots) and VR deployment (nn-vr).
Repository: MIT license, source at https://github.com/ABrain-One/nn-dataset.

Application Example Table

Component	Description	Access Pattern
Model code	Python class Net(...), per-model .py file	ab/nn/nn/<ModelName>.py
Hyperparameters/results	TPE/Optuna + JSON + SQLite per-trial record	/data/*.json, SQLite stat table
Programmatic API	Single-request for (code, perf, config)	ab.nn.api.*
Extensibility	Add new model, dataset loader with standardized API	ab/nn/nn/, ab/nn/dataset/

Significance:

Designed for benchmarking AutoML (AutoGluon, AutoKeras, LLM code-generation).
Enables reproducible comparison of architectures, hyperparameter policies, and training curves at scale.
Single-API access augments rapid workflow prototyping and model ensemble scenarios.

A specialized subset of the LEMUR dataset was constructed for systematic benchmarking of hyperparameter optimization algorithms—specifically, to rigorously compare Optuna’s TPE and LLM-guided proposals (by Code-Llama fine-tuned with LoRA).

Composition and Methodology:

Scope: 7,107 records covering 17 NN architectures: 14 computer vision (on CIFAR-10), 3 text-generation (Salesforce/WikiText).
Hyperparameter Space:
- learning_rate $\alpha \sim U(10^{-4}, 1.0)$
- momentum $\mu \sim U(0.01, 0.99)$
- batch_size $b \in\{4, 8, 16, 32, 64\}$
- epochs $e\in\{1, 2, 5\}$
Generation:
- 3,700 Optuna TPE (Bayesian optimization) trials.
- 1,900 LLM (Code-Llama, LoRA rank=32, $35$ epochs) first fine-tuning cycle proposals, run and appended.
- 1,500 LLM second-cycle proposals.
Record structure (JSON): Includes model name, task type, epochs, learning rate, momentum, batch size, accuracy, optimization method, timestamp.

Performance Metrics:

RMSE: $\mathrm{RMSE}=\sqrt{\frac1N\sum_{i}\epsilon_i^2}$ with $\epsilon_i=1-a_i$ (accuracy-based error).
95% CI: $\mathrm{SE}=\sigma/\sqrt N$ , $\mathrm{CI}=\mathrm{RMSE}\pm t_{\alpha/2,N-1}\cdot \mathrm{SE}$ .

Key Results Table (summarized from Table 4):

Method	Trial Type	RMSE	σ	95% CI
Optuna All	all	0.589	0.219	[0.581, 0.597]
Optuna Best	best only	0.416	0.115	[0.375, 0.456]
LLM Cycle 1	all	0.563	0.182	[0.556, 0.570]
LLM Cycle 2	all	0.567	0.159	[0.563, 0.572]
LLM Best	best only	0.404	0.118	[0.358, 0.480]
LLM One-shot	first gen	0.533	0.162	[0.470, 0.596]

Significance:

LLM-based HPO achieves RMSE competitive with TPE, particularly for certain architectures (SwinTransformer, SqueezeNet, VGG).
Demonstrates generalization of LLM proposals to unseen models (VisionTransformer, MaxViT), occasionally outperforming traditional HPO for those cases.

Distinct from NN or solar datasets, the LEMUR animal vocalization dataset comprises spectrographic recordings of lemur “grunt” calls from 8 species (Lemuridae and Indriidae). The principal aim is to enable hierarchical Bayesian inference of latent species-level spectral shapes.

Collection and Preprocessing:

Species: Eulemur coronatus, E. rubriventer, E. flavifrons, E. fulvus, E. macaco, E. mongoz, Indri indri, Propithecus diadema.
Sampled Calls: e.g., EC (1966), ER (6594), with analysis restricted to the 100 longest per species.
Acquisition: Distance 2–10 m, microphones (Sennheiser ME 66/67, AKG CK 98), recorders (Marantz PMD 671, Olympus S100, etc.), 44.1 kHz/16-bit.
Spectrograms: STFT (Praat 6.0.28), frequency axis: 26 one-third-octave bands (63–20 kHz); time quantized $\Delta t = 0.01$  s; dB SPL units.

Bayesian Model:

Each spectrogram $Y_i(t,h) = \mu_i + A_i(t,h) + \epsilon_i(t,h)$ $Y_{i} (t, h) = μ_{i} + A_{i} (t, h) + ϵ_{i} (t, h)$ with
- $A_i(t,h)$ the latent, time-warped, artifact-corrected spectral shape.
- Synchronized using nonlinear warping: $\psi(t;\chi_i) = \alpha_i + \beta_i l_i \Omega(t/l_i;\zeta_i,\delta_i)$ .
- Artifacts handled via a circular-time kernel GP, $W_2(t,h)$ .
Inference: Markov Chain Monte Carlo for all parameters, with NNGP approximations for scalability.

Species Summaries and Distances:

Posterior mean $S_\ell(t_j, h_k)$ computed as $\mathbb{E}[A_\ell(t_j, h_k)\mid y]$ .
Pairwise species dissimilarity quantified by $d(S_\ell, S_{\ell'}) = (1/|T||H|) \sum_{j,k} [S_\ell(t_j,h_k) - S_{\ell'}(t_j,h_k)]^2$ .

Predictive Evaluation:

Models compared via Continuous Ranked Probability Score (CRPS) in cross-validation. The full model bests ablations in 5/8 species.

Significance:

Delivers a fully Bayesian, nonstationary, time-aligned model of animal spectrograms, scaling to $10^6$ + measurements, yielding interpretable interspecies distances.

5. Data Access, Interoperability, and API Conventions

Each LEMUR variant supports structured, programmatic access:

Solar LEMUR: Portal download (JAXA, ESA, NASA), with VO-compliance for cross-mission queries, and FITS format for tool compatibility (SunPy, SolarSoft).
LEMUR-NN and HPO Datasets: Open-source (MIT) at https://github.com/ABrain-One/nn-dataset, JSON (per trial or record), SQLite relational database, and direct pandas API access. Plugins for graphical output and VR/edge export (nn-plots, nn-vr).
Animal Vocalization Dataset: Formats consistent with Praat STFT for spectrograms; modeling performed with custom hierarchical Bayesian code.

6. Applications and Implications

The LEMUR datasets support:

High-fidelity coronal imaging and plasma diagnostics (solar LEMUR) down to $0.14"$ resolution and $2$ km/s velocity, at 10 s cadence, enabling dynamic studies of solar activity (Teriaca et al., 2011).
Systematic benchmarking, hyperparameter selection, and model introspection in deep learning, with robust support for LLM-centric AutoML research and resource-constrained deployment (Kochnev et al., 8 Apr 2025, Goodarzi et al., 14 Apr 2025).
Advanced comparative analyses of animal sound evolution and bioacoustics through latent spectral shape modeling (Yip et al., 2024).

A plausible implication is that the LEMUR NN dataset accelerates LLM-driven AutoML workflows by simultaneously providing code, configuration, and performance in a single query, while the full solar LEMUR dataset defines a new standard in VUV imaging and spectroscopy. The animal vocalization variant demonstrates the scalability of Bayesian latent variable models in high-dimensional, artifact-rich biological data.

7. Summary Table: LEMUR Dataset Variants

Variant	Domain	Core Content/Methodology
LEMUR Solar UV	Solar Physics	FITS imaging/spectrograph, sub-arcsec, VUV
LEMUR Neural Network (NN)	Machine Learning/AutoML	NN codebase, hyperparam, performance, API
LEMUR Hyperparam–Performance Pairs	HPO Benchmarking	Optuna + LLM-driven hyperparam search
LEMUR Animal Vocalization	Bioacoustics	Lemur “grunt” spectrograms, Bayesian model

Each dataset is directly documented in the cited arXiv works and supports open, reproducible research in its respective domain.

Markdown Report Issue Upgrade to Chat

References (4)

LEMUR: Large European Module for solar Ultraviolet Research. European contribution to JAXA's Solar-C mission (2011)

LEMUR Neural Network Dataset: Towards Seamless AutoML (2025)

Optuna vs Code Llama: Are LLMs a New Paradigm for Hyperparameter Tuning? (2025)

Bayesian inference of Latent Spectral Shapes (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LEMUR Dataset.

LEMUR Dataset: Multi-Domain Scientific Data

1. LEMUR: Solar Ultraviolet Research Dataset (Teriaca et al., 2011)

2. LEMUR Neural Network Dataset (NN): AutoML and Hyperparameter Optimization (Goodarzi et al., 14 Apr 2025, Kochnev et al., 8 Apr 2025)

Dataset Scope and Storage:

Evaluation and Optimization Methodology:

API and Extension:

Statistical and Graphical Tools:

Application Example Table

3. LEMUR Neural Network Hyperparameter–Performance Pair Dataset (Kochnev et al., 8 Apr 2025)

Composition and Methodology:

Key Results Table (summarized from Table 4):

4. LEMUR Animal Vocalization Spectrogram Dataset (Yip et al., 2024)

Collection and Preprocessing:

Bayesian Model:

Species Summaries and Distances:

Predictive Evaluation:

5. Data Access, Interoperability, and API Conventions

6. Applications and Implications

7. Summary Table: LEMUR Dataset Variants

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

LEMUR Dataset: Multi-Domain Scientific Data

1. LEMUR: Solar Ultraviolet Research Dataset (Teriaca et al., 2011)

2. LEMUR Neural Network Dataset (NN): AutoML and Hyperparameter Optimization (Goodarzi et al., 14 Apr 2025, Kochnev et al., 8 Apr 2025)

Dataset Scope and Storage:

Evaluation and Optimization Methodology:

API and Extension:

Statistical and Graphical Tools:

Application Example Table

3. LEMUR Neural Network Hyperparameter–Performance Pair Dataset (Kochnev et al., 8 Apr 2025)

Composition and Methodology:

Key Results Table (summarized from Table 4):

4. LEMUR Animal Vocalization Spectrogram Dataset (Yip et al., 2024)

Collection and Preprocessing:

Bayesian Model:

Species Summaries and Distances:

Predictive Evaluation:

5. Data Access, Interoperability, and API Conventions

6. Applications and Implications

7. Summary Table: LEMUR Dataset Variants

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research