LEMUR Dataset: Multi-Domain Scientific Data
- LEMUR Dataset is a multi-domain collection featuring high-resolution solar imaging, neural network benchmarking, and lemur vocalization spectrograms for Bayesian analysis.
- It provides standardized data formats and detailed methodologies, ensuring reproducible research and smooth interoperability across diverse scientific fields.
- The dataset supports advanced AutoML benchmarking and astrophysical diagnostics, enabling precise hyperparameter optimization and in-depth bioacoustic studies.
The term “LEMUR Dataset” refers to several distinct datasets—all bearing the LEMUR designation—across scientific disciplines. These include: (1) the LEMUR solar ultraviolet (VUV) telescope and its high-resolution imaging spectrograph data for solar physics; (2) the LEMUR neural network (NN) datasets providing structured neural network architectures, training runs, hyperparameter-performance records, and optuna-based benchmarking for AutoML and hyperparameter optimization research; and (3) the LEMUR animal vocalization spectrogram dataset collected for Bayesian modeling of latent animal spectral shapes. The following sections delineate the principal variants, their systematics, and methodologies, drawing directly from the source literature.
1. LEMUR: Solar Ultraviolet Research Dataset (Teriaca et al., 2011)
The Large European Module for solar Ultraviolet Research (LEMUR) dataset originates from a high-throughput VUV solar telescope, part of the JAXA Solar-C mission, parameterized for simultaneous high-resolution, multi-wavelength studies of the solar outer atmosphere. The LEMUR module combines a 30 cm off-axis paraboloid primary mirror (focal length 3.6 m, f/12) with a suite of high-resolution VUV spectrographs and imaging cameras.
Core instrument characteristics:
- Optics: VUV broadband coating (B₄C ∼100 Å over Mo/Si multilayers) yields SW-band reflectance 25–35%, and LW-band 40–50%.
- Spectroscopy: Six channels spanning 170–1270 Å with resolving power to , dispersion mÅ/pixel (SW), mÅ/pixel (LW).
- Imaging: Spatial sampling $0.14''$/pixel; slit length $280''$ (∼2000 px), widths ranging $0.14''–5''$.
- Temporal cadence: Down to $0.5$ s (exposures), typical raster scans ( at $0.28''$) in s.
- Velocity sensitivity: Doppler shifts measurable to km s⁻¹, from line centroid; e.g., Å at Å implies km/s.
LEMUR data calibration:
- Absolute radiometric accuracy (standard-traceable VUV transfer), wavelength scale via onboard lamps and solar reference lines.
- Multi-level data reduction: Level 0 (raw telemetry), Level 1 (flux- and wavelength-calibrated spectrograms), Level 2 (physical maps: intensity, Doppler shift, density, temperature), Level 3 (higher-level data products such as 2D Dopplergrams, EM maps).
- Data formats: FITS, using standard keywords, Ångström units for wavelength, arcseconds for spatial axes, radiance in or photon equivalents.
File sizes and volume estimates:
- Data rate Mbps, e.g., a 10-min “QS Fast” raster produces 200–300 MB of Level 0 data.
- Science campaigns typically yield 1–2 GB per 24 h, with archival interfaces through JAXA, ESA PSA, NASA SDAC, and integration into the Virtual Solar Observatory (VSO).
Significance:
- Enables tracking of plasma flows as small as 1–2 km s⁻¹.
- Supports construction of 2D maps of velocity, temperature, density, and composition at 0.14–0.20″ spatial and 10 s temporal scales.
- FITS and VO standards facilitate interoperability and multi-mission solar studies.
2. LEMUR Neural Network Dataset (NN): AutoML and Hyperparameter Optimization (Goodarzi et al., 14 Apr 2025, Kochnev et al., 8 Apr 2025)
The LEMUR NN dataset (Editor's term: “LEMUR-NN”) is an open-source suite built for benchmarking, neural architecture analysis, and AutoML workflows. It includes well-structured model implementations (Python, PyTorch), code for 36+ classification CNNs, transformer-based vision models, segmentation/detection heads, and Bayesian/complex-valued architectures, as well as complete hyperparameter-performance traces.
Dataset Scope and Storage:
- Architectures Covered: 14+ CV models (AlexNet, ResNet, EfficientNet, SwinTransformer, etc.), segmentation/detection networks (UNet, DeepLabV3, FasterRCNN, RetinaNet), planned NLP via HuggingFace-style transformers.
- Tasks: Image classification (CIFAR-10/100, MNIST, SVHN, etc.), semantic segmentation (COCO-Seg2017), object detection (COCO2017), and planned NLU/NLP.
- Storage: Each model in its own module (e.g.
ab/nn/nn/<ModelName>.py). Results written as per-trial JSON records, and all metrics logged in a normalized SQLite database with schema spanning model, metric, dataset, transforms, and parameters.
Evaluation and Optimization Methodology:
- Hyperparameter search: Learning rate , momentum , batch size , categorical data transforms.
- Optimization: Optuna TPE-powered search loop, logging up to trials per study for comprehensive grid coverage.
- Metrics: Accuracy , mean IoU, [email protected], duration (ns/epoch).
API and Extension:
- Programmatic access:
ab.nn.api.fetch_all_data(),ab.nn.api.data(),ab.nn.api.get_model_code(), all enabling one-shot retrieval of (code, hyperparams, metrics). - VR and Edge: TorchScript/ONNX export, post-training int8 quantization, resource profiling (latency, memory).
Statistical and Graphical Tools:
- Outputs: Raw and aggregated performance plots, distribution histograms, rolling means, and correlation heatmaps.
- Plugins: Standalone tools for graphical summaries (
nn-plots) and VR deployment (nn-vr). - Repository: MIT license, source at https://github.com/ABrain-One/nn-dataset.
Application Example Table
| Component | Description | Access Pattern |
|---|---|---|
| Model code | Python class Net(...), per-model .py file | ab/nn/nn/<ModelName>.py |
| Hyperparameters/results | TPE/Optuna + JSON + SQLite per-trial record | /data/*.json, SQLite stat table |
| Programmatic API | Single-request for (code, perf, config) | ab.nn.api.* |
| Extensibility | Add new model, dataset loader with standardized API | ab/nn/nn/, ab/nn/dataset/ |
Significance:
- Designed for benchmarking AutoML (AutoGluon, AutoKeras, LLM code-generation).
- Enables reproducible comparison of architectures, hyperparameter policies, and training curves at scale.
- Single-API access augments rapid workflow prototyping and model ensemble scenarios.
3. LEMUR Neural Network Hyperparameter–Performance Pair Dataset (Kochnev et al., 8 Apr 2025)
A specialized subset of the LEMUR dataset was constructed for systematic benchmarking of hyperparameter optimization algorithms—specifically, to rigorously compare Optuna’s TPE and LLM-guided proposals (by Code-Llama fine-tuned with LoRA).
Composition and Methodology:
- Scope: 7,107 records covering 17 NN architectures: 14 computer vision (on CIFAR-10), 3 text-generation (Salesforce/WikiText).
- Hyperparameter Space:
- learning_rate
- momentum
- batch_size
- epochs
- Generation:
- 3,700 Optuna TPE (Bayesian optimization) trials.
- 1,900 LLM (Code-Llama, LoRA rank=32, $35$ epochs) first fine-tuning cycle proposals, run and appended.
- 1,500 LLM second-cycle proposals.
- Record structure (JSON): Includes model name, task type, epochs, learning rate, momentum, batch size, accuracy, optimization method, timestamp.
Performance Metrics:
- RMSE: with (accuracy-based error).
- 95% CI: , .
Key Results Table (summarized from Table 4):
| Method | Trial Type | RMSE | σ | 95% CI |
|---|---|---|---|---|
| Optuna All | all | 0.589 | 0.219 | [0.581, 0.597] |
| Optuna Best | best only | 0.416 | 0.115 | [0.375, 0.456] |
| LLM Cycle 1 | all | 0.563 | 0.182 | [0.556, 0.570] |
| LLM Cycle 2 | all | 0.567 | 0.159 | [0.563, 0.572] |
| LLM Best | best only | 0.404 | 0.118 | [0.358, 0.480] |
| LLM One-shot | first gen | 0.533 | 0.162 | [0.470, 0.596] |
Significance:
- LLM-based HPO achieves RMSE competitive with TPE, particularly for certain architectures (SwinTransformer, SqueezeNet, VGG).
- Demonstrates generalization of LLM proposals to unseen models (VisionTransformer, MaxViT), occasionally outperforming traditional HPO for those cases.
4. LEMUR Animal Vocalization Spectrogram Dataset (Yip et al., 2024)
Distinct from NN or solar datasets, the LEMUR animal vocalization dataset comprises spectrographic recordings of lemur “grunt” calls from 8 species (Lemuridae and Indriidae). The principal aim is to enable hierarchical Bayesian inference of latent species-level spectral shapes.
Collection and Preprocessing:
- Species: Eulemur coronatus, E. rubriventer, E. flavifrons, E. fulvus, E. macaco, E. mongoz, Indri indri, Propithecus diadema.
- Sampled Calls: e.g., EC (1966), ER (6594), with analysis restricted to the 100 longest per species.
- Acquisition: Distance 2–10 m, microphones (Sennheiser ME 66/67, AKG CK 98), recorders (Marantz PMD 671, Olympus S100, etc.), 44.1 kHz/16-bit.
- Spectrograms: STFT (Praat 6.0.28), frequency axis: 26 one-third-octave bands (63–20 kHz); time quantized s; dB SPL units.
Bayesian Model:
- Each spectrogram with
- the latent, time-warped, artifact-corrected spectral shape.
- Synchronized using nonlinear warping: .
- Artifacts handled via a circular-time kernel GP, .
- Inference: Markov Chain Monte Carlo for all parameters, with NNGP approximations for scalability.
Species Summaries and Distances:
- Posterior mean computed as .
- Pairwise species dissimilarity quantified by .
Predictive Evaluation:
- Models compared via Continuous Ranked Probability Score (CRPS) in cross-validation. The full model bests ablations in 5/8 species.
Significance:
- Delivers a fully Bayesian, nonstationary, time-aligned model of animal spectrograms, scaling to + measurements, yielding interpretable interspecies distances.
5. Data Access, Interoperability, and API Conventions
Each LEMUR variant supports structured, programmatic access:
- Solar LEMUR: Portal download (JAXA, ESA, NASA), with VO-compliance for cross-mission queries, and FITS format for tool compatibility (SunPy, SolarSoft).
- LEMUR-NN and HPO Datasets: Open-source (MIT) at https://github.com/ABrain-One/nn-dataset, JSON (per trial or record), SQLite relational database, and direct pandas API access. Plugins for graphical output and VR/edge export (nn-plots, nn-vr).
- Animal Vocalization Dataset: Formats consistent with Praat STFT for spectrograms; modeling performed with custom hierarchical Bayesian code.
6. Applications and Implications
The LEMUR datasets support:
- High-fidelity coronal imaging and plasma diagnostics (solar LEMUR) down to $0.14"$ resolution and $2$ km/s velocity, at 10 s cadence, enabling dynamic studies of solar activity (Teriaca et al., 2011).
- Systematic benchmarking, hyperparameter selection, and model introspection in deep learning, with robust support for LLM-centric AutoML research and resource-constrained deployment (Kochnev et al., 8 Apr 2025, Goodarzi et al., 14 Apr 2025).
- Advanced comparative analyses of animal sound evolution and bioacoustics through latent spectral shape modeling (Yip et al., 2024).
A plausible implication is that the LEMUR NN dataset accelerates LLM-driven AutoML workflows by simultaneously providing code, configuration, and performance in a single query, while the full solar LEMUR dataset defines a new standard in VUV imaging and spectroscopy. The animal vocalization variant demonstrates the scalability of Bayesian latent variable models in high-dimensional, artifact-rich biological data.
7. Summary Table: LEMUR Dataset Variants
| Variant | Domain | Core Content/Methodology |
|---|---|---|
| LEMUR Solar UV | Solar Physics | FITS imaging/spectrograph, sub-arcsec, VUV |
| LEMUR Neural Network (NN) | Machine Learning/AutoML | NN codebase, hyperparam, performance, API |
| LEMUR Hyperparam–Performance Pairs | HPO Benchmarking | Optuna + LLM-driven hyperparam search |
| LEMUR Animal Vocalization | Bioacoustics | Lemur “grunt” spectrograms, Bayesian model |
Each dataset is directly documented in the cited arXiv works and supports open, reproducible research in its respective domain.