Papers
Topics
Authors
Recent
Search
2000 character limit reached

LEMUR Dataset: Multi-Domain Scientific Data

Updated 2 December 2025
  • LEMUR Dataset is a multi-domain collection featuring high-resolution solar imaging, neural network benchmarking, and lemur vocalization spectrograms for Bayesian analysis.
  • It provides standardized data formats and detailed methodologies, ensuring reproducible research and smooth interoperability across diverse scientific fields.
  • The dataset supports advanced AutoML benchmarking and astrophysical diagnostics, enabling precise hyperparameter optimization and in-depth bioacoustic studies.

The term “LEMUR Dataset” refers to several distinct datasets—all bearing the LEMUR designation—across scientific disciplines. These include: (1) the LEMUR solar ultraviolet (VUV) telescope and its high-resolution imaging spectrograph data for solar physics; (2) the LEMUR neural network (NN) datasets providing structured neural network architectures, training runs, hyperparameter-performance records, and optuna-based benchmarking for AutoML and hyperparameter optimization research; and (3) the LEMUR animal vocalization spectrogram dataset collected for Bayesian modeling of latent animal spectral shapes. The following sections delineate the principal variants, their systematics, and methodologies, drawing directly from the source literature.

The Large European Module for solar Ultraviolet Research (LEMUR) dataset originates from a high-throughput VUV solar telescope, part of the JAXA Solar-C mission, parameterized for simultaneous high-resolution, multi-wavelength studies of the solar outer atmosphere. The LEMUR module combines a 30 cm off-axis paraboloid primary mirror (focal length 3.6 m, f/12) with a suite of high-resolution VUV spectrographs and imaging cameras.

Core instrument characteristics:

  • Optics: VUV broadband coating (B₄C ∼100 Å over Mo/Si multilayers) yields SW-band reflectance 25–35%, and LW-band 40–50%.
  • Spectroscopy: Six channels spanning 170–1270 Å with resolving power R1.7×104R \sim 1.7\,{\times}\, 10^4 to 3.2×1043.2\,{\times}\, 10^4, dispersion 10\sim10 mÅ/pixel (SW), 40\sim40 mÅ/pixel (LW).
  • Imaging: Spatial sampling $0.14''$/pixel; slit length $280''$ (∼2000 px), widths ranging $0.14''–5''$.
  • Temporal cadence: Down to $0.5$ s (exposures), typical raster scans (100×280100''\times 280'' at $0.28''$) in 50\sim50 s.
  • Velocity sensitivity: Doppler shifts measurable to Δv2\Delta v \lesssim 2 km s⁻¹, from line centroid; e.g., Δλ=0.01\Delta\lambda = 0.01 Å at λ=200\lambda=200 Å implies Δv1.5\Delta v \sim 1.5 km/s.

LEMUR data calibration:

  • Absolute radiometric accuracy 15%\lesssim 15\% (standard-traceable VUV transfer), wavelength scale via onboard lamps and solar reference lines.
  • Multi-level data reduction: Level 0 (raw telemetry), Level 1 (flux- and wavelength-calibrated spectrograms), Level 2 (physical maps: intensity, Doppler shift, density, temperature), Level 3 (higher-level data products such as 2D Dopplergrams, EM maps).
  • Data formats: FITS, using standard keywords, Ångström units for wavelength, arcseconds for spatial axes, radiance in Wm2sr1A˚1W\,m^{-2}\,sr^{-1}\,\text{Å}^{-1} or photon equivalents.

File sizes and volume estimates:

  • Data rate 1.5\sim 1.5 Mbps, e.g., a 10-min “QS Fast” raster produces \sim200–300 MB of Level 0 data.
  • Science campaigns typically yield 1–2 GB per 24 h, with archival interfaces through JAXA, ESA PSA, NASA SDAC, and integration into the Virtual Solar Observatory (VSO).

Significance:

  • Enables tracking of plasma flows as small as 1–2 km s⁻¹.
  • Supports construction of 2D maps of velocity, temperature, density, and composition at 0.14–0.20″ spatial and 10 s temporal scales.
  • FITS and VO standards facilitate interoperability and multi-mission solar studies.

The LEMUR NN dataset (Editor's term: “LEMUR-NN”) is an open-source suite built for benchmarking, neural architecture analysis, and AutoML workflows. It includes well-structured model implementations (Python, PyTorch), code for 36+ classification CNNs, transformer-based vision models, segmentation/detection heads, and Bayesian/complex-valued architectures, as well as complete hyperparameter-performance traces.

Dataset Scope and Storage:

  • Architectures Covered: 14+ CV models (AlexNet, ResNet, EfficientNet, SwinTransformer, etc.), segmentation/detection networks (UNet, DeepLabV3, FasterRCNN, RetinaNet), planned NLP via HuggingFace-style transformers.
  • Tasks: Image classification (CIFAR-10/100, MNIST, SVHN, etc.), semantic segmentation (COCO-Seg2017), object detection (COCO2017), and planned NLU/NLP.
  • Storage: Each model in its own module (e.g. ab/nn/nn/<ModelName>.py). Results written as per-trial JSON records, and all metrics logged in a normalized SQLite database with schema spanning model, metric, dataset, transforms, and parameters.

Evaluation and Optimization Methodology:

  • Hyperparameter search: Learning rate αLogUniform\alpha\sim\text{LogUniform}, momentum μUniform\mu\sim\text{Uniform}, batch size 2k2^k, categorical data transforms.
  • Optimization: Optuna TPE-powered search loop, logging up to 4×1044\times10^4 trials per study for comprehensive grid coverage.
  • Metrics: Accuracy (1/N)i=1N1(yi=y^i)(1/N)\sum_{i=1}^N\mathbb{1}(y_i=\hat{y}_i), mean IoU, [email protected], duration (ns/epoch).

API and Extension:

  • Programmatic access: ab.nn.api.fetch_all_data(), ab.nn.api.data(), ab.nn.api.get_model_code(), all enabling one-shot retrieval of (code, hyperparams, metrics).
  • VR and Edge: TorchScript/ONNX export, post-training int8 quantization, resource profiling (latency, memory).

Statistical and Graphical Tools:

  • Outputs: Raw and aggregated performance plots, distribution histograms, rolling means, and correlation heatmaps.
  • Plugins: Standalone tools for graphical summaries (nn-plots) and VR deployment (nn-vr).
  • Repository: MIT license, source at https://github.com/ABrain-One/nn-dataset.

Application Example Table

Component Description Access Pattern
Model code Python class Net(...), per-model .py file ab/nn/nn/<ModelName>.py
Hyperparameters/results TPE/Optuna + JSON + SQLite per-trial record /data/*.json, SQLite stat table
Programmatic API Single-request for (code, perf, config) ab.nn.api.*
Extensibility Add new model, dataset loader with standardized API ab/nn/nn/, ab/nn/dataset/

Significance:

  • Designed for benchmarking AutoML (AutoGluon, AutoKeras, LLM code-generation).
  • Enables reproducible comparison of architectures, hyperparameter policies, and training curves at scale.
  • Single-API access augments rapid workflow prototyping and model ensemble scenarios.

A specialized subset of the LEMUR dataset was constructed for systematic benchmarking of hyperparameter optimization algorithms—specifically, to rigorously compare Optuna’s TPE and LLM-guided proposals (by Code-Llama fine-tuned with LoRA).

Composition and Methodology:

  • Scope: 7,107 records covering 17 NN architectures: 14 computer vision (on CIFAR-10), 3 text-generation (Salesforce/WikiText).
  • Hyperparameter Space:
    • learning_rate αU(104,1.0)\alpha \sim U(10^{-4}, 1.0)
    • momentum μU(0.01,0.99)\mu \sim U(0.01, 0.99)
    • batch_size b{4,8,16,32,64}b \in\{4, 8, 16, 32, 64\}
    • epochs e{1,2,5}e\in\{1, 2, 5\}
  • Generation:
    • 3,700 Optuna TPE (Bayesian optimization) trials.
    • 1,900 LLM (Code-Llama, LoRA rank=32, $35$ epochs) first fine-tuning cycle proposals, run and appended.
    • 1,500 LLM second-cycle proposals.
  • Record structure (JSON): Includes model name, task type, epochs, learning rate, momentum, batch size, accuracy, optimization method, timestamp.

Performance Metrics:

  • RMSE: RMSE=1Niϵi2\mathrm{RMSE}=\sqrt{\frac1N\sum_{i}\epsilon_i^2} with ϵi=1ai\epsilon_i=1-a_i (accuracy-based error).
  • 95% CI: SE=σ/N\mathrm{SE}=\sigma/\sqrt N, CI=RMSE±tα/2,N1SE\mathrm{CI}=\mathrm{RMSE}\pm t_{\alpha/2,N-1}\cdot \mathrm{SE}.

Key Results Table (summarized from Table 4):

Method Trial Type RMSE σ 95% CI
Optuna All all 0.589 0.219 [0.581, 0.597]
Optuna Best best only 0.416 0.115 [0.375, 0.456]
LLM Cycle 1 all 0.563 0.182 [0.556, 0.570]
LLM Cycle 2 all 0.567 0.159 [0.563, 0.572]
LLM Best best only 0.404 0.118 [0.358, 0.480]
LLM One-shot first gen 0.533 0.162 [0.470, 0.596]

Significance:

  • LLM-based HPO achieves RMSE competitive with TPE, particularly for certain architectures (SwinTransformer, SqueezeNet, VGG).
  • Demonstrates generalization of LLM proposals to unseen models (VisionTransformer, MaxViT), occasionally outperforming traditional HPO for those cases.

Distinct from NN or solar datasets, the LEMUR animal vocalization dataset comprises spectrographic recordings of lemur “grunt” calls from 8 species (Lemuridae and Indriidae). The principal aim is to enable hierarchical Bayesian inference of latent species-level spectral shapes.

Collection and Preprocessing:

  • Species: Eulemur coronatus, E. rubriventer, E. flavifrons, E. fulvus, E. macaco, E. mongoz, Indri indri, Propithecus diadema.
  • Sampled Calls: e.g., EC (1966), ER (6594), with analysis restricted to the 100 longest per species.
  • Acquisition: Distance 2–10 m, microphones (Sennheiser ME 66/67, AKG CK 98), recorders (Marantz PMD 671, Olympus S100, etc.), 44.1 kHz/16-bit.
  • Spectrograms: STFT (Praat 6.0.28), frequency axis: 26 one-third-octave bands (63–20 kHz); time quantized Δt=0.01\Delta t = 0.01 s; dB SPL units.

Bayesian Model:

  • Each spectrogram Yi(t,h)=μi+Ai(t,h)+ϵi(t,h)Y_i(t,h) = \mu_i + A_i(t,h) + \epsilon_i(t,h) with
    • Ai(t,h)A_i(t,h) the latent, time-warped, artifact-corrected spectral shape.
    • Synchronized using nonlinear warping: ψ(t;χi)=αi+βiliΩ(t/li;ζi,δi)\psi(t;\chi_i) = \alpha_i + \beta_i l_i \Omega(t/l_i;\zeta_i,\delta_i).
    • Artifacts handled via a circular-time kernel GP, W2(t,h)W_2(t,h).
  • Inference: Markov Chain Monte Carlo for all parameters, with NNGP approximations for scalability.

Species Summaries and Distances:

  • Posterior mean S(tj,hk)S_\ell(t_j, h_k) computed as E[A(tj,hk)y]\mathbb{E}[A_\ell(t_j, h_k)\mid y].
  • Pairwise species dissimilarity quantified by d(S,S)=(1/TH)j,k[S(tj,hk)S(tj,hk)]2d(S_\ell, S_{\ell'}) = (1/|T||H|) \sum_{j,k} [S_\ell(t_j,h_k) - S_{\ell'}(t_j,h_k)]^2.

Predictive Evaluation:

Significance:

  • Delivers a fully Bayesian, nonstationary, time-aligned model of animal spectrograms, scaling to 10610^6+ measurements, yielding interpretable interspecies distances.

5. Data Access, Interoperability, and API Conventions

Each LEMUR variant supports structured, programmatic access:

  • Solar LEMUR: Portal download (JAXA, ESA, NASA), with VO-compliance for cross-mission queries, and FITS format for tool compatibility (SunPy, SolarSoft).
  • LEMUR-NN and HPO Datasets: Open-source (MIT) at https://github.com/ABrain-One/nn-dataset, JSON (per trial or record), SQLite relational database, and direct pandas API access. Plugins for graphical output and VR/edge export (nn-plots, nn-vr).
  • Animal Vocalization Dataset: Formats consistent with Praat STFT for spectrograms; modeling performed with custom hierarchical Bayesian code.

6. Applications and Implications

The LEMUR datasets support:

  • High-fidelity coronal imaging and plasma diagnostics (solar LEMUR) down to $0.14"$ resolution and $2$ km/s velocity, at 10 s cadence, enabling dynamic studies of solar activity (Teriaca et al., 2011).
  • Systematic benchmarking, hyperparameter selection, and model introspection in deep learning, with robust support for LLM-centric AutoML research and resource-constrained deployment (Kochnev et al., 8 Apr 2025, Goodarzi et al., 14 Apr 2025).
  • Advanced comparative analyses of animal sound evolution and bioacoustics through latent spectral shape modeling (Yip et al., 2024).

A plausible implication is that the LEMUR NN dataset accelerates LLM-driven AutoML workflows by simultaneously providing code, configuration, and performance in a single query, while the full solar LEMUR dataset defines a new standard in VUV imaging and spectroscopy. The animal vocalization variant demonstrates the scalability of Bayesian latent variable models in high-dimensional, artifact-rich biological data.

7. Summary Table: LEMUR Dataset Variants

Variant Domain Core Content/Methodology
LEMUR Solar UV Solar Physics FITS imaging/spectrograph, sub-arcsec, VUV
LEMUR Neural Network (NN) Machine Learning/AutoML NN codebase, hyperparam, performance, API
LEMUR Hyperparam–Performance Pairs HPO Benchmarking Optuna + LLM-driven hyperparam search
LEMUR Animal Vocalization Bioacoustics Lemur “grunt” spectrograms, Bayesian model

Each dataset is directly documented in the cited arXiv works and supports open, reproducible research in its respective domain.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LEMUR Dataset.