Papers
Topics
Authors
Recent
Search
2000 character limit reached

EMBER Dataset Series Overview

Updated 25 January 2026
  • EMBER Dataset Series are open, annotated datasets designed for scalable, reproducible ML research in both cosmology and cybersecurity.
  • They use advanced techniques like U-Net variants and modulated convolutions for cosmological emulation, and gradient-boosted models for robust malware detection.
  • The series provides comprehensive benchmarks, transparent data provenance, and open-source workflows to support rigorous evaluation across diverse applications.

The EMBER Dataset Series encompasses multiple distinct, domain-specific open datasets that have had significant impact on machine learning research across cosmology, cybersecurity, and computer vision applications. These datasets share a common theme of enabling rigorous, scalable, and reproducible model development for challenging scientific and engineering tasks. Both in baryonic emulation for cosmological simulations and in static malware detection for Windows PE files, the EMBER series combines high-quality annotations, comprehensive feature engineering or mapping, and open-sourced reference implementations for baseline models and data workflows.

1. Scope, Variants, and Community Motivation

The EMBER Dataset Series comprises two major lines of development:

  • Astrophysics/Cosmology—Baryonic Emulation Frameworks:
    • EMBER-1: Predicts two-dimensional gas and neutral hydrogen (H I) fields from dark-matter-only cosmological simulations at fixed redshift (z = 2). Primary focus: augmentation of large N-body cosmological runs with high-fidelity, hydrodynamically consistent baryon fields at high spatial resolution (Bernardini et al., 2021).
    • EMBER-2: Extends to multi-channel, continuous-redshift emulation of gas density, velocity, temperature, and H I across z = 6→0, incorporating modulated convolutional architectures that capture temporal evolution with substantially reduced parameter count (Bernardini et al., 21 Feb 2025).
  • Cybersecurity—Static Malware Detection:
    • EMBER-1.0: Provides a large temporally stratified Windows PE dataset (1.1M samples with extracted features, balanced benign and malicious samples, plus unlabeled data for semi-supervised research), along with labeling based on operational VirusTotal consensus. The series is envisioned as a foundation for future temporal, feature-driven, and multi-modal dataset releases (Anderson et al., 2018).

Both series are characterized by transparent data provenance, public code release, and explicit support for evaluation and benchmarking against baseline models (gradient-boosted trees, deep generative models).

2. Detailed Dataset Construction and Structure

EMBER (Cosmology)

  • Input/Target Fields: For EMBER-1, inputs are two-dimensional projected dark-matter density fields (full hydro or DM-only runs), with targets comprising matched-resolution gas and neutral hydrogen density grids. All grids are 4096² pixels (≈3.6 ckpc h⁻¹/pixel) for FIREbox sources.
  • Simulation Sources: Combines large-volume FIREbox hydrodynamical simulations (15 h⁻¹ cMpc box, 1024³ resolution) and zoom-in MassiveFIRE regions, capturing a hierarchy of environmental scales. Additional B100 DM-only boxes (100 h⁻¹ cMpc) extend application to larger volumes at inference.
  • Pre-processing: SPH-style smoothing radii are computed for all particles; particles are deposited with TIPGRID to dense, co-moving 2D slabs. A mixed-log normalization is then applied on a per-field basis to compress dynamic range and facilitate neural network training.
  • Slicing and Projection: Slabs are generated by axis-oriented slicing, yielding one 4096² map per slab. For zoom-ins, crops are randomly projected (10 per halo); a train/validation/test split is strictly enforced along axis and projection partitions.
  • Upsampling for Lower-resolution Inputs: Downsampling procedures allow training on lower-resolution DM fields, with mass conservation enforced via probabilistic mass redistribution.

EMBER (Malware)

  • Dataset Formation: 1.1M Windows PE samples, divided temporally (10 months training, 2 months held-out test), balancing benign and malicious instances by VirusTotal consensus (0 detections for benign, >40 for malicious). 300K unlabeled are included for semi-supervised applications.
  • Feature Extraction: Each sample is parsed into eight groups of PE-specific and format-agnostic features (e.g., header info, imports/exports, section structure, byte and byte-entropy histograms, and string metadata), resulting after vectorization in a 2351-dimensional feature vector per executable.
  • Labeling and Splits: No human annotation; robust vendor majority voting maximizes labeling confidence. The dump of SHA256 hashes and extraction scripts ensures reproducibility for future feature modeling.
  • Open Data Practices: All feature extraction code, vectorization, and baseline model training scripts are open-sourced for straightforward extensibility.

3. Machine Learning Architectures and Baseline Methodologies

EMBER-1 and EMBER-2 (Cosmology)

  • Network Backbones: Both frameworks employ deep U-Net architectures but diverge in their approach to non-determinism and parameter efficiency:
    • EMBER-1: U-Net variants (deterministic and WGAN-based generative models with per-block noise injection for stochasticity).
    • EMBER-2: Adopts a modulated convolutional framework. A context-styling MLP encodes redshift into a style vector, which parametrically modulates each decoder block (via affine transformations), allowing continuous temporal adaptation. The architecture yields 4.5M parameters vs. 27M for EMBER-1.
  • Discriminator Designs: EMBER-2 employs a dual-path discriminator, integrating both spatial and Fourier-domain convolutional blocks, enhancing sensitivity to spectral fidelity and mitigating standard CNN spectral bias.
  • Loss Functions: EMBER-1 combines perceptual multi-scale SSIM and MSLE losses, while EMBER-2 relies exclusively on adversarial objectives with “noise-pluri-tuple” training to ensure the generator respects noise input for diverse baryonic realizations. No explicit physical conservation constraints are enforced during training in EMBER-2.

EMBER (Malware)

  • Baselines: The canonical benchmark is a LightGBM gradient-boosted decision tree model, trained on the full vectorized feature matrix with default hyperparameters (100 trees, 31 leaves per tree, ~10K parameters), yielding AUC = 0.99911 on test. A direct comparison is provided to the MalConv byte-level CNN model, which, despite greater architectural complexity, underperforms the GBDT at identical test FPR.
  • Modeling Paradigm: The dataset and code are explicitly designed for both feature-based (human-interpretable) and featureless (“end-to-end”) research; the released code supports injection of new feature types and augmentations.

4. Evaluation Protocols, Metrics, and Dataset Utility

EMBER (Cosmology)

  • Summary Statistics: Evaluation includes global mass error (median ≈2% for gas, 5% for H I), pixelwise PDFs, 2D power spectra “P(k)” (with WGAN/ModConv model predictions matching hydrodynamical reference to <10% at ≲10 ckpc), bispectrum B(k,k,k), column density distribution functions (f(N_H I)), and halo abundance-matching metrics.
  • Cross-correlation Benchmarks: EMBER-2 quantifies the cross-power spectrum correlation r(k) between DM and baryon fields, maintaining Δ_r <5% for most scales and redshifts, with some degradation at small scales (∼20%) at z=0.
  • Efficiency Factors: EMBER-2 delivers ≈30,000× inference speedup over explicit hydrodynamics, enabling Monte Carlo sampling of baryonic field realizations for large-scale pipelines.

EMBER (Malware)

  • Model Metrics: Test ROC AUC, TPR at fixed FPR, and confusion matrix-based metrics are reported. For default LightGBM, ROC AUC=0.99911, with >92.9% TPR at FPR=0.1%.
  • Use Cases: EMBER supports a broad spectrum of research lines including concept-drift evaluation, semi-supervised learning with unlabeled samples, adversarial robustness benchmarking, and model interpretability due to delivered raw features.

5. Access, Formats, and Open Data Ecosystem

6. Integrative Significance and Prospective Directions

The EMBER dataset paradigm—highly-annotated, comprehensive, scalable, open, and benchmark-centric—has established foundational benchmarks in both cosmological baryonic emulation and cybersecurity static analysis. EMBER-2’s adoption of modulated convolutional architectures demonstrates the integration of field-level generative modeling with rigorous physical constraints, while EMBER (Malware) advances reproducible comparative evaluation for highly operational security applications. Future enhancements may include redshift- and physics-conditioned attention for EMBER-2, expansion to dynamic malware analysis, and tighter integration across synthetic and real domains, supported by the transparent, modular, and extensible infrastructure established by the series.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EMBER Dataset Series.