EMBER2024: Expanded Malware Benchmark

Updated 25 January 2026

EMBER2024 is a large-scale, multi-format benchmark dataset for malware classification, offering extensive labeling and temporal uniformity.
It advances reproducibility with the EMBERv3 feature extraction suite, detailed PE analysis, and support for seven distinct classification tasks.
The challenge set rigorously evaluates classifiers against evasive malware samples, emphasizing low false-positive rates in realistic settings.

EMBER2024 is a large-scale, multi-platform benchmark dataset for holistic evaluation of malware classifiers. Developed in collaboration with the authors of EMBER2017 and EMBER2018, it responds to limitations of earlier public datasets by incorporating six file formats, expanded multi-label task support, an emphasis on temporal uniformity, and a challenge set of evasive malware undetected on initial antivirus scanning. By introducing the EMBERv3 feature extraction suite and comprehensive task labeling, EMBER2024 advances research reproducibility and enables systematic benchmarking across all major malware classification scenarios (Joyce et al., 5 Jun 2025).

1. Dataset Structure and Scope

EMBER2024 consists of 3,238,315 files (approximately 3.2 million), with a roughly even split between benign and malicious samples. The dataset is acquired over 64 weekly quotas to ensure temporal uniformity. It encompasses six executable and document formats:

Format	Count (Train+Test+Challenge)
Win32 PE	1,923,225
Win64 PE	640,814
.NET (PE)	320,805
Android APK	256,256
Linux ELF	32,386
PDF Document	12,805

Files are distributed into three splits: - Training set: 2,626,000 files (weeks 1–52) - Test set: 606,000 files (weeks 53–64) - Challenge set: 6,315 evasive malware samples (see below)

Weekly quotas govern per-format and overall collection to ensure consistent distribution across time and file type. This structure improves upon EMBER2017 and EMBER2018, which focused exclusively on Windows PE files and narrower temporal spans.

2. Classification Task Expansion

EMBER2024 introduces label support for seven distinct malware classification tasks, a marked extension over EMBER2017 (binary detection only) and EMBER2018 (detection and limited family classification). The tasks and their respective labeling schema are as follows:

Malware Detection (binary): Classification of files as benign or malicious using thresholds of 0 versus ≥5 independent antivirus detections.
Family Classification: Labeling of 1,356,182 malicious files with "ClarAVy" family assignments spanning 6,787 families, enabling fine-grained taxonomy.
Behavior Identification (multi-label): Assignment of up to 118 behavior tags per file (e.g., "ransomware," "worm").
File Property Prediction (multi-label): 30 tags indicating properties such as "packed," "stripped," or platform-specific features.
Packer Identification (multi-label): Identification of use of one or more of 52 known packers (e.g., "upx," "themida").
Exploited Vulnerability Identification (multi-label): 293 CVE-style tags (e.g., "cve-2017-0144").
Threat Group Attribution (multi-label): Assignment to up to 43 threat groups (e.g., "APT28," "FIN7").

Each task can be evaluated independently or jointly. The extended, multi-label taxonomy increases benchmark realism and supports a diverse array of research questions in malware analytics.

3. Evasive Malware Challenge Set

The EMBER2024 "challenge" set comprises 6,315 malicious files initially undetected by all tested antivirus products (0/≈70 detections on VirusTotal), but subsequently recognized as malicious (≥5 detections) upon re-scan at least 30 days later. Samples are selected across all 64 collection weeks and are filtered to ensure no overlap or near-duplicate (TLSH distance ≤ 30) with the train and test sets. The primary role of the challenge set is robust evaluation of classifiers against real-world evasive malware, with an emphasis on performance at low false-positive rates. This design addresses the practical importance of evaluating detection capabilities on samples specifically crafted or evolved to evade contemporary antivirus solutions.

4. EMBER Feature Version 3

EMBERv3, the latest feature extraction protocol, raises the raw feature vector dimension to $D=2568$ (from 2381 in v2), with truncated support ( $D_0=696$ ) for non-PE and unparsable-PE files. PRE executable (PE) features are extensively expanded:

PE COFF Header: Adds number_of_sections, sizeof_optional_header, number_of_symbols, pointer_to_symbol_table.
PE Optional Header: Incorporates 13 fields (e.g., sizes/pointers for code, headers, stack/heap, entrypoints, checksums).
Section Header: Adds physical and virtual size ratios and reduces the section name one-hot encoding.
General: file_size, file_entropy, magic_bytes for all formats.
Strings: 76 regex-derived features from URLs, APIs, cryptographic identifiers, and others.
DOS Header: All e_* fields (e_magic, e_lfanew, etc.).
Data Directories: (name, size, virtual address) for each directory.
Rich Header: Vectorized via hash binning ("hashing trick") applied to Rich entries.
Authenticode Signatures: num_certs, self_signed, chain_max_depth, parse errors, and timing deltas.
PE Parse Warnings: 88 binary flags summarizing pefile parsing anomalies.

Mathematical formulations illustrated in the dataset include:

Byte-histogram normalization: $h_i = \text{count}_i / \sum_j \text{count}_j$ , $i=0 \dots 255$ .
Section-size ratios: $\rho_{\text{phys}} = \frac{\text{PhysicalSize}_\text{section}}{\text{FileSize}}$ , $\rho_{\text{virt}} = \frac{\text{PhysicalSize}_\text{section}}{\text{VirtualSize}_\text{section}}$ .
Rich entry hashing: $\text{bin} = (H(\text{entry\_string}) \mod R)$ .

Non-PE and broken PE files support feature extraction in the categories of general, byte_histogram, byte_entropy_histogram, and string features only. The final feature vector $x\in \mathbb{R}^D$ concatenates all normalized numeric and hashed-count features.

5. Evaluation Protocols and Reproducibility

Data splits are strictly chronological to model concept drift:

Train: Files from weeks 1–52 (2.626M files)
Test: Weeks 53–64 (606,000 files)
Challenge: 6,315 evasive malware samples, paired during evaluation with benign test files from the same format

Recommended evaluation metrics:

Malware Detection: ROC AUC, PR AUC, true-positive rate at specified low FPR (≤ 0.01)
Multi-label Tasks: Macro/micro precision, recall, F1, and average AUC per label
Family Classification: Accuracy, weighted/macro precision/recall/F1

Best-practice reproducibility guidelines include:

Use of the provided codebase for feature extraction, labeling, TLSH duplicate filtering, and LightGBM classifier training
Fixing random seeds, recording all hyperparameters
Enforcement of non-overlap/near-duplicate policy (TLSH ≤ 30) within splits
Chronological maintenance of splits and explicit pairing of challenge set samples with same-format benign files to manage domain shift
Documentation of the VirusTotal API version and rescan policy (minimum 30-day interval)

This methodological transparency and prescription positions EMBER2024 as a standard for rigorous, reproducible malware classification research.

6. Significance and Position in the Literature

EMBER2024 is the first benchmark to simultaneously enable:

Multi-format (Windows, Android, Linux, PDF) cross-section of malware
Full multi-task, multi-label evaluation across seven practically relevant classification settings
Evaluation on a curated, real-world evasive malware challenge set
Fine-grained, up-to-date feature extraction with full protocol transparency

Earlier EMBER editions were limited to PE files and binary or limited family classification. EMBER2024’s expanded formats and ground truth annotations facilitate research on generalization, adversarial robustness, concept drift, and the evaluation of training regimes under realistic adversarial conditions. Its design reflects both the current threat landscape and the evolving needs of the academic malware analysis community, forming a foundation for future advances in automated malware classification (Joyce et al., 5 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

EMBER2024 -- A Benchmark Dataset for Holistic Evaluation of Malware Classifiers (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EMBER2024 Expansion.