Papers
Topics
Authors
Recent
Search
2000 character limit reached

EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models

Published 12 Apr 2018 in cs.CR | (1804.04637v2)

Abstract: This paper describes EMBER: a labeled benchmark dataset for training machine learning models to statically detect malicious Windows portable executable files. The dataset includes features extracted from 1.1M binary files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test samples (100K malicious, 100K benign). To accompany the dataset, we also release open source code for extracting features from additional binaries so that additional sample features can be appended to the dataset. This dataset fills a void in the information security machine learning community: a benign/malicious dataset that is large, open and general enough to cover several interesting use cases. We enumerate several use cases that we considered when structuring the dataset. Additionally, we demonstrate one use case wherein we compare a baseline gradient boosted decision tree model trained using LightGBM with default settings to MalConv, a recently published end-to-end (featureless) deep learning model for malware detection. Results show that even without hyper-parameter optimization, the baseline EMBER model outperforms MalConv. The authors hope that the dataset, code and baseline model provided by EMBER will help invigorate machine learning research for malware detection, in much the same way that benchmark datasets have advanced computer vision research.

Citations (422)

Summary

  • The paper presents the EMBER dataset, a large-scale open resource containing 1.1M labeled Windows PE files for static malware detection.
  • It details a methodology that uses feature hashing on PE file structures to convert raw data into numeric vectors suitable for machine learning models.
  • Experimental results show that a gradient-boosted decision tree model achieves superior ROC AUC performance compared to more complex deep learning approaches.

EMBER: An Overview

The paper "EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models" (1804.04637) presents a significant advancement in the field of machine learning for malware detection through the creation of the EMBER dataset. This dataset provides a comprehensive resource of labeled data for training models aimed at identifying malicious Windows portable executable files. EMBER encompasses features from 1.1 million binary files, allowing for robust model development and evaluation within the security domain.

Introduction and Background

Machine learning offers powerful tools for automating complex data-driven tasks, such as static malware detection. Despite its potential, the public research community has witnessed limited progress in this area due to the scarcity of open, large-scale datasets comparable to those available for other applications like image recognition or sentiment analysis. The EMBER dataset addresses this gap by providing a well-structured corpus derived from Windows PE files, offering researchers the features necessary to enhance detection capabilities while navigating legal and security challenges.

The PE file format, ubiquitous in Microsoft Windows environments, serves as the foundation for the dataset's features. The format includes headers, sections, and data essential for execution, making it suitable for the static analysis required for malware detection. Several works have explored feature extraction from PE files, but EMBER's contribution lies in its large-scale, open access nature, designed to accommodate diverse research use cases. Figure 1

Figure 1: The 32-bit PE file structure. Creative commons image courtesy.

Dataset Structure and Methodology

The EMBER dataset is structured to facilitate a multitude of research approaches, from traditional machine learning model comparisons to the exploration of adversarial learning techniques. It consists of JSON files that detail each sample's features, including raw parsed data and format-agnostic attributes like byte histograms. These raw features are translated into model features using methods like feature hashing, which converts complex string representations into manageable numeric vectors for model training.

A distinct characteristic of EMBER is its allowance for temporal studies through coarse time stamps and the inclusion of unlabeled samples, supporting semi-supervised learning approaches. This setup not only aids in understanding concept drift over time but also broadens the spectrum for research into unsupervised learning strategies in malware detection contexts. Figure 2

Figure 2: Distribution of malicious, benign, and unlabeled samples in the training and test sets.

Figure 3

Figure 3: A temporal distribution of the dataset, available from chronology data in the metadata.

Experimental Results

The paper demonstrates the EMBER dataset's utility by constructing a baseline model using gradient-boosted decision trees (GBDT) via LightGBM. Though lacking hyper-parameter optimization, this model achieves impressive ROC AUC results, surpassing performance metrics of more complex models like MalConv, a contemporary end-to-end deep learning solution. This highlights that domain-specific feature knowledge incorporated into machine learning models can yield higher efficacy than raw data-driven methods alone.

The comparison, depicted through ROC curves, affirms the EMBER dataset's capability to serve as a benchmark for evaluating various machine learning models and architectures. Figure 4

Figure 4: ROC curve with log scale for false positive rate (FPR).

Figure 5

Figure 5: Distribution of model test scores on the test set (note the logarithmic scale).

Conclusion

The EMBER dataset constitutes an invaluable resource for advancing machine learning research in malware detection, bridging previous gaps in publicly available data. By fostering innovation across multiple research domains, EMBER invites exploration into static analysis, interpretability, and adversarial resilience within security contexts. The deployment of this dataset alongside a benchmark model provides a foundation upon which future studies can build, enhancing model accuracy and robustness through methodological refinement and technological advancement. As such, EMBER represents a pivotal step forward in aligning machine learning capabilities with real-world cybersecurity challenges.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.