Malware Genome Project Overview

Updated 30 December 2025

The Malware Genome Project is a comprehensive Android malware dataset offering detailed sample characteristics and behavioral profiling for systematic academic research.
It employs dynamic analysis and PCA-driven feature extraction to identify key runtime symptoms across distinct malware families.
The initiative supports a two-tier detection architecture that combines continuous symptom monitoring with specialized classifiers for targeted malware detection.

The Malware Genome Project is a reference dataset and research initiative central to empirical mobile malware studies, particularly in the Android ecosystem. Its principal aim is to aggregate, characterize, and disseminate real-world Android malware samples for systematic academic investigation, enabling reproducible analysis of behaviors, detection strategies, and forensic techniques. As a foundation for both static and dynamic analyses, it has informed numerous detection, classification, and behavioral modeling approaches since its initial release.

1. Dataset Structure and Acquisition

The Malware Genome Project corpus consists of malware samples collected predominantly from third-party Android app markets and honeypots. Manual reverse-engineering and heuristics informed sample labeling, enabling family-level categorization and attachment of metadata. The project released periodic dataset snapshots beginning in 2011, with each APK file accompanied by family identification and auxiliary properties as available.

Milosevic et al. (Milosevic et al., 2015) analyzed a subset comprising 318 APK files across six distinct malware families: GoldDream, Geinimi, BaseBridge, FakePlayer, jSMShider, and Pjapps. Each family reflects differentiated behavior profiles, ranging from SMS theft (GoldDream, FakePlayer, jSMShider) and data exfiltration (Geinimi) to remote device control (BaseBridge, Pjapps).

2. Experimental Environment and Sample Execution Workflow

The investigative workflow for sample characterization utilizes an emulated Android 4.0 (API level 14) environment, instantiated through the official Android SDK. Each APK is installed and executed, with system-level metrics captured during runtime. No additional static preprocessing—such as code de-obfuscation or permission pruning—is performed at this stage in (Milosevic et al., 2015).

Feature collection comprises:

CPU metrics: Extracted from Linux counters (e.g., /proc/stat), including total, user, and kernel CPU times as well as minor/major page faults.
Memory metrics: Collected via adb shell dumpsys meminfo <package_name>, recording proportional set size (Pss) for cursor/native memory segments, private/shared dirty pages, and Dalvik heap allocation and free statistics.

3. Methodology for Feature Extraction and Symptom Ranking

The analytical objective is to identify "symptoms"—runtime OS and process metrics—that are highly characteristic of specific malware families. To this end, Milosevic et al. employ Principal Component Analysis (PCA), using the Weka platform implementation, to transform raw metric vectors $x \in \mathbb{R}^n$ for each execution.

The PCA procedure comprises:

Computing the empirical covariance matrix $\Sigma$ from collected metrics.
Decomposing $\Sigma$ into eigenvalues and eigenvectors: $\Sigma = Q \Lambda Q^T$ , with $\Lambda = \operatorname{diag}(\lambda_1, \lambda_2, \ldots, \lambda_n)$ .
Retaining the top $k$ principal components such that the cumulative eigenvalue sum accounts for at least $95\%$ of total variance:

$\frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^n \lambda_i} \geq 0.95$

Ranking original features according to their loading in the principal components.

Table I in (Milosevic et al., 2015) presents, for each malware family, the five most indicative features as derived by this PCA ranking, e.g., GoldDream is primarily distinguished by Page Major Faults and Cursor Pss.

4. Detection Architecture: General Practitioner and Specialist Detectors

Milosevic et al. propose a two-tiered detection architecture analogous to medical triage. The "general practitioner" (GP) module is a lightweight system monitor that continuously tracks the aforementioned symptoms on-device. Upon identification of family-specific symptom profiles, the GP activates specialized analysis by invoking "specialist" detectors—classifiers trained specifically for one malware family.

While specialist detectors (potentially SVMs or random forests) are conceptually delineated, the poster does not implement or empirically evaluate these in (Milosevic et al., 2015). The training of specialists on feature subspaces tailored to their respective malware family signatures remains a direction for subsequent work.

5. Evaluation Protocol and Reported Findings

The principal aim in (Milosevic et al., 2015) is symptom identification rather than classifier performance assessment. No accuracy, precision, recall, F₁-score, ROC curve, or confusion matrix data are presented. Nor are explicit formulas for evaluation, such as $\text{Precision} = \frac{TP}{TP + FP}$ , included.

The paper's quantitative outcome solely comprises per-family feature rankings generated via eigenvalue-based PCA. Experimental runs consist of executing each of the 318 samples in isolation and reporting the most relevant runtime features for each family.

6. Contextual Significance and Limitations

The Malware Genome Project is one of the largest publicly available Android malware corpora, facilitating both static (reverse engineering, bytecode analysis) and dynamic (behavioral, runtime symptom-based) studies. Its design enables comparative research and fosters reproducibility in malware detection system development.

Limitations include:

Temporal bias: The dataset reflects samples primarily from 2010–2012, limiting representativeness of recent malware families and attack trends.
Labeling heuristics: Family assignment via reverse engineering may not capture zero-day or morphologically evolved malware variants.
Class imbalance: Uneven family sample counts can induce imbalance in classifier training and evaluation, a challenge repeatedly noted in works referencing this corpus.

A plausible implication is that subsequent detection research should focus on extending the family coverage, refining labeling granularity, and mitigating the effects of sampling bias and imbalance.

7. Future Research Directions

The explicit training and validation of specialist detectors on areas of feature space isolated by GP symptom monitoring remain as open tasks. Further, incorporating additional symptom modalities (battery usage, permission analysis, network behavior) is posited for future implementation. Long-term evolution tracking and dynamic update mechanisms for the dataset are necessary to preserve efficacy against emergent mobile malware threats.

Markdown Report Issue Upgrade to Chat

References (1)

A general practitioner or a specialist for your infected smartphone? (2015)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Malware Genome Project.