Is Function Similarity Over-Engineered? Building a Benchmark
Abstract: Binary analysis is a core component of many critical security tasks, including reverse engineering, malware analysis, and vulnerability detection. Manual analysis is often time-consuming, but identifying commonly-used or previously-seen functions can reduce the time it takes to understand a new file. However, given the complexity of assembly, and the NP-hard nature of determining function equivalence, this task is extremely difficult. Common approaches often use sophisticated disassembly and decompilation tools, graph analysis, and other expensive pre-processing steps to perform function similarity searches over some corpus. In this work, we identify a number of discrepancies between the current research environment and the underlying application need. To remedy this, we build a new benchmark, REFuSE-Bench, for binary function similarity detection consisting of high-quality datasets and tests that better reflect real-world use cases. In doing so, we address issues like data duplication and accurate labeling, experiment with real malware, and perform the first serious evaluation of ML binary function similarity models on Windows data. Our benchmark reveals that a new, simple basline, one which looks at only the raw bytes of a function, and requires no disassembly or other pre-processing, is able to achieve state-of-the-art performance in multiple settings. Our findings challenge conventional assumptions that complex models with highly-engineered features are being used to their full potential, and demonstrate that simpler approaches can provide significant value.
- National Security Agency. [n. d.]. NationalSecurityAgency/Ghidra: Ghidra is a software reverse engineering (SRE) framework. https://github.com/NationalSecurityAgency/ghidra
- Practical Binary Code Similarity Detection with BERT-Based Transferable Similarity Learning. In Proceedings of the 38th Annual Computer Security Applications Conference (Austin, TX, USA) (ACSAC ’22). Association for Computing Machinery, New York, NY, USA, 361–374. https://doi.org/10.1145/3564625.3567975
- Scalable Malware Clustering using Multi-Stage Tree Parallelization. In 2020 IEEE International Conference on Intelligence and Security Informatics (ISI). 1–6. https://doi.org/10.1109/ISI49825.2020.9280546
- An In-Depth Analysis of Disassembly on Full-Scale x86/x64 Binaries. In 25th USENIX Security Symposium (USENIX Security 16). USENIX Association, Austin, TX, 583–600. https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/andriesse
- BinBert: Binary Code Understanding with a Fine-tunable and Execution-aware Transformer. arXiv:2208.06692 [cs.CR]
- Accounting for Variance in Machine Learning Benchmarks. In Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica (Eds.), Vol. 3. 747–769. https://proceedings.mlsys.org/paper_files/paper/2021/file/0184b0cd3cfb185989f858a1d9f5c1eb-Paper.pdf
- Signature Verification using a "Siamese" Time Delay Neural Network. In Advances in Neural Information Processing Systems, J. Cowan, G. Tesauro, and J. Alspector (Eds.), Vol. 6. Morgan-Kaufmann. https://proceedings.neurips.cc/paper_files/paper/1993/file/288cc0ff022877bd3df94bc9360b9c5d-Paper.pdf
- Control Flow-Based Malware VariantDetection. IEEE Transactions on Dependable and Secure Computing 11, 4 (2014), 307–317. https://doi.org/10.1109/TDSC.2013.40
- FASER: Binary Code Similarity Search through the use of Intermediate Representations. arXiv:2310.03605 [cs.CR]
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:52967399
- Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 461–470. https://doi.org/10.1145/2939672.2939719
- Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In 2019 IEEE Symposium on Security and Privacy (SP). 472–489. https://doi.org/10.1109/SP.2019.00003
- The Faiss library. (2024). arXiv:2401.08281 [cs.LG]
- VulSeeker: a semantic learning based vulnerability seeker for cross-platform binary. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (Montpellier, France) (ASE ’18). Association for Computing Machinery, New York, NY, USA, 896–899. https://doi.org/10.1145/3238147.3240480
- Weifeng Ge. 2018. Deep Metric Learning with Hierarchical Triplet Loss. In Proceedings of the European Conference on Computer Vision (ECCV).
- UniASM: Binary Code Similarity Detection without Fine-tuning. arXiv:2211.01144 [cs.CR]
- BERTDeep-Ware: A Cross-architecture Malware Detection Solution for IoT Systems. In 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). 927–934. https://doi.org/10.1109/TrustCom53373.2021.00130
- Richard Harang and Ethan M. Rudd. 2020. SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection. arXiv:2012.07634 [cs.CR]
- BinProv: Binary Code Provenance Identification without Disassembly. In Proceedings of the 25th International Symposium on Research in Attacks, Intrusions and Defenses (Limassol, Cyprus) (RAID ’22). Association for Computing Machinery, New York, NY, USA, 350–363. https://doi.org/10.1145/3545948.3545956
- In Defense of the Triplet Loss for Person Re-Identification. ArXiv abs/1703.07737 (2017). https://api.semanticscholar.org/CorpusID:1396647
- Hex-Rays. https://hex-rays.com/ida-pro/. Ida Pro.
- Vestige: Identifying Binary Code Provenance for Vulnerability Detection. In International Conference on Applied Cryptography and Network Security. https://api.semanticscholar.org/CorpusID:235396772
- MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels. In The AAAI-22 Workshop on Artificial Intelligence for Cyber Security (AICS). arXiv:arXiv:2111.15031v1 https://github.com/boozallen/MOTIF
- A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels. In Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security (AISec ’21). Association for Computing Machinery. https://doi.org/10.1145/3474369.3486867 arXiv:arXiv:2109.11126v1
- Rank-1 Similarity Matrix Decomposition For Modeling Changes in Antivirus Consensus Through Time. In Proceedings of the Conference on Applied Machine Learning for Information Security. arXiv:arXiv:2201.00757v1 http://ceur-ws.org/Vol-3095/paper5.pdf
- Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned. IEEE Transactions on Software Engineering (2022), 1–23. https://doi.org/10.1109/TSE.2022.3187689
- Binary executable file similarity calculation using function matching. Journal of Supercomputing 75, 2 (2019), 607 – 622. https://search.ebscohost.com/login.aspx?direct=true&db=aph&AN=134997305&site=ehost-live&scope=site
- Semantic-aware Binary Code Representation with BERT. ArXiv abs/2106.05478 (2021). https://api.semanticscholar.org/CorpusID:235390795
- Function matching-based binary-level software similarity calculation. In Proceedings of the 2013 Research in Adaptive and Convergent Systems (Montreal, Quebec, Canada) (RACS ’13). Association for Computing Machinery, New York, NY, USA, 322–327. https://doi.org/10.1145/2513228.2513300
- I-MAD: Interpretable malware detector using Galaxy Transformer. Computers & Security 108 (2021), 102371. https://doi.org/10.1016/j.cose.2021.102371
- PalmTree: Learning an Assembly Language Model for Instruction Embedding. Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (2021). https://api.semanticscholar.org/CorpusID:232134887
- Graph Matching Networks for Learning the Similarity of Graph Structured Objects. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 3835–3845. https://proceedings.mlr.press/v97/li19d.html
- Experimental Study of Fuzzy Hashing in Malware Clustering Analysis. In 8th Workshop on Cyber Security Experimentation and Test (CSET 15). USENIX Association, Washington, D.C. https://www.usenix.org/conference/cset15/workshop-program/presentation/li
- α𝛼\alphaitalic_α Diff: Cross-Version Binary Code Similarity Detection with DNN. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). 667–678. https://doi.org/10.1145/3238147.3238199
- Assemblage: Automatic Binary Dataset Construction for Machine Learning. arXiv:2405.03991 [cs.CR]
- Codeformer: A GNN-Nested Transformer Model for Binary Code Similarity Detection. Electronics 12, 7 (2023). https://doi.org/10.3390/electronics12071722
- Holographic Global Convolutional Networks for Long-Range Prediction Tasks in Malware Detection. In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 238). PMLR, 4042–4050. https://proceedings.mlr.press/v238/mahmudul-alam24a.html
- Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 4 (apr 2020), 824–836. https://doi.org/10.1109/TPAMI.2018.2889473
- How Machine Learning Is Solving the Binary Function Similarity Problem. In 31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 2099–2116. https://www.usenix.org/conference/usenixsecurity22/presentation/marcelli
- SAFE: Self-Attentive Function Embeddings for Binary Similarity. In Proceedings of 16th Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA).
- Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis. Proceedings 2019 Workshop on Binary Analysis Research (2019). https://api.semanticscholar.org/CorpusID:160018518
- Out of Distribution Data Detection Using Dropout Bayesian Neural Networks. In Proceedings of the 36th AAAI Conference on Artificial Intelligence. https://arxiv.org/abs/2202.08985
- Leveraging Uncertainty for Improved Static Malware Detection Under Extreme False Positive Constraints. In IJCAI-21 1st International Workshop on Adaptive Cyber Defense. arXiv:2108.04081 http://arxiv.org/abs/2108.04081
- o-glassesX: Compiler provenance recovery with attention mechanism from a short code fragment. In Proceedings of the 3nd Workshop on Binary Analysis Research.
- Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! Proceedings of the Conference on Applied Machine Learning in Information Security (2023). https://arxiv.org/abs/2312.15813
- Learning Approximate Execution Semantics From Traces for Binary Function Similarity. IEEE Transactions on Software Engineering 49, 04 (apr 2023), 2776–2790. https://doi.org/10.1109/TSE.2022.3231621
- Malware Detection by Eating a Whole EXE. In AAAI Workshops. https://api.semanticscholar.org/CorpusID:33641567
- Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection. In The Thirty-Fifth AAAI Conference on Artificial Intelligence. arXiv:2012.09390 http://arxiv.org/abs/2012.09390
- Edward Raff and Charles Nicholas. 2020. A Survey of Machine Learning Methods and Challenges for Windows Malware Classification. arXiv:2006.09271 [cs.CR] https://arxiv.org/abs/2006.09271
- Andreas Schaad and Dominik Binder. 2023. Deep-Learning-Based Vulnerability Detection in Binary Executables. In Foundations and Practice of Security, Guy-Vincent Jourdan, Laurent Mounier, Carlisle Adams, Florence Sèdes, and Joaquin Garcia-Alfaro (Eds.). Springer Nature Switzerland, Cham, 453–460.
- FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Jason Upchurch and Xiaobo Zhou. 2016. Malware provenance: code reuse detection in malicious software at scale. In 2016 11th International Conference on Malicious and Unwanted Software (MALWARE). 1–9. https://doi.org/10.1109/MALWARE.2016.7888735
- Attention Is All You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.org/abs/1706.03762
- jTrans: jump-aware transformer for binary code similarity detection. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (, Virtual, South Korea,) (ISSTA 2022). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3533767.3534367
- Learning Fine-Grained Image Similarity with Deep Ranking. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 1386–1393. https://doi.org/10.1109/CVPR.2014.180
- Stephanie Wehner. 2007. Analyzing worms and network traffic using compression. J. Comput. Secur. 15, 3 (Aug. 2007), 303–320.
- Kilian Q. Weinberger and Lawrence K. Saul. 2009. Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research 10, 9 (2009), 207–244. http://jmlr.org/papers/v10/weinberger09a.html
- Marvolo: Programmatic Data Augmentation for Deep Malware Detection. In Machine Learning and Knowledge Discovery in Databases: Research Track: European Conference, ECML PKDD 2023, Turin, Italy, September 18–22, 2023, Proceedings, Part I (Turin, Italy). Springer-Verlag, Berlin, Heidelberg, 270–285. https://doi.org/10.1007/978-3-031-43412-9_16
- Malware Classification by Learning Semantic and Structural Features of Control Flow Graphs. In 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). 540–547. https://doi.org/10.1109/TrustCom53373.2021.00084
- xorpd. https://www.xorpd.net/pages/fcatalog.html. FCATALOG.
- Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (Dallas, Texas, USA) (CCS ’17). Association for Computing Machinery, New York, NY, USA, 363–376. https://doi.org/10.1145/3133956.3134018
- Patch based vulnerability matching for binary programs. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual Event, USA) (ISSTA 2020). Association for Computing Machinery, New York, NY, USA, 376–387. https://doi.org/10.1145/3395363.3397361
- firm VulSeeker: BERT and Siamese based Vulnerability for Embedded Device Firmware Images. In 2021 IEEE Symposium on Computers and Communications (ISCC). 1–7. https://doi.org/10.1109/ISCC53001.2021.9631481
- Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection. Proceedings of the AAAI Conference on Artificial Intelligence 34, 01 (Apr. 2020), 1145–1152. https://doi.org/10.1609/aaai.v34i01.5466
- CodeCMR: Cross-Modal Retrieval For Function-Level Binary Source Code Matching. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 3872–3883. https://proceedings.neurips.cc/paper_files/paper/2020/file/285f89b802bcb2651801455c86d78f2a-Paper.pdf
- kTrans: Knowledge-Aware Transformer for Binary Code Embedding. arXiv preprint arXiv:2308.12659 (2023).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.