Papers
Topics
Authors
Recent
Search
2000 character limit reached

Is Function Similarity Over-Engineered? Building a Benchmark

Published 30 Oct 2024 in cs.LG and cs.CR | (2410.22677v1)

Abstract: Binary analysis is a core component of many critical security tasks, including reverse engineering, malware analysis, and vulnerability detection. Manual analysis is often time-consuming, but identifying commonly-used or previously-seen functions can reduce the time it takes to understand a new file. However, given the complexity of assembly, and the NP-hard nature of determining function equivalence, this task is extremely difficult. Common approaches often use sophisticated disassembly and decompilation tools, graph analysis, and other expensive pre-processing steps to perform function similarity searches over some corpus. In this work, we identify a number of discrepancies between the current research environment and the underlying application need. To remedy this, we build a new benchmark, REFuSE-Bench, for binary function similarity detection consisting of high-quality datasets and tests that better reflect real-world use cases. In doing so, we address issues like data duplication and accurate labeling, experiment with real malware, and perform the first serious evaluation of ML binary function similarity models on Windows data. Our benchmark reveals that a new, simple basline, one which looks at only the raw bytes of a function, and requires no disassembly or other pre-processing, is able to achieve state-of-the-art performance in multiple settings. Our findings challenge conventional assumptions that complex models with highly-engineered features are being used to their full potential, and demonstrate that simpler approaches can provide significant value.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. National Security Agency. [n. d.]. NationalSecurityAgency/Ghidra: Ghidra is a software reverse engineering (SRE) framework. https://github.com/NationalSecurityAgency/ghidra
  2. Practical Binary Code Similarity Detection with BERT-Based Transferable Similarity Learning. In Proceedings of the 38th Annual Computer Security Applications Conference (Austin, TX, USA) (ACSAC ’22). Association for Computing Machinery, New York, NY, USA, 361–374. https://doi.org/10.1145/3564625.3567975
  3. Scalable Malware Clustering using Multi-Stage Tree Parallelization. In 2020 IEEE International Conference on Intelligence and Security Informatics (ISI). 1–6. https://doi.org/10.1109/ISI49825.2020.9280546
  4. An In-Depth Analysis of Disassembly on Full-Scale x86/x64 Binaries. In 25th USENIX Security Symposium (USENIX Security 16). USENIX Association, Austin, TX, 583–600. https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/andriesse
  5. BinBert: Binary Code Understanding with a Fine-tunable and Execution-aware Transformer. arXiv:2208.06692 [cs.CR]
  6. Accounting for Variance in Machine Learning Benchmarks. In Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica (Eds.), Vol. 3. 747–769. https://proceedings.mlsys.org/paper_files/paper/2021/file/0184b0cd3cfb185989f858a1d9f5c1eb-Paper.pdf
  7. Signature Verification using a "Siamese" Time Delay Neural Network. In Advances in Neural Information Processing Systems, J. Cowan, G. Tesauro, and J. Alspector (Eds.), Vol. 6. Morgan-Kaufmann. https://proceedings.neurips.cc/paper_files/paper/1993/file/288cc0ff022877bd3df94bc9360b9c5d-Paper.pdf
  8. Control Flow-Based Malware VariantDetection. IEEE Transactions on Dependable and Secure Computing 11, 4 (2014), 307–317. https://doi.org/10.1109/TDSC.2013.40
  9. FASER: Binary Code Similarity Search through the use of Intermediate Representations. arXiv:2310.03605 [cs.CR]
  10. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:52967399
  11. Kam1n0: MapReduce-based Assembly Clone Search for Reverse Engineering. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 461–470. https://doi.org/10.1145/2939672.2939719
  12. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In 2019 IEEE Symposium on Security and Privacy (SP). 472–489. https://doi.org/10.1109/SP.2019.00003
  13. The Faiss library. (2024). arXiv:2401.08281 [cs.LG]
  14. VulSeeker: a semantic learning based vulnerability seeker for cross-platform binary. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (Montpellier, France) (ASE ’18). Association for Computing Machinery, New York, NY, USA, 896–899. https://doi.org/10.1145/3238147.3240480
  15. Weifeng Ge. 2018. Deep Metric Learning with Hierarchical Triplet Loss. In Proceedings of the European Conference on Computer Vision (ECCV).
  16. UniASM: Binary Code Similarity Detection without Fine-tuning. arXiv:2211.01144 [cs.CR]
  17. BERTDeep-Ware: A Cross-architecture Malware Detection Solution for IoT Systems. In 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). 927–934. https://doi.org/10.1109/TrustCom53373.2021.00130
  18. Richard Harang and Ethan M. Rudd. 2020. SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection. arXiv:2012.07634 [cs.CR]
  19. BinProv: Binary Code Provenance Identification without Disassembly. In Proceedings of the 25th International Symposium on Research in Attacks, Intrusions and Defenses (Limassol, Cyprus) (RAID ’22). Association for Computing Machinery, New York, NY, USA, 350–363. https://doi.org/10.1145/3545948.3545956
  20. In Defense of the Triplet Loss for Person Re-Identification. ArXiv abs/1703.07737 (2017). https://api.semanticscholar.org/CorpusID:1396647
  21. Hex-Rays. https://hex-rays.com/ida-pro/. Ida Pro.
  22. Vestige: Identifying Binary Code Provenance for Vulnerability Detection. In International Conference on Applied Cryptography and Network Security. https://api.semanticscholar.org/CorpusID:235396772
  23. MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels. In The AAAI-22 Workshop on Artificial Intelligence for Cyber Security (AICS). arXiv:arXiv:2111.15031v1 https://github.com/boozallen/MOTIF
  24. A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels. In Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security (AISec ’21). Association for Computing Machinery. https://doi.org/10.1145/3474369.3486867 arXiv:arXiv:2109.11126v1
  25. Rank-1 Similarity Matrix Decomposition For Modeling Changes in Antivirus Consensus Through Time. In Proceedings of the Conference on Applied Machine Learning for Information Security. arXiv:arXiv:2201.00757v1 http://ceur-ws.org/Vol-3095/paper5.pdf
  26. Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned. IEEE Transactions on Software Engineering (2022), 1–23. https://doi.org/10.1109/TSE.2022.3187689
  27. Binary executable file similarity calculation using function matching. Journal of Supercomputing 75, 2 (2019), 607 – 622. https://search.ebscohost.com/login.aspx?direct=true&db=aph&AN=134997305&site=ehost-live&scope=site
  28. Semantic-aware Binary Code Representation with BERT. ArXiv abs/2106.05478 (2021). https://api.semanticscholar.org/CorpusID:235390795
  29. Function matching-based binary-level software similarity calculation. In Proceedings of the 2013 Research in Adaptive and Convergent Systems (Montreal, Quebec, Canada) (RACS ’13). Association for Computing Machinery, New York, NY, USA, 322–327. https://doi.org/10.1145/2513228.2513300
  30. I-MAD: Interpretable malware detector using Galaxy Transformer. Computers & Security 108 (2021), 102371. https://doi.org/10.1016/j.cose.2021.102371
  31. PalmTree: Learning an Assembly Language Model for Instruction Embedding. Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (2021). https://api.semanticscholar.org/CorpusID:232134887
  32. Graph Matching Networks for Learning the Similarity of Graph Structured Objects. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 3835–3845. https://proceedings.mlr.press/v97/li19d.html
  33. Experimental Study of Fuzzy Hashing in Malware Clustering Analysis. In 8th Workshop on Cyber Security Experimentation and Test (CSET 15). USENIX Association, Washington, D.C. https://www.usenix.org/conference/cset15/workshop-program/presentation/li
  34. α𝛼\alphaitalic_α Diff: Cross-Version Binary Code Similarity Detection with DNN. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). 667–678. https://doi.org/10.1145/3238147.3238199
  35. Assemblage: Automatic Binary Dataset Construction for Machine Learning. arXiv:2405.03991 [cs.CR]
  36. Codeformer: A GNN-Nested Transformer Model for Binary Code Similarity Detection. Electronics 12, 7 (2023). https://doi.org/10.3390/electronics12071722
  37. Holographic Global Convolutional Networks for Long-Range Prediction Tasks in Malware Detection. In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 238). PMLR, 4042–4050. https://proceedings.mlr.press/v238/mahmudul-alam24a.html
  38. Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 4 (apr 2020), 824–836. https://doi.org/10.1109/TPAMI.2018.2889473
  39. How Machine Learning Is Solving the Binary Function Similarity Problem. In 31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 2099–2116. https://www.usenix.org/conference/usenixsecurity22/presentation/marcelli
  40. SAFE: Self-Attentive Function Embeddings for Binary Similarity. In Proceedings of 16th Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA).
  41. Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis. Proceedings 2019 Workshop on Binary Analysis Research (2019). https://api.semanticscholar.org/CorpusID:160018518
  42. Out of Distribution Data Detection Using Dropout Bayesian Neural Networks. In Proceedings of the 36th AAAI Conference on Artificial Intelligence. https://arxiv.org/abs/2202.08985
  43. Leveraging Uncertainty for Improved Static Malware Detection Under Extreme False Positive Constraints. In IJCAI-21 1st International Workshop on Adaptive Cyber Defense. arXiv:2108.04081 http://arxiv.org/abs/2108.04081
  44. o-glassesX: Compiler provenance recovery with attention mechanism from a short code fragment. In Proceedings of the 3nd Workshop on Binary Analysis Research.
  45. Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! Proceedings of the Conference on Applied Machine Learning in Information Security (2023). https://arxiv.org/abs/2312.15813
  46. Learning Approximate Execution Semantics From Traces for Binary Function Similarity. IEEE Transactions on Software Engineering 49, 04 (apr 2023), 2776–2790. https://doi.org/10.1109/TSE.2022.3231621
  47. Malware Detection by Eating a Whole EXE. In AAAI Workshops. https://api.semanticscholar.org/CorpusID:33641567
  48. Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection. In The Thirty-Fifth AAAI Conference on Artificial Intelligence. arXiv:2012.09390 http://arxiv.org/abs/2012.09390
  49. Edward Raff and Charles Nicholas. 2020. A Survey of Machine Learning Methods and Challenges for Windows Malware Classification. arXiv:2006.09271 [cs.CR] https://arxiv.org/abs/2006.09271
  50. Andreas Schaad and Dominik Binder. 2023. Deep-Learning-Based Vulnerability Detection in Binary Executables. In Foundations and Practice of Security, Guy-Vincent Jourdan, Laurent Mounier, Carlisle Adams, Florence Sèdes, and Joaquin Garcia-Alfaro (Eds.). Springer Nature Switzerland, Cham, 453–460.
  51. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  52. Jason Upchurch and Xiaobo Zhou. 2016. Malware provenance: code reuse detection in malicious software at scale. In 2016 11th International Conference on Malicious and Unwanted Software (MALWARE). 1–9. https://doi.org/10.1109/MALWARE.2016.7888735
  53. Attention Is All You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.org/abs/1706.03762
  54. jTrans: jump-aware transformer for binary code similarity detection. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (, Virtual, South Korea,) (ISSTA 2022). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3533767.3534367
  55. Learning Fine-Grained Image Similarity with Deep Ranking. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 1386–1393. https://doi.org/10.1109/CVPR.2014.180
  56. Stephanie Wehner. 2007. Analyzing worms and network traffic using compression. J. Comput. Secur. 15, 3 (Aug. 2007), 303–320.
  57. Kilian Q. Weinberger and Lawrence K. Saul. 2009. Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research 10, 9 (2009), 207–244. http://jmlr.org/papers/v10/weinberger09a.html
  58. Marvolo: Programmatic Data Augmentation for Deep Malware Detection. In Machine Learning and Knowledge Discovery in Databases: Research Track: European Conference, ECML PKDD 2023, Turin, Italy, September 18–22, 2023, Proceedings, Part I (Turin, Italy). Springer-Verlag, Berlin, Heidelberg, 270–285. https://doi.org/10.1007/978-3-031-43412-9_16
  59. Malware Classification by Learning Semantic and Structural Features of Control Flow Graphs. In 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). 540–547. https://doi.org/10.1109/TrustCom53373.2021.00084
  60. xorpd. https://www.xorpd.net/pages/fcatalog.html. FCATALOG.
  61. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (Dallas, Texas, USA) (CCS ’17). Association for Computing Machinery, New York, NY, USA, 363–376. https://doi.org/10.1145/3133956.3134018
  62. Patch based vulnerability matching for binary programs. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual Event, USA) (ISSTA 2020). Association for Computing Machinery, New York, NY, USA, 376–387. https://doi.org/10.1145/3395363.3397361
  63. firm VulSeeker: BERT and Siamese based Vulnerability for Embedded Device Firmware Images. In 2021 IEEE Symposium on Computers and Communications (ISCC). 1–7. https://doi.org/10.1109/ISCC53001.2021.9631481
  64. Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection. Proceedings of the AAAI Conference on Artificial Intelligence 34, 01 (Apr. 2020), 1145–1152. https://doi.org/10.1609/aaai.v34i01.5466
  65. CodeCMR: Cross-Modal Retrieval For Function-Level Binary Source Code Matching. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 3872–3883. https://proceedings.neurips.cc/paper_files/paper/2020/file/285f89b802bcb2651801455c86d78f2a-Paper.pdf
  66. kTrans: Knowledge-Aware Transformer for Binary Code Embedding. arXiv preprint arXiv:2308.12659 (2023).
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 9 likes about this paper.