Papers
Topics
Authors
Recent
Search
2000 character limit reached

Enhancing Function Name Prediction using Votes-Based Name Tokenization and Multi-Task Learning

Published 15 May 2024 in cs.SE | (2405.09112v1)

Abstract: Reverse engineers would acquire valuable insights from descriptive function names, which are absent in publicly released binaries. Recent advances in binary function name prediction using data-driven machine learning show promise. However, existing approaches encounter difficulties in capturing function semantics in diverse optimized binaries and fail to reserve the meaning of labels in function names. We propose Epitome, a framework that enhances function name prediction using votes-based name tokenization and multi-task learning, specifically tailored for different compilation optimization binaries. Epitome learns comprehensive function semantics by pre-trained assembly LLM and graph neural network, incorporating function semantics similarity prediction task, to maximize the similarity of function semantics in the context of different compilation optimization levels. In addition, we present two data preprocessing methods to improve the comprehensibility of function names. We evaluate the performance of Epitome using 2,597,346 functions extracted from binaries compiled with 5 optimizations (O0-Os) for 4 architectures (x64, x86, ARM, and MIPS). Epitome outperforms the state-of-the-art function name prediction tool by up to 44.34%, 64.16%, and 54.44% in precision, recall, and F1 score, while also exhibiting superior generalizability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Miltiadis Allamanis. 2019. The Adverse Effects of Code Duplication in Machine Learning Models of Code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Athens, Greece) (Onward! 2019). Association for Computing Machinery, New York, NY, USA, 143–153. https://doi.org/10.1145/3359591.3359735
  2. code2seq: Generating Sequences from Structured Representations of Code. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=H1gKYo09tX
  3. In nomine function: Naming functions in stripped binaries with neural networks. arXiv preprint arXiv:1912.07946 (2019).
  4. Eran Avidan and Dror G Feitelson. 2017. Effects of variable names on comprehension: An empirical study. In 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC). IEEE, 55–65.
  5. Kent Beck. 2007. Implementation patterns. Pearson Education.
  6. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”.
  7. Enriching word vectors with subword information. Transactions of the association for computational linguistics 5 (2017), 135–146.
  8. Buildroot. 2023. Buildroot: making embedded linux easy. Retrieved June 5, 2023 from https://buildroot.org
  9. How Professional Hackers Understand Protected Code while Performing Attack Tasks. In 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC). 154–164. https://doi.org/10.1109/ICPC.2017.2
  10. SFuzz: Slice-Based Fuzzing for Real-Time Operating Systems. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security (Los Angeles, CA, USA) (CCS ’22). Association for Computing Machinery, New York, NY, USA, 485–498. https://doi.org/10.1145/3548606.3559367
  11. Sethesaurus: Wordnet in software engineering. IEEE Transactions on Software Engineering 47, 9 (2019), 1960–1979.
  12. Elliot J. Chikofsky and James H Cross. 1990. Reverse engineering and design recovery: A taxonomy. IEEE software 7, 1 (1990), 13–17.
  13. Neural reverse engineering of stripped binaries using augmented control flow graphs. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1–28.
  14. An ensemble of pre-trained transformer models for imbalanced multiclass malware classification. Computers & Security 121 (2022), 102846.
  15. Anderson Derek and Randal Scott. 2023. Word ninja. Retrieved June 5, 2023 from https://github.com/keredson/wordninja
  16. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  17. Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 472–489.
  18. Automatic Recovery of Fine-grained Compiler Artifacts at the Binary Level. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 853–868.
  19. How developers choose names. IEEE Transactions on Software Engineering 48, 1 (2020), 37–52.
  20. P2IM: Scalable and Hardware-independent Firmware Testing via Automatic Peripheral Interface Modeling. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 1237–1254. https://www.usenix.org/conference/usenixsecurity20/presentation/feng
  21. Structured Neural Summarization. In International Conference on Learning Representations. https://openreview.net/forum?id=H1ersoRqtm
  22. Free Software Foundation. 2023a. Coreutils - gnu core utilities. Retrieved June 5, 2023 from https://www.gnu.org/software/coreutils/
  23. Free Software Foundation. 2023b. Gnu binutilss. Retrieved June 5, 2023 from https://www.gnu.org/software/binutils/
  24. A Lightweight Framework for Function Name Reassignment Based on Large-Scale Stripped Binaries. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual, Denmark) (ISSTA 2021). Association for Computing Machinery, New York, NY, USA, 607–619. https://doi.org/10.1145/3460319.3464804
  25. GCC. 2023. Options That Control Optimization. Retrieved June 5, 2023 from https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
  26. Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393 (2016).
  27. Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. Journal of machine learning research 3, Mar (2003), 1157–1182.
  28. Antti Haapala. 2023. Python- Levenshtein. Retrieved June 5, 2023 from https://github.com/ztane/python-Levenshtein
  29. Debin: Predicting Debug Information in Stripped Binaries. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (Toronto, Canada) (CCS ’18). Association for Computing Machinery, New York, NY, USA, 1667–1680. https://doi.org/10.1145/3243734.3243866
  30. Hex-Rays. 2023. IDA Pro. Retrieved June 5, 2023 from https://hex-rays.com/ida-pro/
  31. On the naturalness of software. Commun. ACM 59, 5 (2016), 122–131.
  32. Einar W Høst and Bjarte M Østvold. 2009. Debugging method names. In ECOOP 2009–Object-Oriented Programming: 23rd European Conference, Genoa, Italy, July 6-10, 2009. Proceedings 23. Springer, 294–317.
  33. Deep Code Comment Generation. In Proceedings of the 26th Conference on Program Comprehension (Gothenburg, Sweden) (ICPC ’18). Association for Computing Machinery, New York, NY, USA, 200–210. https://doi.org/10.1145/3196321.3196334
  34. Summarizing Source Code with Transferred API Knowledge. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (Stockholm, Sweden) (IJCAI’18). AAAI Press, 2269–2275.
  35. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
  36. Summarizing Source Code using a Neural Attention Model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 2073–2083. https://doi.org/10.18653/v1/P16-1195
  37. An introduction to statistical learning. Vol. 112. Springer.
  38. SymLM: Predicting Function Names in Stripped Binaries via Context-Sensitive Execution-Aware Code Embeddings. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 1631–1645.
  39. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980
  40. Anton Kolonin. 2022. Unsupervised Tokenization Learning. arXiv preprint arXiv:2205.11443 (2022).
  41. Igor Kononenko et al. 1994. Estimating attributes: Analysis and extensions of RELIEF. In ECML, Vol. 94. Citeseer, 171–182.
  42. Dire: A neural approach to decompiled identifier naming. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 628–639.
  43. Improved Code Summarization via a Graph Neural Network. In Proceedings of the 28th International Conference on Program Comprehension (Seoul, Republic of Korea) (ICPC ’20). Association for Computing Machinery, New York, NY, USA, 184–195. https://doi.org/10.1145/3387904.3389268
  44. A Neural Model for Generating Natural Language Summaries of Program Subroutines. In Proceedings of the 41st International Conference on Software Engineering (Montreal, Quebec, Canada) (ICSE ’19). IEEE Press, 795–806. https://doi.org/10.1109/ICSE.2019.00087
  45. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
  46. Palmtree: learning an assembly language model for instruction embedding. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 3236–3251.
  47. A context-based automated approach for method name consistency checking and suggestion. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 574–586.
  48. Huan Liu. 2010. Feature Selection. Springer US, Boston, MA, 402–406. https://doi.org/10.1007/978-0-387-30164-8_306
  49. DéjàVu: a map of code duplicates on GitHub. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 1–28.
  50. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  51. Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). 807–814.
  52. OpenAI. 2023. GPT-4. Retrieved June 5, 2023 from https://platform.openai.com/docs/models/gpt-4
  53. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
  54. Probabilistic Naming of Functions in Stripped Binaries. In Annual Computer Security Applications Conference (Austin, USA) (ACSAC ’20). Association for Computing Machinery, New York, NY, USA, 373–385. https://doi.org/10.1145/3427228.3427265
  55. Learning Approximate Execution Semantics From Traces for Binary Function Similarity. IEEE Transactions on Software Engineering 49, 4 (2023), 2776–2790. https://doi.org/10.1109/TSE.2022.3231621
  56. Pre-trained models for natural language processing: A survey. Science China Technological Sciences 63, 10 (2020), 1872–1897.
  57. Radim Rehurek and Petr Sojka. 2011. Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic 3, 2 (2011), 2.
  58. Wind River. 2023. VxWorks:The Leading RTOS for the Intelligent Edge. Retrieved June 5, 2023 from https://www.windriver.com/products/vxworks
  59. Marko Robnik-Šikonja and Igor Kononenko. 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine learning 53 (2003), 23–69.
  60. Fixing faults in c and java source code: Abbreviated vs. full-word identifier names. ACM Transactions on Software Engineering and Methodology (TOSEM) 26, 2 (2017), 1–43.
  61. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.
  62. CEA IT Security. 2023. Miasm. Retrieved June 5, 2023 from https://github.com/cea-sec/miasm
  63. ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 6947–6957. https://aclanthology.org/2020.lrec-1.858
  64. Identification of common molecular subsequences. Journal of molecular biology 147, 1 (1981), 195–197.
  65. Synopsys. 2022. Synopsys 2022 open source security and risk analysis report. Retrieved 2023 from https://www.synopsys.com/software-integrity/resources/analyst-reports/open-source-security-risk-analysis.html
  66. Princeton University. 2023. WordNet A Lexical Database for English. Retrieved June 5, 2023 from https://wordnet.princeton.edu/
  67. An Observational Investigation of Reverse Engineers’ Processes. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 1875–1892. https://www.usenix.org/conference/usenixsecurity20/presentation/votipka-observational
  68. Hackers vs. Testers: A Comparison of Software Vulnerability Discovery Processes. In 2018 IEEE Symposium on Security and Privacy (SP). 374–391. https://doi.org/10.1109/SP.2018.00003
  69. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems 32, 1 (2020), 4–24.
  70. Commit Message Generation for Source Code Changes. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (Macao, China) (IJCAI’19). AAAI Press, 3975–3981.
  71. LmPa: Improving Decompilation by Synergy of Large Language Model and Program Analysis. arXiv preprint arXiv:2306.02546 (2023).
  72. Helping Johnny to Analyze Malware: A Usability-Optimized Decompiler and Malware Analysis User Study. In 2016 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 158–177. https://doi.org/10.1109/SP.2016.18
  73. HMM-BiMM: Hidden Markov Model-based word segmentation via improved Bi-directional Maximal Matching algorithm. Computers & Electrical Engineering 94 (2021), 107354.
  74. Asteria: Deep learning-based AST-encoding for cross-platform binary code similarity detection. In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 224–236.
  75. Order matters: semantic-aware neural networks for binary code similarity detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1145–1152.
  76. Transformer-XL With Graph Neural Network for Source Code Summarization. In 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC). 3436–3441. https://doi.org/10.1109/SMC52423.2021.9658619

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.