Enhancing Function Name Prediction using Votes-Based Name Tokenization and Multi-Task Learning
Abstract: Reverse engineers would acquire valuable insights from descriptive function names, which are absent in publicly released binaries. Recent advances in binary function name prediction using data-driven machine learning show promise. However, existing approaches encounter difficulties in capturing function semantics in diverse optimized binaries and fail to reserve the meaning of labels in function names. We propose Epitome, a framework that enhances function name prediction using votes-based name tokenization and multi-task learning, specifically tailored for different compilation optimization binaries. Epitome learns comprehensive function semantics by pre-trained assembly LLM and graph neural network, incorporating function semantics similarity prediction task, to maximize the similarity of function semantics in the context of different compilation optimization levels. In addition, we present two data preprocessing methods to improve the comprehensibility of function names. We evaluate the performance of Epitome using 2,597,346 functions extracted from binaries compiled with 5 optimizations (O0-Os) for 4 architectures (x64, x86, ARM, and MIPS). Epitome outperforms the state-of-the-art function name prediction tool by up to 44.34%, 64.16%, and 54.44% in precision, recall, and F1 score, while also exhibiting superior generalizability.
- Miltiadis Allamanis. 2019. The Adverse Effects of Code Duplication in Machine Learning Models of Code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Athens, Greece) (Onward! 2019). Association for Computing Machinery, New York, NY, USA, 143–153. https://doi.org/10.1145/3359591.3359735
- code2seq: Generating Sequences from Structured Representations of Code. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=H1gKYo09tX
- In nomine function: Naming functions in stripped binaries with neural networks. arXiv preprint arXiv:1912.07946 (2019).
- Eran Avidan and Dror G Feitelson. 2017. Effects of variable names on comprehension: An empirical study. In 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC). IEEE, 55–65.
- Kent Beck. 2007. Implementation patterns. Pearson Education.
- Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”.
- Enriching word vectors with subword information. Transactions of the association for computational linguistics 5 (2017), 135–146.
- Buildroot. 2023. Buildroot: making embedded linux easy. Retrieved June 5, 2023 from https://buildroot.org
- How Professional Hackers Understand Protected Code while Performing Attack Tasks. In 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC). 154–164. https://doi.org/10.1109/ICPC.2017.2
- SFuzz: Slice-Based Fuzzing for Real-Time Operating Systems. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security (Los Angeles, CA, USA) (CCS ’22). Association for Computing Machinery, New York, NY, USA, 485–498. https://doi.org/10.1145/3548606.3559367
- Sethesaurus: Wordnet in software engineering. IEEE Transactions on Software Engineering 47, 9 (2019), 1960–1979.
- Elliot J. Chikofsky and James H Cross. 1990. Reverse engineering and design recovery: A taxonomy. IEEE software 7, 1 (1990), 13–17.
- Neural reverse engineering of stripped binaries using augmented control flow graphs. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1–28.
- An ensemble of pre-trained transformer models for imbalanced multiclass malware classification. Computers & Security 121 (2022), 102846.
- Anderson Derek and Randal Scott. 2023. Word ninja. Retrieved June 5, 2023 from https://github.com/keredson/wordninja
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 472–489.
- Automatic Recovery of Fine-grained Compiler Artifacts at the Binary Level. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 853–868.
- How developers choose names. IEEE Transactions on Software Engineering 48, 1 (2020), 37–52.
- P2IM: Scalable and Hardware-independent Firmware Testing via Automatic Peripheral Interface Modeling. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 1237–1254. https://www.usenix.org/conference/usenixsecurity20/presentation/feng
- Structured Neural Summarization. In International Conference on Learning Representations. https://openreview.net/forum?id=H1ersoRqtm
- Free Software Foundation. 2023a. Coreutils - gnu core utilities. Retrieved June 5, 2023 from https://www.gnu.org/software/coreutils/
- Free Software Foundation. 2023b. Gnu binutilss. Retrieved June 5, 2023 from https://www.gnu.org/software/binutils/
- A Lightweight Framework for Function Name Reassignment Based on Large-Scale Stripped Binaries. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual, Denmark) (ISSTA 2021). Association for Computing Machinery, New York, NY, USA, 607–619. https://doi.org/10.1145/3460319.3464804
- GCC. 2023. Options That Control Optimization. Retrieved June 5, 2023 from https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
- Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393 (2016).
- Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. Journal of machine learning research 3, Mar (2003), 1157–1182.
- Antti Haapala. 2023. Python- Levenshtein. Retrieved June 5, 2023 from https://github.com/ztane/python-Levenshtein
- Debin: Predicting Debug Information in Stripped Binaries. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (Toronto, Canada) (CCS ’18). Association for Computing Machinery, New York, NY, USA, 1667–1680. https://doi.org/10.1145/3243734.3243866
- Hex-Rays. 2023. IDA Pro. Retrieved June 5, 2023 from https://hex-rays.com/ida-pro/
- On the naturalness of software. Commun. ACM 59, 5 (2016), 122–131.
- Einar W Høst and Bjarte M Østvold. 2009. Debugging method names. In ECOOP 2009–Object-Oriented Programming: 23rd European Conference, Genoa, Italy, July 6-10, 2009. Proceedings 23. Springer, 294–317.
- Deep Code Comment Generation. In Proceedings of the 26th Conference on Program Comprehension (Gothenburg, Sweden) (ICPC ’18). Association for Computing Machinery, New York, NY, USA, 200–210. https://doi.org/10.1145/3196321.3196334
- Summarizing Source Code with Transferred API Knowledge. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (Stockholm, Sweden) (IJCAI’18). AAAI Press, 2269–2275.
- Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
- Summarizing Source Code using a Neural Attention Model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 2073–2083. https://doi.org/10.18653/v1/P16-1195
- An introduction to statistical learning. Vol. 112. Springer.
- SymLM: Predicting Function Names in Stripped Binaries via Context-Sensitive Execution-Aware Code Embeddings. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 1631–1645.
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980
- Anton Kolonin. 2022. Unsupervised Tokenization Learning. arXiv preprint arXiv:2205.11443 (2022).
- Igor Kononenko et al. 1994. Estimating attributes: Analysis and extensions of RELIEF. In ECML, Vol. 94. Citeseer, 171–182.
- Dire: A neural approach to decompiled identifier naming. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 628–639.
- Improved Code Summarization via a Graph Neural Network. In Proceedings of the 28th International Conference on Program Comprehension (Seoul, Republic of Korea) (ICPC ’20). Association for Computing Machinery, New York, NY, USA, 184–195. https://doi.org/10.1145/3387904.3389268
- A Neural Model for Generating Natural Language Summaries of Program Subroutines. In Proceedings of the 41st International Conference on Software Engineering (Montreal, Quebec, Canada) (ICSE ’19). IEEE Press, 795–806. https://doi.org/10.1109/ICSE.2019.00087
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
- Palmtree: learning an assembly language model for instruction embedding. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 3236–3251.
- A context-based automated approach for method name consistency checking and suggestion. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 574–586.
- Huan Liu. 2010. Feature Selection. Springer US, Boston, MA, 402–406. https://doi.org/10.1007/978-0-387-30164-8_306
- DéjàVu: a map of code duplicates on GitHub. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 1–28.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
- Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). 807–814.
- OpenAI. 2023. GPT-4. Retrieved June 5, 2023 from https://platform.openai.com/docs/models/gpt-4
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
- Probabilistic Naming of Functions in Stripped Binaries. In Annual Computer Security Applications Conference (Austin, USA) (ACSAC ’20). Association for Computing Machinery, New York, NY, USA, 373–385. https://doi.org/10.1145/3427228.3427265
- Learning Approximate Execution Semantics From Traces for Binary Function Similarity. IEEE Transactions on Software Engineering 49, 4 (2023), 2776–2790. https://doi.org/10.1109/TSE.2022.3231621
- Pre-trained models for natural language processing: A survey. Science China Technological Sciences 63, 10 (2020), 1872–1897.
- Radim Rehurek and Petr Sojka. 2011. Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic 3, 2 (2011), 2.
- Wind River. 2023. VxWorks:The Leading RTOS for the Intelligent Edge. Retrieved June 5, 2023 from https://www.windriver.com/products/vxworks
- Marko Robnik-Šikonja and Igor Kononenko. 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine learning 53 (2003), 23–69.
- Fixing faults in c and java source code: Abbreviated vs. full-word identifier names. ACM Transactions on Software Engineering and Methodology (TOSEM) 26, 2 (2017), 1–43.
- Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.
- CEA IT Security. 2023. Miasm. Retrieved June 5, 2023 from https://github.com/cea-sec/miasm
- ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 6947–6957. https://aclanthology.org/2020.lrec-1.858
- Identification of common molecular subsequences. Journal of molecular biology 147, 1 (1981), 195–197.
- Synopsys. 2022. Synopsys 2022 open source security and risk analysis report. Retrieved 2023 from https://www.synopsys.com/software-integrity/resources/analyst-reports/open-source-security-risk-analysis.html
- Princeton University. 2023. WordNet A Lexical Database for English. Retrieved June 5, 2023 from https://wordnet.princeton.edu/
- An Observational Investigation of Reverse Engineers’ Processes. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 1875–1892. https://www.usenix.org/conference/usenixsecurity20/presentation/votipka-observational
- Hackers vs. Testers: A Comparison of Software Vulnerability Discovery Processes. In 2018 IEEE Symposium on Security and Privacy (SP). 374–391. https://doi.org/10.1109/SP.2018.00003
- A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems 32, 1 (2020), 4–24.
- Commit Message Generation for Source Code Changes. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (Macao, China) (IJCAI’19). AAAI Press, 3975–3981.
- LmPa: Improving Decompilation by Synergy of Large Language Model and Program Analysis. arXiv preprint arXiv:2306.02546 (2023).
- Helping Johnny to Analyze Malware: A Usability-Optimized Decompiler and Malware Analysis User Study. In 2016 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 158–177. https://doi.org/10.1109/SP.2016.18
- HMM-BiMM: Hidden Markov Model-based word segmentation via improved Bi-directional Maximal Matching algorithm. Computers & Electrical Engineering 94 (2021), 107354.
- Asteria: Deep learning-based AST-encoding for cross-platform binary code similarity detection. In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 224–236.
- Order matters: semantic-aware neural networks for binary code similarity detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1145–1152.
- Transformer-XL With Graph Neural Network for Source Code Summarization. In 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC). 3436–3441. https://doi.org/10.1109/SMC52423.2021.9658619
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.