A Survey of Machine Learning for Big Code and Naturalness

Published 18 Sep 2017 in cs.SE, cs.LG, and cs.PL | (1709.06182v2)

Abstract: Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit code's abundance of patterns. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges and opportunities.

Abstract PDF Upgrade to Chat

Citations (806)

View on Semantic Scholar

Summary

The paper surveys diverse machine learning models that exploit the naturalness of code to generate and analyze code structures.
It categorizes models into token-level, syntactic, and semantic approaches, emphasizing applications like code autocompletion, defect detection, and program synthesis.
The authors highlight future research challenges including bridging formal and continuous representations, refining evaluation metrics, and enhancing debugging tools.

A Survey of Machine Learning for Big Code and Naturalness

The paper "A Survey of Machine Learning for Big Code and Naturalness" by Allamanis et al. is a comprehensive survey that examines the intersection of machine learning, programming languages, and software engineering, focusing on the modeling of source code with machine learning techniques. The survey dramatically explains how recent advances in machine learning can be leveraged to analyze and generate code, aiming to improve software engineering tools by exploiting the repetitive and predictable nature of code, which the authors term as the "naturalness" of code.

Overview of Probabilistic Models

The paper primarily discusses three categories of probabilistic models:

Code-Generating Models: These models are designed to generate code by modeling the generation process of smaller code components, such as tokens or abstract syntax tree (AST) nodes. They further classify these into:
- Token-level Models: Models that generate sequences of tokens using methods like $n$ -gram LLMs or recurrent neural networks (RNNs).
- Syntactic Models: Models that generate tree structures representing the syntax of code, often using tree-based grammars or neural architectures.
- Semantic Models: Models focusing on graph structures to capture more complex relationships within code, such as data flow graphs.
Representational Models of Code: These models aim to learn abstract representations of code that can predict properties or features. They include structured prediction models, which capture interdependencies between different elements of the code, and distributed representations, often utilizing neural networks to encode semantic information.
Pattern Mining Models: These unsupervised models discover recurrent patterns in the code without supervision, often used for tasks such as documentation and anomaly detection. Techniques used include tree substitution grammars and graphical models.

Key Applications

The survey explores a myriad of applications where probabilistic models of code can be beneficial:

Recommender Systems: These include systems for code autocompletion and suggestion tools, significantly enhancing development productivity by predicting the next likely code snippet based on context.
Inferring Coding Conventions: Machine learning models can automatically infer and enforce coding conventions, aiding in code consistency and readability.
Code Defects: Statistical models can identify anomalies in code that may indicate defects, leveraging the propensity for certain patterns to signify correct or erroneous behavior.
Code Translation and Clones: Methods for translating code between programming languages and detecting code clones to foster reuse and reduce redundancy.
Code-to-Text and Text-to-Code: Techniques for converting natural language into source code (and vice versa), which is invaluable for documentation, code search, and even program synthesis.
Program Synthesis: The generation of code snippets or entire programs from specifications or examples, often using techniques from programming by example.
Program Analysis: Applying machine learning to inform static analysis tools, easing the detection of potential flaws and improving overall software quality.

Future Directions and Challenges

The paper also highlights several significant challenges and future directions for research:

Bridging Representations: There is a need to better integrate formal programming language representations with continuous representations used in machine learning.
Handling Data Sparsity and Compositionality: Code data is often sparse and highly compositional, requiring models that can generalize from limited data.
Evaluation Metrics: Developing better metrics for evaluating machine learning models of code to ensure they are both effective and practical.
Debugging and Traceability: Extending these models to aid in debugging tasks and improving traceability between different software artifacts.

Conclusion

This comprehensive survey underscores the potential of machine learning to revolutionize software engineering by leveraging the naturalness of code. Through a range of probabilistic models, applications, and an analysis of future challenges, the authors provide a robust framework for understanding how machine learning can contribute to creating more reliable and maintainable software systems. The ongoing and future research in this domain holds promise for bridging the gap between traditional software engineering methodologies and modern machine learning techniques, leading to innovative tools and practices in software development.