Structured Generative Models of Natural Source Code

Published 2 Jan 2014 in cs.PL, cs.LG, and stat.ML | (1401.0514v2)

Abstract: We study the problem of building generative models of natural source code (NSC); that is, source code written and understood by humans. Our primary contribution is to describe a family of generative models for NSC that have three key properties: First, they incorporate both sequential and hierarchical structure. Second, we learn a distributed representation of source code elements. Finally, they integrate closely with a compiler, which allows leveraging compiler logic and abstractions when building structure into the model. We also develop an extension that includes more complex structure, refining how the model generates identifier tokens based on what variables are currently in scope. Our models can be learned efficiently, and we show empirically that including appropriate structure greatly improves the models, measured by the probability of generating test programs.

Abstract PDF Upgrade to Chat

Citations (169)

View on Semantic Scholar

Summary

Structured Generative Models of Natural Source Code

The paper titled "Structured Generative Models of Natural Source Code" by Chris J. Maddison and Daniel Tarlow introduces novel methodologies to enhance the understanding and generation of Natural Source Code (NSC) through machine learning-driven approaches. The study leverages probabilistic context free grammars (PCFGs) and neuro-probabilistic language models to model source code, integrating compiler-like reasoning for improved generative capacity.

Approach and Model Development

The authors propose Log-bilinear Tree-Traversal models (LTTs), a significant innovation in structured generative modeling. These models utilize a depth-first traversal method for parsing abstract syntax trees (ASTs), aligning closely with the inherent sequential and hierarchical nature of NSC. By utilizing traversal variables that evolve during traversal, LTTs actively integrate contextual dependency beyond the scope of traditional PCFGs.

One innovation within LTTs is the addition of deterministic traversal variables that account for variable scoping, capturing context-specific information such as variable presence and declaration timing. This consideration highlights the importance of systematic syntax constraints in syntactical representation, enhancing NSC modeling by combining learned probabilistic views with compiler-derived structural insights.

Empirical Analysis

Empirical evaluation underscores the superiority of LTTs in terms of predictive log-likelihoods when compared to baseline models, including traditional N-gram models and simple PCFGs. The LTT models, particularly those enhanced with scope awareness and deterministic traversal variables, exhibit considerable improvements in log probability per token, a testament to their capability in capturing the intricate dependencies and constraints present in NSC.

Train-validation-test splits conducted on an NSC corpus from TopCoder.com demonstrate the models' effective generalization across independent programmer sets, ensuring robustness and applicability across diverse programming contexts. LTTs, particularly those with scope modeling, drastically reduce token generation uncertainty, presenting promising avenues for code completion, automatic code summarization, and syntactic error correction.

Implications and Future Directions

The development of structured generative models marks a significant step in machine learning applications within the domain of software engineering. Not only do these models facilitate improved tooling such as advanced autocompletion, but they also stand to revolutionize tasks such as automatic programming language translation and program manipulation. By considering code naturalness as a dimension for learning, these models lay a foundation for systematic NSC understanding.

Future work could investigate expanding scope modeling to include broader variable contexts, such as method calls and external libraries, or integrating semantic understanding to account for functional outcomes within NSC. The integration of modular scope and type inference further promises improvements in model accuracy and application versatility.

In conclusion, the structured generative approaches presented offer a promising landscape for advancing machine learning in programming, providing authentic and statistically-grounded interpretation of source code. This alignment with both the syntactical and semantical demands of software engineering promises substantial contributions to ongoing innovations in AI-driven development tools.