JIT Defect Localization

Updated 2 December 2025

Just-In-Time Defect Localization is a method for pinpointing faulty code regions—such as functions, hunks, or lines—immediately after commits using historical change data.
It employs advanced techniques including graph-based transformers, commit-token models, and spectrum-based analysis to rank defect risks with high precision.
This approach enhances software quality by enabling real-time, actionable insights that reduce developer investigation effort, as demonstrated by improved F1, AUC, and Top-n accuracy metrics.

Just-In-Time Defect Localization (JIT-DL) is a technically rigorous approach for identifying and ranking defect-inducing code regions—such as functions, hunks, or lines—in software systems immediately as commits are made or optimization passes complete. JIT-DL extends the predictive focus of Just-In-Time Defect Prediction (JIT-DP) from commit-level risk to the problem of localizing the exact faulty code portions, enabling more granular, actionable software quality interventions. Recent JIT-DL methods comprise a spectrum of architectures, feature representations, and evaluation metrics tailored for high precision and efficiency in large codebases and dynamic compiler environments. Major recent contributions include the application of graph-based learning for function-level localization, commit-token models for line-level ranking, and directed program generation for spectrum-based fault localization within JIT compilers.

1. Formal Problem Definition and Objectives

JIT-DL targets the fine-grained identification of defect-inducing code, typically at the function, hunk, or line level, immediately after code changes are introduced. The localization is performed by learning discriminative models and/or statistical techniques using historical code changes and corresponding defect labels.

At the function-change level, the localization task is formalized as multiclass classification: given inputs $f_{clean}$ (the function before the change) and $f_{bug}$ (the function after the change), predict $y \in \{0, \ldots, K-1\}$ , where 0 represents clean changes and $1,\ldots,K-1$ enumerate defect categories (Ni et al., 2022). The objective is to minimize the cross-entropy:

$L_{loc}(\theta) = -\sum_{i=1}^{N}\sum_{k=0}^{K-1} 1[y^{(i)} = k] \cdot \log P_\theta(y=k | f_{clean}^{(i)}, f_{bug}^{(i)})$

At the JIT compiler level, the goal is to identify “suspicious” program entities (e.g., IR nodes, functions) likely responsible for a code-generation bug using analysis of test inputs (passing and failing variants) and spectrum-based fault localization. The optimization is to minimize overlap among failing test entities and maximize union among passing entities (Lim et al., 2023):

Generate failing programs $P_{\mathrm{fail}}$ with minimal intersection of entity sets.
Generate passing programs $P_{\mathrm{pass}}$ with maximal union of entity sets under high similarity to the seed program.

2. Methodologies and Feature Representations

JIT-DL employs a diversity of input representations and feature extraction methods, driven by the granularity of localization and the underlying model architecture.

Function-based localization. CompDefect builds on GraphCodeBERT, capturing semantic tokens, variable lists, and explicit data-flow graphs for both pre- and post-change versions. Inputs are linearized as $X_{clean}$ and $X_{bug}$ , concatenated, and embedded for transformer-based modeling with graph-guided masked attention (Ni et al., 2022).

Commit and line-level ranking. JITLine uses a “bag-of-tokens” approach (alphanumeric tokens from changed lines, literals replaced by placeholders) and generic commit metrics. Line-level localization derives from LIME feature importance weights, where each changed line $l$ is scored as the sum of per-token local importances $e_i$ (Pornprasit et al., 2021):

$score(l)=\sum_{i \in tokens(l)}e_i$

JIT compiler fault localization. DPGen4JIT constructs program variants via AST-driven mutations, selects passing/failing sets based on structural similarity/difference, and uses spectrum-based measures (e.g., Ochiai formula) on execution coverage to rank suspicious entities (Lim et al., 2023).

3. Model Architectures and Algorithms

Graph-based neural models. CompDefect processes function changes via transformers initialized with GraphCodeBERT, outputs summary embeddings $h_{clean}$ and $h_{bug}$ , and uses a neural tensor network to explicitly encode differences, followed by softmax classification (Ni et al., 2022).

RandomForest with interpretable local explanations. JITLine applies a RandomForest to commit-level vectors, integrated with class-imbalance handling (SMOTE). Line-level defect localization is achieved via LIME, yielding interpretable line rankings and effort-aware prioritization (Pornprasit et al., 2021).

Directed mutation and spectrum-based inference. For compiler IR, DPGen4JIT leverages systematic AST mutation and program generation, operationalizes test selection via Jaccard similarity, and applies statistical spectra to entity coverage (Ochiai score):

$\mathrm{Sus}(I) = \frac{I_{ef}}{\sqrt{(I_{ef}+I_{nf})(I_{ef}+I_{ep})}}$

where $I_{ef}$ , $I_{ep}$ , $I_{nf}$ enumerate entity test-coverages.

IR visualization and reduction. Metro map visualization is accomplished by graph and hypergraph reduction followed by octilinear map embedding, supporting manual or automated bug localization within JIT optimization passes via node/phase intersection analysis (Lim et al., 2021).

4. Evaluation Metrics and Quantitative Results

JIT-DL systems are quantitatively assessed via metrics suited to respective localization granularity, including precision, recall, $F_1$ , area under ROC curve (AUC), effort-aware statistics, and entity ranking measures:

Metric	Definition/Scope	Notable Values
$F_1$	$2PR/(P+R)$, standard localization	CompDefect: $0.679$ vs. DeepJIT $0.414$ (Ni et al., 2022)
AUC	ROC-curve area, binary discrimination	CompDefect: $0.785$ vs. CC2Vec $0.492$ (Ni et al., 2022)
Top-n Accuracy	% of bugs in top $n$ entities	DPGen4JIT Top-1: $25.0\%$ ; Top-20: $69.4\%$ (Lim et al., 2023)
PCI@20\%LOC	Proportion of defects found in top 20% LOC	JITLine OpenStack: $0.56$; Qt: $0.70$ (Pornprasit et al., 2021)
Effort@20%Recall	Fraction of LOC needed for 20% recall	JITLine OpenStack: $0.04$; Qt: $0.02$ (Pornprasit et al., 2021)
Top-10 Line Accuracy	Fraction of fixed defective lines in top 10	OpenStack: $0.70$; Qt: $0.50$ (Pornprasit et al., 2021)
Initial False Alarm (IFA)	# clean lines before first defect	OpenStack median: $0$; Qt: $1$ (Pornprasit et al., 2021)

JITLine exhibits significant improvements over CC2Vec, DeepJIT, and baseline n-gram models in both accuracy and cost-effectiveness, achieving much lower effort@recall and IFA. DPGen4JIT delivers higher top-n localization and a pronounced reduction of non-suspicious entities compared with random and single-failing benchmarks.

5. Comparative Analysis and Trade-offs

JIT-DL research demonstrates trade-offs among granularity, recall, runtime efficiency, and explainability.

Granularity: Approaches such as CompDefect achieve function-level localization and multiclass defect categorization, outperforming commit-level models in actionable precision (Ni et al., 2022). JITLine yields line-level defect prioritization, promoting targeted code reviews (Pornprasit et al., 2021).
Runtime and scalability: JITLine operates at $70$– $100\times$ lower training cost than deep learning baselines, making it suitable for continuous integration. Metro map-based visualization for JIT compilers enables defect localization in minutes, supported by aggressive IR size reduction (e.g., $–79.1\%$ in node count for V8 Bug 5129) (Lim et al., 2021).
Model complexity: Bag-of-tokens and RandomForest architectures offer superior interpretability. Deep neural and graph-based models (GraphCodeBERT, tensor networks) require heavier computation but enable richer context and multiclass outputs.
Integration: DPGen4JIT can be deployed as a test-generation front end for ongoing JIT-DL pipelines, while MetroSets visualization can be integrated with spectrum-based scoring for automated suspiciousness ranking (Lim et al., 2023, Lim et al., 2021).

6. Limitations and Future Directions

JIT-DL research faces several open challenges:

Dataset and labeling constraints: Quality of defect localization is sensitive to program seed selection, defect labeling (e.g., SZZ for commit-level), and domain coverage—often limited to major open-source repositories.
Semantic mutation and test coverage: For JIT compiler localization, enhanced target identification suggests moving beyond isolated node mutations to detecting combinatorial defect triggers.
Generalizability: While DPGen4JIT and MetroSets focus on JIT compiler IR, the methods generalize to other language processors given AST-based grammars and bug-detection oracles.
Automated scoring: Manual visualization for bug rank scoring is currently a bottleneck and is being addressed by integrating statistical supersets.
Effort-aware actionability: Extending line-level localization to suggest direct fixes or refactorings is an open area (e.g., TimeLIME, counterfactual edits).

This suggests that progress in JIT-DL hinges on further integration of structure-aware learning, automated explanatory metrics, and domain-adaptive test generation.

7. Significance and Implications

JIT-DL methods offer substantial advances over traditional defect prediction paradigms by narrowing localization granularity, reducing developer investigation effort, and enabling real-time actionable insights. At commit, function, line, and IR-entity resolutions, models such as CompDefect, JITLine, DPGen4JIT, and MetroSets streamline both detection and diagnosis of software defects, particularly in modern, continuously-evolving systems (Ni et al., 2022, Pornprasit et al., 2021, Lim et al., 2023, Lim et al., 2021).

A plausible implication is that ongoing refinement of input representations, localized scoring, and test-generation procedures will further enhance defect localization fidelity, portability across programming languages, and relevance for industrial-scale deployment in both general software development and specialized compilation toolchains.