Cross-Project Fault Prediction

Updated 9 February 2026

Cross-project fault prediction is the technique of using labeled defect data from multiple projects to predict defects in a target project despite differences in data distributions.
It employs methods such as relevancy filtering, transfer learning, and ensemble approaches to reconcile heterogeneous feature sets and improve prediction accuracy.
Effective application requires rigorous data cleansing, meta-learning for optimal algorithm selection, and proper cost-sensitive evaluation to ensure practical utility.

Cross-project fault prediction (CPFP), also referred to as cross-project defect prediction (CPDP), is the task of constructing and deploying predictive models that estimate defect-proneness in a software project (the target) by leveraging labeled data from one or more other projects (the sources). This methodology is especially significant when historical defect data is unavailable for the target project, for instance, in new or inactive projects. CPFP research is characterized by particular challenges related to data distribution differences, data quality, feature-set heterogeneity, and industrial applicability.

1. Formal Definition and Central Challenges

In CPFP, let the dataset $DS = \{(x_i, y_i)\}_{i=1}^{M}$ , where $x_i \in \mathbb{R}^{N-1}$ represents static code metrics and $y_i\in\{0,1\}$ the defect label ($1 =$ defective, $0 =$ clean). Unlike within-project defect prediction (WPDP), which assumes the training and test sets are drawn from the same (or similar) distribution, CPFP must contend with:

Covariate Shift: Discrepancy in the distribution of feature vectors between sources and target projects ( $P_S(X) \neq P_T(X)$ ).
Concept Drift: Instability or inconsistency in the mapping from features to labels across projects ( $P_S(Y|X) \neq P_T(Y|X)$ ).
Class Imbalance: Datasets typically exhibit a small proportion of defect-prone modules, which leads to biased classifier behavior.
Heterogeneous Feature Sets: Source and target may use differing sets of metrics, precluding direct transfer unless reconciled (He et al., 2014).
Public Data Quality: Prevalence of duplicated or contradictory records ("identical" and "inconsistent" cases) in standard datasets, e.g., Jureczko and NASA (Sun et al., 2018).

These issues degrade the generalization ability of naïve models and motivate the development of data cleansing, relevancy filtering, transfer-learning, and robust evaluation techniques.

2. Data Quality Issues and Cleansing in CPFP

The reliability of CPFP models is directly influenced by the underlying data quality. In prominent datasets such as Jureczko, duplicate ("identical cases") and contradiction ("inconsistent cases") events are widespread: 52 of 65 releases exhibited at least one identical pair and 28 at least one inconsistent pair, with some releases affected at rates well above 50%. Such anomalies distort both training and evaluation, artificially inflating or deflating reported accuracy and AUC.

A two-stage cleaning process is advised:

Duplicate Removal: For all $i < j$ , if $x_i = x_j$ and $y_i = y_j$ , remove instance $j$ .
Inconsistency Removal: For remaining pairs $i < j$ , if $x_i = x_j$ and $y_i \ne y_j$ , remove both $i$ and $j$ .

Applying these steps across all datasets can result in the elimination of up to $10^4$ cases per release and purges misleading data which may otherwise bias model assessment (Sun et al., 2018).

Data cleansing is shown to cause substantial performance shifts:

Random Forest + Global Filter: Average $+$ 19.9% F-Measure after cleaning; other filters and classifiers saw more modest but nontrivial changes.
AUC: Typically ±3% average change, with some releases experiencing swings >10% (Sun et al., 2018).

Uncleaned data may overstate (or understate) algorithm effectiveness and should therefore never be used without rigorous anomaly scans.

3. Representative Methodologies for Cross-Project Fault Prediction

CPFP research proposes a spectrum of methodologies, with core archetypes and corresponding representative approaches:

Relevancy Filtering: Filters source data to retain only instances similar to the target, typically via $k$ -nearest neighbor techniques (e.g., Burak Filter) (Sun et al., 2018, He et al., 2016), clustering, or project-level similarity (Herbold, 2017).
Transfer Learning: Domain adaptation methods such as mean/median scaling, feature transformation (Transfer Component Analysis/TCA+), and instance reweighting (e.g., data gravitation) (Porto et al., 2018, Tong et al., 2019). Recent advances utilize bilevel optimization frameworks that configure both the transformation and the classifier (e.g., BiLO-CPDP, MBL-CPDP) (Li et al., 2020, Chen et al., 2024).
Data Cleansing: Systematic duplicate and inconsistency removal, particularly necessary for public datasets (see above).
Ensemble Learning: Hybrid-inducer ensembles (HIEL), bootstrap aggregation, and weighted majority voting strategies have been empirically shown to improve F-measure and cost effectiveness over single-algorithm approaches (B et al., 2022).
Resampling for Imbalance: Target-aware oversampling (TOMO), SMOTE, and diversity-based methods (MAHAKIL) are used to address skewed class distributions in source data, with validated improvements in recall and G-measure (Bennin et al., 2022, Tong et al., 2019).
Instance Weighting and Feature Importance: Algorithms such as FWTNB leverage maximal information coefficient (MIC) based feature weighting and local instance relevance to enhance transferability (Tong et al., 2019).

Methodological frameworks now routinely combine several of these components in robust, automated pipelines (Chen et al., 2024).

4. Algorithm Selection, Meta-Learning, and Automated Pipeline Search

Given that no single CPFP method dominates universally—method rankings are highly dataset- and project-dependent—meta-learning has been proposed to guide the selection of the most effective pipeline for a given prediction scenario. Meta-features (statistical summaries of datasets or project releases) are input to a meta-learner (commonly a random forest) that recommends among a short list of top-performing CPDP techniques (Porto et al., 2018). The accuracy of such meta-learning approaches can reach ~53% in selecting the best method for new projects, providing an advantage over arbitrary or fixed choices, though the margin over single best base-learners can be limited.

Recent developments have further automated model selection via bi-level or multi-objective optimization (BiLO-CPDP, MBL-CPDP). These frameworks search over large combinatorial spaces of feature selectors, transfer learners, classifiers, and hyperparameters to identify Pareto-optimal pipelines, demonstrating significant AUC and F-measure improvements over baselines (Li et al., 2020, Chen et al., 2024).

5. Evaluation Metrics, Cost Sensitivity, and Practical Impact

Standard evaluation metrics in CPFP include:

Discrimination: Area Under the ROC Curve (AUC), F-measure, G-measure, Matthews Correlation Coefficient (MCC), recall, and precision.
Effort- and Cost-Oriented: Notably, classic predictive performance may not correlate with cost efficiency. Cost assessments require metrics such as Normalized Estimated Cost Metric ( $NECM_{C_{ratio}}$ ), percentage of defects found in top- $p\%$ of code ( $Rel_{p\%}$ ), and AUCEC (area under the cost-effectiveness curve) (Herbold, 2018).

Benchmarking of 26 state-of-the-art CPDP methods shows that naïvely flagging all modules as defective often outperforms non-cost-sensitive machine-learned models in cost terms (Herbold, 2018), highlighting a critical misalignment between optimizing for AUC/F-measure and achieving practical reductions in inspection effort. Only approaches that explicitly integrate cost objectives (e.g., genetic programming for recall vs. LOC) can outperform the trivial all-defective baseline.

Organizations deploying CPFP must thus define clear objective functions (e.g., maximize defect detection subject to inspection budget) and ensure models are evaluated accordingly.

6. Industrial Relevance, Interpretability, and Emerging Trends

CPFP methods are evolving in response to industrial requirements:

Interpretability: Rule-based and metric-thresholding approaches allow explicit mapping from code characteristics (e.g., cyclomatic complexity, function size) to actionable predictions, harmonizing with standards such as ISO 26262 and providing traceable gates for safety-critical domains (Luca et al., 6 Feb 2026). Threshold-based approaches can be directly integrated into SQA workflows with high-precision flags, avoiding the opaqueness of typical machine-learned predictors.
Inverse Defect Prediction (IDP): Rather than predicting defect-prone locations, IDP identifies methods with structural triviality and low historical defect risk, achieving high precision and significant fault-density reduction, especially in cross-project scenarios (Niedermayr et al., 2018, Niedermayr et al., 2018).
Privacy-Preserving Federated Learning: Federated CPFP enables training defect predictors without raw data sharing, using local heterogeneity-aware knowledge distillation and open-source distillation datasets (Wang et al., 2024). This addresses data privacy concerns and data heterogeneity, delivering state-of-the-art F1 and AUC with communication efficiency.
Concurrency-Fault Prediction: CPFP frameworks have been adapted to model concurrency-specific code structures and mutation-derived metrics, supporting defect prediction for multi-threaded software (Yu et al., 2018).
Deep Transfer Learning and GANs: Adversarial domain adaptation via GANs aligns source and target metric distributions, showing performance gains in moderately balanced target projects (Pal, 2021), although severe class imbalance remains challenging.

7. Current Limitations and Best Practices

While CPFP has developed into a mature research area, several limitations persist:

Heterogeneity in Evaluation: The diversity in datasets, metric sets, evaluation baselines, and case-study protocols impedes direct comparison and cumulative progress (Herbold, 2017).
Lack of Cost-Sensitivity: Many published models optimize standard metrics but not practical cost/effort objectives, leading to suboptimal outcomes for QA resource allocation (Herbold, 2018).
Metric Set Mismatches: Distribution characteristic-based mapping enables transfer when source and target metric sets differ, but this area remains less explored than the identical-metric case (He et al., 2014).
Class Imbalance Handling: Resampling methods improve recall and G-measure but increase false alarm rates; their utility depends on explicit trade-off preferences (Bennin et al., 2022, Tong et al., 2019).
Data Quality: Rigorous data cleansing is non-negotiable prior to any CPFP modeling (Sun et al., 2018).

Recommended practices include:

Automated data-quality scans before model building.
Relevancy and filtering of source data by instance/project similarity metrics.
Use of ensemble models with model/instance weighting where feasible.
Explicit cost-based evaluation and reporting, including baseline comparisons to "predict-all-defective".
Integration of interpretable features or thresholds when requirements demand transparency.
Comprehensive documentation and sharing of code, data splits, and hyperparameter settings to enable reproducibility and cumulative advancement.

References:

(Sun et al., 2018, Luca et al., 6 Feb 2026, Chen et al., 2024, Li et al., 2020, Wang et al., 2024, Porto et al., 2018, B et al., 2022, Herbold, 2018, Herbold, 2017, Niedermayr et al., 2018, Niedermayr et al., 2018, Bennin et al., 2022, Tong et al., 2019, He et al., 2014, Yu et al., 2018, Pal, 2021, Haldar et al., 2023).