PFed-Signal: Federated ADR Signal Detection
- The paper demonstrates that PFed-Signal mitigates bias in ADR detection by integrating Pfed-Split data partitioning with Euclidean distance-based filtering.
- It employs a two-stage federated learning approach that cleans and partitions large FAERS datasets, enhancing classical metrics like ROR and PRR.
- The integrated transformer-based ADR predictor achieves high accuracy, F1, recall, and AUC, underscoring the framework's effectiveness on noisy data.
PFed-Signal is a federated learning-based adverse drug reaction (ADR) signal detection and prediction framework tailored for large, noisy pharmacovigilance datasets such as FAERS. It introduces a two-stage strategy: data partition and cleaning via Pfed-Split, and federated bias-aware ADR prediction via ADR-Signal, which collectively mitigate the confounding impact of biased records—such as duplicates, under-reporting, and label errors—on classical disproportionality metrics (reporting odds ratio, ROR; and proportional reporting ratio, PRR) and modern neural ADR predictors. The system formalizes a Euclidean distance–based criterion for identifying and purging locally biased data and deploys a federated transformer model architecture to maximize accuracy, F1, recall, and AUC on the cleaned distributed dataset (Li et al., 29 Dec 2025).
1. Data Partitioning and Pfed-Split Mechanism
Pfed-Split processes a monolithic FAERS dataset
by executing a multi-step pre-processing and splitting workflow:
- Cleaning: Deduplicate entries, clip numeric features to medically valid ranges (e.g., age ∈ [0,120]), and drop records with missing or invalid critical attributes.
- Random Partition: Uniformly split the cleaned dataset into mutually exclusive client splits , ensuring .
- ADR-based Tables: Within each client , generate ADR-specific tables for .
Parameter choices for (client count), feature cleaning thresholds, and random seeds are selected to balance privacy, statistical efficiency, and split reproducibility.
2. Federated Architecture, Signal Cleaning, and Euclidean Distance-Based Filtering
PFed-Signal employs a server–client federated learning pattern:
- Local Training: Each client trains ADR-specific binary classifiers (parameters ) on its local .
- Aggregation: The server aggregates local models to form global ADR-specific weights:
- Bias Detection via Euclidean Distance: To detect and remove bias, the server computes Euclidean distance:
Any with (where is typically set via cross-validation, e.g., ) is marked biased and excluded from further aggregation.
This mechanism addresses dataset noise at scale—a major limitation of pure statistical scoring (e.g., ROR, PRR) which is prone to bias-induced inflation.
3. ADR Prediction Model: Integration of Cleaned Federated Data
The clean global training set is defined as
and supports two downstream tasks:
- Calculation of Robust Signal Scores: ROR and PRR are recalculated using , yielding higher, more reliable values compared with those computed on the biased original dataset.
- Transformer-based ADR Classifier (ADR-Signal): A transformer model is trained on in a federated manner, allowing high-capacity clients to leverage the full data distribution while retaining privacy.
4. Computational Workflow and Complexity
The cleaning and partitioning steps require operations. Each federated round has local update cost per client, and aggregation complexity for -dimensional parameter vectors. Overall complexity is linear in set size and model dimension.
Pfed-Split’s preprocessing and partition strategy produces tables, enabling eventual bias-aware federated training and evaluation. Federated averaging converges under standard convexity and smoothness conditions as or better when using momentum.
Pseudocode for Pfed-Split:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Algorithm Pfed-Split Input: D ← original FAERS records n ← number of clients valid_ranges ← feature thresholds Output: {AT^i_j} # ADR-based tables per client 1. D_clean ← drop_duplicates(D) 2. for each numeric feature f in D_clean: clip f to valid_ranges[f] 3. D_pre ← delete_records_with_nulls(D_clean) 4. Randomly partition D_pre into Split_1, ..., Split_n 5. for i = 1..n: for j = 1..m: AT^i_j ← { r ∈ Split_i : ADR_label(r) == j } 6. return {AT^i_j : i=1..n, j=1..m} |
5. Empirical Evaluation and Signal Enhancement
Extensive empirical evaluation on FAERS demonstrates that, after Pfed-Split and federated Euclidean-based bias detection:
- Improved Metrics: ROR and PRR computed on exhibit higher values and less noise than on the original, uncleaned data.
- Superior Predictive Performance: On benchmark ADR signal prediction, the PFed-Signal framework achieves accuracy = 0.887, F1 = 0.890, recall = 0.913, and AUC = 0.957, all exceeding baselines (Li et al., 29 Dec 2025).
This suggests that federated bias mitigation is essential in ADR mining, where classical metrics and neural predictors otherwise remain vulnerable to data artifacts.
6. Parameterization, Practical Considerations, and Limitations
The parameter controls the split granularity: higher increases privacy but reduces the statistical power per client partition. Threshold must be tuned via out-of-sample validation to balance over-filtering and under-filtering. The approach is computationally scalable, requiring only linear operations in both data and model size.
A plausible implication is that the methodology—partitioning, federated signal cleaning, and Euclidean filtering—may extend to other medical data fusion settings facing similar biases. However, tuning of and careful preprocessing remain critical for achieving optimal performance and avoiding inadvertent exclusion of valid signal.
References:
- “PFed-Signal: An ADR Prediction Model based on Federated Learning” (Li et al., 29 Dec 2025)
- For conceptual contrast in split federated learning: “Flexible Personalized Split Federated Learning for On-Device Fine-Tuning of Foundation Models” (Yuan et al., 14 Aug 2025)