PFed-Signal: Federated ADR Signal Detection

Updated 5 January 2026

The paper demonstrates that PFed-Signal mitigates bias in ADR detection by integrating Pfed-Split data partitioning with Euclidean distance-based filtering.
It employs a two-stage federated learning approach that cleans and partitions large FAERS datasets, enhancing classical metrics like ROR and PRR.
The integrated transformer-based ADR predictor achieves high accuracy, F1, recall, and AUC, underscoring the framework's effectiveness on noisy data.

PFed-Signal is a federated learning-based adverse drug reaction (ADR) signal detection and prediction framework tailored for large, noisy pharmacovigilance datasets such as FAERS. It introduces a two-stage strategy: data partition and cleaning via Pfed-Split, and federated bias-aware ADR prediction via ADR-Signal, which collectively mitigate the confounding impact of biased records—such as duplicates, under-reporting, and label errors—on classical disproportionality metrics (reporting odds ratio, ROR; and proportional reporting ratio, PRR) and modern neural ADR predictors. The system formalizes a Euclidean distance–based criterion for identifying and purging locally biased data and deploys a federated transformer model architecture to maximize accuracy, F1, recall, and AUC on the cleaned distributed dataset (Li et al., 29 Dec 2025).

1. Data Partitioning and Pfed-Split Mechanism

Pfed-Split processes a monolithic FAERS dataset

$D = \{ r_k = (x_k, y_k) \mid x_k \in \mathbb{R}^d,\, y_k \in \{1,\dots,m\} \}_{k=1}^N$

by executing a multi-step pre-processing and splitting workflow:

Cleaning: Deduplicate entries, clip numeric features to medically valid ranges (e.g., age ∈ [0,120]), and drop records with missing or invalid critical attributes.
Random Partition: Uniformly split the cleaned dataset $D_{\mathrm{pre}}$ into $n$ mutually exclusive client splits $\{ Split_i \}$ , ensuring $D_{\mathrm{pre}} = \bigcup_{i=1}^n Split_i$ .
ADR-based Tables: Within each client $i$ , generate ADR-specific tables $AT^i_j = \{ (x, y) \in Split_i : y = j \}$ for $j=1,\ldots,m$ .

Parameter choices for $n$ (client count), feature cleaning thresholds, and random seeds are selected to balance privacy, statistical efficiency, and split reproducibility.

2. Federated Architecture, Signal Cleaning, and Euclidean Distance-Based Filtering

PFed-Signal employs a server–client federated learning pattern:

Local Training: Each client $i$ trains ADR-specific binary classifiers (parameters $w_{i,j}$ ) on its local $AT^i_j$ .
Aggregation: The server aggregates local models to form global ADR-specific weights:

$\overline w_j = \sum_{i=1}^n \frac{|AT^i_j|}{\sum_{i'}|AT^{i'}_j|} w_{i,j}$
Bias Detection via Euclidean Distance: To detect and remove bias, the server computes Euclidean distance:

$\Delta^i_j = \left\| w_{i,j} - \overline w_j \right\|_2 = \sqrt{ \sum_\ell (w_{i,j,\ell} - \overline w_{j,\ell})^2 }$

Any $AT^i_j$ with $\Delta^i_j > \epsilon$ (where $\epsilon$ is typically set via cross-validation, e.g., $\epsilon=4$ ) is marked biased and excluded from further aggregation.

This mechanism addresses dataset noise at scale—a major limitation of pure statistical scoring (e.g., ROR, PRR) which is prone to bias-induced inflation.

3. ADR Prediction Model: Integration of Cleaned Federated Data

The clean global training set is defined as

$D_{\text{clean}} = \bigcup_{i,j: \Delta^i_j \le \epsilon} AT^i_j$

and supports two downstream tasks:

Calculation of Robust Signal Scores: ROR and PRR are recalculated using $D_{\text{clean}}$ , yielding higher, more reliable values compared with those computed on the biased original dataset.
Transformer-based ADR Classifier (ADR-Signal): A transformer model is trained on $D_{\text{clean}}$ in a federated manner, allowing high-capacity clients to leverage the full data distribution while retaining privacy.

4. Computational Workflow and Complexity

The cleaning and partitioning steps require $O(Nd)$ operations. Each federated round has $O(|Split_i| d)$ local update cost per client, and $O(n p)$ aggregation complexity for $p$ -dimensional parameter vectors. Overall complexity is linear in set size and model dimension.

Pfed-Split’s preprocessing and partition strategy produces $n \times m$ tables, enabling eventual bias-aware federated training and evaluation. Federated averaging converges under standard convexity and smoothness conditions as $O(1/T)$ or better when using momentum.

Pseudocode for Pfed-Split:

Algorithm Pfed-Split
Input:    D ← original FAERS records
          n ← number of clients
          valid_ranges ← feature thresholds
Output:   {AT^i_j}  # ADR-based tables per client

1. D_clean ← drop_duplicates(D)
2. for each numeric feature f in D_clean:
       clip f to valid_ranges[f]
3. D_pre ← delete_records_with_nulls(D_clean)
4. Randomly partition D_pre into Split_1, ..., Split_n
5. for i = 1..n:
       for j = 1..m:
           AT^i_j ← { r ∈ Split_i : ADR_label(r) == j }
6. return {AT^i_j : i=1..n, j=1..m}

5. Empirical Evaluation and Signal Enhancement

Extensive empirical evaluation on FAERS demonstrates that, after Pfed-Split and federated Euclidean-based bias detection:

Improved Metrics: ROR and PRR computed on $D_{\text{clean}}$ exhibit higher values and less noise than on the original, uncleaned data.
Superior Predictive Performance: On benchmark ADR signal prediction, the PFed-Signal framework achieves accuracy = 0.887, F1 = 0.890, recall = 0.913, and AUC = 0.957, all exceeding baselines (Li et al., 29 Dec 2025).

This suggests that federated bias mitigation is essential in ADR mining, where classical metrics and neural predictors otherwise remain vulnerable to data artifacts.

6. Parameterization, Practical Considerations, and Limitations

The parameter $n$ controls the split granularity: higher $n$ increases privacy but reduces the statistical power per client partition. Threshold $\epsilon$ must be tuned via out-of-sample validation to balance over-filtering and under-filtering. The approach is computationally scalable, requiring only linear operations in both data and model size.

A plausible implication is that the methodology—partitioning, federated signal cleaning, and Euclidean filtering—may extend to other medical data fusion settings facing similar biases. However, tuning of $\epsilon$ and careful preprocessing remain critical for achieving optimal performance and avoiding inadvertent exclusion of valid signal.

References:

“PFed-Signal: An ADR Prediction Model based on Federated Learning” (Li et al., 29 Dec 2025)
For conceptual contrast in split federated learning: “Flexible Personalized Split Federated Learning for On-Device Fine-Tuning of Foundation Models” (Yuan et al., 14 Aug 2025)

Markdown Report Issue Upgrade to Chat

References (2)

PFed-Signal: An ADR Prediction Model based on Federated Learning (2025)

Flexible Personalized Split Federated Learning for On-Device Fine-Tuning of Foundation Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PFed-Signal.