Papers
Topics
Authors
Recent
Search
2000 character limit reached

PFed-Signal: Federated ADR Signal Detection

Updated 5 January 2026
  • The paper demonstrates that PFed-Signal mitigates bias in ADR detection by integrating Pfed-Split data partitioning with Euclidean distance-based filtering.
  • It employs a two-stage federated learning approach that cleans and partitions large FAERS datasets, enhancing classical metrics like ROR and PRR.
  • The integrated transformer-based ADR predictor achieves high accuracy, F1, recall, and AUC, underscoring the framework's effectiveness on noisy data.

PFed-Signal is a federated learning-based adverse drug reaction (ADR) signal detection and prediction framework tailored for large, noisy pharmacovigilance datasets such as FAERS. It introduces a two-stage strategy: data partition and cleaning via Pfed-Split, and federated bias-aware ADR prediction via ADR-Signal, which collectively mitigate the confounding impact of biased records—such as duplicates, under-reporting, and label errors—on classical disproportionality metrics (reporting odds ratio, ROR; and proportional reporting ratio, PRR) and modern neural ADR predictors. The system formalizes a Euclidean distance–based criterion for identifying and purging locally biased data and deploys a federated transformer model architecture to maximize accuracy, F1, recall, and AUC on the cleaned distributed dataset (Li et al., 29 Dec 2025).

1. Data Partitioning and Pfed-Split Mechanism

Pfed-Split processes a monolithic FAERS dataset

D={rk=(xk,yk)xkRd,yk{1,,m}}k=1ND = \{ r_k = (x_k, y_k) \mid x_k \in \mathbb{R}^d,\, y_k \in \{1,\dots,m\} \}_{k=1}^N

by executing a multi-step pre-processing and splitting workflow:

  • Cleaning: Deduplicate entries, clip numeric features to medically valid ranges (e.g., age ∈ [0,120]), and drop records with missing or invalid critical attributes.
  • Random Partition: Uniformly split the cleaned dataset DpreD_{\mathrm{pre}} into nn mutually exclusive client splits {Spliti}\{ Split_i \}, ensuring Dpre=i=1nSplitiD_{\mathrm{pre}} = \bigcup_{i=1}^n Split_i.
  • ADR-based Tables: Within each client ii, generate ADR-specific tables ATji={(x,y)Spliti:y=j}AT^i_j = \{ (x, y) \in Split_i : y = j \} for j=1,,mj=1,\ldots,m.

Parameter choices for nn (client count), feature cleaning thresholds, and random seeds are selected to balance privacy, statistical efficiency, and split reproducibility.

2. Federated Architecture, Signal Cleaning, and Euclidean Distance-Based Filtering

PFed-Signal employs a server–client federated learning pattern:

  • Local Training: Each client ii trains ADR-specific binary classifiers (parameters wi,jw_{i,j}) on its local ATjiAT^i_j.
  • Aggregation: The server aggregates local models to form global ADR-specific weights:

    wj=i=1nATjiiATjiwi,j\overline w_j = \sum_{i=1}^n \frac{|AT^i_j|}{\sum_{i'}|AT^{i'}_j|} w_{i,j}

  • Bias Detection via Euclidean Distance: To detect and remove bias, the server computes Euclidean distance:

    Δji=wi,jwj2=(wi,j,wj,)2\Delta^i_j = \left\| w_{i,j} - \overline w_j \right\|_2 = \sqrt{ \sum_\ell (w_{i,j,\ell} - \overline w_{j,\ell})^2 }

Any ATjiAT^i_j with Δji>ϵ\Delta^i_j > \epsilon (where ϵ\epsilon is typically set via cross-validation, e.g., ϵ=4\epsilon=4) is marked biased and excluded from further aggregation.

This mechanism addresses dataset noise at scale—a major limitation of pure statistical scoring (e.g., ROR, PRR) which is prone to bias-induced inflation.

3. ADR Prediction Model: Integration of Cleaned Federated Data

The clean global training set is defined as

Dclean=i,j:ΔjiϵATjiD_{\text{clean}} = \bigcup_{i,j: \Delta^i_j \le \epsilon} AT^i_j

and supports two downstream tasks:

  • Calculation of Robust Signal Scores: ROR and PRR are recalculated using DcleanD_{\text{clean}}, yielding higher, more reliable values compared with those computed on the biased original dataset.
  • Transformer-based ADR Classifier (ADR-Signal): A transformer model is trained on DcleanD_{\text{clean}} in a federated manner, allowing high-capacity clients to leverage the full data distribution while retaining privacy.

4. Computational Workflow and Complexity

The cleaning and partitioning steps require O(Nd)O(Nd) operations. Each federated round has O(Splitid)O(|Split_i| d) local update cost per client, and O(np)O(n p) aggregation complexity for pp-dimensional parameter vectors. Overall complexity is linear in set size and model dimension.

Pfed-Split’s preprocessing and partition strategy produces n×mn \times m tables, enabling eventual bias-aware federated training and evaluation. Federated averaging converges under standard convexity and smoothness conditions as O(1/T)O(1/T) or better when using momentum.

Pseudocode for Pfed-Split:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Algorithm Pfed-Split
Input:    D  original FAERS records
          n  number of clients
          valid_ranges  feature thresholds
Output:   {AT^i_j}  # ADR-based tables per client

1. D_clean  drop_duplicates(D)
2. for each numeric feature f in D_clean:
       clip f to valid_ranges[f]
3. D_pre  delete_records_with_nulls(D_clean)
4. Randomly partition D_pre into Split_1, ..., Split_n
5. for i = 1..n:
       for j = 1..m:
           AT^i_j  { r  Split_i : ADR_label(r) == j }
6. return {AT^i_j : i=1..n, j=1..m}

5. Empirical Evaluation and Signal Enhancement

Extensive empirical evaluation on FAERS demonstrates that, after Pfed-Split and federated Euclidean-based bias detection:

  • Improved Metrics: ROR and PRR computed on DcleanD_{\text{clean}} exhibit higher values and less noise than on the original, uncleaned data.
  • Superior Predictive Performance: On benchmark ADR signal prediction, the PFed-Signal framework achieves accuracy = 0.887, F1 = 0.890, recall = 0.913, and AUC = 0.957, all exceeding baselines (Li et al., 29 Dec 2025).

This suggests that federated bias mitigation is essential in ADR mining, where classical metrics and neural predictors otherwise remain vulnerable to data artifacts.

6. Parameterization, Practical Considerations, and Limitations

The parameter nn controls the split granularity: higher nn increases privacy but reduces the statistical power per client partition. Threshold ϵ\epsilon must be tuned via out-of-sample validation to balance over-filtering and under-filtering. The approach is computationally scalable, requiring only linear operations in both data and model size.

A plausible implication is that the methodology—partitioning, federated signal cleaning, and Euclidean filtering—may extend to other medical data fusion settings facing similar biases. However, tuning of ϵ\epsilon and careful preprocessing remain critical for achieving optimal performance and avoiding inadvertent exclusion of valid signal.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PFed-Signal.