Entropy-Based Expertise Estimation
- Entropy-based expertise estimation integrates textual, profile, and citation signals to rank academic experts using unsupervised, probabilistic evidence weighting.
- The methodology quantifies uncertainty via Shannon's entropy and fuses heterogeneous sensor outputs with Dempster–Shafer theory for robust expert ranking.
- Empirical results on large academic datasets demonstrate its competitive performance compared to supervised learning-to-rank methods and traditional aggregation models.
Entropy-based expertise estimation is a principled approach for ranking individuals' expertise in response to a query, utilizing multisensor data fusion, Shannon's entropy, and the Dempster–Shafer theory of evidence. The methodology orchestrates heterogeneous signals—textual content analysis, profile metrics, and citation graph structure—without needing supervised training, resolving sensor disagreement via probabilistic uncertainty weighting. Empirical evaluation demonstrates its efficacy in ranking academic experts, with performance matching supervised learning-to-rank methods and exceeding traditional rank aggregation baselines (Moreira et al., 2013).
1. Multisensor Framework for Expertise Estimation
The framework formalizes three independent expertise estimators ("sensors") extracting diverse forms of evidence:
- Text Sensor: Assesses information retrieval (IR)-style relevance between candidate publications and the query. Extracted features include term frequency, inverse document frequency, BM25, Jaccard similarity, Okapi-BM25 over venues, among others.
- Profile Sensor: Quantifies the candidate's productivity and publication record. Events captured include total publications, journal counts, publication years (overall and query-specific), average publications per year, etc.
- Citation Sensor: Utilizes the citation graph to measure scientific impact. Features include total citations (overall and query-specific), average citations per year, h-index variants (h, g, a, e, contemporary, trend, individual), PageRank scores of candidate’s papers, and number of unique collaborators.
For each sensor , scores for a candidate are computed across all event types , normalized to via min-max normalization. A data-fusion algorithm (e.g., CombSUM) aggregates these per-sensor evidence:
2. Shannon's Entropy Quantification of Sensor Uncertainty
Each sensor’s uncertainty in distinguishing expertise is quantified through Shannon’s entropy. For sensor , candidate set , and event set , define:
- if , else $0$
The entropy:
Maximum possible entropy is . The normalized uncertainty weight for each sensor is:
This weight reflects the sensor’s evidential ambiguity for candidate expertise assignments.
3. Dempster–Shafer Theory for Sensor Evidence Combination
Expertise evidences from different sensors may conflict in their candidate rank assignments. The combination protocol leverages the Dempster–Shafer framework:
- Define the frame of discernment all candidates.
- For each sensor :
- Assign mass function over :
- For singleton ,
- For ignorance ,
- Otherwise,
Normalization ensures .
To combine two mass functions :
- for ;
Shannon's entropy modulates the mass assigned to uncertainty (), integrating each sensor’s confidence into the fusion.
4. Algorithmic Realization and Final Expertise Ranking
The procedure for query consists of:
- Candidate Retrieval: Assemble , all authors with at least one document matching -terms.
- Sensor Processing:
- Compute raw event scores for all , .
- Normalize each event’s scores to by min–max.
- Aggregate into CombSUM per candidate.
- Normalize CombSUM scores: .
- Compute .
- Assign , .
- Sensor Fusion:
- Combine via Dempster’s rule ().
- Ranking: Final expertise score for candidate is . Rank candidates in descending order.
This closed-form yields , computed through successive Dempster–Shafer combinations.
5. Experimental Evaluation and Comparative Performance
Empirical assessment used two datasets:
- Proximity DBLP: 456,704 authors, 743,349 publications, 112,303 citations, no abstracts.
- Enriched DBLP (ArnetMiner): 1,033,050 authors, 1,632,440 publications, 2,327,450 citations, 653,514 abstracts.
Test queries comprised 13 Computer Science topics, with candidate pools of 400 authors per query and expert relevance judgments. Main metrics included Precision@ (=5,10,15,20) and Mean Average Precision (MAP), with statistical significance evaluated via two-sided randomization test ().
Key results:
| Dataset | Best Fusion | P@5 | MAP | Baseline MAP | Supervised MAP |
|---|---|---|---|---|---|
| Proximity DBLP | D-S+Condorcet | 0.7538 | 0.4905 | CombSUM 0.3027 | — |
| Text+Citation | — | 0.5443 | |||
| Enriched DBLP | D-S+Condorcet | 0.6308 | 0.4055 | Condorcet 0.2773 | SVMmap 0.4068 |
| Text+Profile | — | 0.4530 | Model 1 0.2715 | SVMrank 0.4289 |
CombSUM, Condorcet, and Balog’s expert finding baselines were outperformed by the proposed method, which matched the effectiveness of supervised SVMmap and SVMrank algorithms. Notably, Dempster–Shafer + entropy fusion did not require labeled training data.
6. Context, Implications, and Capabilities
The entropy-based multisensor expert estimation framework demonstrates that incorporating both uncertainty quantification and principled evidence fusion robustly resolves conflicting signals from heterogeneous academic data sources (Moreira et al., 2013). The assignment of entropy-weighted ignorance mass admits the intrinsic limitations or disagreement of each sensor into the final ranking, increasing reliability.
A plausible implication is that the method offers resilience against overfitting or adversarial profile skew, given its unsupervised nature and explicit accounting for uncertainty. Furthermore, the ability to aggregate diverse indicators (document relevance, citation influence, career productivity) in a mathematically coherent manner suggests generalizability to broader expert-finding contexts.
7. Related Methodologies and Distinctions
Compared to standard rank aggregation (CombSUM, Condorcet) and candidate/document-based probabilistic models (Balog et al.), entropy-based multisensor fusion uniquely incorporates sensor-level uncertainty via Shannon’s entropy and leverages Dempster–Shafer evidence theory for combination. Supervised learning-to-rank approaches (SVMmap, SVMrank) require labeled data, whereas the described methodology achieves comparable performance absent explicit relevance supervision.
This suggests that entropy-based expertise estimation can serve either as a standalone ranking mechanism where training data are limited, or as a complementary signal within ensemble expert finding systems.