Papers
Topics
Authors
Recent
Search
2000 character limit reached

Entropy-Based Expertise Estimation

Updated 28 December 2025
  • Entropy-based expertise estimation integrates textual, profile, and citation signals to rank academic experts using unsupervised, probabilistic evidence weighting.
  • The methodology quantifies uncertainty via Shannon's entropy and fuses heterogeneous sensor outputs with Dempster–Shafer theory for robust expert ranking.
  • Empirical results on large academic datasets demonstrate its competitive performance compared to supervised learning-to-rank methods and traditional aggregation models.

Entropy-based expertise estimation is a principled approach for ranking individuals' expertise in response to a query, utilizing multisensor data fusion, Shannon's entropy, and the Dempster–Shafer theory of evidence. The methodology orchestrates heterogeneous signals—textual content analysis, profile metrics, and citation graph structure—without needing supervised training, resolving sensor disagreement via probabilistic uncertainty weighting. Empirical evaluation demonstrates its efficacy in ranking academic experts, with performance matching supervised learning-to-rank methods and exceeding traditional rank aggregation baselines (Moreira et al., 2013).

1. Multisensor Framework for Expertise Estimation

The framework formalizes three independent expertise estimators ("sensors") extracting diverse forms of evidence:

  • Text Sensor: Assesses information retrieval (IR)-style relevance between candidate publications and the query. Extracted features include term frequency, inverse document frequency, BM25, Jaccard similarity, Okapi-BM25 over venues, among others.
  • Profile Sensor: Quantifies the candidate's productivity and publication record. Events captured include total publications, journal counts, publication years (overall and query-specific), average publications per year, etc.
  • Citation Sensor: Utilizes the citation graph to measure scientific impact. Features include total citations (overall and query-specific), average citations per year, h-index variants (h, g, a, e, contemporary, trend, individual), PageRank scores of candidate’s papers, and number of unique collaborators.

For each sensor SS, scores for a candidate cc are computed across all event types eESe \in E_S, normalized to [0,1][0,1] via min-max normalization. A data-fusion algorithm (e.g., CombSUM) aggregates these per-sensor evidence:

CombSUMS(c,q)=eESscoree,norm(c,q)\text{CombSUM}_S(c, q) = \sum_{e \in E_S} \text{score}_{e,\text{norm}}(c, q)

2. Shannon's Entropy Quantification of Sensor Uncertainty

Each sensor’s uncertainty in distinguishing expertise is quantified through Shannon’s entropy. For sensor SS, candidate set AA, and event set ESE_S, define:

  • relevantEvent(e,a)=1relevantEvent(e, a) = 1 if scoree(a)>0\text{score}_e(a) > 0, else $0$
  • p(e,a)=relevantEvent(e,a)AESp(e,a) = \frac{relevantEvent(e,a)}{|A| \cdot |E_S|}

The entropy:

H(S)=aAeESp(e,a)log2p(e,a)H(S) = -\sum_{a \in A} \sum_{e \in E_S} p(e,a) \cdot \log_2 p(e,a)

Maximum possible entropy is maxH(S)=log2(AES)\max H(S) = \log_2(|A| \cdot |E_S|). The normalized uncertainty weight for each sensor is:

wS=H(S)/maxH(S)[0,1]w_S = H(S) / \max H(S) \in [0,1]

This weight reflects the sensor’s evidential ambiguity for candidate expertise assignments.

3. Dempster–Shafer Theory for Sensor Evidence Combination

Expertise evidences from different sensors may conflict in their candidate rank assignments. The combination protocol leverages the Dempster–Shafer framework:

  • Define the frame of discernment Θ={\Theta = \{all candidates}\}.
  • For each sensor SS:
    • Assign mass function mSm_S over 2Θ2^\Theta:
    • For singleton {c}\{c\}, mS({c})=fusionS,norm(c)m_S(\{c\}) = \text{fusion}_{S,\text{norm}}(c)
    • For ignorance Θ\Theta, mS(Θ)=wSm_S(\Theta) = w_S
    • Otherwise, mS(X)=0m_S(X) = 0

Normalization ensures cmS({c})+mS(Θ)=1\sum_{c} m_S(\{c\}) + m_S(\Theta) = 1.

To combine two mass functions m1,m2m_1, m_2:

  • K=BC=m1(B)m2(C)K = \sum_{B \cap C = \emptyset} m_1(B)\,m_2(C)
  • m1m2(A)=11KBC=Am1(B)m2(C)m_1 \oplus m_2(A) = \frac{1}{1-K} \sum_{B \cap C = A} m_1(B)\,m_2(C) for AA \neq \emptyset; m1m2()=0m_1 \oplus m_2(\emptyset)=0

Shannon's entropy modulates the mass assigned to uncertainty (mS(Θ)=wSm_S(\Theta) = w_S), integrating each sensor’s confidence into the fusion.

4. Algorithmic Realization and Final Expertise Ranking

The procedure for query qq consists of:

  1. Candidate Retrieval: Assemble AA, all authors with at least one document matching qq-terms.
  2. Sensor Processing:
    • Compute raw event scores scoree(c)\text{score}_e(c) for all eESe \in E_S, cAc \in A.
    • Normalize each event’s scores to [0,1][0,1] by min–max.
    • Aggregate into CombSUM per candidate.
    • Normalize CombSUM scores: cCombSUMS(c)=1wS\sum_c CombSUM_S(c) = 1-w_S.
    • Compute H(S),maxH(S),wSH(S), \max H(S), w_S.
    • Assign mS({c})=CombSUMS(c)m_S(\{c\}) = CombSUM_S(c), mS(Θ)=wSm_S(\Theta) = w_S.
  3. Sensor Fusion:
    • Combine mText,mProfile,mCitationm_{\text{Text}}, m_{\text{Profile}}, m_{\text{Citation}} via Dempster’s rule (mTP=mTextmProfile;mfinal=mTPmCitationm_{TP} = m_{\text{Text}} \oplus m_{\text{Profile}}; m_{\text{final}} = m_{TP} \oplus m_{\text{Citation}}).
  4. Ranking: Final expertise score for candidate cc is mfinal({c})m_{\text{final}}(\{c\}). Rank candidates in descending order.

This closed-form yields mcomb({c})=(m1mn)({c})m_{\text{comb}}(\{c\}) = (m_1 \oplus \cdots \oplus m_n)(\{c\}), computed through successive Dempster–Shafer combinations.

5. Experimental Evaluation and Comparative Performance

Empirical assessment used two datasets:

  • Proximity DBLP: 456,704 authors, 743,349 publications, 112,303 citations, no abstracts.
  • Enriched DBLP (ArnetMiner): 1,033,050 authors, 1,632,440 publications, 2,327,450 citations, 653,514 abstracts.

Test queries comprised 13 Computer Science topics, with candidate pools of 400 authors per query and expert relevance judgments. Main metrics included Precision@kk (kk=5,10,15,20) and Mean Average Precision (MAP), with statistical significance evaluated via two-sided randomization test (α=0.10\alpha=0.10).

Key results:

Dataset Best Fusion P@5 MAP Baseline MAP Supervised MAP
Proximity DBLP D-S+Condorcet 0.7538 0.4905 CombSUM 0.3027
Text+Citation 0.5443
Enriched DBLP D-S+Condorcet 0.6308 0.4055 Condorcet 0.2773 SVMmap 0.4068
Text+Profile 0.4530 Model 1 0.2715 SVMrank 0.4289

CombSUM, Condorcet, and Balog’s expert finding baselines were outperformed by the proposed method, which matched the effectiveness of supervised SVMmap and SVMrank algorithms. Notably, Dempster–Shafer + entropy fusion did not require labeled training data.

6. Context, Implications, and Capabilities

The entropy-based multisensor expert estimation framework demonstrates that incorporating both uncertainty quantification and principled evidence fusion robustly resolves conflicting signals from heterogeneous academic data sources (Moreira et al., 2013). The assignment of entropy-weighted ignorance mass admits the intrinsic limitations or disagreement of each sensor into the final ranking, increasing reliability.

A plausible implication is that the method offers resilience against overfitting or adversarial profile skew, given its unsupervised nature and explicit accounting for uncertainty. Furthermore, the ability to aggregate diverse indicators (document relevance, citation influence, career productivity) in a mathematically coherent manner suggests generalizability to broader expert-finding contexts.

Compared to standard rank aggregation (CombSUM, Condorcet) and candidate/document-based probabilistic models (Balog et al.), entropy-based multisensor fusion uniquely incorporates sensor-level uncertainty via Shannon’s entropy and leverages Dempster–Shafer evidence theory for combination. Supervised learning-to-rank approaches (SVMmap, SVMrank) require labeled data, whereas the described methodology achieves comparable performance absent explicit relevance supervision.

This suggests that entropy-based expertise estimation can serve either as a standalone ranking mechanism where training data are limited, or as a complementary signal within ensemble expert finding systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Entropy-based Expertise Estimation.