Papers
Topics
Authors
Recent
Search
2000 character limit reached

SRank: Functional Overlap Reranking

Updated 29 January 2026
  • Functional Overlap Reranking (SRank) is a suite of post-processing algorithms that reorder candidate outputs by measuring overlaps in salient features such as named entities in QA or functional outputs in code generation.
  • The approach operates without retraining the base model by leveraging simple yet effective overlap metrics, applying named-entity intersection in QA and functional agreement across test inputs in code generation.
  • Experimental evaluations show that SRank achieves significant performance gains, including a pass@1 improvement of about 6.1% over other rerankers, while enhancing adversarial robustness in natural language and code tasks.

Functional Overlap Reranking (SRank) encompasses a family of model-agnostic post-processing algorithms that reorder candidate outputs—such as answer spans in adversarial question answering or clusters in neural code generation—based on quantified overlaps of salient features or behaviors. Most notably, SRank leverages either named-entity overlap between candidate sentences and questions in natural language QA (Majumder et al., 2021) or functional agreement between output clusters of sampled programs on test inputs in code generation (To et al., 2023). These approaches require no retraining or modification of the underlying generator models and achieve state-of-the-art improvements by exploiting distinct, often shallow, yet highly effective interaction signals among candidates.

1. Formal Definitions of SRank

SRank instantiates specific forms depending on the application domain:

a) Question Answering (QA) (Majumder et al., 2021):

Let QQ denote the input question, CC the context, and MM a span-based QA model. MM generates a set A={(s1,e1),...,(sn,en)}A = \{ (s_1, e_1), ..., (s_n, e_n) \} of nn candidate answer spans, where each (sk,ek)(s_k, e_k) anchors an answer span eke_k in sentence sks_k. For each candidate, a named-entity tagger extracts NER(Q)NER(Q) and NER(sk)NER(s_k). The reranking score is defined as

SRank(Q,(sk,ek))=NER(sk)NER(Q).\mathrm{SRank}(Q, (s_k, e_k)) = |NER(s_k) \cap NER(Q)|.

Candidates are reranked in descending order of this intersection size; ties are broken by the original model order.

b) Neural Code Generation (To et al., 2023):

Given a naturallanguage programming specification cc, a code LLM samples NN solutions S={s1,...,sN}S = \{s_1, ..., s_N\} and MM test inputs Z={z1,...,zM}Z = \{z_1, ..., z_M\} with respective outputs. Solutions are clustered: CpC_p is the set of code samples whose output vectors on ZZ are identical. For KK clusters C1,...,CKC_1, ..., C_K, the interaction matrix IRK×KI \in \mathbb{R}^{K \times K} records functional overlap:

Iij=1Mk=1M1[oi,k=oj,k]I_{ij} = \frac{1}{M} \sum_{k=1}^{M} \mathbf{1}[o_{i,k} = o_{j,k}]

where oi,ko_{i,k} is the (deterministic) output of all members of CiC_i on zkz_k. The SRank score for cluster CiC_i aggregates overlap-weighted cluster features VjV_j:

Ri=j=1KIijVj.R_i = \sum_{j=1}^K I_{ij} V_j.

The highest-ranked cluster yields the solution returned.

2. Overlap Measures and Algorithms

a) Named-Entity Overlap (QA):

SRank exclusively uses the named-entity intersection:

overlapk=NER(sk)NER(Q).\mathrm{overlap}_k = |NER(s_k) \cap NER(Q)|.

No additional weightings, normalizations, or alternative overlaps (n-gram, TF-IDF, embedding similarity) are used in the evaluated pipeline, though the framework is general (Majumder et al., 2021).

b) Functional Agreement (Code):

Clusters are defined by equivalence on output vectors across test cases. Pairwise overlap is the normalized count of test cases for which two clusters agree:

Overlap(Ci,Cj)=1Mk=1M1[oi,k=oj,k]\mathrm{Overlap}(C_i, C_j) = \frac{1}{M} \sum_{k=1}^{M}\mathbf{1}[o_{i,k} = o_{j,k}]

leading to an II matrix that models inter-cluster functional similarity.

3. Pipelines and Implementation Details

a) QA Pipeline (Majumder et al., 2021):

  • Predict nn candidate spans with the base QA model.
  • For each, extract the containing sentence and run an NER tagger (AllenNLP’s ELMo-based CoNLL-2003).
  • Compute the set intersection NER(sk)NER(Q)NER(s_k) \cap NER(Q) for each candidate.
  • Sort candidates by intersection size, then by original model rank.
  • Return the highest-ranked span.

Common settings: BiDAF and BERT as base models; beam size n=10n=10; max input 400 tokens; per-example reranking adds O(nL)\mathcal{O}(n L) NER overhead, negligible versus the model’s forward pass.

b) Code Generation Pipeline (To et al., 2023):

  • Sample NN solutions and MM test inputs (e.g., N=M=100N=M=100).
  • Execute all sis_i on all zkz_k to get output vectors.
  • Cluster solutions by identical output vectors across ZZ.
  • Construct the K×KK \times K interaction matrix II of overlaps.
  • Define VjV_j as cluster feature (e.g., cluster size or pass rate).
  • Calculate Ri=j=1KIijVjR_i = \sum_{j=1}^K I_{ij} V_j, sort clusters, and extract a representative solution from the winner.

Computational cost is dominated by O(NM)O(NM) executions; KNK \ll N in most cases.

4. Experimental Evaluation and Comparative Results

Model Greedy CodeT Coder-Reviewer SRank
WizardCoder34B 68.90 72.36 75.31
CodeGen2.5-Instr. 28.05 56.81 45.63 60.55
StarCoder 39.63 50.51 38.71 53.99
Codex002 47.00 65.80 66.90 69.66

SRank achieves average pass@1 improvement of ≈6.1% over CodeT and Coder-Reviewer rerankers.

QA Results (Majumder et al., 2021):

  • On Adversarial SQuAD (AddSent), BiDAF+SRank achieves 45.4 F1 / 38.0 EM (vs. vanilla BiDAF, 21.4 / 16.0, and BiDAF+SLN, 22.8 / 17.2).
  • For BERT, gains are similar (BERT+SRank: 61.2/53.6 vs. BERT+QAInfoMax, 41.8/37.2).
  • On clean data, SRank-only BERT can experience minor drops due to answer type mismatch but yields strong net improvements in adversarial settings.

5. Analysis, Robustness, and Practical Trade-offs

a) Adversarial Robustness (QA):

SRank exploits that adversarial distractors in context rarely share named entities with the question, allowing span filtering without any retraining. This model-agnostic and post-hoc property makes it applicable to any span-predicting reader.

b) Robustness (Code):

Functional overlap amplifies the consensus among correct solutions: genuinely correct clusters tend to agree on outputs, while “buggy” clusters disagree. Especially under small NN and MM, this consensus effect outperforms single-cluster features and reduces noise in cluster scoring.

c) Limitations:

  • In QA, SRank may fail if all candidate sentences with shared entities lack the correct answer or if entity overlap does not imply answer type match.
  • In code, the approach assumes perfect clustering by exact outputs; richer semantic (canonical) clustering may further improve effectiveness.
  • Results are presented for English QA and Python code only; generalization to other languages or domains is not yet demonstrated.

6. Extensions, Limitations, and Future Perspectives

SRank’s design is agnostic to the underlying overlap metric and to the cluster features. While only named-entity and functional overlap are employed in the respective evaluated domains, the general framework could incorporate richer content-match (n-grams, TF-IDF, embeddings), additional verifiers, or LLM log-probs as cluster features. Scaling SRank to massive candidate sets may require approximations or subset sampling. Its strong empirical performance underscores both the potential of consensus-based reranking and possible vulnerabilities in current adversarial benchmarks, motivating future development of defenses premised on deeper semantic or pragmatic reasoning beyond entity or functional overlap (Majumder et al., 2021, To et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Functional Overlap Reranking (SRank).