SRank: Functional Overlap Reranking
- Functional Overlap Reranking (SRank) is a suite of post-processing algorithms that reorder candidate outputs by measuring overlaps in salient features such as named entities in QA or functional outputs in code generation.
- The approach operates without retraining the base model by leveraging simple yet effective overlap metrics, applying named-entity intersection in QA and functional agreement across test inputs in code generation.
- Experimental evaluations show that SRank achieves significant performance gains, including a pass@1 improvement of about 6.1% over other rerankers, while enhancing adversarial robustness in natural language and code tasks.
Functional Overlap Reranking (SRank) encompasses a family of model-agnostic post-processing algorithms that reorder candidate outputs—such as answer spans in adversarial question answering or clusters in neural code generation—based on quantified overlaps of salient features or behaviors. Most notably, SRank leverages either named-entity overlap between candidate sentences and questions in natural language QA (Majumder et al., 2021) or functional agreement between output clusters of sampled programs on test inputs in code generation (To et al., 2023). These approaches require no retraining or modification of the underlying generator models and achieve state-of-the-art improvements by exploiting distinct, often shallow, yet highly effective interaction signals among candidates.
1. Formal Definitions of SRank
SRank instantiates specific forms depending on the application domain:
a) Question Answering (QA) (Majumder et al., 2021):
Let denote the input question, the context, and a span-based QA model. generates a set of candidate answer spans, where each anchors an answer span in sentence . For each candidate, a named-entity tagger extracts and . The reranking score is defined as
Candidates are reranked in descending order of this intersection size; ties are broken by the original model order.
b) Neural Code Generation (To et al., 2023):
Given a naturallanguage programming specification , a code LLM samples solutions and test inputs with respective outputs. Solutions are clustered: is the set of code samples whose output vectors on are identical. For clusters , the interaction matrix records functional overlap:
where is the (deterministic) output of all members of on . The SRank score for cluster aggregates overlap-weighted cluster features :
The highest-ranked cluster yields the solution returned.
2. Overlap Measures and Algorithms
a) Named-Entity Overlap (QA):
SRank exclusively uses the named-entity intersection:
No additional weightings, normalizations, or alternative overlaps (n-gram, TF-IDF, embedding similarity) are used in the evaluated pipeline, though the framework is general (Majumder et al., 2021).
b) Functional Agreement (Code):
Clusters are defined by equivalence on output vectors across test cases. Pairwise overlap is the normalized count of test cases for which two clusters agree:
leading to an matrix that models inter-cluster functional similarity.
3. Pipelines and Implementation Details
a) QA Pipeline (Majumder et al., 2021):
- Predict candidate spans with the base QA model.
- For each, extract the containing sentence and run an NER tagger (AllenNLP’s ELMo-based CoNLL-2003).
- Compute the set intersection for each candidate.
- Sort candidates by intersection size, then by original model rank.
- Return the highest-ranked span.
Common settings: BiDAF and BERT as base models; beam size ; max input 400 tokens; per-example reranking adds NER overhead, negligible versus the model’s forward pass.
b) Code Generation Pipeline (To et al., 2023):
- Sample solutions and test inputs (e.g., ).
- Execute all on all to get output vectors.
- Cluster solutions by identical output vectors across .
- Construct the interaction matrix of overlaps.
- Define as cluster feature (e.g., cluster size or pass rate).
- Calculate , sort clusters, and extract a representative solution from the winner.
Computational cost is dominated by executions; in most cases.
4. Experimental Evaluation and Comparative Results
Table 1: pass@1 (%) on HumanEval and MBPP-S (To et al., 2023)
| Model | Greedy | CodeT | Coder-Reviewer | SRank |
|---|---|---|---|---|
| WizardCoder34B | 68.90 | 72.36 | – | 75.31 |
| CodeGen2.5-Instr. | 28.05 | 56.81 | 45.63 | 60.55 |
| StarCoder | 39.63 | 50.51 | 38.71 | 53.99 |
| Codex002 | 47.00 | 65.80 | 66.90 | 69.66 |
SRank achieves average pass@1 improvement of ≈6.1% over CodeT and Coder-Reviewer rerankers.
QA Results (Majumder et al., 2021):
- On Adversarial SQuAD (AddSent), BiDAF+SRank achieves 45.4 F1 / 38.0 EM (vs. vanilla BiDAF, 21.4 / 16.0, and BiDAF+SLN, 22.8 / 17.2).
- For BERT, gains are similar (BERT+SRank: 61.2/53.6 vs. BERT+QAInfoMax, 41.8/37.2).
- On clean data, SRank-only BERT can experience minor drops due to answer type mismatch but yields strong net improvements in adversarial settings.
5. Analysis, Robustness, and Practical Trade-offs
a) Adversarial Robustness (QA):
SRank exploits that adversarial distractors in context rarely share named entities with the question, allowing span filtering without any retraining. This model-agnostic and post-hoc property makes it applicable to any span-predicting reader.
b) Robustness (Code):
Functional overlap amplifies the consensus among correct solutions: genuinely correct clusters tend to agree on outputs, while “buggy” clusters disagree. Especially under small and , this consensus effect outperforms single-cluster features and reduces noise in cluster scoring.
c) Limitations:
- In QA, SRank may fail if all candidate sentences with shared entities lack the correct answer or if entity overlap does not imply answer type match.
- In code, the approach assumes perfect clustering by exact outputs; richer semantic (canonical) clustering may further improve effectiveness.
- Results are presented for English QA and Python code only; generalization to other languages or domains is not yet demonstrated.
6. Extensions, Limitations, and Future Perspectives
SRank’s design is agnostic to the underlying overlap metric and to the cluster features. While only named-entity and functional overlap are employed in the respective evaluated domains, the general framework could incorporate richer content-match (n-grams, TF-IDF, embeddings), additional verifiers, or LLM log-probs as cluster features. Scaling SRank to massive candidate sets may require approximations or subset sampling. Its strong empirical performance underscores both the potential of consensus-based reranking and possible vulnerabilities in current adversarial benchmarks, motivating future development of defenses premised on deeper semantic or pragmatic reasoning beyond entity or functional overlap (Majumder et al., 2021, To et al., 2023).