Papers
Topics
Authors
Recent
Search
2000 character limit reached

Unanimous Voting in CoT+NLI

Updated 21 January 2026
  • Unanimous Voting (CoT+NLI) is a protocol that combines Chain-of-Thought reasoning and Natural Language Inference to achieve consensus-based, high-precision factual validation.
  • It leverages multiple reasoning paths and strict agreement criteria to filter out hallucinations, ensuring robust and interpretable AI outputs in domains such as biomedical fact-checking.
  • The approach generalizes traditional majority voting by enforcing unanimity across diverse agents or reasoning chains, optimizing accuracy in critical decision-making tasks.

Unanimous Voting (CoT+NLI) refers to a family of protocols and aggregation schemes that combine Chain-of-Thought (CoT) reasoning and Natural Language Inference (NLI) with stringent agreement-based decision criteria. These methods aim to improve reliability and factual correctness in AI-generated inference—particularly in high-stakes fields like biomedical NLI and fact-checking of LLMs—by requiring that either multiple generated reasoning chains or multiple agents reach a strict consensus, optionally validated by NLI models. Unanimous Voting protocols generalize standard majority voting by enforcing maximal agreement between diverse problem-solving paths, thus filtering out hallucinations and spurious outputs and ensuring high-precision, interpretable results in NLI and related domains.

1. Principles and Formalism of Unanimous Voting

Unanimous Voting protocols require that all candidate solutions (chains, agents, or classifiers) agree before a claim is accepted as valid. In the context of CoT+NLI, this means each atomic fact or conclusion must be independently supported by both a chain-of-thought reasoning module and an NLI evaluator, or—when using multi-agent frameworks—by all participating agents, often as verified by NLI consistency scores.

The foundational aggregation rule for Unanimous Voting over two systems (e.g., CoT and NLI) for a given atomic fact ff is:

sUnVot(f)=sCoT(f)×sNLI(f)s_{\mathrm{UnVot}}(f) = s_{\mathrm{CoT}}(f) \times s_{\mathrm{NLI}}(f)

where sCoT(f),sNLI(f){0,1}s_{\mathrm{CoT}}(f),s_{\mathrm{NLI}}(f) \in \{0,1\}. Only if both modules independently assign support (1) is ff accepted as factual. In disagreement or refutation by either, the system outputs non-factual (0) (Afzal et al., 2 Sep 2025).

Unanimous consensus in a multi-agent setting is reached if all NN agents jointly agree: i=1Nagreedi=N    i,j:SNLI(Ai,Aj)τ\sum_{i=1}^N \mathrm{agreed}_i = N \iff \forall\, i, j: S_{\mathrm{NLI}}(A_i, A_j) \ge \tau where AiA_i is agent ii's chain-of-thought and SNLIS_{\mathrm{NLI}} is an entailment score between pairs of chains with threshold τ\tau (Kaesberg et al., 26 Feb 2025).

2. Self-Consistent Chain-of-Thought with Majority/Unanimous Voting

The FZI-WIM system at SemEval-2024 Task 2 exemplifies the application of majority (and, by extension, unanimous) voting in CoT-augmented biomedical NLI. Instead of single-path greedy decoding, the system generates kk diverse CoT reasoning chains per input via stochastic sampling, extracts each chain's terminal decision, and aggregates results via majority voting:

y=argmaxyYi=1k1(f(ri)=y)y^* = \arg\max_{y \in \mathcal{Y}} \sum_{i=1}^k \mathbf{1}\left(f(r_i) = y\right)

where Y={Entailment,Contradiction}\mathcal{Y} = \{\mathrm{Entailment}, \mathrm{Contradiction} \} and r1,,rkr_1,\dots,r_k are sampled chains (Liu et al., 2024).

This self-consistent CoT approach significantly improves upon greedy decoding in terms of F1, faithfulness, and consistency, demonstrating increased robustness through diversity of reasoning chains and aggregation.

3. FactBench: Unanimous Voting for Medical Fact-Checking

The Unanimous Voting mechanism ("UnVot") in FActBench aggregates fact-checking results from both CoT-prompted LLMs and domain-finetuned NLI models. For each atomic fact fif_i, both modules issue a binary judgment:

  • CoT: $s_{\mathrm{CoT}}(f_i) = \begin{cases} 1 & \text{if } P_{\mathrm{CoT}}(\mathrm{supported}|f_i,E)\ge 0.5\0&\text{otherwise}\end{cases}$
  • NLI: $s_{\mathrm{NLI}}(f_i) = \begin{cases} 1 & \text{if } P_{\mathrm{NLI}}(\mathrm{entailment}|f_i, E)\ge 0.5\0&\text{otherwise}\end{cases}$

Only if both sCoT(fi)=1s_{\mathrm{CoT}}(f_i)=1 and sNLI(fi)=1s_{\mathrm{NLI}}(f_i)=1 is fif_i scored as factual (sUnVot(fi)=1s_{\mathrm{UnVot}}(f_i)=1) (Afzal et al., 2 Sep 2025). Empirically, UnVot yields factuality scores most closely correlated with human domain expert ratings across summarization and generative QA tasks, outperforming pure CoT or pure NLI pipelines in precision and the degree of hallucination mitigation.

Table: FactBench Task-wise Factuality Scores

Task Baseline CoT* NLI* UnVot* Human
Summ 54.81 96.87 85.41 83.45 84.0
LaySumm 52.50 97.60 91.09 88.94 88.7
RAG(QA) 38.43 100.00 83.04 83.04 87.3
PureGen 71.26 88.17 31.61 31.31 62.7

(*Intrinsic + extrinsic checks; see (Afzal et al., 2 Sep 2025) Table 3.)

4. Multi-Agent Unanimous Voting Integrating CoT and NLI

Advanced protocols implement Unanimous Voting in multi-agent debate, enforcing 100% agreement among NN agents, typically using a combination of CoT generation and pairwise NLI validation (Kaesberg et al., 26 Feb 2025). In a standard protocol:

  • Each agent ii generates a chain AiA_i via CoT.
  • Pairwise NLI consistency scores SNLI(Ai,Aj)S_{\mathrm{NLI}}(A_i, A_j) are computed.
  • If all SNLI(Ai,Aj)τS_{\mathrm{NLI}}(A_i, A_j) \ge \tau for a fixed threshold τ\tau and all i,ji,j, agents are unanimous.
  • Otherwise, agents exchange and refine their chains, repeating for a bounded number of rounds.

Empirical findings indicate that, for knowledge tasks, consensus protocols including unanimity yield improved performance (by 2.8 pp on MMLU/GPQA-type benchmarks), with N=3 typically sufficing but further gains for N=5-7 (Kaesberg et al., 26 Feb 2025). Excessive rounds beyond a small threshold degrade performance, emphasizing the need for tightly constrained consensus.

5. Algorithmic and Operational Details

Pseudocode: FZI-WIM NLI Inference Pipeline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
for each (premise, statement) in D:
    prompt = fill(P, premise, statement)
    chains = sample_chains(M, prompt, T=0.7, top_k=50, n=k_initial)
    chains = unique(chains)
    votes = {Entailment:0, Contradiction:0}
    for r in chains:
        y = extract_label(r)
        votes[y] += 1
    while tie(votes) and len(chains) < k_max:
        more = sample_chains(M, prompt, T=0.7, top_k=50, n=k_initial)
        new_chains = unique(set(more) - set(chains))
        chains.extend(new_chains)
        for r in new_chains:
            y = extract_label(r)
            votes[y] += 1
    if not tie(votes):
        y_star = argmax_y votes[y]
    else:
        y_star = greedy_decode_label(M, prompt)
    store(y_star)
(Liu et al., 2024)

Pseudocode: Multi-Agent CoT+NLI Unanimous Voting

1
2
3
4
5
6
7
8
9
10
11
12
for round in 1...T_max:
    for agent i in 1...N:
        if round==1:
            A_i = AGENT_PROMPT_CoT(x)
        else:
            A_i = AGENT_PROMPT_REFINE(x, {A_1,...,A_N})
    for i in 1...N:
        agreed_i = all(\forall j \neq i: S_nli(A_i, A_j) >= tau)
    if sum(agreed_i)==N:
        break
y_star = CONSOLIDATE({A_i: i=1..N})
return y_star
(Kaesberg et al., 26 Feb 2025)

6. Extension to Other Domains and Expected Benefits

Unanimous Voting with CoT+NLI is domain-agnostic and generalizes to domains such as law, finance, and STEM fields by adapting atomic fact extraction, evidence retrieval, NLI model pretraining, and CoT prompting to in-domain data (Afzal et al., 2 Sep 2025). Key benefits include:

  • Improved robustness to spurious high token-probability errors (resilient against LLM hallucinations and inconsistent paraphrase handling) (Liu et al., 2024).
  • Higher faithfulness and precision, closely tracking domain expert ratings (as measured on FactBench) (Afzal et al., 2 Sep 2025).
  • Tunable trade-offs: thresholds (τCoT,τNLI\tau_{\mathrm{CoT}},\tau_{\mathrm{NLI}}), number of sampled chains or agents (kk, NN), and consensus criteria allow balancing recall, precision, and resource consumption.

A plausible implication is that the computational budget for full Unanimous Voting grows linearly with the number of CoT samples or participating agents, and quadratically with agent count for all-pair NLI validation, motivating selective pruning or fallback heuristics for large-scale deployment (Liu et al., 2024, Kaesberg et al., 26 Feb 2025).

7. Connections to Classical Unanimous/Consensus Voting in Decision Theory

While the above instantiations focus on CoT and NLI within neural inference systems, Unanimous Voting also figures in classical decision protocols and optimization. For example, the Unanimous Vote problem—optimally determining a stopping rule for coin tosses—admits an exact O(nlogn)O(n \log n) solution and reveals insights about adaptivity gaps (1.2±o(1)1.2\pm o(1)) between optimal adaptive and nonadaptive policies (Keles et al., 19 Oct 2025). While unrelated to NLI per se, this classical literature contextualizes the efficiency and optimality properties that modern CoT+NLI aggregation schemes aim to approximate in the domain of AI-driven fact verification and inference.


Unanimous Voting (CoT+NLI) thus provides a mathematically rigorous, empirically validated framework for high-precision aggregation of model outputs in complex reasoning and verification tasks, unifying strict consensus protocols with state-of-the-art fact-checking and collaborative inference methodologies (Liu et al., 2024, Afzal et al., 2 Sep 2025, Kaesberg et al., 26 Feb 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Unanimous Voting (CoT+NLI).