Unanimous Voting in CoT+NLI
- Unanimous Voting (CoT+NLI) is a protocol that combines Chain-of-Thought reasoning and Natural Language Inference to achieve consensus-based, high-precision factual validation.
- It leverages multiple reasoning paths and strict agreement criteria to filter out hallucinations, ensuring robust and interpretable AI outputs in domains such as biomedical fact-checking.
- The approach generalizes traditional majority voting by enforcing unanimity across diverse agents or reasoning chains, optimizing accuracy in critical decision-making tasks.
Unanimous Voting (CoT+NLI) refers to a family of protocols and aggregation schemes that combine Chain-of-Thought (CoT) reasoning and Natural Language Inference (NLI) with stringent agreement-based decision criteria. These methods aim to improve reliability and factual correctness in AI-generated inference—particularly in high-stakes fields like biomedical NLI and fact-checking of LLMs—by requiring that either multiple generated reasoning chains or multiple agents reach a strict consensus, optionally validated by NLI models. Unanimous Voting protocols generalize standard majority voting by enforcing maximal agreement between diverse problem-solving paths, thus filtering out hallucinations and spurious outputs and ensuring high-precision, interpretable results in NLI and related domains.
1. Principles and Formalism of Unanimous Voting
Unanimous Voting protocols require that all candidate solutions (chains, agents, or classifiers) agree before a claim is accepted as valid. In the context of CoT+NLI, this means each atomic fact or conclusion must be independently supported by both a chain-of-thought reasoning module and an NLI evaluator, or—when using multi-agent frameworks—by all participating agents, often as verified by NLI consistency scores.
The foundational aggregation rule for Unanimous Voting over two systems (e.g., CoT and NLI) for a given atomic fact is:
where . Only if both modules independently assign support (1) is accepted as factual. In disagreement or refutation by either, the system outputs non-factual (0) (Afzal et al., 2 Sep 2025).
Unanimous consensus in a multi-agent setting is reached if all agents jointly agree: where is agent 's chain-of-thought and is an entailment score between pairs of chains with threshold (Kaesberg et al., 26 Feb 2025).
2. Self-Consistent Chain-of-Thought with Majority/Unanimous Voting
The FZI-WIM system at SemEval-2024 Task 2 exemplifies the application of majority (and, by extension, unanimous) voting in CoT-augmented biomedical NLI. Instead of single-path greedy decoding, the system generates diverse CoT reasoning chains per input via stochastic sampling, extracts each chain's terminal decision, and aggregates results via majority voting:
where and are sampled chains (Liu et al., 2024).
This self-consistent CoT approach significantly improves upon greedy decoding in terms of F1, faithfulness, and consistency, demonstrating increased robustness through diversity of reasoning chains and aggregation.
3. FactBench: Unanimous Voting for Medical Fact-Checking
The Unanimous Voting mechanism ("UnVot") in FActBench aggregates fact-checking results from both CoT-prompted LLMs and domain-finetuned NLI models. For each atomic fact , both modules issue a binary judgment:
- CoT: $s_{\mathrm{CoT}}(f_i) = \begin{cases} 1 & \text{if } P_{\mathrm{CoT}}(\mathrm{supported}|f_i,E)\ge 0.5\0&\text{otherwise}\end{cases}$
- NLI: $s_{\mathrm{NLI}}(f_i) = \begin{cases} 1 & \text{if } P_{\mathrm{NLI}}(\mathrm{entailment}|f_i, E)\ge 0.5\0&\text{otherwise}\end{cases}$
Only if both and is scored as factual () (Afzal et al., 2 Sep 2025). Empirically, UnVot yields factuality scores most closely correlated with human domain expert ratings across summarization and generative QA tasks, outperforming pure CoT or pure NLI pipelines in precision and the degree of hallucination mitigation.
Table: FactBench Task-wise Factuality Scores
| Task | Baseline | CoT* | NLI* | UnVot* | Human |
|---|---|---|---|---|---|
| Summ | 54.81 | 96.87 | 85.41 | 83.45 | 84.0 |
| LaySumm | 52.50 | 97.60 | 91.09 | 88.94 | 88.7 |
| RAG(QA) | 38.43 | 100.00 | 83.04 | 83.04 | 87.3 |
| PureGen | 71.26 | 88.17 | 31.61 | 31.31 | 62.7 |
(*Intrinsic + extrinsic checks; see (Afzal et al., 2 Sep 2025) Table 3.)
4. Multi-Agent Unanimous Voting Integrating CoT and NLI
Advanced protocols implement Unanimous Voting in multi-agent debate, enforcing 100% agreement among agents, typically using a combination of CoT generation and pairwise NLI validation (Kaesberg et al., 26 Feb 2025). In a standard protocol:
- Each agent generates a chain via CoT.
- Pairwise NLI consistency scores are computed.
- If all for a fixed threshold and all , agents are unanimous.
- Otherwise, agents exchange and refine their chains, repeating for a bounded number of rounds.
Empirical findings indicate that, for knowledge tasks, consensus protocols including unanimity yield improved performance (by 2.8 pp on MMLU/GPQA-type benchmarks), with N=3 typically sufficing but further gains for N=5-7 (Kaesberg et al., 26 Feb 2025). Excessive rounds beyond a small threshold degrade performance, emphasizing the need for tightly constrained consensus.
5. Algorithmic and Operational Details
Pseudocode: FZI-WIM NLI Inference Pipeline
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
for each (premise, statement) in D: prompt = fill(P, premise, statement) chains = sample_chains(M, prompt, T=0.7, top_k=50, n=k_initial) chains = unique(chains) votes = {Entailment:0, Contradiction:0} for r in chains: y = extract_label(r) votes[y] += 1 while tie(votes) and len(chains) < k_max: more = sample_chains(M, prompt, T=0.7, top_k=50, n=k_initial) new_chains = unique(set(more) - set(chains)) chains.extend(new_chains) for r in new_chains: y = extract_label(r) votes[y] += 1 if not tie(votes): y_star = argmax_y votes[y] else: y_star = greedy_decode_label(M, prompt) store(y_star) |
Pseudocode: Multi-Agent CoT+NLI Unanimous Voting
1 2 3 4 5 6 7 8 9 10 11 12 |
for round in 1...T_max: for agent i in 1...N: if round==1: A_i = AGENT_PROMPT_CoT(x) else: A_i = AGENT_PROMPT_REFINE(x, {A_1,...,A_N}) for i in 1...N: agreed_i = all(\forall j \neq i: S_nli(A_i, A_j) >= tau) if sum(agreed_i)==N: break y_star = CONSOLIDATE({A_i: i=1..N}) return y_star |
6. Extension to Other Domains and Expected Benefits
Unanimous Voting with CoT+NLI is domain-agnostic and generalizes to domains such as law, finance, and STEM fields by adapting atomic fact extraction, evidence retrieval, NLI model pretraining, and CoT prompting to in-domain data (Afzal et al., 2 Sep 2025). Key benefits include:
- Improved robustness to spurious high token-probability errors (resilient against LLM hallucinations and inconsistent paraphrase handling) (Liu et al., 2024).
- Higher faithfulness and precision, closely tracking domain expert ratings (as measured on FactBench) (Afzal et al., 2 Sep 2025).
- Tunable trade-offs: thresholds (), number of sampled chains or agents (, ), and consensus criteria allow balancing recall, precision, and resource consumption.
A plausible implication is that the computational budget for full Unanimous Voting grows linearly with the number of CoT samples or participating agents, and quadratically with agent count for all-pair NLI validation, motivating selective pruning or fallback heuristics for large-scale deployment (Liu et al., 2024, Kaesberg et al., 26 Feb 2025).
7. Connections to Classical Unanimous/Consensus Voting in Decision Theory
While the above instantiations focus on CoT and NLI within neural inference systems, Unanimous Voting also figures in classical decision protocols and optimization. For example, the Unanimous Vote problem—optimally determining a stopping rule for coin tosses—admits an exact solution and reveals insights about adaptivity gaps () between optimal adaptive and nonadaptive policies (Keles et al., 19 Oct 2025). While unrelated to NLI per se, this classical literature contextualizes the efficiency and optimality properties that modern CoT+NLI aggregation schemes aim to approximate in the domain of AI-driven fact verification and inference.
Unanimous Voting (CoT+NLI) thus provides a mathematically rigorous, empirically validated framework for high-precision aggregation of model outputs in complex reasoning and verification tasks, unifying strict consensus protocols with state-of-the-art fact-checking and collaborative inference methodologies (Liu et al., 2024, Afzal et al., 2 Sep 2025, Kaesberg et al., 26 Feb 2025).