InFi-Checker: Certified & Interpretable AI Systems

Updated 17 January 2026

InFi-Checker is a composite system that integrates certified DNN proof checking, mobile input filtering, and LLM fact-checking to enhance reliability in safety-critical AI systems.
It employs rigorous methodologies such as exact real arithmetic with Imandra, depth-first proof traversal, and Siamese feature-embedding networks to balance accuracy and efficiency.
Empirical benchmarks report up to 98% filtering rates, over 90% balanced accuracy in LLM fact-checking, and tractable performance even in full-checking regimes.

InFi-Checker is a composite designation that refers to certified, interpretable, and fine-grained checking systems across several domains, notably: deep neural network (DNN) verification, resource-efficient input filtering in mobile-centric inference, and factuality checking and error analysis for outputs of LLMs. These instantiations share methodological rigor, explicit performance metrics, and scalable implementations suitable for deployment in safety-critical and efficiency-sensitive AI environments.

1. Certified Proof Checking for DNN Verification

InFi-Checker for neural verification is implemented in the Imandra theorem prover and is designed to consume JSON-encoded proofs of UNSAT (unsatisfiability) produced by verifiers such as Marabou (Desmartin et al., 2023). Its principal components comprise a parameterized proof-tree datatype, global tableau and bound-vectors updated during traversal, and a list of piecewise-linear constraints (especially ReLU). The checker operates in two orthogonal modes: full versus partial theory-lemma checking (methodological rigor) and alternate representations for vectors/matrices (native lists vs. sparse maps) to balance rigor and speed.

Exact real arithmetic is a foundational element, enabled by Imandra’s built-in real type and OCaml’s Zarith library. All arithmetic—additions, subtractions, scalar multiplications, and comparisons—is conducted at infinite precision, thus negating numerical instability or rounding-induced soundness violation. This supports reliable upper-bound computation for linear forms, such as $w^{\intercal}Ax$ , where all operations are exact over $\mathbb{R}$ .

The formal specification of soundness is stated in Imandra’s logic and LaTeX: if the contradiction test passes at every leaf (i.e., $\mathrm{upper}(w^{\intercal}Ax) < 0$ ), then the original LP $(A x = 0,\ l \leq x \leq u)$ is infeasible:

$\forall\,A \in \mathbb{R}^{m\times n},\ u,l \in \mathbb{R}^n,\ w \in \mathbb{R}^m.\ \mathrm{check\_contradiction}(w,A,u,l) \implies \neg\exists\,x \in \mathbb{R}^n.\, A x = 0 \land l \leq x \leq u$

The algorithmic workflow parses the proof object and performs a depth-first traversal of nodes and leaves. At nodes, splits and theory-lemmas are checked via recomputation and pattern-matching; at leaves, Farkas vectors provide contradiction certificates. All supporting linear algebraic properties are formally proved in the same environment for consistency.

Empirical benchmarks on ACAS-Xu verification tasks indicate that InFi-Checker (in sparse/no-lemma mode) is approximately $2\times$ slower than the original C++ checker, but fully rigorous checking can be up to $150\times$ slower while remaining tractable ( $<40$ min for largest proofs) (Desmartin et al., 2023). No numerical instabilities were observed.

Performance constraints and lack of support for richer activation functions remain present. Ongoing work includes optimizations (AVL-based sparse maps, specialized matrix indexing), extension to mixed-integer proofs, and full integration into larger system-level proofs.

2. Input Filtering for Mobile-Centric Inference

InFi-Checker also refers to a systemized framework for end-to-end learning to filter inputs, thereby improving resource-efficiency in mobile AI inference workloads (Yuan et al., 2022). The theoretical foundation rests on formalizing filterability using complexity measures (VC-dimension, Rademacher complexity), with explicit case analyses distinguishing filterability feasibility.

An inference workload $(\mathcal{X},\,\mathcal{Y},\,c,\,\mathcal{H},\,D,\,S)$ is evaluated for redundancy via $f_h:\mathcal{Y}\to\{0,1\}$ , labeling outputs as redundant or useful. The filter learning problem seeks a model $\mathbb{R}$ 0 (with hypothesis class $\mathbb{R}$ 1) satisfying $\mathbb{R}$ 2 at low computational cost. Filtering rate $\mathbb{R}$ 3, accuracy $\mathbb{R}$ 4, and saved cost $\mathbb{R}$ 5 are central metrics. Validity conditions are $\mathbb{R}$ 6 and $\mathbb{R}$ 7.

Case-based filterability analysis yields:

Low-confidence classification: Not filterable, $\mathbb{R}$ 8.
Class-subset skip: Filterable when $\mathbb{R}$ 9, $\mathrm{upper}(w^{\intercal}Ax) < 0$ 0.
Thresholded regression: Filterable, guaranteed lower complexity.

InFi's metric-learning framework supports both SKIP and REUSE strategies using a Siamese feature-embedding network $\mathrm{upper}(w^{\intercal}Ax) < 0$ 1, modality-specific architectures, and contrastive/binary loss functions. Active online update mechanisms are deployed for nonstationary input streams. Efficient implementation covers six modalities (text, image, video, audio, sensor, feature-map) with end-to-end differentiability.

Empirical results demonstrate up to $\mathrm{upper}(w^{\intercal}Ax) < 0$ 2 filtering rate, $\mathrm{upper}(w^{\intercal}Ax) < 0$ 3 throughput and up to $\mathrm{upper}(w^{\intercal}Ax) < 0$ 4 bandwidth saving at $\mathrm{upper}(w^{\intercal}Ax) < 0$ 5 accuracy in video analytics tasks (Yuan et al., 2022). Practical guidance is provided for estimating filterability, selecting architectures, cross-validation of operational parameters, and efficiency checks on target devices.

3. Factual Consistency and Fine-Grained Fact-Checking of LLM Outputs

Recent advances in InFi-Checker extend its scope to interpretable, fine-grained fact-checking for LLM-generated text, as described in InFi-Check and InFi-Check-FG benchmarks (Bai et al., 10 Jan 2026). The InFi-Checker model jointly retrieves explicit supporting evidence, classifies error types at sentence granularity, provides chain-of-thought justifications, and generates corrections. Data synthesis assembles claims grounded in curated corpora, attaches evidence and audits claims via multi-model and human verification, and synthesizes diverse error types—Predicate (PredE), Entity (EntE), Circumstance (CircE), Co-reference (CorefE), Discourse Link (LinkE), and Extrinsic (OutE).

The full structured target for model training comprises correct sentence, hallucinated version, evidence, error category, justification, and correction. Multitask learning combines cross-entropy classification over seven fine-grained classes and sequence-level generation of evidence/justification/correction. Backbone models include Llama-3.1-8B-Instruct and Qwen3-8B.

Experimental benchmarks show high balanced accuracy (BAcc: $\mathrm{upper}(w^{\intercal}Ax) < 0$ 6 for Llama, $\mathrm{upper}(w^{\intercal}Ax) < 0$ 7 for Qwen), and $\mathrm{upper}(w^{\intercal}Ax) < 0$ 8 gain over GPT-4o, with error localization and sentence alignment ratio (SAR) rendering outputs substantially more interpretable (Bai et al., 10 Jan 2026). Generalization to out-of-distribution datasets and binary fact-check benchmarks is robust (Macro-F1 $\mathrm{upper}(w^{\intercal}Ax) < 0$ 9).

Fine-grained ablations reveal that removing any output element reduces BAcc drastically (to $(A x = 0,\ l \leq x \leq u)$ 0). Cost efficiency is highlighted, with evaluation costing $(A x = 0,\ l \leq x \leq u)$ 14 $on InFi-Check-FG, which is significantly lower than closed-source LLMs.</p> <p>Limitations include sensitivity to initial LLM quality in data synthesis, taxonomy scope (excludes style/grammar), and possible modularity improvement via retriever integration. Real-time deployment may benefit from classifier <a href="https://www.emergentmind.com/topics/lora-reconstruction-distillation" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">distillation</a>.</p> <h2 class='paper-heading' id='insights-from-weakly-supervised-factuality-metrics'>4. Insights from Weakly Supervised Factuality Metrics</h2> <p>Techniques from WeCheck (<a href="/papers/2212.10057" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Wu et al., 2022</a>) are directly transferable to InFi-Checker’s factuality evaluation. Two-step weakly supervised pipelines aggregate multiple noisy scorers (<a href="https://www.emergentmind.com/topics/evidence-grounded-natural-language-inference-nli" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">NLI</a>, QA, retrieval-based metrics) into soft labels via a generative labeling model ($ (A x = 0,\ l \leq x \leq u)$2), then train a compact encoder on these using noise-aware cross-entropy objectives. The generative model independently learns coverage and accuracy parameters of each weak label source, marginalizing to infer reliable per-example labels:

$(A x = 0,\ l \leq x \leq u)$3

Factuality metric pre-training on NLI datasets (MultiNLI, Adversarial-NLI, Fever-NLI, LingNLI) is beneficial, attaining ROC AUC $(A x = 0,\ l \leq x \leq u)$4 on TRUE benchmark. WeCheck achieves $(A x = 0,\ l \leq x \leq u)$5 ROC AUC while being $(A x = 0,\ l \leq x \leq u)$6 faster than QA-based metrics (Wu et al., 2022). Recommendations for InFi-Checker include aggregation of heterogeneous weak sources, adaptive thresholding, and joint denoising.

5. Practical Considerations and Limitations

Performance overhead is notable in full-checking regimes, motivating further optimizations (e.g., efficient sparse representations, custom matrix/list operations) (Desmartin et al., 2023). In mobile-centric deployments, training latency of InFi modules varies by modality; image models utilize $(A x = 0,\ l \leq x \leq u)$7 MB memory, vector modalities $(A x = 0,\ l \leq x \leq u)$8 MB. Active online update improves sample efficiency in nonstationary contexts.

Interpretability is maximized in fine-grained fact-checking via explicit evidence attachment and justifications; however, coverage is limited to predefined taxonomies. Not all error modalities (e.g., rhetorical, grammatical) are handled, and system integration still relies on the underlying quality of synthetic data pipelines and LLM decoders.

A plausible implication is that, while InFi-Checker sets new standards for certified, interpretable, and efficient verification in AI systems, continued work on formal verification, extension to heterogeneous activation functions, and scalable deployment in real-time environments is warranted.

6. Summary Table: InFi-Checker Instantiations

Application Domain	Key Methodology	Reported Metrics/Results
DNN Verification	Certified UNSAT proof checking; infinite-precision arithmetic in Imandra (Desmartin et al., 2023)	Sparse/no-lemma: $(A x = 0,\ l \leq x \leq u)$9 C++ speed; full: $\forall\,A \in \mathbb{R}^{m\times n},\ u,l \in \mathbb{R}^n,\ w \in \mathbb{R}^m.\ \mathrm{check\_contradiction}(w,A,u,l) \implies \neg\exists\,x \in \mathbb{R}^n.\, A x = 0 \land l \leq x \leq u$0; exact reals, no instability
Input Filtering	End-to-end modality-agnostic input filter learning (Yuan et al., 2022)	Filtering rate up to $\forall\,A \in \mathbb{R}^{m\times n},\ u,l \in \mathbb{R}^n,\ w \in \mathbb{R}^m.\ \mathrm{check\_contradiction}(w,A,u,l) \implies \neg\exists\,x \in \mathbb{R}^n.\, A x = 0 \land l \leq x \leq u$1, $\forall\,A \in \mathbb{R}^{m\times n},\ u,l \in \mathbb{R}^n,\ w \in \mathbb{R}^m.\ \mathrm{check\_contradiction}(w,A,u,l) \implies \neg\exists\,x \in \mathbb{R}^n.\, A x = 0 \land l \leq x \leq u$2 throughput, $\forall\,A \in \mathbb{R}^{m\times n},\ u,l \in \mathbb{R}^n,\ w \in \mathbb{R}^m.\ \mathrm{check\_contradiction}(w,A,u,l) \implies \neg\exists\,x \in \mathbb{R}^n.\, A x = 0 \land l \leq x \leq u$3 bandwidth savings
LLM Fact-Checking	Joint evidence/error-type/correction classification (Bai et al., 10 Jan 2026)	BAcc $\forall\,A \in \mathbb{R}^{m\times n},\ u,l \in \mathbb{R}^n,\ w \in \mathbb{R}^m.\ \mathrm{check\_contradiction}(w,A,u,l) \implies \neg\exists\,x \in \mathbb{R}^n.\, A x = 0 \land l \leq x \leq u$4 (Llama), $\forall\,A \in \mathbb{R}^{m\times n},\ u,l \in \mathbb{R}^n,\ w \in \mathbb{R}^m.\ \mathrm{check\_contradiction}(w,A,u,l) \implies \neg\exists\,x \in \mathbb{R}^n.\, A x = 0 \land l \leq x \leq u$5 (Qwen); cost-efficient, interpretable

This table organizes the central instantiations of InFi-Checker, synthesizing their methodologies and principal reported results. Domain-specific details, proofs, and pseudocode are detailed in the corresponding papers (Desmartin et al., 2023, Yuan et al., 2022, Wu et al., 2022, Bai et al., 10 Jan 2026).