Papers
Topics
Authors
Recent
Search
2000 character limit reached

InFi-Checker: Certified & Interpretable AI Systems

Updated 17 January 2026
  • InFi-Checker is a composite system that integrates certified DNN proof checking, mobile input filtering, and LLM fact-checking to enhance reliability in safety-critical AI systems.
  • It employs rigorous methodologies such as exact real arithmetic with Imandra, depth-first proof traversal, and Siamese feature-embedding networks to balance accuracy and efficiency.
  • Empirical benchmarks report up to 98% filtering rates, over 90% balanced accuracy in LLM fact-checking, and tractable performance even in full-checking regimes.

InFi-Checker is a composite designation that refers to certified, interpretable, and fine-grained checking systems across several domains, notably: deep neural network (DNN) verification, resource-efficient input filtering in mobile-centric inference, and factuality checking and error analysis for outputs of LLMs. These instantiations share methodological rigor, explicit performance metrics, and scalable implementations suitable for deployment in safety-critical and efficiency-sensitive AI environments.

1. Certified Proof Checking for DNN Verification

InFi-Checker for neural verification is implemented in the Imandra theorem prover and is designed to consume JSON-encoded proofs of UNSAT (unsatisfiability) produced by verifiers such as Marabou (Desmartin et al., 2023). Its principal components comprise a parameterized proof-tree datatype, global tableau and bound-vectors updated during traversal, and a list of piecewise-linear constraints (especially ReLU). The checker operates in two orthogonal modes: full versus partial theory-lemma checking (methodological rigor) and alternate representations for vectors/matrices (native lists vs. sparse maps) to balance rigor and speed.

Exact real arithmetic is a foundational element, enabled by Imandra’s built-in real type and OCaml’s Zarith library. All arithmetic—additions, subtractions, scalar multiplications, and comparisons—is conducted at infinite precision, thus negating numerical instability or rounding-induced soundness violation. This supports reliable upper-bound computation for linear forms, such as w⊺Axw^{\intercal}Ax, where all operations are exact over R\mathbb{R}.

The formal specification of soundness is stated in Imandra’s logic and LaTeX: if the contradiction test passes at every leaf (i.e., upper(w⊺Ax)<0\mathrm{upper}(w^{\intercal}Ax) < 0), then the original LP (Ax=0, l≤x≤u)(A x = 0,\ l \leq x \leq u) is infeasible:

∀ A∈Rm×n, u,l∈Rn, w∈Rm. check_contradiction(w,A,u,l)  ⟹  ¬∃ x∈Rn. Ax=0∧l≤x≤u\forall\,A \in \mathbb{R}^{m\times n},\ u,l \in \mathbb{R}^n,\ w \in \mathbb{R}^m.\ \mathrm{check\_contradiction}(w,A,u,l) \implies \neg\exists\,x \in \mathbb{R}^n.\, A x = 0 \land l \leq x \leq u

The algorithmic workflow parses the proof object and performs a depth-first traversal of nodes and leaves. At nodes, splits and theory-lemmas are checked via recomputation and pattern-matching; at leaves, Farkas vectors provide contradiction certificates. All supporting linear algebraic properties are formally proved in the same environment for consistency.

Empirical benchmarks on ACAS-Xu verification tasks indicate that InFi-Checker (in sparse/no-lemma mode) is approximately 2×2\times slower than the original C++ checker, but fully rigorous checking can be up to 150×150\times slower while remaining tractable (<40<40 min for largest proofs) (Desmartin et al., 2023). No numerical instabilities were observed.

Performance constraints and lack of support for richer activation functions remain present. Ongoing work includes optimizations (AVL-based sparse maps, specialized matrix indexing), extension to mixed-integer proofs, and full integration into larger system-level proofs.

2. Input Filtering for Mobile-Centric Inference

InFi-Checker also refers to a systemized framework for end-to-end learning to filter inputs, thereby improving resource-efficiency in mobile AI inference workloads (Yuan et al., 2022). The theoretical foundation rests on formalizing filterability using complexity measures (VC-dimension, Rademacher complexity), with explicit case analyses distinguishing filterability feasibility.

An inference workload (X, Y, c, H, D, S)(\mathcal{X},\,\mathcal{Y},\,c,\,\mathcal{H},\,D,\,S) is evaluated for redundancy via fh:Y→{0,1}f_h:\mathcal{Y}\to\{0,1\}, labeling outputs as redundant or useful. The filter learning problem seeks a model g:X→Zg:\mathcal{X}\to\mathbb{Z} (with hypothesis class G\mathcal{G}) satisfying g(x)≈fh(h(x))g(x)\approx f_h(h(x)) at low computational cost. Filtering rate rr, accuracy AccAcc, and saved cost CtotC_{tot} are central metrics. Validity conditions are Acc≥TAccAcc\geq T_{Acc} and Ctot<C(h)C_{tot} < C(h).

Case-based filterability analysis yields:

  • Low-confidence classification: Not filterable, R^S(G)≥R^S(H)\hat{\mathcal{R}}_S(\mathcal{G})\geq \hat{\mathcal{R}}_S(\mathcal{H}).
  • Class-subset skip: Filterable when ∣Y′∣≪ℓ|\mathcal{Y}'|\ll \ell, R^S(G)≤R^S(H)\hat{\mathcal{R}}_S(\mathcal{G})\leq \hat{\mathcal{R}}_S(\mathcal{H}).
  • Thresholded regression: Filterable, guaranteed lower complexity.

InFi's metric-learning framework supports both SKIP and REUSE strategies using a Siamese feature-embedding network gmodg_{mod}, modality-specific architectures, and contrastive/binary loss functions. Active online update mechanisms are deployed for nonstationary input streams. Efficient implementation covers six modalities (text, image, video, audio, sensor, feature-map) with end-to-end differentiability.

Empirical results demonstrate up to 98%98\% filtering rate, 8.5×8.5\times throughput and up to 95%95\% bandwidth saving at >90%>90\% accuracy in video analytics tasks (Yuan et al., 2022). Practical guidance is provided for estimating filterability, selecting architectures, cross-validation of operational parameters, and efficiency checks on target devices.

3. Factual Consistency and Fine-Grained Fact-Checking of LLM Outputs

Recent advances in InFi-Checker extend its scope to interpretable, fine-grained fact-checking for LLM-generated text, as described in InFi-Check and InFi-Check-FG benchmarks (Bai et al., 10 Jan 2026). The InFi-Checker model jointly retrieves explicit supporting evidence, classifies error types at sentence granularity, provides chain-of-thought justifications, and generates corrections. Data synthesis assembles claims grounded in curated corpora, attaches evidence and audits claims via multi-model and human verification, and synthesizes diverse error types—Predicate (PredE), Entity (EntE), Circumstance (CircE), Co-reference (CorefE), Discourse Link (LinkE), and Extrinsic (OutE).

The full structured target for model training comprises correct sentence, hallucinated version, evidence, error category, justification, and correction. Multitask learning combines cross-entropy classification over seven fine-grained classes and sequence-level generation of evidence/justification/correction. Backbone models include Llama-3.1-8B-Instruct and Qwen3-8B.

Experimental benchmarks show high balanced accuracy (BAcc: 90.9%90.9\% for Llama, 92.3%92.3\% for Qwen), and +27.6 pp+27.6\,\mathrm{pp} gain over GPT-4o, with error localization and sentence alignment ratio (SAR) rendering outputs substantially more interpretable (Bai et al., 10 Jan 2026). Generalization to out-of-distribution datasets and binary fact-check benchmarks is robust (Macro-F1 ≥83.7%\geq 83.7\%).

Fine-grained ablations reveal that removing any output element reduces BAcc drastically (to 20.6%20.6\%). Cost efficiency is highlighted, with evaluation costing $\sim\$4onInFi−Check−FG,whichissignificantlylowerthanclosed−sourceLLMs.</p><p>LimitationsincludesensitivitytoinitialLLMqualityindatasynthesis,taxonomyscope(excludesstyle/grammar),andpossiblemodularityimprovementviaretrieverintegration.Real−timedeploymentmaybenefitfromclassifier<ahref="https://www.emergentmind.com/topics/lora−reconstruction−distillation"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">distillation</a>.</p><h2class=′paper−heading′id=′insights−from−weakly−supervised−factuality−metrics′>4.InsightsfromWeaklySupervisedFactualityMetrics</h2><p>TechniquesfromWeCheck(<ahref="/papers/2212.10057"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Wuetal.,2022</a>)aredirectlytransferabletoInFi−Checker’sfactualityevaluation.Two−stepweaklysupervisedpipelinesaggregatemultiplenoisyscorers(<ahref="https://www.emergentmind.com/topics/natural−language−inference−nli"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">NLI</a>,QA,retrieval−basedmetrics)intosoftlabelsviaagenerativelabelingmodel( on InFi-Check-FG, which is significantly lower than closed-source LLMs.</p> <p>Limitations include sensitivity to initial LLM quality in data synthesis, taxonomy scope (excludes style/grammar), and possible modularity improvement via retriever integration. Real-time deployment may benefit from classifier <a href="https://www.emergentmind.com/topics/lora-reconstruction-distillation" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">distillation</a>.</p> <h2 class='paper-heading' id='insights-from-weakly-supervised-factuality-metrics'>4. Insights from Weakly Supervised Factuality Metrics</h2> <p>Techniques from WeCheck (<a href="/papers/2212.10057" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Wu et al., 2022</a>) are directly transferable to InFi-Checker’s factuality evaluation. Two-step weakly supervised pipelines aggregate multiple noisy scorers (<a href="https://www.emergentmind.com/topics/natural-language-inference-nli" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">NLI</a>, QA, retrieval-based metrics) into soft labels via a generative labeling model (p_\theta),thentrainacompactencoderontheseusingnoise−awarecross−entropyobjectives.Thegenerativemodelindependentlylearnscoverageandaccuracyparametersofeachweaklabelsource,marginalizingtoinferreliableper−examplelabels:</p><p>), then train a compact encoder on these using noise-aware cross-entropy objectives. The generative model independently learns coverage and accuracy parameters of each weak label source, marginalizing to infer reliable per-example labels:</p> <p>p_\theta(\tilde{A},y) = p(y)\prod_{i=1}^k [B_i\alpha_i\mathbf{1}(\tilde{A}_i = y) + B_i(1-\alpha_i)\mathbf{1}(\tilde{A}_i=1-y) + (1-B_i)\mathbf{1}(\tilde{A}_i = -1)]</p><p>Factualitymetricpre−trainingonNLIdatasets(MultiNLI,Adversarial−NLI,Fever−NLI,LingNLI)isbeneficial,attainingROCAUC</p> <p>Factuality metric pre-training on NLI datasets (MultiNLI, Adversarial-NLI, Fever-NLI, LingNLI) is beneficial, attaining ROC AUC \sim80.3onTRUEbenchmark.WeCheckachieves on TRUE benchmark. WeCheck achieves 84.8ROCAUCwhilebeing ROC AUC while being 20-30\timesfasterthanQA−basedmetrics(<ahref="/papers/2212.10057"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Wuetal.,2022</a>).RecommendationsforInFi−Checkerincludeaggregationofheterogeneousweaksources,adaptivethresholding,andjointdenoising.</p><h2class=′paper−heading′id=′practical−considerations−and−limitations′>5.PracticalConsiderationsandLimitations</h2><p>Performanceoverheadisnotableinfull−checkingregimes,motivatingfurtheroptimizations(e.g.,efficientsparserepresentations,custommatrix/listoperations)(<ahref="/papers/2307.06299"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Desmartinetal.,2023</a>).Inmobile−centricdeployments,traininglatencyofInFimodulesvariesbymodality;imagemodelsutilize faster than QA-based metrics (<a href="/papers/2212.10057" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Wu et al., 2022</a>). Recommendations for InFi-Checker include aggregation of heterogeneous weak sources, adaptive thresholding, and joint denoising.</p> <h2 class='paper-heading' id='practical-considerations-and-limitations'>5. Practical Considerations and Limitations</h2> <p>Performance overhead is notable in full-checking regimes, motivating further optimizations (e.g., efficient sparse representations, custom matrix/list operations) (<a href="/papers/2307.06299" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Desmartin et al., 2023</a>). In mobile-centric deployments, training latency of InFi modules varies by modality; image models utilize 50MBmemory,vectormodalities MB memory, vector modalities <5MB.Activeonlineupdateimproves<ahref="https://www.emergentmind.com/topics/sample−efficiency"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">sampleefficiency</a>innonstationarycontexts.</p><p>Interpretabilityismaximizedinfine−grainedfact−checkingviaexplicitevidenceattachmentandjustifications;however,coverageislimitedtopredefinedtaxonomies.Notallerrormodalities(e.g.,rhetorical,grammatical)arehandled,andsystemintegrationstillreliesontheunderlyingqualityofsyntheticdatapipelinesandLLMdecoders.</p><p>Aplausibleimplicationisthat,whileInFi−Checkersetsnewstandardsforcertified,interpretable,andefficientverificationinAIsystems,continuedworkonformalverification,extensiontoheterogeneousactivationfunctions,andscalabledeploymentinreal−timeenvironmentsiswarranted.</p><h2class=′paper−heading′id=′summary−table−infi−checker−instantiations′>6.SummaryTable:InFi−CheckerInstantiations</h2><divclass=′overflow−x−automax−w−fullmy−4′><tableclass=′tableborder−collapsew−full′style=′table−layout:fixed′><thead><tr><th>ApplicationDomain</th><th>KeyMethodology</th><th>ReportedMetrics/Results</th></tr></thead><tbody><tr><td>DNNVerification</td><td>CertifiedUNSATproofchecking;infinite−precisionarithmeticinImandra(<ahref="/papers/2307.06299"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Desmartinetal.,2023</a>)</td><td>Sparse/no−lemma: MB. Active online update improves <a href="https://www.emergentmind.com/topics/sample-efficiency" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">sample efficiency</a> in nonstationary contexts.</p> <p>Interpretability is maximized in fine-grained fact-checking via explicit evidence attachment and justifications; however, coverage is limited to predefined taxonomies. Not all error modalities (e.g., rhetorical, grammatical) are handled, and system integration still relies on the underlying quality of synthetic data pipelines and LLM decoders.</p> <p>A plausible implication is that, while InFi-Checker sets new standards for certified, interpretable, and efficient verification in AI systems, continued work on formal verification, extension to heterogeneous activation functions, and scalable deployment in real-time environments is warranted.</p> <h2 class='paper-heading' id='summary-table-infi-checker-instantiations'>6. Summary Table: InFi-Checker Instantiations</h2><div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Application Domain</th> <th>Key Methodology</th> <th>Reported Metrics/Results</th> </tr> </thead><tbody><tr> <td>DNN Verification</td> <td>Certified UNSAT proof checking; infinite-precision arithmetic in Imandra (<a href="/papers/2307.06299" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Desmartin et al., 2023</a>)</td> <td>Sparse/no-lemma: 2\timesC++speed;full: C++ speed; full: 150\times;exactreals,noinstability</td></tr><tr><td>InputFiltering</td><td>End−to−endmodality−agnosticinputfilterlearning(<ahref="/papers/2209.13873"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Yuanetal.,2022</a>)</td><td>Filteringrateupto; exact reals, no instability</td> </tr> <tr> <td>Input Filtering</td> <td>End-to-end modality-agnostic input filter learning (<a href="/papers/2209.13873" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Yuan et al., 2022</a>)</td> <td>Filtering rate up to 98\%,, 8.5\timesthroughput, throughput, 95\%bandwidthsavings</td></tr><tr><td>LLMFact−Checking</td><td>Jointevidence/error−type/correctionclassification(<ahref="/papers/2601.06666"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Baietal.,10Jan2026</a>)</td><td>BAcc bandwidth savings</td> </tr> <tr> <td>LLM Fact-Checking</td> <td>Joint evidence/error-type/correction classification (<a href="/papers/2601.06666" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Bai et al., 10 Jan 2026</a>)</td> <td>BAcc 90.9\%(Llama), (Llama), 92.3\%$ (Qwen); cost-efficient, interpretable

This table organizes the central instantiations of InFi-Checker, synthesizing their methodologies and principal reported results. Domain-specific details, proofs, and pseudocode are detailed in the corresponding papers (Desmartin et al., 2023, Yuan et al., 2022, Wu et al., 2022, Bai et al., 10 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InFi-Checker.