InFi-Checker: Certified & Interpretable AI Systems
Updated 17 January 2026
- InFi-Checker is a composite system that integrates certified DNN proof checking, mobile input filtering, and LLM fact-checking to enhance reliability in safety-critical AI systems.
- It employs rigorous methodologies such as exact real arithmetic with Imandra, depth-first proof traversal, and Siamese feature-embedding networks to balance accuracy and efficiency.
- Empirical benchmarks report up to 98% filtering rates, over 90% balanced accuracy in LLM fact-checking, and tractable performance even in full-checking regimes.
InFi-Checker is a composite designation that refers to certified, interpretable, and fine-grained checking systems across several domains, notably: deep neural network (DNN) verification, resource-efficient input filtering in mobile-centric inference, and factuality checking and error analysis for outputs of LLMs. These instantiations share methodological rigor, explicit performance metrics, and scalable implementations suitable for deployment in safety-critical and efficiency-sensitive AI environments.
1. Certified Proof Checking for DNN Verification
InFi-Checker for neural verification is implemented in the Imandra theorem prover and is designed to consume JSON-encoded proofs of UNSAT (unsatisfiability) produced by verifiers such as Marabou (Desmartin et al., 2023). Its principal components comprise a parameterized proof-tree datatype, global tableau and bound-vectors updated during traversal, and a list of piecewise-linear constraints (especially ReLU). The checker operates in two orthogonal modes: full versus partial theory-lemma checking (methodological rigor) and alternate representations for vectors/matrices (native lists vs. sparse maps) to balance rigor and speed.
Exact real arithmetic is a foundational element, enabled by Imandra’s built-in real type and OCaml’s Zarith library. All arithmetic—additions, subtractions, scalar multiplications, and comparisons—is conducted at infinite precision, thus negating numerical instability or rounding-induced soundness violation. This supports reliable upper-bound computation for linear forms, such as w⊺Ax, where all operations are exact over R.
The formal specification of soundness is stated in Imandra’s logic and LaTeX: if the contradiction test passes at every leaf (i.e., upper(w⊺Ax)<0), then the original LP (Ax=0, l≤x≤u) is infeasible:
∀A∈Rm×n, u,l∈Rn, w∈Rm. check_contradiction(w,A,u,l)⟹¬∃x∈Rn.Ax=0∧l≤x≤u
The algorithmic workflow parses the proof object and performs a depth-first traversal of nodes and leaves. At nodes, splits and theory-lemmas are checked via recomputation and pattern-matching; at leaves, Farkas vectors provide contradiction certificates. All supporting linear algebraic properties are formally proved in the same environment for consistency.
Empirical benchmarks on ACAS-Xu verification tasks indicate that InFi-Checker (in sparse/no-lemma mode) is approximately 2× slower than the original C++ checker, but fully rigorous checking can be up to 150× slower while remaining tractable (<40 min for largest proofs) (Desmartin et al., 2023). No numerical instabilities were observed.
Performance constraints and lack of support for richer activation functions remain present. Ongoing work includes optimizations (AVL-based sparse maps, specialized matrix indexing), extension to mixed-integer proofs, and full integration into larger system-level proofs.
InFi-Checker also refers to a systemized framework for end-to-end learning to filter inputs, thereby improving resource-efficiency in mobile AI inference workloads (Yuan et al., 2022). The theoretical foundation rests on formalizing filterability using complexity measures (VC-dimension, Rademacher complexity), with explicit case analyses distinguishing filterability feasibility.
An inference workload (X,Y,c,H,D,S) is evaluated for redundancy via fh​:Y→{0,1}, labeling outputs as redundant or useful. The filter learning problem seeks a model g:X→Z (with hypothesis class G) satisfying g(x)≈fh​(h(x)) at low computational cost. Filtering rate r, accuracy Acc, and saved cost Ctot​ are central metrics. Validity conditions are Acc≥TAcc​ and Ctot​<C(h).
Case-based filterability analysis yields:
- Low-confidence classification: Not filterable, R^S​(G)≥R^S​(H).
- Class-subset skip: Filterable when ∣Y′∣≪ℓ, R^S​(G)≤R^S​(H).
- Thresholded regression: Filterable, guaranteed lower complexity.
InFi's metric-learning framework supports both SKIP and REUSE strategies using a Siamese feature-embedding network gmod​, modality-specific architectures, and contrastive/binary loss functions. Active online update mechanisms are deployed for nonstationary input streams. Efficient implementation covers six modalities (text, image, video, audio, sensor, feature-map) with end-to-end differentiability.
Empirical results demonstrate up to 98% filtering rate, 8.5× throughput and up to 95% bandwidth saving at >90% accuracy in video analytics tasks (Yuan et al., 2022). Practical guidance is provided for estimating filterability, selecting architectures, cross-validation of operational parameters, and efficiency checks on target devices.
3. Factual Consistency and Fine-Grained Fact-Checking of LLM Outputs
Recent advances in InFi-Checker extend its scope to interpretable, fine-grained fact-checking for LLM-generated text, as described in InFi-Check and InFi-Check-FG benchmarks (Bai et al., 10 Jan 2026). The InFi-Checker model jointly retrieves explicit supporting evidence, classifies error types at sentence granularity, provides chain-of-thought justifications, and generates corrections. Data synthesis assembles claims grounded in curated corpora, attaches evidence and audits claims via multi-model and human verification, and synthesizes diverse error types—Predicate (PredE), Entity (EntE), Circumstance (CircE), Co-reference (CorefE), Discourse Link (LinkE), and Extrinsic (OutE).
The full structured target for model training comprises correct sentence, hallucinated version, evidence, error category, justification, and correction. Multitask learning combines cross-entropy classification over seven fine-grained classes and sequence-level generation of evidence/justification/correction. Backbone models include Llama-3.1-8B-Instruct and Qwen3-8B.
Experimental benchmarks show high balanced accuracy (BAcc: 90.9% for Llama, 92.3% for Qwen), and +27.6pp gain over GPT-4o, with error localization and sentence alignment ratio (SAR) rendering outputs substantially more interpretable (Bai et al., 10 Jan 2026). Generalization to out-of-distribution datasets and binary fact-check benchmarks is robust (Macro-F1 ≥83.7%).
Fine-grained ablations reveal that removing any output element reduces BAcc drastically (to 20.6%). Cost efficiency is highlighted, with evaluation costing $\sim\$4onInFi−Check−FG,whichissignificantlylowerthanclosed−sourceLLMs.</p><p>LimitationsincludesensitivitytoinitialLLMqualityindatasynthesis,taxonomyscope(excludesstyle/grammar),andpossiblemodularityimprovementviaretrieverintegration.Real−timedeploymentmaybenefitfromclassifier<ahref="https://www.emergentmind.com/topics/lora−reconstruction−distillation"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">distillation</a>.</p><h2class=′paper−heading′id=′insights−from−weakly−supervised−factuality−metrics′>4.InsightsfromWeaklySupervisedFactualityMetrics</h2><p>TechniquesfromWeCheck(<ahref="/papers/2212.10057"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Wuetal.,2022</a>)aredirectlytransferabletoInFi−Checker’sfactualityevaluation.Two−stepweaklysupervisedpipelinesaggregatemultiplenoisyscorers(<ahref="https://www.emergentmind.com/topics/natural−language−inference−nli"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">NLI</a>,QA,retrieval−basedmetrics)intosoftlabelsviaagenerativelabelingmodel(p_\theta),thentrainacompactencoderontheseusingnoise−awarecross−entropyobjectives.Thegenerativemodelindependentlylearnscoverageandaccuracyparametersofeachweaklabelsource,marginalizingtoinferreliableper−examplelabels:</p><p>p_\theta(\tilde{A},y) = p(y)\prod_{i=1}^k [B_i\alpha_i\mathbf{1}(\tilde{A}_i = y) + B_i(1-\alpha_i)\mathbf{1}(\tilde{A}_i=1-y) + (1-B_i)\mathbf{1}(\tilde{A}_i = -1)]</p><p>Factualitymetricpre−trainingonNLIdatasets(MultiNLI,Adversarial−NLI,Fever−NLI,LingNLI)isbeneficial,attainingROCAUC\sim80.3onTRUEbenchmark.WeCheckachieves84.8ROCAUCwhilebeing20-30\timesfasterthanQA−basedmetrics(<ahref="/papers/2212.10057"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Wuetal.,2022</a>).RecommendationsforInFi−Checkerincludeaggregationofheterogeneousweaksources,adaptivethresholding,andjointdenoising.</p><h2class=′paper−heading′id=′practical−considerations−and−limitations′>5.PracticalConsiderationsandLimitations</h2><p>Performanceoverheadisnotableinfull−checkingregimes,motivatingfurtheroptimizations(e.g.,efficientsparserepresentations,custommatrix/listoperations)(<ahref="/papers/2307.06299"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Desmartinetal.,2023</a>).Inmobile−centricdeployments,traininglatencyofInFimodulesvariesbymodality;imagemodelsutilize50MBmemory,vectormodalities<5MB.Activeonlineupdateimproves<ahref="https://www.emergentmind.com/topics/sample−efficiency"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">sampleefficiency</a>innonstationarycontexts.</p><p>Interpretabilityismaximizedinfine−grainedfact−checkingviaexplicitevidenceattachmentandjustifications;however,coverageislimitedtopredefinedtaxonomies.Notallerrormodalities(e.g.,rhetorical,grammatical)arehandled,andsystemintegrationstillreliesontheunderlyingqualityofsyntheticdatapipelinesandLLMdecoders.</p><p>Aplausibleimplicationisthat,whileInFi−Checkersetsnewstandardsforcertified,interpretable,andefficientverificationinAIsystems,continuedworkonformalverification,extensiontoheterogeneousactivationfunctions,andscalabledeploymentinreal−timeenvironmentsiswarranted.</p><h2class=′paper−heading′id=′summary−table−infi−checker−instantiations′>6.SummaryTable:InFi−CheckerInstantiations</h2><divclass=′overflow−x−automax−w−fullmy−4′><tableclass=′tableborder−collapsew−full′style=′table−layout:fixed′><thead><tr><th>ApplicationDomain</th><th>KeyMethodology</th><th>ReportedMetrics/Results</th></tr></thead><tbody><tr><td>DNNVerification</td><td>CertifiedUNSATproofchecking;infinite−precisionarithmeticinImandra(<ahref="/papers/2307.06299"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Desmartinetal.,2023</a>)</td><td>Sparse/no−lemma:2\timesC++speed;full:150\times;exactreals,noinstability</td></tr><tr><td>InputFiltering</td><td>End−to−endmodality−agnosticinputfilterlearning(<ahref="/papers/2209.13873"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Yuanetal.,2022</a>)</td><td>Filteringrateupto98\%,8.5\timesthroughput,95\%bandwidthsavings</td></tr><tr><td>LLMFact−Checking</td><td>Jointevidence/error−type/correctionclassification(<ahref="/papers/2601.06666"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Baietal.,10Jan2026</a>)</td><td>BAcc90.9\%(Llama),92.3\%$ (Qwen); cost-efficient, interpretable