On the Foundations of Trustworthy Artificial Intelligence

Published 26 Mar 2026 in cs.AI and cs.CR | (2603.24904v1)

Abstract: We prove that platform-deterministic inference is necessary and sufficient for trustworthy AI. We formalize this as the Determinism Thesis and introduce trust entropy to quantify the cost of non-determinism, proving that verification failure probability equals 1 - 2^{-H_T} exactly. We prove a Determinism-Verification Collapse: verification under determinism requires O(1) hash comparison; without it, the verifier faces an intractable membership problem. IEEE 754 floating-point arithmetic fundamentally violates the determinism requirement. We resolve this by constructing a pure integer inference engine that achieves bitwise identical output across ARM and x86. In 82 cross-architecture tests on models up to 6.7B parameters, we observe zero hash mismatches. Four geographically distributed nodes produce identical outputs, verified by 356 on-chain attestation transactions. Every major trust property of AI systems (fairness, robustness, privacy, safety, alignment) presupposes platform determinism. Our system, 99,000 lines of Rust deployed across three continents, establishes that AI trust is a question of arithmetic.

Abstract PDF Upgrade to Chat

Summary

The paper presents the Determinism Thesis, proving that non-deterministic inference violates verifiability, reproducibility, auditability, and certifiability.
The study identifies IEEE 754 floating-point arithmetic as a source of non-determinism and introduces an ARC engine using fixed-width integer arithmetic to ensure consistency.
Empirical evaluations demonstrate bitwise-identical outputs across various platforms with notable speedups, enabling efficient cryptographic attestation and scalable verification.

Determinism as the Foundation of Trustworthy AI

Determinism Thesis and Formal Foundations

The paper establishes the Determinism Thesis, asserting that platform-deterministic inference—where an AI model produces identical output for identical input across all hardware—is both necessary and sufficient for trustworthiness. The core properties of trustworthy AI, namely verifiability, reproducibility, auditability, and certifiability, are rigorously defined and proven to be violated by non-deterministic inference. Formal proofs demonstrate:

Necessity: Non-determinism invalidates all four trust properties, undermining external verification and regulatory certification.
Sufficiency: Platform-determinism enables efficient cryptographic attestation and hash-based verification, reducing verification complexity to $O(1)$ regardless of model size or computational depth.

The operational metric, trust entropy (Rényi collision entropy over execution outputs), quantifies the probability of honest-party verification failure, establishing a precise link between non-determinism and trust. The paper shows that verification completeness is binary: any degree of non-determinism results in strictly positive false rejection rates.

The Floating-Point Barrier

Experimental and theoretical analysis reveals that IEEE 754 floating-point arithmetic, ubiquitously used in neural network inference, fundamentally fails the determinism requirement. Floating-point addition is non-associative and hardware-dependent, causing output divergence across architectures (e.g., ARM NEON vs x86 AVX2) due to different parallel reduction trees. The divergence exponentially amplifies through deep residual and transformer networks, ultimately yielding entirely different token sequences from identical inputs and model parameters—a phenomenon confirmed in controlled cross-node inference experiments.

The impossibility theorem demonstrates that floating-point-based inference is inherently non-deterministic for any model with hardware-dependent reduction order, refuting all approaches based on approximate determinism or mitigations like constrained accumulation or thread pinning.

Platform-Deterministic Inference Architecture

To resolve the floating-point barrier, the ARC engine is constructed, implementing transformer inference using exclusively fixed-width integer arithmetic. Integer operations—by the algebraic properties of two's complement rings—are associative, distributive, and commutative, guaranteeing platform determinism. The architecture comprises:

INT8 weight quantization with per-row Q16 scale factors.
Q16 fixed-point activations, normalization, and nonlinearities (RMSNorm, SiLU, RoPE with fixed Q16 tables).
Full-precision KV caching for attention, preserving integrity across sequence positions.
Greedy argmax decoding, optionally deterministic sampling via cryptographically seeded PRNGs.
Data-parallel computation with fixed evaluation order, ensuring independence from scheduling or hardware differences.

Empirical evaluation across ARM and x86 platforms validates zero hash mismatches for Llama-2-7B and TinyLlama-1.1B over up to 1,024 generated tokens. Multi-node tests spanning three continents yield identical outputs, confirmed via over 350 on-chain attestation transactions.

Verification Complexity: Determinism-Verification Collapse

The paper proves a critical result: verification complexity collapses under determinism. For deterministic inference, verification is solely a matter of hash comparison after re-execution, independent of platform or implementation details. For non-deterministic inference, verifying a claimed output involves solving an intractable membership problem over the execution equivalence class, which grows combinatorially with model depth and dimension.

The result supports a trust stack: determinism enables consensus, attestation, verification, and certification—the foundational computation layer for all higher-level trust properties (fairness auditing, robustness certification, privacy compliance, safety verification, and alignment checking).

Experimental Validation and Numerical Results

The ARC engine demonstrates:

Bitwise identical outputs across architectures for all evaluated sequence lengths and prompt classes.
Faster inference than optimized floating-point implementations on identical hardware (2.3x speedup on GPU, 1.26x speedup on CPU).
Scalable STARK proofs covering dense layer computations (up to 70B parameter models), with commitment sizes fixed at 152 bytes and proving times scaling linearly.
On-chain attestation and DAG consensus, facilitating real-time multi-node consensus over inference outputs.

Quality inspection shows factually correct, grammatically coherent outputs indistinguishable from floating-point inference for math, factual Q&A, code generation, and creative tasks. Perplexity scores exhibit quantization-induced degradation, but the calibration gap is shown to be a function of INT8 bit-width rather than determinism; mixed-precision schemes will mitigate quality loss while retaining platform-determinism.

Theoretical and Practical Implications

The Determinism Thesis implies that every independently verifiable trust property in AI presupposes platform-deterministic inference. It underpins verifiable external audits, robust cross-platform certification, reproducible science, consensus in multi-agent systems, and trustless model serving. The ongoing industry migration to integer quantization, motivated by efficiency, is constructing the infrastructure for trustworthy AI, as INT8 tensor cores inherently support deterministic operations.

From a theoretical perspective, the work parallels prior paradigm shifts in cryptography and distributed systems, replacing physical trust requirements with mathematically computable ones. Verification becomes a property of arithmetic, not of higher-level algorithmic or interpretability protocols. The elimination of floating-point-induced non-determinism closes the gap between AI inference and cryptographic verifiability.

Open Problems and Future Directions

Key unresolved areas include:

Extension of STARK proofs to cover the full inference pipeline (attention, normalization, activations) for end-to-end cryptographic attestation at scale.
Deterministic training, addressing stochastic and distributed non-determinism.
Precision allocation optimization for mixed-precision deterministic quantization.
Hardware-specific kernel optimization to leverage native integer units for single-digit millisecond inference latency.
Formal verification and standardization of integer inference specs for regulatory adoption.

Conclusion

The paper rigorously establishes the mathematical foundation for trustworthy AI, showing that determinism—not interpretability, alignment, or transparency—is prerequisite for verifiable trust. Floating-point arithmetic, due to the lack of associativity, is fundamentally incompatible; integer arithmetic provides deterministic inference by algebraic necessity. The practical demonstration spans large-scale transformer models, cross-architecture validation, multi-node consensus, cryptographic proof integration, and performance superiority. The Determinism Thesis reframes AI trust as a function of mathematical structure, suggesting that progress in trustworthy AI rests on the substrate of deterministic computation, not on higher-level debate.

The results invite adoption of integer arithmetic architectures for all applications requiring trusted, reproducible, auditable, certified AI inference and open avenues for cryptographically sound attestation and decentralized verification at production scale.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper argues a simple but powerful idea: for AI to be truly trustworthy, it must be deterministic. That means if you give the same AI model the same input, it should give the exact same output every time, no matter which computer it runs on. The authors show this is both necessary and sufficient for trust. They also build and test a new inference engine that makes LLMs behave this way across different kinds of hardware.

What questions did the authors ask?

Do we need determinism to trust AI decisions, and is determinism enough to make AI verifiable and auditable?
Why do today’s AI systems sometimes give different answers on different computers?
Can we build a fast, large-scale AI engine that produces bit-for-bit identical results across different processors (like ARM and x86)?
How does non-determinism (tiny numerical differences) grow inside deep neural networks?
How hard is it to verify AI outputs when systems are deterministic versus non-deterministic?
How do other trust goals (fairness, robustness, privacy, safety, alignment) depend on determinism?

How did they study it?

The authors combined theory, engineering, and experiments:

They defined “platform-deterministic inference” (same model + same input → same output on any hardware) and “trustworthy AI” (verifiable, reproducible, auditable, and certifiable).
They introduced “trust entropy,” a way to measure how much outputs vary across hardware. If every machine gives the same answer, trust entropy is zero. They showed the chance that an honest verification fails equals $1 - 2^{-\text{HT}}$ , so zero entropy means zero failure.
They proved a “Determinism–Verification Collapse”: if outputs are deterministic, verifying them is as simple as re-running the model and comparing a short hash (a digital fingerprint). If outputs can vary, verification becomes much harder because you must accept many possible “valid” answers.
They showed the main cause of variation is how computers handle decimals (floating-point numbers). Adding numbers with rounding isn’t perfectly consistent when done in different orders, and different chips group additions differently. This is like adding many small rounded numbers to a big number in different sequences: tiny rounding differences can add up to a different final result.
To fix this, they built the ARC engine that does all math using integers instead of floats. Integer math doesn’t have rounding problems like this and gives the same answer on any computer that uses standard two’s complement integers.
They tested their engine across different machines and across the globe. They also recorded results “on-chain” and used consensus methods so multiple nodes could agree on an AI’s answer. For smaller, portable checks, they generated compact STARK commitments (tiny cryptographic receipts) for parts of the computation.

Technical terms in everyday language:

Floating point vs. integer: Floating point is like working with decimals; small round-offs can differ by machine. Integer math is like working with whole numbers and consistent scaling, so every machine gets the same answer.
Hash: A short “fingerprint” of data. If two outputs have the same hash, they’re the same with overwhelming probability.
Consensus/DAG: A way for many computers to agree on a shared record, like everyone signing off that “we all got this exact answer.”
STARK: A kind of cryptographic proof that a computation was done correctly; think of it as a compact, checkable receipt.

What did they find, and why does it matter?

Main findings:

Determinism is necessary and sufficient for trustworthy AI. If outputs can change across hardware, you can’t reliably verify, reproduce, audit, or certify an AI’s behavior. If outputs are deterministic, you can do all four.
Floating-point math breaks determinism. Because addition with rounding isn’t associative, different processors (with different vector widths or threading) add things in different orders and get slightly different numbers. In deep models, these tiny differences can snowball and lead to different generated tokens.
Verification becomes easy under determinism. If the AI is deterministic, verifying an output can be as simple as re-running once and comparing one hash. If it isn’t, verification may require understanding or simulating the exact execution details of the other machine, which is impractical.
A working fix exists today. Their ARC engine uses integer arithmetic for inference and produced bit-for-bit identical outputs for large models like Llama‑2‑7B across ARM and x86 machines in 82 cross-architecture tests (up to 1,024 tokens), with zero mismatches.
Real-world consensus worked. Four nodes in different parts of the world independently ran the model and produced identical outputs, with 356 on-chain attestations to back it up.
It’s fast. On their hardware, the integer engine was faster than a floating-point backend while staying deterministic.

Why this matters:

If you want to trust an AI’s decision (say, a medical suggestion or a loan decision), you must be able to redo the same computation and get the same result. Otherwise, you can’t tell if a difference comes from cheating, a bug, or just hardware quirks.
Many trust goals depend on determinism. For example, to audit fairness, you must recreate the exact decision path; to certify safety, you must know it will behave identically on different devices.

What methods did they use to make AI deterministic?

To keep lists short and useful, here are the key engineering ideas they used:

Integer-only math for the forward pass: quantized INT8 weights and fixed-point activations (Q16), with careful design to avoid overflow.
Integer versions of model parts like normalization, activations, and positional embeddings, using lookup tables and fixed procedures so every platform computes exactly the same numbers.
Deterministic token selection: greedy decoding or fixed-seed sampling so the same input always leads to the same next token.
Parallelism that doesn’t change results: independent pieces run in parallel but are combined in a fixed order.
Cross-node agreement: output hashes posted on-chain and confirmed via consensus; anyone can re-run the model to check.

What’s the bigger impact?

A foundation for trustworthy AI: The authors show that fairness, robustness, privacy, safety, and alignment checks all assume determinism. Without it, those checks can’t be independently verified.
A practical path forward: Since many accelerators already support fast integer operations, moving AI inference to integer math can make systems both faster and more trustworthy.
Better oversight and certification: Regulators, companies, and users can verify AI outputs by re-running them and comparing hashes, or by using compact cryptographic receipts.
A shift in focus: The paper suggests that making AI trustworthy isn’t only about “alignment” or “interpretability.” It’s also about the arithmetic under the hood. Choosing the right math (deterministic integer inference) unlocks simple, reliable verification for everyone.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper advances a strong determinism-centric thesis and a working integer inference engine, but it leaves several concrete issues unresolved. Future work could address the following:

Formal specification of the deterministic inference function:
- No canonical, machine-readable specification of all numerical semantics (integer widths, fixed-point formats, rounding/shift rules, saturation vs wrap, dequantization, normalization, tie-breaking for argmax/top-k/top-p) is provided to enable independent re-implementation and certification.
Integer softmax and attention details:
- The paper does not describe a bit-exact, integer implementation of softmax for attention (exp, max-shift, normalization), its error bounds, or how overflow/underflow is prevented in worst-case sequences.
Dynamic range and overflow guarantees:
- Only partial back-of-the-envelope bounds are given (e.g., dot-product safety in 64-bit). A complete, model-agnostic proof that all intermediate values (matmuls, RMSNorm sums of squares, softmax accumulations, SiLU) cannot overflow under supported dimensions (e.g., d=8192–16384, L up to 80+) and context lengths is missing.
Deterministic RoPE tables:
- RoPE sin/cos are computed via FP64 at load time and “empirically” match after Q16 rounding. There is no formal guarantee across platforms/libraries. A standardized distribution (or a bit-exact integer CORDIC generator) and proof of cross-platform identity is not yet provided.
Deterministic quantization pipeline:
- The engine assumes INT8 weights with per-row scales but does not define a bit-exact, cross-toolchain quantization spec (tie-breaking on rounding, scale computation, clipping rules) to ensure deterministic model preparation across platforms and versions.
Tokenization determinism:
- The paper does not analyze determinism of the tokenization and text normalization stack (regex/Unicode differences across OS/locale/library versions), which can break end-to-end reproducibility before arithmetic begins.
Completeness of determinism under parallelism:
- Beyond independent attention heads, the paper does not audit all kernels for potential cross-thread reductions, races, or reordering (e.g., layernorm reductions, residual accumulations) that might reintroduce non-determinism under aggressive compiler/vectorizer settings.
Hardware and compiler assumptions:
- The determinism claim assumes two’s complement integers and arithmetic right-shift semantics. It does not present conformance tests across diverse CPUs/GPUs/NPUs (including saturating or non-ARSH hardware), nor robustness to different compilers/optimization levels/drivers or shader backends.
GPU/accelerator portability:
- Deterministic equivalence is shown for Apple M2 Ultra vs x86 CPU; it does not validate bitwise identity on NVIDIA/AMD GPUs, TPUs, or AI ASICs using integer tensor cores, nor across different graphics APIs/drivers (e.g., SPIR-V, CUDA, Metal).
Quality evaluation and trade-offs:
- Model quality is insufficiently assessed. Reported perplexity is not comparable (Chat vs base). There is no systematic benchmark suite (e.g., MMLU, GSM8K, HumanEval, long-context tasks), no ablations for INT8 vs INT16/mixed-precision, and no quantification of accuracy–determinism trade-offs.
Long-context operation:
- Experiments are limited to ≤1,024 tokens (and 7B/1.1B models). It remains unknown whether determinism and quality hold for 4k–32k contexts, rope scaling/extrapolation, and potential numeric issues (e.g., attention score range, KV cache growth).
Scope of the floating-point impossibility:
- Theorem 9 assumes hardware-determined reduction order. The paper does not investigate whether a canonical, software-enforced FP reduction tree and fixed rounding modes (or software FP) could recover cross-platform determinism and with what performance penalty.
Verification complexity with execution traces:
- The “Determinism-Verification Collapse” frames non-deterministic verification as a membership problem over combinatorially many outputs, but does not analyze schemes where the prover supplies an explicit execution trace/schedule (or a proof thereof). Formal lower/upper bounds with trace-witnesses are not provided.
Trust entropy (HT) measurement:
- While defined theoretically, there is no methodology to estimate HT in practice (e.g., sampling across real hardware populations), no empirical measurements, and no guidance on how to manage HT under heterogeneous fleets.
STARK coverage and end-to-end proving:
- Current proofs cover dense layers only. Constraints for attention, normalization, and activations are missing, as are performance projections for full-model proofs (proof size, proving/verification time, recursion strategies, verifier resource constraints).
Privacy and attestation security:
- Publishing H(x) and H(y) can leak information via dictionary attacks. Protocols for input/output privacy (salts, commitments with hiding, ZK attestations) and an adversarial/economic analysis of the dispute mechanism and DAG consensus are not provided.
Side-channel and timing determinism:
- The paper does not analyze whether the deterministic kernels are also resistant to timing/cache-based side channels in multi-tenant settings, or whether constant-time properties are needed for trustworthy deployment.
Deterministic randomness for stochastic features:
- For applications requiring randomness (e.g., temperature sampling, differential privacy), the proposal suggests deterministic PRNG seeding. It does not specify how to integrate verifiable randomness (e.g., VRFs/beacons) while preserving both reproducibility and unpredictability, nor the implications for DP guarantees.
Training pipeline trust:
- The work focuses on inference. Deterministic or verifiable training (data order, augmentation, optimizer states, randomness) and data lineage/provenance remain open for end-to-end AI system trust.
Robustness of integer wraparound semantics:
- Although mathematically well-defined, integer wraparound may silently degrade model behavior. There is no empirical or theoretical assurance that wrap does not occur in practice (or bounds on how often/where), nor mitigation strategies (e.g., saturation, wider accumulators).
Formalization of Theorem 7 (trust dependency hierarchy):
- The reductions from fairness/robustness/privacy/safety/alignment to determinism are argued informally and at decision-level granularity. Formal proofs, counterexamples, and clearer scope (e.g., population-level audits that may not require per-decision reproducibility) are absent.
Broader model classes and control flow:
- The approach is validated on standard transformer LMs. It does not address architectures with dynamic control flow (e.g., MoE gating, sparsity, conditional computation), vision models, or multi-modal pipelines where determinism across heterogeneous operators is harder.
Standardization and certification path:
- There is no proposed standard (spec/test suite/reference IR) for “platform-deterministic inference,” no conformance tests for vendors, and no roadmap for regulatory bodies to recognize and certify deterministic inference across hardware.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are concrete, deployable use cases enabled by the paper’s findings (platform-deterministic inference, trust entropy, hash-based verification, on-chain attestation, and the ARC integer engine). Each item includes sector mapping, potential tools/products/workflows, and key feasibility dependencies.

Deterministic, auditable AI decisions in regulated services
- Sectors: finance, healthcare, public sector
- Tools/workflows: ARC integer engine (INT8/Q16), BLAKE3 input/model/output hashes, deterministic PRNG (e.g., ChaCha20 seeded by H(m|x)) for reproducible sampling, attestation receipts attached to decisions
- Applications: credit decisions, medical triage/decision-support, benefit eligibility rulings with cryptographically verifiable receipts that auditors can re-execute on any hardware
- Assumptions/dependencies: model quality under INT8/Q16 is acceptable for the task; identical weight bytes (e.g., GGUF) available to auditors; organizations adopt fixed decode policies (greedy or deterministic PRNG seeding)
Compliance-grade reproducibility for internal audits and incident investigations
- Sectors: enterprise software, fintech, medtech
- Tools/workflows: “re-execute-to-verify” protocol using ARC; content-addressable logs keyed by H(m), H(x), H(y); deterministic replay of full traces
- Applications: post-hoc incident reconstructions; root-cause analysis without platform confounds
- Assumptions/dependencies: storage of prompts and model hashes; ability to share or escrow models for audit
Cross-platform certification transfer for safety testing
- Sectors: robotics, avionics, medical devices, automotive
- Tools/workflows: certify behavior once (on any platform) and transfer proofs via platform-deterministic engine; attach certs/receipts to firmware or model packages
- Applications: reduce per-platform certification burden for the same deterministic model across ARM/x86/GPUs
- Assumptions/dependencies: regulators accept platform-determinism as equivalence; deterministic inference remains within model performance requirements
Reproducible benchmarking and leaderboards
- Sectors: academia, ML benchmarking bodies, open-model communities
- Tools/workflows: publish H(m), H(dataset shards), fixed seeds; require zero hash mismatches across sites to accept results; measure “trust entropy” (HT) as a reported metric
- Applications: exact replication of leaderboard runs independent of hardware
- Assumptions/dependencies: datasets and models distributed with canonical hashes; community agreement on determinism requirements
Verifiable decentralized inference and compute marketplaces
- Sectors: web3, cloud marketplaces, edge compute
- Tools/workflows: DAG consensus for output hashes, economic bonds with challenge periods, on-chain InferenceAttestation (H(m), H(x), H(y)), optional Circle STARK commitment receipts for dense layers
- Applications: marketplaces where clients pay only when outputs verify; trust-minimized multi-node inference
- Assumptions/dependencies: chain costs and latency acceptable; light clients may still require STARK receipts until full proofs cover all layers
Content authenticity and provenance for AI-generated media
- Sectors: media, publishing, enterprise content, C2PA ecosystems
- Tools/workflows: append attestation receipts and hashes as provenance metadata; reproducible sampling ensures identical outputs from m and x; content-addressable storage via H(y)
- Applications: prove a given content item was generated by a specific model+prompt; facilitate downstream accountability
- Assumptions/dependencies: consuming platforms honor and persist provenance; deterministic sampling policy is fixed and disclosed
Deterministic CI/CD and regression testing for model-serving
- Sectors: MLOps, software engineering
- Tools/workflows: golden hash baselines for prompts; gate deployments on hash equality; detect regressions deterministically across build targets and hardware
- Applications: reliable model upgrades and hotfixes without “it diverges on prod hardware” failures
- Assumptions/dependencies: stable model artifacts; migrations maintain deterministic arithmetic and fixed evaluation order
Multi-cloud/high-availability inference with consensus on outputs
- Sectors: cloud, SaaS
- Tools/workflows: cross-region nodes compute and consensus on H(y); failover without changing outputs; single source of truth via DAG consensus
- Applications: resilient serving with identical results from any region/vendor
- Assumptions/dependencies: network latencies compatible with SLAs; shared model artifacts across regions
Model substitution and tampering detection
- Sectors: security, compliance
- Tools/workflows: verify H(m) prior to inference; compare H(y) against expected; alerts on any mismatch
- Applications: detect model drift, poisoning, or unapproved hot-swap in production
- Assumptions/dependencies: secure artifact management and attestation in the deployment pipeline
Fairness and privacy execution audits (per-decision)
- Sectors: finance, HR tech, govtech
- Tools/workflows: reproducible traces for feature influence and DP mechanism verification; compare audited run to attested run
- Applications: case-level fairness audits and DP “was noise applied correctly” checks
- Assumptions/dependencies: identical execution path and seed; auditors can access features and hashes; DP mechanism integrated in deterministic pipeline
Deterministic sampling for creative and enterprise workflows
- Sectors: enterprise apps, creative tools
- Tools/workflows: seed PRNG from H(m|x); guarantee identical drafts/replies for the same input and model
- Applications: reproducible creative outputs for legal review, contract negotiation, customer communications
- Assumptions/dependencies: acceptance of fixed seeding; disclosure of determinism policy to users
Content-addressable inference caching
- Sectors: infrastructure, CDN
- Tools/workflows: use (H(m), H(x)) → H(y) as cache key; cross-node reuse enabled by deterministic equality
- Applications: reduce compute cost across fleets; cache hits validated by hash
- Assumptions/dependencies: stable model artifacts; prompt canonicalization to ensure identical H(x)
Academic reproducibility packages
- Sectors: academia, research labs
- Tools/workflows: bundle GGUF weights, integer lookup tables (e.g., RoPE), seeds, code version; publish hashes in papers
- Applications: exact replication of figures and ablations across labs
- Assumptions/dependencies: willingness to release model artifacts or controlled-access escrow
On-device deterministic assistants with verifiable receipts
- Sectors: mobile, IoT
- Tools/workflows: ARC integer kernels on CPUs/NPUs/GPUs; attach receipts for critical interactions (e.g., health advice)
- Applications: edge AI that remains auditable without server trust
- Assumptions/dependencies: edge devices support two’s-complement integer ops; storage and UX for receipts

Long-Term Applications

These use cases require further research, scaling, standardization, or ecosystem adoption (e.g., mixed-precision quality improvements, full-protocol proofs, regulatory integration).

Safety-critical certification regimes anchored in platform determinism
- Sectors: automotive, avionics, medical devices, energy grid
- Tools/workflows: regulatory standards mandating platform-deterministic inference for certified components; homologation once for all hardware
- Dependencies: regulator consensus; standardized test suites; documented determinism proofs; quality parity via INT16/mixed-precision integer paths
End-to-end succinct proofs of full model inference
- Sectors: web3, cross-chain verification, compliance
- Tools/workflows: extend Circle STARKs (or similar) to cover attention, normalization, activations; generate layer-complete proofs with compact on-chain verification
- Dependencies: scalable AIR/arithmetization, proof performance at 10B–70B scale, verifier adoption
Hardware and software standards for deterministic AI (IEEE/NIST/ISO)
- Sectors: semiconductors, OS vendors, standards bodies
- Tools/workflows: specifications for deterministic inference kernels, fixed reduction orders, integer-only APIs, standardized RoPE tables as binary artifacts
- Dependencies: industry alignment; conformance test harnesses; updates to ML runtimes
Deterministic training and fine-tuning pipelines
- Sectors: foundation model labs, enterprise fine-tuning
- Tools/workflows: integer-friendly optimizers, deterministic data shuffling/seeding, quantization-aware training that preserves determinism
- Dependencies: research on integer training stability; performance-competitive kernels; mitigations for nondeterministic I/O and parallelism
Enterprise and government procurement policies requiring determinism
- Sectors: public procurement, regulated industries
- Tools/workflows: RFP checklists for platform determinism; penalties for unverifiable outputs; mandatory HT = 0 (trust entropy) for decision-critical AI
- Dependencies: policy uptake; practical evaluation procedures; compliance audits infrastructure
Federated/multi-party analytics with trust-minimized verification
- Sectors: health networks, finance consortia, supply chains
- Tools/workflows: cross-organization inference with DAG consensus on outputs; dispute windows; hashed commitments to inputs/outputs
- Dependencies: governance agreements; privacy-preserving prompt/data hashing; interoperability of attestation formats
Liability and legal frameworks for cryptographically attested AI outputs
- Sectors: legal services, insurance
- Tools/workflows: make receipts admissible evidence; insurance underwriting conditioned on deterministic auditability
- Dependencies: jurisprudence on digital attestations; standards for chain-of-custody of hashes and models
Deterministic accelerators and kernel ecosystems
- Sectors: hardware vendors, compilers, ML frameworks
- Tools/workflows: native INT8/INT16 tensor cores with fixed evaluation orders; deterministic WGSL/CUDA kernels; compiler passes ensuring associativity-preserving schedules
- Dependencies: vendor roadmaps; performance parity; framework support (PyTorch/ONNX backends)
Trust entropy (HT) as a deployment and risk metric
- Sectors: risk management, SRE, governance
- Tools/workflows: measure HT across fleet/hardware; set SLAs and guardrails (e.g., only deploy HT = 0 for decision-critical paths)
- Dependencies: operational tooling to estimate HT; dashboards; policy mapping HT thresholds to allowed use
Deterministic A/B testing and model selection without platform confounds
- Sectors: product analytics, growth teams
- Tools/workflows: compare variants with identical seeds and hashes; ensure observed differences are model-driven, not hardware artifacts
- Dependencies: org-wide adoption of deterministic serving; consistent prompt canonicalization
Verifiable autonomy “black box” recorders
- Sectors: drones, robotics, industrial controls
- Tools/workflows: log H(m), H(sensor input), H(action sequence) per cycle; dispute resolution by re-execution in simulations
- Dependencies: real-time deterministic inference on-device; storage and secure time-stamping; regulator acceptance
Interoperable provenance across content platforms
- Sectors: social media, news, creative suites
- Tools/workflows: standardize embedding of H(m), H(x), H(y) in C2PA/XMP; cross-platform verification of AI-origin claims
- Dependencies: broad platform support; UX patterns; privacy considerations for prompt disclosure
Mixed-precision deterministic inference to close quality gaps
- Sectors: all inference-heavy applications
- Tools/workflows: INT16 or hybrid INT8/INT16 schemes (e.g., INT16 for attention, INT8 for FFN); deterministic lookup tables with higher resolution
- Dependencies: quantization research; memory/latency trade studies; compatibility with deterministic kernels
Determinism-first MLOps products
- Sectors: DevOps/MLOps vendors
- Tools/workflows: “determinism SLOs,” fleet-wide hash health checks, cross-arch diff tools, automatic dispute/replay systems, provenance-aware feature stores
- Dependencies: market demand; integration with existing observability stacks; model distribution governance

Notes on feasibility dependencies that cut across items:

Arithmetic requirements: two’s complement integer arithmetic and fixed evaluation order on all target hardware.
Model performance: some tasks may require INT16/mixed precision to maintain quality; quantization-aware techniques may be needed.
Floating-point leak elimination: distribute precomputed RoPE tables as binary artifacts to avoid platform variance, or use deterministic integer generation of tables.
Verification cost: re-execution requires access to models and compute; succinct proofs are not yet end-to-end for full transformer inference.
Governance and policy: regulators/standards bodies must recognize platform determinism and hash-based verification as sufficient evidence.
Ecosystem alignment: adoption requires model format stability (e.g., GGUF), deterministic sampling policies, consistent prompt canonicalization, and secure artifact management.

View Paper Prompt View All Prompts

Glossary

AIR (Algebraic Intermediate Representation): A constraint system for expressing computations in STARK proofs as low-degree polynomial relations over execution traces. "The AIR has 6 trace columns and 4 constraints of degree ≤ 2,"
ARC engine: A pure-integer neural network inference engine designed for cross-platform, bitwise-identical outputs. "We resolve the barrier by constructing the ARC engine, a pure integer arithmetic inference engine that achieves bitwise identical output across ARM and x86 architectures."
Argmax: The operation that selects the index of the maximum value, often used for greedy decoding in LLMs. "The ARC engine uses greedy decoding (argmax over logits) for token selection."
Attestation (cryptographic attestation): A published cryptographic commitment to the inputs and outputs of a computation to enable later verification. "Attestation. The prover computes y = f(m, x) and publishes the attestation Q = (H(m), H(x), H(y))."
AVX-512: An x86 SIMD instruction set extension providing 512-bit vector operations that affect floating-point reduction order. "512-bit AVX-512 (8 lanes): (w1x1+ ... + wgxg) + ..."
AVX2: An x86 SIMD instruction set extension providing 256-bit vector operations that can change accumulation order and rounding. "distributed across 128-bit NEON lanes on ARM versus 256-bit AVX2 lanes on x86,"
BLAKE3: A modern cryptographic hash function used for fast, collision-resistant hashing of models, inputs, and outputs. "Let H be a collision-resistant hash function (e.g., BLAKE3)."
Byzantine fault tolerance: A consensus approach that ensures system reliability even when some participants act maliciously or arbitrarily. "Lamport, Shostak, and Pease [12] replaced trusted intermediaries with Byzantine fault tolerance."
Catalan number: A combinatorial sequence counting distinct binary tree shapes, used here to bound the number of floating-point reduction trees. "C(n) is the n-th Catalan number"
ChaCha20: A stream cipher/PRNG that can be deterministically seeded to produce reproducible sampling in token generation. "deterministic PRNG (e.g., ChaCha20)"
Circle STARK: A specific STARK proof system/stack used to prove correctness of inference subcomputations with compact commitments. "We provide Circle STARK proofs [9], [10] of inference layer computations with 152-byte on- chain commitment receipts,"
Collision-resistant hash: A hash function property that makes finding two distinct inputs with the same digest computationally infeasible. "Let H be a collision-resistant hash function (e.g., BLAKE3)."
DAG consensus: A consensus protocol organizing blocks in a Directed Acyclic Graph to achieve fast finality, used for attestation transactions. "We demonstrate multi-node deterministic inference through DAG consensus"
Dequantization: The process of converting low-precision integer representations back to floating-point; eliminating it can improve speed. "due to the efficiency of native integer operations and the elimination of FP32 dequantization overhead."
Determinism Thesis: The claim that platform-deterministic inference is necessary and sufficient for trustworthy AI. "We propose and prove the Determinism Thesis:"
Determinism-Verification Collapse: The result that determinism reduces verification of computations to O(1) hash comparison, while non-determinism makes verification intractable. "We prove a Determinism-Verification Collapse:"
Differential privacy: A formal privacy framework ensuring computations do not reveal too much about any individual data point. "differential privacy mechanisms"
Execution equivalence class: The set of all outputs obtainable by valid executions of the same computation across platforms. "the execution equivalence class is: E(f, m, x) = {fh(m, x) : hEH}"
FP32: 32-bit IEEE 754 floating-point representation. "Let v = (1.0, 2-24, 2-24, 2-24) in FP32."
FRI (proximity proof): A subprotocol in STARKs used to prove that a function is close to a low-degree polynomial. "FRI proximity proof, Merkle commitments, constraint evaluations"
GGUF: A model file format for LLMs used by llama.cpp and related tools. "Our engine loads any model distributed in the GGUF format (the standard interchange format used by llama.cpp [14] and the broader open-weight ecosystem)."
Greedy decoding: A decoding strategy that selects the highest-probability token at each step without sampling. "The ARC engine uses greedy decoding (argmax over logits) for token selection."
IEEE 754: The standard defining floating-point arithmetic behavior, including rounding and non-associativity. "IEEE 754 floating-point arithmetic [4] is deterministic for individual operations but not for sequences."
INT8: 8-bit signed integer precision used for quantized weights to reduce memory and improve deterministic performance. "Weights are stored as INT8 (1 byte per parameter)"
Kahan summation: A compensated summation algorithm that reduces floating-point error but does not guarantee determinism across platforms. "using Kahan summation"
KV cache: The storage of key and value vectors from prior tokens to accelerate transformer attention. "Key and value vectors are cached at full Q16 (i64) precision across all sequence positions."
Lipschitz constant: A bound on how much a function can stretch distances, used to analyze error/difference propagation. "each sub-layer g; has Lipschitz constant di"
Merkle commitments: Hash-based commitments to large datasets enabling succinct verification through Merkle tree roots. "FRI proximity proof, Merkle commitments, constraint evaluations"
Mersenne-31 field: A finite field defined by a Mersenne prime modulus used for efficient STARK arithmetic. "over the Mersenne-31 field"
Membership problem: The task of deciding whether a claimed output belongs to the set of valid outputs under some execution semantics. "requires solving an intractable membership problem over combinatorially many valid outputs."
Newton-Raphson inverse square root: An iterative method to compute 1/sqrt(x), implemented here in fixed-point integers. "then apply Newton-Raphson inverse square root entirely in integer arithmetic"
Non-associative (floating-point addition): The property that (a+b)+c may differ from a+(b+c) due to rounding in floating-point arithmetic. "IEEE 754 floating-point addition is non- associative"
Number Theoretic Transform (NTT): A discrete Fourier transform over finite fields, used to accelerate polynomial operations in proofs. "For layers exceeding the NTT (Number Theoretic Transform) trace size limit (~ 224 rows),"
Operational trust entropy: A Rényi collision entropy measure over cross-platform outputs quantifying non-determinism. "The operational trust entropy is the Rényi collision entropy:"
Perplexity (PPL): A measure of how well a probabilistic model predicts a sample, commonly used to evaluate LLMs. "We separately measure perplexity (PPL) on WikiText-2,"
Platform-Deterministic Inference: An inference property where identical inputs yield identical outputs across all hardware platforms. "Definition 2 (Platform-Deterministic Inference)."
PRNG: Pseudorandom number generator; when deterministically seeded, it enables reproducible stochastic sampling. "deterministic PRNG (e.g., ChaCha20)"
Rényi collision entropy: A specific entropy measure (order-2 Rényi) used here to define trust entropy over execution outputs. "trust entropy (Rényi collision entropy over execution outputs)"
Residual neural network: A neural architecture with skip connections x_{i+1} = x_i + g_i(x_i) that affect error accumulation. "Let f be a residual neural network of L blocks,"
RMSNorm: Root Mean Square Normalization, a normalization technique used in Llama-style models. "Llama-class models use RMSNorm:"
RoPE (Rotary Position Embedding): A positional encoding method that rotates feature pairs by position-dependent angles. "Rotary Position Embedding encodes position through rotation:"
Round-to-nearest-even: The IEEE 754 default rounding mode that rounds to the nearest representable value, with ties to even. "round-to- nearest-even rounds down"
Second-preimage resistance: A hash security property making it infeasible to find a different input mapping to a specific hash. "By second-preimage resistance of H (implied by collision resistance),"
SiLU: The Sigmoid Linear Unit activation function defined as x·σ(x). "The SiLU activation SiLU(x) = x . o(x) is implemented via a 257-entry exponential lookup table"
SIMD: Single Instruction, Multiple Data; parallel execution over vector lanes affecting floating-point reduction order. "across SIMD lanes of different widths,"
Softmax: A function that converts scores to a probability distribution, sensitive to small numerical differences. "through nonlinear activations and softmax normalization,"
STARK: A family of transparent, scalable proofs (Succinct Transparent ARguments of Knowledge) for verifying computations. "a STARK [9]"
Trust Dependency Hierarchy: The assertion that various trust properties (fairness, robustness, privacy, safety, alignment) depend on determinism. "Theorem 7 (Trust Dependency Hierarchy)."
Trust entropy: The paper’s measure of non-determinism across platforms, defined via Rényi collision entropy. "We introduce trust entropy (Rényi collision entropy over execution outputs)"
Two's complement: The integer representation used by most CPUs; its ring properties underpin deterministic integer arithmetic. "two's complement integer arithmetic"
ULP (unit in the last place): The spacing between adjacent floating-point numbers at a given magnitude. "ULP (unit in the last place)"
Verification complexity: The computational effort required to verify an output belongs to the set of valid executions. "Definition 15 (Verification Complexity)."
WGSL: WebGPU Shading Language used to implement deterministic GPU compute shaders. "9 cross-platform WGSL compute shaders"

On the Foundations of Trustworthy Artificial Intelligence

Summary

Determinism as the Foundation of Trustworthy AI

Determinism Thesis and Formal Foundations

The Floating-Point Barrier

Platform-Deterministic Inference Architecture

Verification Complexity: Determinism-Verification Collapse

Experimental Validation and Numerical Results

Theoretical and Practical Implications

Open Problems and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the authors ask?

How did they study it?

What did they find, and why does it matter?

What methods did they use to make AI deterministic?

What’s the bigger impact?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Authors (1)

Collections

Tweets

On the Foundations of Trustworthy Artificial Intelligence

Summary

Determinism as the Foundation of Trustworthy AI

Determinism Thesis and Formal Foundations

The Floating-Point Barrier

Platform-Deterministic Inference Architecture

Verification Complexity: Determinism-Verification Collapse

Experimental Validation and Numerical Results

Theoretical and Practical Implications

Open Problems and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the authors ask?

How did they study it?

What did they find, and why does it matter?

What methods did they use to make AI deterministic?

What’s the bigger impact?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (1)

Collections

Tweets