On the Foundations of Trustworthy Artificial Intelligence
Abstract: We prove that platform-deterministic inference is necessary and sufficient for trustworthy AI. We formalize this as the Determinism Thesis and introduce trust entropy to quantify the cost of non-determinism, proving that verification failure probability equals 1 - 2{-H_T} exactly. We prove a Determinism-Verification Collapse: verification under determinism requires O(1) hash comparison; without it, the verifier faces an intractable membership problem. IEEE 754 floating-point arithmetic fundamentally violates the determinism requirement. We resolve this by constructing a pure integer inference engine that achieves bitwise identical output across ARM and x86. In 82 cross-architecture tests on models up to 6.7B parameters, we observe zero hash mismatches. Four geographically distributed nodes produce identical outputs, verified by 356 on-chain attestation transactions. Every major trust property of AI systems (fairness, robustness, privacy, safety, alignment) presupposes platform determinism. Our system, 99,000 lines of Rust deployed across three continents, establishes that AI trust is a question of arithmetic.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper argues a simple but powerful idea: for AI to be truly trustworthy, it must be deterministic. That means if you give the same AI model the same input, it should give the exact same output every time, no matter which computer it runs on. The authors show this is both necessary and sufficient for trust. They also build and test a new inference engine that makes LLMs behave this way across different kinds of hardware.
What questions did the authors ask?
- Do we need determinism to trust AI decisions, and is determinism enough to make AI verifiable and auditable?
- Why do today’s AI systems sometimes give different answers on different computers?
- Can we build a fast, large-scale AI engine that produces bit-for-bit identical results across different processors (like ARM and x86)?
- How does non-determinism (tiny numerical differences) grow inside deep neural networks?
- How hard is it to verify AI outputs when systems are deterministic versus non-deterministic?
- How do other trust goals (fairness, robustness, privacy, safety, alignment) depend on determinism?
How did they study it?
The authors combined theory, engineering, and experiments:
- They defined “platform-deterministic inference” (same model + same input → same output on any hardware) and “trustworthy AI” (verifiable, reproducible, auditable, and certifiable).
- They introduced “trust entropy,” a way to measure how much outputs vary across hardware. If every machine gives the same answer, trust entropy is zero. They showed the chance that an honest verification fails equals , so zero entropy means zero failure.
- They proved a “Determinism–Verification Collapse”: if outputs are deterministic, verifying them is as simple as re-running the model and comparing a short hash (a digital fingerprint). If outputs can vary, verification becomes much harder because you must accept many possible “valid” answers.
- They showed the main cause of variation is how computers handle decimals (floating-point numbers). Adding numbers with rounding isn’t perfectly consistent when done in different orders, and different chips group additions differently. This is like adding many small rounded numbers to a big number in different sequences: tiny rounding differences can add up to a different final result.
- To fix this, they built the ARC engine that does all math using integers instead of floats. Integer math doesn’t have rounding problems like this and gives the same answer on any computer that uses standard two’s complement integers.
- They tested their engine across different machines and across the globe. They also recorded results “on-chain” and used consensus methods so multiple nodes could agree on an AI’s answer. For smaller, portable checks, they generated compact STARK commitments (tiny cryptographic receipts) for parts of the computation.
Technical terms in everyday language:
- Floating point vs. integer: Floating point is like working with decimals; small round-offs can differ by machine. Integer math is like working with whole numbers and consistent scaling, so every machine gets the same answer.
- Hash: A short “fingerprint” of data. If two outputs have the same hash, they’re the same with overwhelming probability.
- Consensus/DAG: A way for many computers to agree on a shared record, like everyone signing off that “we all got this exact answer.”
- STARK: A kind of cryptographic proof that a computation was done correctly; think of it as a compact, checkable receipt.
What did they find, and why does it matter?
Main findings:
- Determinism is necessary and sufficient for trustworthy AI. If outputs can change across hardware, you can’t reliably verify, reproduce, audit, or certify an AI’s behavior. If outputs are deterministic, you can do all four.
- Floating-point math breaks determinism. Because addition with rounding isn’t associative, different processors (with different vector widths or threading) add things in different orders and get slightly different numbers. In deep models, these tiny differences can snowball and lead to different generated tokens.
- Verification becomes easy under determinism. If the AI is deterministic, verifying an output can be as simple as re-running once and comparing one hash. If it isn’t, verification may require understanding or simulating the exact execution details of the other machine, which is impractical.
- A working fix exists today. Their ARC engine uses integer arithmetic for inference and produced bit-for-bit identical outputs for large models like Llama‑2‑7B across ARM and x86 machines in 82 cross-architecture tests (up to 1,024 tokens), with zero mismatches.
- Real-world consensus worked. Four nodes in different parts of the world independently ran the model and produced identical outputs, with 356 on-chain attestations to back it up.
- It’s fast. On their hardware, the integer engine was faster than a floating-point backend while staying deterministic.
Why this matters:
- If you want to trust an AI’s decision (say, a medical suggestion or a loan decision), you must be able to redo the same computation and get the same result. Otherwise, you can’t tell if a difference comes from cheating, a bug, or just hardware quirks.
- Many trust goals depend on determinism. For example, to audit fairness, you must recreate the exact decision path; to certify safety, you must know it will behave identically on different devices.
What methods did they use to make AI deterministic?
To keep lists short and useful, here are the key engineering ideas they used:
- Integer-only math for the forward pass: quantized INT8 weights and fixed-point activations (Q16), with careful design to avoid overflow.
- Integer versions of model parts like normalization, activations, and positional embeddings, using lookup tables and fixed procedures so every platform computes exactly the same numbers.
- Deterministic token selection: greedy decoding or fixed-seed sampling so the same input always leads to the same next token.
- Parallelism that doesn’t change results: independent pieces run in parallel but are combined in a fixed order.
- Cross-node agreement: output hashes posted on-chain and confirmed via consensus; anyone can re-run the model to check.
What’s the bigger impact?
- A foundation for trustworthy AI: The authors show that fairness, robustness, privacy, safety, and alignment checks all assume determinism. Without it, those checks can’t be independently verified.
- A practical path forward: Since many accelerators already support fast integer operations, moving AI inference to integer math can make systems both faster and more trustworthy.
- Better oversight and certification: Regulators, companies, and users can verify AI outputs by re-running them and comparing hashes, or by using compact cryptographic receipts.
- A shift in focus: The paper suggests that making AI trustworthy isn’t only about “alignment” or “interpretability.” It’s also about the arithmetic under the hood. Choosing the right math (deterministic integer inference) unlocks simple, reliable verification for everyone.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The paper advances a strong determinism-centric thesis and a working integer inference engine, but it leaves several concrete issues unresolved. Future work could address the following:
- Formal specification of the deterministic inference function:
- No canonical, machine-readable specification of all numerical semantics (integer widths, fixed-point formats, rounding/shift rules, saturation vs wrap, dequantization, normalization, tie-breaking for argmax/top-k/top-p) is provided to enable independent re-implementation and certification.
- Integer softmax and attention details:
- The paper does not describe a bit-exact, integer implementation of softmax for attention (exp, max-shift, normalization), its error bounds, or how overflow/underflow is prevented in worst-case sequences.
- Dynamic range and overflow guarantees:
- Only partial back-of-the-envelope bounds are given (e.g., dot-product safety in 64-bit). A complete, model-agnostic proof that all intermediate values (matmuls, RMSNorm sums of squares, softmax accumulations, SiLU) cannot overflow under supported dimensions (e.g., d=8192–16384, L up to 80+) and context lengths is missing.
- Deterministic RoPE tables:
- RoPE sin/cos are computed via FP64 at load time and “empirically” match after Q16 rounding. There is no formal guarantee across platforms/libraries. A standardized distribution (or a bit-exact integer CORDIC generator) and proof of cross-platform identity is not yet provided.
- Deterministic quantization pipeline:
- The engine assumes INT8 weights with per-row scales but does not define a bit-exact, cross-toolchain quantization spec (tie-breaking on rounding, scale computation, clipping rules) to ensure deterministic model preparation across platforms and versions.
- Tokenization determinism:
- The paper does not analyze determinism of the tokenization and text normalization stack (regex/Unicode differences across OS/locale/library versions), which can break end-to-end reproducibility before arithmetic begins.
- Completeness of determinism under parallelism:
- Beyond independent attention heads, the paper does not audit all kernels for potential cross-thread reductions, races, or reordering (e.g., layernorm reductions, residual accumulations) that might reintroduce non-determinism under aggressive compiler/vectorizer settings.
- Hardware and compiler assumptions:
- The determinism claim assumes two’s complement integers and arithmetic right-shift semantics. It does not present conformance tests across diverse CPUs/GPUs/NPUs (including saturating or non-ARSH hardware), nor robustness to different compilers/optimization levels/drivers or shader backends.
- GPU/accelerator portability:
- Deterministic equivalence is shown for Apple M2 Ultra vs x86 CPU; it does not validate bitwise identity on NVIDIA/AMD GPUs, TPUs, or AI ASICs using integer tensor cores, nor across different graphics APIs/drivers (e.g., SPIR-V, CUDA, Metal).
- Quality evaluation and trade-offs:
- Model quality is insufficiently assessed. Reported perplexity is not comparable (Chat vs base). There is no systematic benchmark suite (e.g., MMLU, GSM8K, HumanEval, long-context tasks), no ablations for INT8 vs INT16/mixed-precision, and no quantification of accuracy–determinism trade-offs.
- Long-context operation:
- Experiments are limited to ≤1,024 tokens (and 7B/1.1B models). It remains unknown whether determinism and quality hold for 4k–32k contexts, rope scaling/extrapolation, and potential numeric issues (e.g., attention score range, KV cache growth).
- Scope of the floating-point impossibility:
- Theorem 9 assumes hardware-determined reduction order. The paper does not investigate whether a canonical, software-enforced FP reduction tree and fixed rounding modes (or software FP) could recover cross-platform determinism and with what performance penalty.
- Verification complexity with execution traces:
- The “Determinism-Verification Collapse” frames non-deterministic verification as a membership problem over combinatorially many outputs, but does not analyze schemes where the prover supplies an explicit execution trace/schedule (or a proof thereof). Formal lower/upper bounds with trace-witnesses are not provided.
- Trust entropy (HT) measurement:
- While defined theoretically, there is no methodology to estimate HT in practice (e.g., sampling across real hardware populations), no empirical measurements, and no guidance on how to manage HT under heterogeneous fleets.
- STARK coverage and end-to-end proving:
- Current proofs cover dense layers only. Constraints for attention, normalization, and activations are missing, as are performance projections for full-model proofs (proof size, proving/verification time, recursion strategies, verifier resource constraints).
- Privacy and attestation security:
- Publishing H(x) and H(y) can leak information via dictionary attacks. Protocols for input/output privacy (salts, commitments with hiding, ZK attestations) and an adversarial/economic analysis of the dispute mechanism and DAG consensus are not provided.
- Side-channel and timing determinism:
- The paper does not analyze whether the deterministic kernels are also resistant to timing/cache-based side channels in multi-tenant settings, or whether constant-time properties are needed for trustworthy deployment.
- Deterministic randomness for stochastic features:
- For applications requiring randomness (e.g., temperature sampling, differential privacy), the proposal suggests deterministic PRNG seeding. It does not specify how to integrate verifiable randomness (e.g., VRFs/beacons) while preserving both reproducibility and unpredictability, nor the implications for DP guarantees.
- Training pipeline trust:
- The work focuses on inference. Deterministic or verifiable training (data order, augmentation, optimizer states, randomness) and data lineage/provenance remain open for end-to-end AI system trust.
- Robustness of integer wraparound semantics:
- Although mathematically well-defined, integer wraparound may silently degrade model behavior. There is no empirical or theoretical assurance that wrap does not occur in practice (or bounds on how often/where), nor mitigation strategies (e.g., saturation, wider accumulators).
- Formalization of Theorem 7 (trust dependency hierarchy):
- The reductions from fairness/robustness/privacy/safety/alignment to determinism are argued informally and at decision-level granularity. Formal proofs, counterexamples, and clearer scope (e.g., population-level audits that may not require per-decision reproducibility) are absent.
- Broader model classes and control flow:
- The approach is validated on standard transformer LMs. It does not address architectures with dynamic control flow (e.g., MoE gating, sparsity, conditional computation), vision models, or multi-modal pipelines where determinism across heterogeneous operators is harder.
- Standardization and certification path:
- There is no proposed standard (spec/test suite/reference IR) for “platform-deterministic inference,” no conformance tests for vendors, and no roadmap for regulatory bodies to recognize and certify deterministic inference across hardware.
Practical Applications
Immediate Applications
Below are concrete, deployable use cases enabled by the paper’s findings (platform-deterministic inference, trust entropy, hash-based verification, on-chain attestation, and the ARC integer engine). Each item includes sector mapping, potential tools/products/workflows, and key feasibility dependencies.
- Deterministic, auditable AI decisions in regulated services
- Sectors: finance, healthcare, public sector
- Tools/workflows: ARC integer engine (INT8/Q16), BLAKE3 input/model/output hashes, deterministic PRNG (e.g., ChaCha20 seeded by H(m|x)) for reproducible sampling, attestation receipts attached to decisions
- Applications: credit decisions, medical triage/decision-support, benefit eligibility rulings with cryptographically verifiable receipts that auditors can re-execute on any hardware
- Assumptions/dependencies: model quality under INT8/Q16 is acceptable for the task; identical weight bytes (e.g., GGUF) available to auditors; organizations adopt fixed decode policies (greedy or deterministic PRNG seeding)
- Compliance-grade reproducibility for internal audits and incident investigations
- Sectors: enterprise software, fintech, medtech
- Tools/workflows: “re-execute-to-verify” protocol using ARC; content-addressable logs keyed by H(m), H(x), H(y); deterministic replay of full traces
- Applications: post-hoc incident reconstructions; root-cause analysis without platform confounds
- Assumptions/dependencies: storage of prompts and model hashes; ability to share or escrow models for audit
- Cross-platform certification transfer for safety testing
- Sectors: robotics, avionics, medical devices, automotive
- Tools/workflows: certify behavior once (on any platform) and transfer proofs via platform-deterministic engine; attach certs/receipts to firmware or model packages
- Applications: reduce per-platform certification burden for the same deterministic model across ARM/x86/GPUs
- Assumptions/dependencies: regulators accept platform-determinism as equivalence; deterministic inference remains within model performance requirements
- Reproducible benchmarking and leaderboards
- Sectors: academia, ML benchmarking bodies, open-model communities
- Tools/workflows: publish H(m), H(dataset shards), fixed seeds; require zero hash mismatches across sites to accept results; measure “trust entropy” (HT) as a reported metric
- Applications: exact replication of leaderboard runs independent of hardware
- Assumptions/dependencies: datasets and models distributed with canonical hashes; community agreement on determinism requirements
- Verifiable decentralized inference and compute marketplaces
- Sectors: web3, cloud marketplaces, edge compute
- Tools/workflows: DAG consensus for output hashes, economic bonds with challenge periods, on-chain InferenceAttestation (H(m), H(x), H(y)), optional Circle STARK commitment receipts for dense layers
- Applications: marketplaces where clients pay only when outputs verify; trust-minimized multi-node inference
- Assumptions/dependencies: chain costs and latency acceptable; light clients may still require STARK receipts until full proofs cover all layers
- Content authenticity and provenance for AI-generated media
- Sectors: media, publishing, enterprise content, C2PA ecosystems
- Tools/workflows: append attestation receipts and hashes as provenance metadata; reproducible sampling ensures identical outputs from m and x; content-addressable storage via H(y)
- Applications: prove a given content item was generated by a specific model+prompt; facilitate downstream accountability
- Assumptions/dependencies: consuming platforms honor and persist provenance; deterministic sampling policy is fixed and disclosed
- Deterministic CI/CD and regression testing for model-serving
- Sectors: MLOps, software engineering
- Tools/workflows: golden hash baselines for prompts; gate deployments on hash equality; detect regressions deterministically across build targets and hardware
- Applications: reliable model upgrades and hotfixes without “it diverges on prod hardware” failures
- Assumptions/dependencies: stable model artifacts; migrations maintain deterministic arithmetic and fixed evaluation order
- Multi-cloud/high-availability inference with consensus on outputs
- Sectors: cloud, SaaS
- Tools/workflows: cross-region nodes compute and consensus on H(y); failover without changing outputs; single source of truth via DAG consensus
- Applications: resilient serving with identical results from any region/vendor
- Assumptions/dependencies: network latencies compatible with SLAs; shared model artifacts across regions
- Model substitution and tampering detection
- Sectors: security, compliance
- Tools/workflows: verify H(m) prior to inference; compare H(y) against expected; alerts on any mismatch
- Applications: detect model drift, poisoning, or unapproved hot-swap in production
- Assumptions/dependencies: secure artifact management and attestation in the deployment pipeline
- Fairness and privacy execution audits (per-decision)
- Sectors: finance, HR tech, govtech
- Tools/workflows: reproducible traces for feature influence and DP mechanism verification; compare audited run to attested run
- Applications: case-level fairness audits and DP “was noise applied correctly” checks
- Assumptions/dependencies: identical execution path and seed; auditors can access features and hashes; DP mechanism integrated in deterministic pipeline
- Deterministic sampling for creative and enterprise workflows
- Sectors: enterprise apps, creative tools
- Tools/workflows: seed PRNG from H(m|x); guarantee identical drafts/replies for the same input and model
- Applications: reproducible creative outputs for legal review, contract negotiation, customer communications
- Assumptions/dependencies: acceptance of fixed seeding; disclosure of determinism policy to users
- Content-addressable inference caching
- Sectors: infrastructure, CDN
- Tools/workflows: use (H(m), H(x)) → H(y) as cache key; cross-node reuse enabled by deterministic equality
- Applications: reduce compute cost across fleets; cache hits validated by hash
- Assumptions/dependencies: stable model artifacts; prompt canonicalization to ensure identical H(x)
- Academic reproducibility packages
- Sectors: academia, research labs
- Tools/workflows: bundle GGUF weights, integer lookup tables (e.g., RoPE), seeds, code version; publish hashes in papers
- Applications: exact replication of figures and ablations across labs
- Assumptions/dependencies: willingness to release model artifacts or controlled-access escrow
- On-device deterministic assistants with verifiable receipts
- Sectors: mobile, IoT
- Tools/workflows: ARC integer kernels on CPUs/NPUs/GPUs; attach receipts for critical interactions (e.g., health advice)
- Applications: edge AI that remains auditable without server trust
- Assumptions/dependencies: edge devices support two’s-complement integer ops; storage and UX for receipts
Long-Term Applications
These use cases require further research, scaling, standardization, or ecosystem adoption (e.g., mixed-precision quality improvements, full-protocol proofs, regulatory integration).
- Safety-critical certification regimes anchored in platform determinism
- Sectors: automotive, avionics, medical devices, energy grid
- Tools/workflows: regulatory standards mandating platform-deterministic inference for certified components; homologation once for all hardware
- Dependencies: regulator consensus; standardized test suites; documented determinism proofs; quality parity via INT16/mixed-precision integer paths
- End-to-end succinct proofs of full model inference
- Sectors: web3, cross-chain verification, compliance
- Tools/workflows: extend Circle STARKs (or similar) to cover attention, normalization, activations; generate layer-complete proofs with compact on-chain verification
- Dependencies: scalable AIR/arithmetization, proof performance at 10B–70B scale, verifier adoption
- Hardware and software standards for deterministic AI (IEEE/NIST/ISO)
- Sectors: semiconductors, OS vendors, standards bodies
- Tools/workflows: specifications for deterministic inference kernels, fixed reduction orders, integer-only APIs, standardized RoPE tables as binary artifacts
- Dependencies: industry alignment; conformance test harnesses; updates to ML runtimes
- Deterministic training and fine-tuning pipelines
- Sectors: foundation model labs, enterprise fine-tuning
- Tools/workflows: integer-friendly optimizers, deterministic data shuffling/seeding, quantization-aware training that preserves determinism
- Dependencies: research on integer training stability; performance-competitive kernels; mitigations for nondeterministic I/O and parallelism
- Enterprise and government procurement policies requiring determinism
- Sectors: public procurement, regulated industries
- Tools/workflows: RFP checklists for platform determinism; penalties for unverifiable outputs; mandatory HT = 0 (trust entropy) for decision-critical AI
- Dependencies: policy uptake; practical evaluation procedures; compliance audits infrastructure
- Federated/multi-party analytics with trust-minimized verification
- Sectors: health networks, finance consortia, supply chains
- Tools/workflows: cross-organization inference with DAG consensus on outputs; dispute windows; hashed commitments to inputs/outputs
- Dependencies: governance agreements; privacy-preserving prompt/data hashing; interoperability of attestation formats
- Liability and legal frameworks for cryptographically attested AI outputs
- Sectors: legal services, insurance
- Tools/workflows: make receipts admissible evidence; insurance underwriting conditioned on deterministic auditability
- Dependencies: jurisprudence on digital attestations; standards for chain-of-custody of hashes and models
- Deterministic accelerators and kernel ecosystems
- Sectors: hardware vendors, compilers, ML frameworks
- Tools/workflows: native INT8/INT16 tensor cores with fixed evaluation orders; deterministic WGSL/CUDA kernels; compiler passes ensuring associativity-preserving schedules
- Dependencies: vendor roadmaps; performance parity; framework support (PyTorch/ONNX backends)
- Trust entropy (HT) as a deployment and risk metric
- Sectors: risk management, SRE, governance
- Tools/workflows: measure HT across fleet/hardware; set SLAs and guardrails (e.g., only deploy HT = 0 for decision-critical paths)
- Dependencies: operational tooling to estimate HT; dashboards; policy mapping HT thresholds to allowed use
- Deterministic A/B testing and model selection without platform confounds
- Sectors: product analytics, growth teams
- Tools/workflows: compare variants with identical seeds and hashes; ensure observed differences are model-driven, not hardware artifacts
- Dependencies: org-wide adoption of deterministic serving; consistent prompt canonicalization
- Verifiable autonomy “black box” recorders
- Sectors: drones, robotics, industrial controls
- Tools/workflows: log H(m), H(sensor input), H(action sequence) per cycle; dispute resolution by re-execution in simulations
- Dependencies: real-time deterministic inference on-device; storage and secure time-stamping; regulator acceptance
- Interoperable provenance across content platforms
- Sectors: social media, news, creative suites
- Tools/workflows: standardize embedding of H(m), H(x), H(y) in C2PA/XMP; cross-platform verification of AI-origin claims
- Dependencies: broad platform support; UX patterns; privacy considerations for prompt disclosure
- Mixed-precision deterministic inference to close quality gaps
- Sectors: all inference-heavy applications
- Tools/workflows: INT16 or hybrid INT8/INT16 schemes (e.g., INT16 for attention, INT8 for FFN); deterministic lookup tables with higher resolution
- Dependencies: quantization research; memory/latency trade studies; compatibility with deterministic kernels
- Determinism-first MLOps products
- Sectors: DevOps/MLOps vendors
- Tools/workflows: “determinism SLOs,” fleet-wide hash health checks, cross-arch diff tools, automatic dispute/replay systems, provenance-aware feature stores
- Dependencies: market demand; integration with existing observability stacks; model distribution governance
Notes on feasibility dependencies that cut across items:
- Arithmetic requirements: two’s complement integer arithmetic and fixed evaluation order on all target hardware.
- Model performance: some tasks may require INT16/mixed precision to maintain quality; quantization-aware techniques may be needed.
- Floating-point leak elimination: distribute precomputed RoPE tables as binary artifacts to avoid platform variance, or use deterministic integer generation of tables.
- Verification cost: re-execution requires access to models and compute; succinct proofs are not yet end-to-end for full transformer inference.
- Governance and policy: regulators/standards bodies must recognize platform determinism and hash-based verification as sufficient evidence.
- Ecosystem alignment: adoption requires model format stability (e.g., GGUF), deterministic sampling policies, consistent prompt canonicalization, and secure artifact management.
Glossary
- AIR (Algebraic Intermediate Representation): A constraint system for expressing computations in STARK proofs as low-degree polynomial relations over execution traces. "The AIR has 6 trace columns and 4 constraints of degree ≤ 2,"
- ARC engine: A pure-integer neural network inference engine designed for cross-platform, bitwise-identical outputs. "We resolve the barrier by constructing the ARC engine, a pure integer arithmetic inference engine that achieves bitwise identical output across ARM and x86 architectures."
- Argmax: The operation that selects the index of the maximum value, often used for greedy decoding in LLMs. "The ARC engine uses greedy decoding (argmax over logits) for token selection."
- Attestation (cryptographic attestation): A published cryptographic commitment to the inputs and outputs of a computation to enable later verification. "Attestation. The prover computes y = f(m, x) and publishes the attestation Q = (H(m), H(x), H(y))."
- AVX-512: An x86 SIMD instruction set extension providing 512-bit vector operations that affect floating-point reduction order. "512-bit AVX-512 (8 lanes): (w1x1+ ... + wgxg) + ..."
- AVX2: An x86 SIMD instruction set extension providing 256-bit vector operations that can change accumulation order and rounding. "distributed across 128-bit NEON lanes on ARM versus 256-bit AVX2 lanes on x86,"
- BLAKE3: A modern cryptographic hash function used for fast, collision-resistant hashing of models, inputs, and outputs. "Let H be a collision-resistant hash function (e.g., BLAKE3)."
- Byzantine fault tolerance: A consensus approach that ensures system reliability even when some participants act maliciously or arbitrarily. "Lamport, Shostak, and Pease [12] replaced trusted intermediaries with Byzantine fault tolerance."
- Catalan number: A combinatorial sequence counting distinct binary tree shapes, used here to bound the number of floating-point reduction trees. "C(n) is the n-th Catalan number"
- ChaCha20: A stream cipher/PRNG that can be deterministically seeded to produce reproducible sampling in token generation. "deterministic PRNG (e.g., ChaCha20)"
- Circle STARK: A specific STARK proof system/stack used to prove correctness of inference subcomputations with compact commitments. "We provide Circle STARK proofs [9], [10] of inference layer computations with 152-byte on- chain commitment receipts,"
- Collision-resistant hash: A hash function property that makes finding two distinct inputs with the same digest computationally infeasible. "Let H be a collision-resistant hash function (e.g., BLAKE3)."
- DAG consensus: A consensus protocol organizing blocks in a Directed Acyclic Graph to achieve fast finality, used for attestation transactions. "We demonstrate multi-node deterministic inference through DAG consensus"
- Dequantization: The process of converting low-precision integer representations back to floating-point; eliminating it can improve speed. "due to the efficiency of native integer operations and the elimination of FP32 dequantization overhead."
- Determinism Thesis: The claim that platform-deterministic inference is necessary and sufficient for trustworthy AI. "We propose and prove the Determinism Thesis:"
- Determinism-Verification Collapse: The result that determinism reduces verification of computations to O(1) hash comparison, while non-determinism makes verification intractable. "We prove a Determinism-Verification Collapse:"
- Differential privacy: A formal privacy framework ensuring computations do not reveal too much about any individual data point. "differential privacy mechanisms"
- Execution equivalence class: The set of all outputs obtainable by valid executions of the same computation across platforms. "the execution equivalence class is: E(f, m, x) = {fh(m, x) : hEH}"
- FP32: 32-bit IEEE 754 floating-point representation. "Let v = (1.0, 2-24, 2-24, 2-24) in FP32."
- FRI (proximity proof): A subprotocol in STARKs used to prove that a function is close to a low-degree polynomial. "FRI proximity proof, Merkle commitments, constraint evaluations"
- GGUF: A model file format for LLMs used by llama.cpp and related tools. "Our engine loads any model distributed in the GGUF format (the standard interchange format used by llama.cpp [14] and the broader open-weight ecosystem)."
- Greedy decoding: A decoding strategy that selects the highest-probability token at each step without sampling. "The ARC engine uses greedy decoding (argmax over logits) for token selection."
- IEEE 754: The standard defining floating-point arithmetic behavior, including rounding and non-associativity. "IEEE 754 floating-point arithmetic [4] is deterministic for individual operations but not for sequences."
- INT8: 8-bit signed integer precision used for quantized weights to reduce memory and improve deterministic performance. "Weights are stored as INT8 (1 byte per parameter)"
- Kahan summation: A compensated summation algorithm that reduces floating-point error but does not guarantee determinism across platforms. "using Kahan summation"
- KV cache: The storage of key and value vectors from prior tokens to accelerate transformer attention. "Key and value vectors are cached at full Q16 (i64) precision across all sequence positions."
- Lipschitz constant: A bound on how much a function can stretch distances, used to analyze error/difference propagation. "each sub-layer g; has Lipschitz constant di"
- Merkle commitments: Hash-based commitments to large datasets enabling succinct verification through Merkle tree roots. "FRI proximity proof, Merkle commitments, constraint evaluations"
- Mersenne-31 field: A finite field defined by a Mersenne prime modulus used for efficient STARK arithmetic. "over the Mersenne-31 field"
- Membership problem: The task of deciding whether a claimed output belongs to the set of valid outputs under some execution semantics. "requires solving an intractable membership problem over combinatorially many valid outputs."
- Newton-Raphson inverse square root: An iterative method to compute 1/sqrt(x), implemented here in fixed-point integers. "then apply Newton-Raphson inverse square root entirely in integer arithmetic"
- Non-associative (floating-point addition): The property that (a+b)+c may differ from a+(b+c) due to rounding in floating-point arithmetic. "IEEE 754 floating-point addition is non- associative"
- Number Theoretic Transform (NTT): A discrete Fourier transform over finite fields, used to accelerate polynomial operations in proofs. "For layers exceeding the NTT (Number Theoretic Transform) trace size limit (~ 224 rows),"
- Operational trust entropy: A Rényi collision entropy measure over cross-platform outputs quantifying non-determinism. "The operational trust entropy is the Rényi collision entropy:"
- Perplexity (PPL): A measure of how well a probabilistic model predicts a sample, commonly used to evaluate LLMs. "We separately measure perplexity (PPL) on WikiText-2,"
- Platform-Deterministic Inference: An inference property where identical inputs yield identical outputs across all hardware platforms. "Definition 2 (Platform-Deterministic Inference)."
- PRNG: Pseudorandom number generator; when deterministically seeded, it enables reproducible stochastic sampling. "deterministic PRNG (e.g., ChaCha20)"
- Rényi collision entropy: A specific entropy measure (order-2 Rényi) used here to define trust entropy over execution outputs. "trust entropy (Rényi collision entropy over execution outputs)"
- Residual neural network: A neural architecture with skip connections x_{i+1} = x_i + g_i(x_i) that affect error accumulation. "Let f be a residual neural network of L blocks,"
- RMSNorm: Root Mean Square Normalization, a normalization technique used in Llama-style models. "Llama-class models use RMSNorm:"
- RoPE (Rotary Position Embedding): A positional encoding method that rotates feature pairs by position-dependent angles. "Rotary Position Embedding encodes position through rotation:"
- Round-to-nearest-even: The IEEE 754 default rounding mode that rounds to the nearest representable value, with ties to even. "round-to- nearest-even rounds down"
- Second-preimage resistance: A hash security property making it infeasible to find a different input mapping to a specific hash. "By second-preimage resistance of H (implied by collision resistance),"
- SiLU: The Sigmoid Linear Unit activation function defined as x·σ(x). "The SiLU activation SiLU(x) = x . o(x) is implemented via a 257-entry exponential lookup table"
- SIMD: Single Instruction, Multiple Data; parallel execution over vector lanes affecting floating-point reduction order. "across SIMD lanes of different widths,"
- Softmax: A function that converts scores to a probability distribution, sensitive to small numerical differences. "through nonlinear activations and softmax normalization,"
- STARK: A family of transparent, scalable proofs (Succinct Transparent ARguments of Knowledge) for verifying computations. "a STARK [9]"
- Trust Dependency Hierarchy: The assertion that various trust properties (fairness, robustness, privacy, safety, alignment) depend on determinism. "Theorem 7 (Trust Dependency Hierarchy)."
- Trust entropy: The paper’s measure of non-determinism across platforms, defined via Rényi collision entropy. "We introduce trust entropy (Rényi collision entropy over execution outputs)"
- Two's complement: The integer representation used by most CPUs; its ring properties underpin deterministic integer arithmetic. "two's complement integer arithmetic"
- ULP (unit in the last place): The spacing between adjacent floating-point numbers at a given magnitude. "ULP (unit in the last place)"
- Verification complexity: The computational effort required to verify an output belongs to the set of valid executions. "Definition 15 (Verification Complexity)."
- WGSL: WebGPU Shading Language used to implement deterministic GPU compute shaders. "9 cross-platform WGSL compute shaders"
Collections
Sign up for free to add this paper to one or more collections.