NExT-OMNI: Multimodal Biomedical & AI Framework

Updated 21 February 2026

NExT-OMNI is a dual-framework that fuses multi-contrast biomedical imaging with a discrete flow matching AI model for any-to-any cross-modal analysis.
It synchronously integrates several imaging modalities (CT, MRI, PET, etc.) within a shared ROI, addressing motion and registration challenges for precise diagnostics.
Its AI component employs modality-specific encoders and a transformer backbone to enable efficient cross-modal generation, retrieval, and unified representation.

NExT-OMNI refers to two independently developed frameworks at the frontier of biomedical imaging and foundation models for multimodal artificial intelligence. Both systems embody the principle of fusing multiple information channels—whether biophysical or data-driven—for higher-fidelity, synchronous, and unified understanding or generation. In biomedical imaging, NExT-OMNI (“next-generation omni-tomography”) integrates five or more modalities (e.g., CT, MRI, PET, SPECT, ultrasound, optical) around a shared interior region of interest (ROI), enabling simultaneous, multi-contrast volumetric imaging. In the context of machine learning, NExT-OMNI designates an omnimodal foundation model built upon discrete flow matching (DFM), capable of any-to-any cross-modal understanding, generation, and retrieval by leveraging a unified representation across text, image, audio, and video domains. Both instantiations resolve historical limitations associated with sequential acquisition, registration, or decoupled architectures and provide rigorous algorithmic and architectural innovations to support next-generation scientific and technological applications (1212.5579, Luo et al., 15 Oct 2025).

1. Fundamental Goals and Motivations

Biomedical NExT-OMNI

NExT-OMNI in biomedical imaging addresses the central challenge of capturing rapid, complex physiological dynamics by synchronously acquiring multi-contrast data of living subjects. It aims to:

Close the gap between in vitro omics data and in vivo phenotype through high-throughput, multi-parametric tomographic measurements.
Support systems-biology studies of transient molecular and cellular events in situ (e.g., cardiac cycles, tumor microenvironments).
Enable personalized and preventive medicine by precisely characterizing tissue heterogeneity, plaque vulnerability, and therapy response—in real time, under a unified protocol.

Achieving these objectives requires overcoming motion artifacts, geometric misalignment, and inconsistent contrast by collecting diverse modality signals—such as attenuation, relaxation, molecular binding, and perfusion—in parallel within a common ROI (1212.5579).

AI NExT-OMNI

In artificial intelligence, NExT-OMNI defines a path toward universal omnimodal intelligence: a foundation model that natively ingests and generates across text, image, audio, and video. The motivation is to:

Break free from fragmented, task-decoupled, or autoregressive (AR) pipelines, whose sequential logic and modality-specific design limit cross-domain fusion and response efficiency.
Realize a compact, unified generator and understanding engine that performs any-to-any mapping and retrieval while supporting multi-turn, interleaved, and high-throughput scenarios across all major data modalities.

This is realized by adopting a discrete flow matching backbone, enabling both parallel decoding and bidirectional feature integration (Luo et al., 15 Oct 2025).

2. Enabling Theory and Mathematical Foundations

Interior Tomography for Biomedical NExT-OMNI

Central to biomedical NExT-OMNI is interior tomography, which demonstrates that the ROI can be exactly and stably reconstructed from projections truncated to that region, provided certain structural priors:

Theorem 1 (Known Subregion): If the attenuation function $f(x,y)$ within the ROI contains a known subregion $D_0$ , $f$ on the entire ROI is uniquely and stably reconstructable from line integrals passing through the ROI alone.
Theorem 2 (Piecewise Polynomial): If $f(x,y)$ is a piecewise $n$ th-order polynomial on the ROI, $f$ is recoverable exactly via high-order total variation or moment-based inversion using truncated data.
Theorem 3 (General Modalities): The interior tomography principle generalizes from attenuation tomography (CT) to SPECT, phase-contrast CT, and localized MRI.

A key formula involves the truncated Hilbert transform:

$\mathcal{H}_a[g_\theta](s)=\frac{1}{\pi}\,\mathrm{PV}\int_{-a}^{a}\frac{g_\theta(t)}{s-t}\,dt,$

with ROI-restricted filtered backprojection yielding exact reconstructions under specified priors (1212.5579).

Discrete Flow Matching for AI NExT-OMNI

DFM introduces a time-indexed family of discrete distributions $\{p_t(x)\}_{t\in[0,1]}$ on tokens $x\in\mathcal{T}^D$ , smoothly interpolating between a base distribution $p(x)$ and data distribution $q(x)$ . Transitions are parameterized either as mixture paths or metric-induced paths, the latter utilizing distances (e.g., cosine) between token embeddings and a temperature schedule $\beta_t$ . The framework optimizes a kinetic-optimal velocity field $u_t^i$ for transporting discrete mass along these marginals:

$u_t^i(x^i, z \mid x_1) = p_t(x^i|x_1^i)\;\frac{\partial\beta_t}{\partial t} \max\{d(z^i, x_1^i) - d(x^i, x_1^i), 0\}$

Training minimizes expected cross-entropy to predict $x_1$ from corrupted $x_t$ at randomly sampled times $t$ , optionally regularized by modality-specific reconstruction losses to preserve fine-grained semantic structure:

$\mathcal{L}_{\mathrm{overall}} = \lambda_1 \mathcal{L}_{\mathrm{ce}} + \lambda_2 \mathcal{L}_{\mathrm{rec}}^V + \lambda_3 \mathcal{L}_{\mathrm{rec}}^A$

where dynamic balancing of loss terms is performed via GradNorm (Luo et al., 15 Oct 2025).

3. System Architectures

Biomedical Imaging System

The NExT-OMNI imaging system consists of:

Stationary, Multi-Source CT Ring: 8–16 carbon-fiber sources and matching photon-counting detectors form a non-rotating ring to support electromagnetic compatibility and rapid frame rates.
Open MRI Magnet Rings: Dual, coaxial, C-shaped rings (0.5–1.0 T) establish a homogeneous field, accommodating simultaneous CT and gradient coil operations.
Unified Calibration: All components operate in a synchronized, registered reference frame, with precise isocenter alignment and time-stamped acquisition across modalities.
Integration Option: PET, SPECT, ultrasound, and optical detectors are insertable modules sharing the synchronization bus and common ROI.
Typical Parameters: Source–isocenter distance 50 cm, detector radius 60 cm, RF coil inner diameter 70 cm.

Data acquisition is governed by a master clock (CT up to 100 ms frame rate, MRI with synchronized TR/TE), with gating to prevent mutual electromagnetic interference (1212.5579).

Omnimodal Foundation Model

The NExT-OMNI AI model comprises:

Modality Encoders (Warmup): Vision encoder initialized from CLIP-ViT-Large (VQVAE 4×4096 codes), audio encoder from Whisper-Turbo (VQVAE 2×2048), both trained with reconstructive and semantic alignment objectives. Results in unified discrete codebooks for downstream tokenization.
DFM-Repurposed Transformer Backbone: A 7B-parameter transformer derived from Qwen2.5, modified for DFM; quantized visual/audio tokens and unaltered text/video tokens are interleaved and projected into a common embedding space with full bidirectional attention.
Modality-Specific Heads: Lightweight decoders per modality for translating codebook indices back to original data; stable next-token decoding is adopted over parallel alternatives.
Dynamic-Length Generation and Adaptive Caching: Responses are block-padded, enabling flexible extension/truncation during inference. Caching strategies accelerate denoising and provide a 1.2× inference speedup over AR baselines (Luo et al., 15 Oct 2025).

4. Algorithms and Training Protocols

Joint Reconstruction in Biomedical NExT-OMNI

Reconstruction is formalized as a joint optimization:

$\underset{x_{\rm CT},\,x_{\rm MRI}}{\mathrm{minimize}}\; \|A_{\rm CT} x_{\rm CT} - y_{\rm CT}\|_2^2 + \|A_{\rm MRI} x_{\rm MRI} - y_{\rm MRI}\|_2^2 + \lambda \Phi(x_{\rm CT}, x_{\rm MRI})$

where $A_{\rm CT}$ and $A_{\rm MRI}$ are the forward operators, and $\Phi$ enforces cross-modality coupling (e.g., joint dictionary-based sparse coding):

$\Phi(x_{\rm CT}, x_{\rm MRI}) = \sum_p \|R_p x_{\rm CT} - D \alpha_p\|_2^2 + \|R_p x_{\rm MRI} - D \alpha_p\|_2^2 + \mu \sum_p \|\alpha_p\|_1$

Alternating minimization cycles through sparse coding, dictionary updates, and modality-wise reconstructions (including filtered backprojection and non-Cartesian gridding). Compressed sensing ensures robustness to few-view and undersampled regimes (1212.5579).

Multistage Training in AI NExT-OMNI

NExT-OMNI's pipeline advances in three stages:

Stage I: Pre-Training (PT)

Data: 32M image-text pairs, 25M text-image, 16M audio-text, 6M text-audio, plus 4M text only.
Hyperparameters: Encoders/decoders LR 2e-5, others 1e-4, AdamW optimizer.

Stage II: Continued PT (CPT)

Data: Increased resolution (384²), 10M video seeds, 2M text-video, 12M audio, synthetic/real mixes.
Context window and batch sizes expanded.

Stage III: SFT (Instruction Tuning)

Data: 7.6M image generation instructions, 0.5M audio-text dialogues, 1.7M image dialogues, 0.9M video Q&A, additional multi-modal and reasoning data.

Curriculum resampling and contrastive caption filtering promote convergence. Standard augmentation techniques are employed (Luo et al., 15 Oct 2025).

5. Benchmark Results and Applications

Biomedical Imaging Performance

Representative outcomes include:

Task	NExT-OMNI	CT Alone	MRI Alone
Min. fibrous cap detectability	0.4 mm	0.7 mm	0.6 mm
Lipid core vol. estimation error	5%	12%	10%
Sensitivity/specificity (plaque)	94%/92%	82%/85%	86%/88%

In tumor models:

Dice coefficient for hypoxic segmentation: 0.89 (omni) vs. 0.74 (MRI only).
Correlation of K^trans (MRI) with iodine perfusion (CT): R=0.92 (omni) vs. 0.78 (separate scans).
Detection sensitivity for resistant subvolumes: 95% (omni) vs. 70% (PET alone).

These improvements derive from simultaneous, voxel-registered, multi-parametric measurement (1212.5579).

AI Omnimodal Model Benchmarking

Table: Selected NExT-OMNI results (AR = autoregressive; DFM = discrete flow matching):

Task	NExT-OMNI	Prior SOTA OpenOmni	Other Models
Omnimodal understanding (avg. accuracy)	39.7%	36.5%	-
Multi-turn vision interaction (OpenING)	55.0%	SEED-X: 50.2%	MMaDA: 47.7%
Speech QA (LLaMA/WebQA)	62.0/47.4	Stream-Omni 60.3/46.3	-
Cross-modal retrieval (Top-5, M-BEIR)	32.9%	Bagel: 28.5%	Janus: 26.6%

Ablations confirm that unified DFM boosts generation/retrieval and that combined dynamic generation and reconstruction terms yield the best balance on understanding, generation, and retrieval tasks (Luo et al., 15 Oct 2025).

6. Limitations and Prospective Directions

Biomedical Imaging

Extension to PET/SPECT requires integrating stationary gamma detectors with interior tomography modeling.
Real-time reconstruction speedups are contingent on advanced GPU solvers.
Incorporation of multi-physics contrasts (e.g., photoacoustics) and robust EMI/cost mitigation must proceed for clinical translation.
Regulatory and commercial deployment pathways remain open research questions (1212.5579).

Omnimodal Foundation Modeling

All reported AI NExT-OMNI benchmarks are at 7B parameter scale (~2T tokens). Scaling to larger core models is an open area.
Video generation is currently limited to clip lengths of ≤8 frames.
Theoretical and empirical understanding of discrete-flow unification for complex multimodal reasoning tasks is an ongoing topic.
Prospective extensions include adding modalities (3D, haptics), supporting vision-language-action scenarios, and advancing the theoretical substrate for discrete flows (Luo et al., 15 Oct 2025).

References:

"Omni-tomography: Next-generation Biomedical Imaging" (1212.5579)
"NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching" (Luo et al., 15 Oct 2025)

Markdown Report Issue Upgrade to Chat

References (2)

Omni-tomography: Next-generation Biomedical Imaging (2012)

NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NExT-OMNI.