Papers
Topics
Authors
Recent
Search
2000 character limit reached

JD3P: Joint Discrete Denoising Diffusion

Updated 29 January 2026
  • The paper introduces JD3P, demonstrating a unified framework that extends diffusion-based generative models to jointly synthesize discrete and continuous data.
  • It details a dual-channel approach using Gaussian and Gaussian-softmax diffusions to handle heterogeneous data modalities with permutation invariance.
  • Empirical results show significant improvements in tasks like CAD sketch generation and vision-language-action modeling, enhancing fidelity and efficiency.

The Joint Discrete Denoising Diffusion Process (JD3P) is a family of probabilistic generative models that extend denoising diffusion processes to handle structured data comprising both discrete and continuous variables, or multiple interdependent discrete modalities. JD3P provides a unified, theoretically grounded, and highly practical framework for stochastic generation and modeling in domains requiring the synthesis or reconstruction of heterogeneous symbolic, parametric, and spatial data. Key innovations in JD3P include forward and reverse noising/denoising processes tailored to both categorical and continuous domains, permutation-invariant modeling, and mechanisms for joint multimodal co-refinement, as exemplified in state-of-the-art systems for computer-aided design (CAD) sketch generation (Chereddy et al., 15 Jul 2025), unified vision-language-action modeling (Chen et al., 3 Nov 2025), and discrete Markov frameworks (Campbell et al., 2022).

1. Mathematical Foundations of JD3P

JD3P generalizes denoising diffusion probabilistic models (DDPMs) to the joint generation of sequence elements where each primitive is characterized by a tuple of continuous and discrete components. In the canonical SketchDNN formulation (Chereddy et al., 15 Jul 2025), a CAD primitive is defined as xi=(bi,ci,pi)x_i = (b_i, c_i, p_i) with bib_i a binary construction-aid flag, cic_i a categorical class label (e.g., Line, Circle, Arc, Point), and pip_i a vector of class-dependent continuous parameters. The forward diffusion process consists of two independent Markov chains:

  • Continuous channel: Each ptp_t is diffused through a Gaussian process:

q(xtxt1)=N(xt;αtxt1,βtI).q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\left(\mathbf{x}_t; \sqrt{\alpha_t}\,\mathbf{x}_{t-1},\,\beta_t I\right).

The closed-form marginal permits direct sampling of noise levels at arbitrary time steps.

  • Discrete channel (Gaussian-Softmax diffusion): Discrete variables are diffused by adding Gaussian noise to the logit space of the one-hot encoding and projecting to the probability simplex via softmax at each step:

ut=αtlog(yt1)+1αtε,εN(0,I),\mathbf{u}_t = \sqrt{\alpha_t} \log(\mathbf{y}_{t-1}) + \sqrt{1-\alpha_t}\,\varepsilon,\quad \varepsilon \sim \mathcal{N}(0, I),

yt=softmax(ut).\mathbf{y}_t = \mathrm{softmax}(\mathbf{u}_t).

As tTt\to T, the distribution approaches the uniform distribution on the simplex.

The model parameters are optimized by minimizing a loss combining mean-squared error (MSE) for continuous variables and cross-entropy for categorical variables, corresponding to a variational lower bound (ELBO) with distinct continuous and discrete terms.

2. Joint Modeling of Multimodal and Heterogeneous Data

JD3P's formalism naturally supports data where each element or modality follows distinctive statistical laws. In SketchDNN (Chereddy et al., 15 Jul 2025), the challenge of heterogeneous parameterizations is solved by embedding all primitive types in a unified vector and gating each block via the predicted class probabilities (“superposition”). This superposition enables continuous gradients even for mixed-symbolic representations and improves the robustness and flexibility of learned models. For CAD sketches as unordered sets, forward and reverse chains are constructed to factorize over primitives, and transformer-based denoisers are implemented without positional encodings to enforce permutation invariance.

In vision-language-action applications (Chen et al., 3 Nov 2025), JD3P is employed to jointly denoise actions and visual predictions by concatenating future vision and action tokens in a single discrete diffusion trajectory. The approach extends to tokenized multimodal sequences, where language, vision, and action blocks share the same transformer backbone under a hybrid attention mask structure.

3. Forward and Reverse Processes

JD3P’s forward (noising) processes are adapted to the underlying data: Gaussian for continuous variables and stochastic masking or noise-injection for discrete/categorical variables. Reverse processes are learned via deep neural networks, predicting either the denoised initial (clean) state or categorical probabilities. The analytic reverse step for Gaussian-Softmax diffusion matches the posterior of the forward process, ensuring consistent joint denoising.

SketchDNN reverse for continuous:

xt1=μθ(xt,t)+σtε,\mathbf{x}_{t-1} = \mu_\theta(\mathbf{x}_t, t) + \sigma_t \varepsilon,

with μθ\mu_\theta and σt\sigma_t derived from the forward kernel and the network’s denoised prediction.

SketchDNN reverse for discrete:

ut1=αt(1αˉt1)log(yt)+αˉt1(1αt)log(y0θ^)1αˉt+σtε,\mathbf{u}_{t-1} = \frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})\log(\mathbf{y}_t) + \sqrt{\bar\alpha_{t-1}(1-\alpha_t)}\log(\hat{\mathbf{y}_0^\theta})}{1-\bar\alpha_t} + \sigma_t \varepsilon,

followed by softmax projection.

In discrete-only settings (Chen et al., 3 Nov 2025), the forward process replaces tokens with a mask symbol independently and reverses via conditional categorical logits. The loss is computed over masked positions via cross-entropy.

4. Permutation Invariance and Structured Prediction

JD3P advances permutation-invariant modeling by constructing Markov chains and neural denoisers that do not depend on primitive order. In CAD sketch modeling (Chereddy et al., 15 Jul 2025), both forward and reverse chains are factorized across primitives, and transformers are stripped of positional encodings to maintain the necessary permutation equivariance. Empirical ablation studies confirm that this design decision is crucial: adding positional encodings slightly degrades performance, while removing compositional superposition (i.e., Gaussian-Softmax diffusion) causes substantial loss in both fidelity and likelihood.

5. Practical Implementation and Algorithmic Schemes

JD3P models are trained via stochastic gradient descent using mini-batches and randomly sampled timesteps. Both SketchDNN and UD-VLA provide explicit pseudocode outlining forward diffusion, denoising network calls, computation of MSE and cross-entropy losses, and parameter update steps.

Sampling proceeds by initializing variables at the maximally noisy state and performing denoising steps iteratively. For multimodal discrete domains (Chen et al., 3 Nov 2025), inference-time techniques include prefix key–value caching, adaptive mask schedules, tempered sampling, and sub-vocabulary masking to enforce modality coherence during generation.

6. Empirical Results and Performance Benchmarks

On the SketchGraphs CAD dataset, JD3P delivers state-of-the-art results in both likelihood and sample quality: negative log-likelihood is reduced from 84.80 to 81.33 bits/sketch, and Fréchet Inception Distance from 16.04 to 7.80 relative to autoregressive baselines (Chereddy et al., 15 Jul 2025). Ablations establish the critical advantage of Gaussian-Softmax superposition and permutation-invariant denoising.

For unified vision-language-action systems (Chen et al., 3 Nov 2025), JD3P yields faster inference (4× faster than autoregressive baselines) and higher scores on embodied robotics benchmarks, with improvements in both average trajectory lengths and task success rates over competing architectures.

The general framework of JD3P is extensible to a variety of domains, including fully discrete spaces (Campbell et al., 2022), where continuous-time Markov chain (CTMC) analogues provide efficient training, sharp theoretical guarantees (explicit ELBOs and TV distance bounds), and powerful tau-leaping samplers.

Latent Discrete Diffusion Models (LDDMs) (Shariatian et al., 20 Oct 2025) further extend the concept by coupling discrete diffusion of tokens with continuous latent channels, capturing inter-token dependencies that are otherwise lost with factorized masked denoisers. This highlights JD3P’s flexibility in supporting both joint and sequential coupling of modalities, with empirical improvements in unconditional language modeling and structured prediction.

In summary, JD3P frameworks unify discrete and continuous diffusion trajectories, enabling joint, permutation-invariant, and cross-modal generative modeling with provable training objectives and scalable sampling algorithms. Their effectiveness is established across domains such as CAD synthesis, embodied vision-language-action, and generic discrete data generation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Joint Discrete Denoising Diffusion Process (JD3P).