JD3P: Joint Discrete Denoising Diffusion
- The paper introduces JD3P, demonstrating a unified framework that extends diffusion-based generative models to jointly synthesize discrete and continuous data.
- It details a dual-channel approach using Gaussian and Gaussian-softmax diffusions to handle heterogeneous data modalities with permutation invariance.
- Empirical results show significant improvements in tasks like CAD sketch generation and vision-language-action modeling, enhancing fidelity and efficiency.
The Joint Discrete Denoising Diffusion Process (JD3P) is a family of probabilistic generative models that extend denoising diffusion processes to handle structured data comprising both discrete and continuous variables, or multiple interdependent discrete modalities. JD3P provides a unified, theoretically grounded, and highly practical framework for stochastic generation and modeling in domains requiring the synthesis or reconstruction of heterogeneous symbolic, parametric, and spatial data. Key innovations in JD3P include forward and reverse noising/denoising processes tailored to both categorical and continuous domains, permutation-invariant modeling, and mechanisms for joint multimodal co-refinement, as exemplified in state-of-the-art systems for computer-aided design (CAD) sketch generation (Chereddy et al., 15 Jul 2025), unified vision-language-action modeling (Chen et al., 3 Nov 2025), and discrete Markov frameworks (Campbell et al., 2022).
1. Mathematical Foundations of JD3P
JD3P generalizes denoising diffusion probabilistic models (DDPMs) to the joint generation of sequence elements where each primitive is characterized by a tuple of continuous and discrete components. In the canonical SketchDNN formulation (Chereddy et al., 15 Jul 2025), a CAD primitive is defined as with a binary construction-aid flag, a categorical class label (e.g., Line, Circle, Arc, Point), and a vector of class-dependent continuous parameters. The forward diffusion process consists of two independent Markov chains:
- Continuous channel: Each is diffused through a Gaussian process:
The closed-form marginal permits direct sampling of noise levels at arbitrary time steps.
- Discrete channel (Gaussian-Softmax diffusion): Discrete variables are diffused by adding Gaussian noise to the logit space of the one-hot encoding and projecting to the probability simplex via softmax at each step:
As , the distribution approaches the uniform distribution on the simplex.
The model parameters are optimized by minimizing a loss combining mean-squared error (MSE) for continuous variables and cross-entropy for categorical variables, corresponding to a variational lower bound (ELBO) with distinct continuous and discrete terms.
2. Joint Modeling of Multimodal and Heterogeneous Data
JD3P's formalism naturally supports data where each element or modality follows distinctive statistical laws. In SketchDNN (Chereddy et al., 15 Jul 2025), the challenge of heterogeneous parameterizations is solved by embedding all primitive types in a unified vector and gating each block via the predicted class probabilities (“superposition”). This superposition enables continuous gradients even for mixed-symbolic representations and improves the robustness and flexibility of learned models. For CAD sketches as unordered sets, forward and reverse chains are constructed to factorize over primitives, and transformer-based denoisers are implemented without positional encodings to enforce permutation invariance.
In vision-language-action applications (Chen et al., 3 Nov 2025), JD3P is employed to jointly denoise actions and visual predictions by concatenating future vision and action tokens in a single discrete diffusion trajectory. The approach extends to tokenized multimodal sequences, where language, vision, and action blocks share the same transformer backbone under a hybrid attention mask structure.
3. Forward and Reverse Processes
JD3P’s forward (noising) processes are adapted to the underlying data: Gaussian for continuous variables and stochastic masking or noise-injection for discrete/categorical variables. Reverse processes are learned via deep neural networks, predicting either the denoised initial (clean) state or categorical probabilities. The analytic reverse step for Gaussian-Softmax diffusion matches the posterior of the forward process, ensuring consistent joint denoising.
SketchDNN reverse for continuous:
with and derived from the forward kernel and the network’s denoised prediction.
SketchDNN reverse for discrete:
followed by softmax projection.
In discrete-only settings (Chen et al., 3 Nov 2025), the forward process replaces tokens with a mask symbol independently and reverses via conditional categorical logits. The loss is computed over masked positions via cross-entropy.
4. Permutation Invariance and Structured Prediction
JD3P advances permutation-invariant modeling by constructing Markov chains and neural denoisers that do not depend on primitive order. In CAD sketch modeling (Chereddy et al., 15 Jul 2025), both forward and reverse chains are factorized across primitives, and transformers are stripped of positional encodings to maintain the necessary permutation equivariance. Empirical ablation studies confirm that this design decision is crucial: adding positional encodings slightly degrades performance, while removing compositional superposition (i.e., Gaussian-Softmax diffusion) causes substantial loss in both fidelity and likelihood.
5. Practical Implementation and Algorithmic Schemes
JD3P models are trained via stochastic gradient descent using mini-batches and randomly sampled timesteps. Both SketchDNN and UD-VLA provide explicit pseudocode outlining forward diffusion, denoising network calls, computation of MSE and cross-entropy losses, and parameter update steps.
Sampling proceeds by initializing variables at the maximally noisy state and performing denoising steps iteratively. For multimodal discrete domains (Chen et al., 3 Nov 2025), inference-time techniques include prefix key–value caching, adaptive mask schedules, tempered sampling, and sub-vocabulary masking to enforce modality coherence during generation.
6. Empirical Results and Performance Benchmarks
On the SketchGraphs CAD dataset, JD3P delivers state-of-the-art results in both likelihood and sample quality: negative log-likelihood is reduced from 84.80 to 81.33 bits/sketch, and Fréchet Inception Distance from 16.04 to 7.80 relative to autoregressive baselines (Chereddy et al., 15 Jul 2025). Ablations establish the critical advantage of Gaussian-Softmax superposition and permutation-invariant denoising.
For unified vision-language-action systems (Chen et al., 3 Nov 2025), JD3P yields faster inference (4× faster than autoregressive baselines) and higher scores on embodied robotics benchmarks, with improvements in both average trajectory lengths and task success rates over competing architectures.
7. Extensions and Related Frameworks
The general framework of JD3P is extensible to a variety of domains, including fully discrete spaces (Campbell et al., 2022), where continuous-time Markov chain (CTMC) analogues provide efficient training, sharp theoretical guarantees (explicit ELBOs and TV distance bounds), and powerful tau-leaping samplers.
Latent Discrete Diffusion Models (LDDMs) (Shariatian et al., 20 Oct 2025) further extend the concept by coupling discrete diffusion of tokens with continuous latent channels, capturing inter-token dependencies that are otherwise lost with factorized masked denoisers. This highlights JD3P’s flexibility in supporting both joint and sequential coupling of modalities, with empirical improvements in unconditional language modeling and structured prediction.
In summary, JD3P frameworks unify discrete and continuous diffusion trajectories, enabling joint, permutation-invariant, and cross-modal generative modeling with provable training objectives and scalable sampling algorithms. Their effectiveness is established across domains such as CAD synthesis, embodied vision-language-action, and generic discrete data generation.