Spherical Flows for Sampling Categorical Data
Abstract: We study the problem of learning generative models for discrete sequences in a continuous embedding space. Whereas prior approaches typically operate in Euclidean space or on the probability simplex, we instead work on the sphere $\mathbb S{d-1}$. There the von Mises-Fisher (vMF) distribution induces a natural noise process and admits a closed-form conditional score. The conditional velocity is in general intractable. Exploiting the radial symmetry of the vMF density we reduce the continuity equation on $\mathbb S{d-1}$ to a scalar ODE in the cosine similarity, whose unique bounded solution determines the velocity. The marginal velocity and marginal score on $(\mathbb S{d-1})L$ both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights. This gives access to both ODE and predictor-corrector (PC) sampling. The posterior is the only learned object, trained by a cross-entropy loss. Experiments compare the vMF path against geodesic and Euclidean alternatives. The combination of vMF and PC sampling significantly improves results on Sudoku and language modeling.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (big picture)
The paper introduces a new way to make computers generate sequences of symbols (like words in a sentence or digits in a Sudoku) all at once, instead of one-by-one. The key idea is to place each token’s “embedding” (its learned vector) on the surface of a high‑dimensional sphere, and then learn how to move points on that sphere from pure noise to a clean, meaningful sequence. This spherical approach uses a special “spotlight-shaped” noise called the von Mises–Fisher (vMF) distribution that fits naturally on a sphere.
What questions the authors wanted to answer
The authors set out to answer:
- Can we design a noise-and-denoise process that lives directly on a sphere, where token embeddings naturally sit?
- Can we make this process mathematically simple enough to compute fast and accurately?
- Will this spherical method produce better sequences (like correct Sudoku solutions or fluent text) than earlier methods that live in flat (Euclidean) space or use different spherical paths?
- Can we train just one thing (the per-position token probabilities) and reuse it to drive multiple samplers?
How they did it (ideas and methods in simple terms)
Think of each token (like a word or a Sudoku digit) as a point on the surface of a giant sphere (imagine a globe, but in many dimensions). To generate a full sequence, the method:
- Starts from a fully noisy state—points spread uniformly around the sphere.
- Gradually “unblurs” the points, nudging them toward the correct token embeddings.
Two ingredients power this:
- A noise path on the sphere:
- Geodesic path (slerp): the spherical version of a straight line between two points on the globe.
- vMF path: a “spotlight” distribution that focuses mass around a direction on the sphere, controlled by a concentration value (small = very blurry; large = tightly focused).
- A learned “guide” per position:
- The model learns, for each position in the sequence, how likely each token is (these are the per-position posteriors).
- Training uses a standard cross-entropy loss, teaching the model to predict these token probabilities from noisy inputs.
Why the vMF path is special:
- It fits the sphere naturally: its shape depends only on “how aligned” two vectors are (their cosine similarity).
- The authors show a neat math shortcut: a complicated “mass conservation” rule (the continuity equation that governs how probability moves) collapses into a simple one-dimensional ordinary differential equation (ODE) in terms of the cosine similarity. This makes the “push direction” (velocity) computable and efficient.
- In high dimensions, the vMF path reveals the target gradually in a well‑controlled way, which helps training and sampling.
Sampling strategies:
- ODE sampling (predictor): follow the learned velocity field—think of arrows on the sphere that push your points toward better tokens.
- Predictor–Corrector (PC): do a predictor step, then a few corrector steps that add a tiny bit of noise and use the “score” (a measure of where probability is increasing) to refine the state. This often improves quality.
A key design: the only thing we learn is the per-position token probabilities. Once we have those, we can compute both the “velocity” for ODE steps and the “score” for corrector steps. This makes the system clean and flexible.
What they found and why it matters
Main findings:
- The vMF spherical path combined with predictor–corrector sampling clearly outperforms other paths (including geodesic and Euclidean ones) on two tasks:
- Sudoku-Extreme: Higher fraction of valid solutions than baselines when using PC. The vMF+PC combo got the best results among tested methods.
- Language Modeling (LM1B): Much better generation perplexity (a standard measure of text quality; lower is better) with vMF+PC. Predictor–corrector brought large gains over using ODE alone.
- Geodesic paths (slerp) improved less, and Euclidean “variance-exploding” noise gained little from PC. “Variance-preserving” Euclidean noise benefited somewhat from PC, but not as much as vMF.
- vMF’s “spotlight” noise pairs especially well with PC because both the noisy inputs and token targets live on the sphere with the same norm. That means the network can focus on direction (which token it points to) rather than magnitude, helping it sharpen predictions.
Why this matters:
- Generating sequences all at once can be faster than the usual “one token at a time” approach.
- Better generation quality (more valid Sudokus, lower perplexity text) shows that spherical modeling is a strong alternative to flat-space methods for discrete data.
- The method is simple to train: learn per-position token probabilities with cross-entropy, then plug those into multiple samplers.
What this could lead to (impact and future use)
- Faster or more reliable sequence generation: Non‑autoregressive models like this can generate entire sequences in a handful of steps, which is promising for speed.
- Better handling of discrete data: Many tasks map naturally to tokens (language, code, puzzles). Modeling them on a sphere with vMF noise could improve quality across applications.
- Cleaner design: Because the only learned object is the token posterior, the same trained model can drive different samplers (ODE or predictor–corrector), making deployment flexible.
- Foundation for other manifolds: The math trick (reducing the sphere’s continuity equation to a simple 1D ODE using symmetry) suggests similar ideas could work on other curved spaces, opening doors to new generative models.
In short, the paper shows that “spherical flows” with vMF noise and predictor–corrector sampling are a powerful, practical way to generate discrete sequences, improving results on both puzzles and language.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise, actionable list of what the paper leaves unresolved, grouped by theme.
Theory and formal properties
- Absence of convergence guarantees from the cross-entropy–trained posterior to the true posterior under model misspecification and finite-capacity networks; no conditions are given under which the learned marginal flow matches as .
- No end-to-end likelihood or consistency analysis connecting the cross-entropy objective to maximum-likelihood flow learning on (e.g., whether minimizing the training loss yields a flow that asymptotically matches the data distribution).
- Existence/uniqueness and regularity of the marginal flow on with learned posteriors are not analyzed (e.g., measurability and smoothness of and when is represented by a neural network).
- The reduction of the continuity equation to a 1D ODE (via radial symmetry) is specific to vMF; it is open whether similar reductions (and closed-form velocities) exist for other spherical families (e.g., Kent or Bingham) or mixtures thereof.
- No theoretical error analysis for numerical integration: discretization in -space (Euler steps) lacks global error bounds or stability results as a function of NFE, step size, and dimension .
- Lack of analysis of the behavior of the conditional velocity near the poles beyond boundedness; stability and error accumulation near these singular points are not quantified for high and large .
Modeling and design choices
- The geodesic (slerp) path lacks a tractable Riemannian score; PC and SDE sampling cannot be used with it. Whether an analytical or numerically stable approximation to the score exists for geodesic paths remains open.
- The method learns only per-position marginals ; joint posteriors across positions (which could capture long-range constraints and correlations more explicitly) are not modeled or approximated.
- The choice and impact of the embedding dimension on learning dynamics, capacity, and performance are not studied; no guidance on how to set for different vocabulary sizes and tasks.
- Jointly learned spherical embeddings w_k may suffer from degeneracies (collapse, poor separation, anisotropy); the paper does not propose or study geometric regularizers (e.g., packing/separation penalties) or identifiability constraints.
- The uniform-norm constraint on embeddings emphasizes direction only; how this affects calibration, confidence, and ambiguity (versus Euclidean embeddings where magnitude carries confidence) is not analyzed.
- Alternative spherical conditional paths (e.g., anisotropic Kent/Bingham, mixtures of vMF) that could encode directional preferences or multi-modality are not explored.
- Extending beyond the sphere to other manifolds (e.g., hyperbolic or product manifolds better aligned with lexical hierarchies) is not addressed.
Schedules, scores, and sampling
- Schedule selection is heuristic: how to choose , the mapping , or to learn schedules jointly with the model is not addressed; no dimension-adaptive or data-adaptive schedule design is provided.
- Although the paper factors and discretizes in , there is no study of schedule–discretization interactions (e.g., which grids minimize error for a given NFE).
- SDE sampling is only outlined; empirical comparisons of SDE vs ODE vs PC (and their computational budgets and stability) are missing.
- Predictor–corrector hyperparameters are tuned via grid searches without adaptive or theoretically grounded step-size selection (e.g., MALA on spheres, preconditioning, or acceptance-rate control).
- For VE, PC brings little improvement empirically; the geometric or analytical reasons (e.g., mismatch between VE noise geometry and spherical posterior geometry) are not investigated.
Computational and scaling aspects
- Inference cost scales with vocabulary size : each step needs posterior-weighted sums over all (O() per step). The paper does not address scalability to very large vocabularies or longer sequences. Techniques such as top- pruning, ANN retrieval, factorized/clustered softmax, or caching are not explored.
- Numerical evaluation of for large and large may require high-precision Bessel and normalization computations; robustness, speed, and approximation error (e.g., asymptotic expansions) are not benchmarked.
- Training/inference wall-clock times and memory footprints for vMF vs Euclidean baselines are not reported; it is unclear how the spherical approach scales in practice.
Experiments and evaluation
- Benchmark coverage is limited (Sudoku-Extreme and LM1B). Generality to other discrete domains (e.g., code, protein sequences, larger-context language tasks) is not evaluated.
- Sensitivity analyses are missing for (i) embedding dimension , (ii) vocabulary size , (iii) , and (iv) number of PC steps; ablations disentangling these from architectural variables would aid reproducibility and design.
- The trade-off between generation entropy and perplexity (and its effect on sample diversity and quality) is handled via entropy floors but lacks a principled selection criterion or calibration analysis.
- For constrained generation (e.g., Sudoku), constraints are enforced only via pinning clue positions; integrating general constraints into the dynamics (e.g., projected Langevin on constraint sets or energy terms) is not explored.
- LM1B evaluation relies on GPT-2-large scoring and entropy; additional metrics (e.g., human judgments, diversity, repetition rates, calibration, and downstream task utility) would strengthen conclusions.
- Self-conditioning and other modern enhancements (e.g., classifier-free guidance analogues for flows on spheres) are not systematically studied across paths.
- The method does not compare against stronger recent discrete flow/diffusion baselines on LM1B under matched settings (e.g., with identical NFE, entropy, and conditioning schemes), limiting causal attribution.
Extensions and broader applicability
- Applicability to conditional generation with side information (beyond pinning) is not developed (e.g., how to condition on prompts, tags, or structured constraints within the flow on ).
- Handling OOV tokens, dynamic vocabularies, or subword merging/splitting within a spherical embedding framework remains unaddressed.
- Combining the spherical approach with autoregressive decoders (hybrid generation) to leverage both global consistency (flows) and local coherence (AR) is not investigated.
These gaps point to concrete next steps: derive scores for geodesic/no-score paths or propose approximations; learn schedules; scale inference via top-/ANN; explore anisotropic spherical noise; add geometric regularization for embeddings; compare SDE/ODE/PC with theoretically grounded adaptivity; and broaden benchmarks and metrics.
Practical Applications
Immediate Applications
Below are concrete, deployable use cases that leverage the paper’s findings on spherical flows with von Mises–Fisher (vMF) noise, posterior-only learning via cross-entropy, and predictor–corrector (PC) sampling on the sphere.
- Software and AI tooling (Text and code generation)
- Use case: Non-autoregressive (parallel) draft generation for sentences, paragraphs, or code, followed by optional refinement.
- What’s enabled: Faster parallel sampling than left-to-right decoding; a “creativity knob” via the PC steps that trades off perplexity vs. entropy for controllable diversity.
- Potential products/workflows:
- An editor plugin that generates K diverse full-sentence candidates in one shot, with a user-facing control for “more accurate vs. more diverse” (mapped to corrector step count/step size).
- A coding assistant that proposes multiple function bodies at once, then refines selected candidates with quick Langevin “corrector” passes.
- Dependencies/assumptions: Requires training data and a DiT-style transformer backbone; NFE budget at inference (e.g., ~100 network calls) must fit latency targets; quality may still trail SOTA autoregressive LMs for complex tasks.
- Constrained generation and combinatorial reasoning
- Use case: Solving or generating constrained puzzles (e.g., Sudoku), timetables, or simple schedule assignments where some positions are pinned.
- What’s enabled: The method supports pinning constraints at every sampler step; PC improves solution validity substantially versus pure ODE.
- Sectors: Operations research, logistics, education (puzzle generation), software testing (test-suite synthesis under constraints).
- Potential products/workflows:
- A Sudoku solver/generator API using vMF+PC for high validity and controllable diversity in puzzles.
- A lightweight “constrained sequence generator” for timetable or roster prototypes where discrete slots must satisfy local constraints.
- Dependencies/assumptions: Must express constraints as position-level pinning or masking; scaling to large/global constraint sets may need task-specific conditioning schemes.
- Synthetic data generation for discrete sequences
- Use case: Generating synthetic categorical sequences (tokens, labels) for augmentation, benchmarking, or stress testing.
- Sectors:
- Healthcare: synthetic EHR event-type sequences for non-sensitive benchmarking.
- Finance: synthetic transaction-code sequences for pipeline testing.
- Education: tokenized question banks or practice items.
- Potential products/workflows:
- A “SphericalFlow” data generator library that emits categorical sequences with per-token entropy control.
- A CI pipeline tool that generates stress-test inputs (categorical fields) to test downstream systems.
- Dependencies/assumptions: Synthetic outputs should not be used for privacy-sensitive use cases without dedicated privacy mechanisms; model quality depends on training data coverage.
- Research prototyping for discrete generative modeling
- Use case: Rapid evaluation of flow-based text models without engineering bespoke discrete processes.
- What’s enabled: A single cross-entropy target that learns per-position posteriors; closed-form score and tractable velocity on the sphere support ODE, SDE, and PC from one model.
- Potential tools:
- A PyTorch/JAX module implementing vMF paths on the sphere with precomputed ψ̃_t lookup tables and tangent projections.
- Benchmarks comparing Euclidean (VP/VE) vs. spherical (vMF vs. geodesic) paths for discrete sequence tasks.
- Dependencies/assumptions: Accurate and stable numerical evaluation of Bessel ratios and ψ̃_t; appropriate schedule κ_t and embedding dimension selection.
- Education and consumer applications
- Use case: Puzzle apps (Sudoku), language practice (full-sentence suggestions), and form autocompletion for categorical fields under simple constraints.
- What’s enabled: One-shot generation of complete outputs with on-device or server inference; user-tunable diversity with PC.
- Potential products:
- A Sudoku app that generates new puzzles or fills incomplete boards with high validity.
- A writing assistant that proposes entire sentence alternatives with tone/creativity control (entropy proxy).
- Dependencies/assumptions: Model size vs. device constraints; inference-time compute for PC steps; proper safety guardrails for consumer text.
Long-Term Applications
These use cases require further research, scaling, integration, or development beyond the current results but are grounded in the paper’s methods (vMF paths, radial ODE reduction, posterior-only learning, and spherical PC sampling).
- High-quality non-autoregressive language generation at scale
- Use case: Matching or outperforming autoregressive LLMs in quality with faster, parallel decoding.
- What’s needed: Scaling to larger backbones, better conditioning (e.g., self-conditioning, distillation from AR LMs), advanced schedules κ_t, and more efficient PC/SDE schemes.
- Sectors: Software, cloud AI services, enterprise productivity.
- Dependencies/assumptions: Compute budgets for training; improved training objectives or alignment techniques; quality-parity targets with AR baselines.
- Hybrid inference pipelines (draft-then-refine)
- Use case: Use spherical flows to generate full-sequence drafts, then refine with AR decoders or task-specific rerankers.
- What’s needed: Distillation or reranking interfaces; latency-aware orchestration between parallel draft generation and AR refinement.
- Sectors: Software, media, customer support automation.
- Dependencies/assumptions: Integration with AR LMs; routing strategies to minimize latency; quality consistency across multiple candidates.
- Discrete biological sequence modeling
- Use case: Protein, peptide, or DNA sequence generation under structural/functional constraints using spherical flows for fast sampling and entropy control.
- Sectors: Healthcare, biotechnology, pharmaceuticals.
- What’s needed: Domain-specific vocabularies, conditioning on structure/assays, constraint encoders, and evaluation pipelines.
- Dependencies/assumptions: High-quality domain datasets; careful safety review for biosecurity; joint modeling of long-range dependencies and constraints.
- Robotics and planning (discrete action sequences)
- Use case: Generating action sequences or plans in discrete domains (e.g., symbolic planners, task-and-motion stacks) in one shot, with subsequent corrective steps.
- Sectors: Robotics, manufacturing.
- What’s needed: Task-conditional models, integration with MDP/RL frameworks, constraint-aware sampling (safety, feasibility).
- Dependencies/assumptions: Coupling to environment simulators; safety constraints embedded as pinning or guidance; validation loops for feasibility.
- Structured prediction in enterprise workflows
- Use case: Large, real-world scheduling, routing, or allocation with combinatorial constraints and domain rules.
- Sectors: Logistics, energy grid dispatch (discrete modes), public-sector operations.
- What’s needed: Constraint-programming interfaces to enforce global constraints; scalable conditioning mechanisms across long horizons; performance guarantees or repair heuristics.
- Dependencies/assumptions: Hybrid solvers that combine spherical flows with ILP/CP post-processing; fairness and regulatory compliance for allocations.
- Discrete multimodal generation
- Use case: Generating discrete components in multimodal systems (e.g., tokenized captions, segmentation maps, symbolic music) via spherical flows that harmonize with continuous components.
- Sectors: Media, creative tools, autonomous systems.
- What’s needed: Joint manifold design for mixed discrete–continuous outputs; cross-modal conditioning; efficient PC with multimodal noise processes.
- Dependencies/assumptions: Robust training across modalities; tooling for Brownian motion on product manifolds; effective entropy control across channels.
- Privacy-aware synthetic categorical data
- Use case: Generating useful synthetic records in regulated sectors.
- Sectors: Healthcare, finance, public policy.
- What’s needed: Differential privacy or other privacy-preserving training on top of the spherical flow; utility–privacy evaluations.
- Dependencies/assumptions: Strong privacy guarantees integrated into learning (not provided in the paper); governance frameworks for responsible use.
- Compression and accelerated decoding
- Use case: Leveraging non-AR parallel sampling for low-latency decoding or neural compression of text-like data.
- Sectors: Communications, on-device AI.
- What’s needed: Model compression, approximate PC schemes, early-exit mechanisms, and hardware-friendly implementations of spherical operations.
- Dependencies/assumptions: Efficient libraries for spherical geometry (projection, Brownian motion) on accelerators; acceptable rate–distortion trade-offs.
- Generalized manifold-based discrete modeling
- Use case: Extending the radial-symmetry reduction and vMF methodology to other manifolds (e.g., Stiefel, hyperbolic space) for structured symbol systems.
- Sectors: Academia, advanced ML research.
- What’s needed: New noise processes with tractable scores/velocities, analogous radial reductions, and stable numerical schemes.
- Dependencies/assumptions: Task-specific manifold choices; numerical stability for special functions; theoretical guarantees for continuity equations on new manifolds.
Notes on feasibility and implementation dependencies
- Numerical stability and tooling:
- Reliable evaluation of Bessel ratios A_d(κ) and schedule-independent ψ̃_t(s) (precomputable on a κ grid).
- Stable and efficient tangent projections and Brownian motion on spheres for PC steps.
- Training objective and data:
- Cross-entropy training to approximate per-position posteriors; sufficient data diversity is required.
- Hyperparameters and schedules:
- Concentration schedule κ_t and step discretization in “concentration space” strongly affect quality; PC step size needs tuning per task.
- Compute budget:
- Inference requires multiple network evaluations (NFE), often mitigated by parallelism but still subject to latency constraints.
- Quality vs. control:
- PC unlocks a practical control over entropy/diversity, but downstream tasks must calibrate this trade-off (e.g., validity for constraints, coherence for text).
Glossary
- adaLN conditioning: Adaptive LayerNorm conditioning mechanism used to inject conditioning signals into a transformer backbone. "All continuous methods share a DiT-style transformer \cite{peebles2023scalable} backbone with adaLN conditioning."
- Bessel ratio: The ratio of modified Bessel functions that gives the mean resultant length (expected cosine similarity) of a vMF distribution. "The mean resultant length of the vMF distribution is the Bessel ratio"
- Brownian motion (on the sphere): A stochastic process representing random motion constrained to the sphere, used in Langevin corrector steps. "where is Brownian motion on at position , independent across positions."
- Concentration parameter: In vMF, the parameter controlling how tightly the distribution concentrates around its mean direction. "von Mises--Fisher (vMF) family of distributions on indexed by a concentration parameter ."
- Continuity equation: A PDE expressing conservation of probability mass under a velocity field. "The pair satisfies the continuity equation"
- Cross-entropy loss: A training objective equal to the negative expected log-likelihood of the target distribution, minimized when the model matches the target. "We train with the cross-entropy loss"
- DiT-style transformer: A diffusion Transformer backbone architecture adapted for generative modeling. "All continuous methods share a DiT-style transformer \cite{peebles2023scalable} backbone with adaLN conditioning."
- Divergence (on the sphere): The differential operator measuring the net outflow of a vector field on the sphere. "the divergence is defined by"
- Extrinsic representations: Representing manifold points in the ambient Euclidean space (e.g., unit vectors in Rd for the sphere). "We work with extrinsic representations on "
- Flow of measures: A time-indexed family of probability distributions connected by a vector field or ODE/SDE dynamics. "We sample from $p_{\mathrm{data}$ by constructing a flow of measures"
- Flux equation: A one-dimensional conservation law obtained from radially symmetric dynamics on the sphere. "the one-dimensional flux equation"
- Geodesic distance: The shortest-path distance along a manifold; on the sphere it is the arccosine of the inner product. "the geodesic distance between two points is given by"
- Geodesic interpolation: Interpolation along the manifold’s geodesic (shortest) path, the spherical analogue of linear interpolation. "The first is the geodesic interpolation, the analogue of linear interpolation in "
- Laplace-Beltrami operator: The intrinsic Laplacian on a Riemannian manifold, generalizing the Euclidean Laplacian. "and the Laplace-Beltrami operator becomes"
- Langevin dynamics: Stochastic dynamics combining gradient ascent on log-density with noise, used for corrector steps that preserve a target distribution. "The Langevin dynamics on read per position as"
- Marginal score: The gradient of the log marginal density with respect to the state; used in SDE/PC sampling. "The marginal velocity and the marginal score on both decompose"
- Marginal velocity: The velocity field obtained by averaging conditional velocities weighted by posteriors over latent tokens. "The marginal velocity and the marginal score on both decompose"
- Mean resultant length: For directional distributions, the expected cosine similarity with the mean direction; for vMF it equals the Bessel ratio. "The mean resultant length of the vMF distribution is the Bessel ratio"
- Modified Bessel function of the first kind: Special function appearing in the vMF normalization and moments. "is the normalization constant and is the modified Bessel function of the first kind and order ."
- Predictor-corrector sampling: A sampling scheme combining deterministic predictor steps with stochastic corrector steps (e.g., Langevin) to improve sample quality. "This gives access to both ODE and predictor-corrector (PC) sampling."
- Product manifold: The Cartesian product of manifolds, itself a manifold, used to model sequences as positions on . "For paths of measures on the product manifold "
- Posterior: The conditional distribution over tokens given the current noisy state; used to weight velocities and scores. "The posterior is the only learned object"
- Radial symmetry: Dependence only on the inner product with a fixed direction (e.g., on the sphere). "A function is called radially symmetric around "
- Riemannian gradient: The gradient on a manifold, obtained by projecting the Euclidean gradient onto the tangent space. "Then the Riemannian gradient of on is"
- Riemannian manifold: A smooth manifold equipped with an inner product on each tangent space, enabling geometric calculus. "we embed the discrete data into a continuous Riemannian manifold"
- Riemannian volume measure: The natural volume measure induced by a Riemannian metric, used to define densities on manifolds. "(with respect to the Riemannian volume measure)"
- SDE sampling: Sampling via stochastic differential equations driven by the score of the log-density. "The score \eqref{eq:marginal_score_vmf} also makes SDE sampling available"
- Slerp (spherical linear interpolation): A geodesic interpolation method on the sphere that preserves constant-speed motion. "This path is sometimes called spherical linear interpolation \rmfamily(slerp)."
- Tangent space: The linear space of directions at a manifold point; for the sphere, vectors orthogonal to the position vector. "The tangent space at a point is"
- Variance-exploding (VE): A noise schedule/path where the variance increases over time, e.g., adding growing Gaussian noise. "variance-exploding (VE), given as "
- Variance-preserving (VP): A noise schedule/path that keeps marginal variance constant (e.g., linear interpolation with fixed variance). "variance-preserving (VP) i.e the linear interpolation from \ref{bsp2}"
- von Mises–Fisher (vMF) distribution: A directional distribution on the sphere parameterized by a mean direction and concentration. "The von Mises-Fisher distribution with mean and concentration , denoted by , has the density"
Collections
Sign up for free to add this paper to one or more collections.