Formal characterization of error propagation from dependency estimation to TV bound

Characterize how estimation errors in the learned dependency matrix \hat{\mathbf{D}}—relative to the true pairwise-dependency matrix \mathbf{D} defined by expected total-variation influences—affect the total variation distance between the model’s joint conditional distribution P_\theta(Y_S | X, Y_U) and the factorized product distribution Q_\theta(Y_S | X, Y_U) under the DEMASK greedy subset selection algorithm; specifically, derive bounds that translate prediction error in \hat{\mathbf{D}} into degradation of the guarantee on TV(P_\theta(Y_S | X, Y_U), Q_\theta(Y_S | X, Y_U)) that holds when \mathbf{D} is known.

Background

DEMASK’s theoretical guarantee (Theorem 1) proves that the total variation distance between the model’s joint conditional P_\theta(Y_S | X, Y_U) and the factorized approximation Q_\theta(Y_S | X, Y_U) is bounded by a threshold \tau when the true dependency matrix \mathbf{D} is used and a sub-additivity assumption holds.

In practice, DEMASK replaces \mathbf{D} with a learned estimator \hat{\mathbf{D}} derived from a single forward pass. The paper notes that prediction errors in \hat{\mathbf{D}} propagate to the TV bound but does not provide a formal analysis.

A formal characterization would provide end-to-end theoretical robustness guarantees for DEMASK by quantifying how inaccuracies in dependency prediction impact the total variation guarantee and, consequently, the reliability of parallel decoding.

References

Theorem~\ref{thm:correctness} assumes access to the true dependency matrix $\mathbf{D}$, whereas our implementation uses a learned approximation $\hat{\mathbf{D}$. Prediction errors propagate to the TV bound, though we have not formally characterized this relationship.

Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models  (2604.02560 - Ringel et al., 2 Apr 2026) in Limitations, Theory-practice gap in dependency estimation