Feature-encoder-free ImageNet training

Develop a Drifting Model training procedure that succeeds on ImageNet 256×256 without relying on an external feature encoder by constructing a kernel or representation that effectively measures sample similarity directly in the generator’s output space (either SD‑VAE latent space or pixel space), thereby enabling the drifting field to function without auxiliary features.

Background

The paper’s best results rely on computing the drifting loss in a learned feature space (e.g., ResNet-style encoders pre-trained via MAE, MoCo, or SimCLR). These features improve the kernel’s ability to capture semantic similarity, which in turn stabilizes and strengthens training.

The authors report that removing the feature encoder caused failures on ImageNet, suggesting that effective similarity measurement in raw latent/pixel spaces is currently insufficient. Resolving this would broaden applicability and simplify the training pipeline.

References

On the other hand, we report that we were unable to make our method work on ImageNet without a feature encoder. In this case, the kernel may fail to effectively describe similarity, even in the presence of a latent VAE. We leave further study of this limitation for future work.

Generative Modeling via Drifting  (2602.04770 - Deng et al., 4 Feb 2026) in ImageNet Experiments — Feature Space for Drifting