Distilling Datasets Into Less Than One Image

Published 18 Mar 2024 in cs.CV | (2403.12040v1)

Abstract: Dataset distillation aims to compress a dataset into a much smaller one so that a model trained on the distilled dataset achieves high accuracy. Current methods frame this as maximizing the distilled classification accuracy for a budget of K distilled images-per-class, where K is a positive integer. In this paper, we push the boundaries of dataset distillation, compressing the dataset into less than an image-per-class. It is important to realize that the meaningful quantity is not the number of distilled images-per-class but the number of distilled pixels-per-dataset. We therefore, propose Poster Dataset Distillation (PoDD), a new approach that distills the entire original dataset into a single poster. The poster approach motivates new technical solutions for creating training images and learnable labels. Our method can achieve comparable or better performance with less than an image-per-class compared to existing methods that use one image-per-class. Specifically, our method establishes a new state-of-the-art performance on CIFAR-10, CIFAR-100, and CUB200 using as little as 0.3 images-per-class.

Abstract PDF HTML Upgrade to Chat

References (30)

Citations (1)

View on Semantic Scholar

Summary

The paper presents PoDD, a novel approach that compresses full datasets into a single composite poster using less than one image per class.
It utilizes innovative algorithms, PoCO for semantic class ordering and PoDDL for soft label assignment, to optimize overlapping patch usage.
The method achieves state-of-the-art results on CIFAR-10, CIFAR-100, and CUB200 benchmarks, operating effectively at as low as 0.3 IPC.

Exploring the Frontiers of Dataset Distillation: Beyond One Image Per Class

Introduction to Poster Dataset Distillation (PoDD)

Dataset distillation, a process aimed at compressing extensive datasets into significantly smaller yet highly effective ones, has seen considerable advancements over recent years. The novel contribution we discuss herein, known as Poster Dataset Distillation (PoDD), pushes the boundaries of conventional dataset distillation techniques. Traditional approaches, constrained by the paradigm of maintaining at least one distilled image per class (IPC), have been outperformed by PoDD, which distills an entire dataset into a singular, shared "poster." This poster represents the dataset with less than 1 IPC, optimizing not merely the number of distilled images but focusing on the efficient use of pixels within a dataset.

Poster Dataset Distillation Defined

PoDD emerges as a breakthrough technique, proposing the distillation of datasets into a format significantly under one IPC. By transcending the limitations of discrete images per class, it leverages a single composite image or "poster" that encapsulates the essence of the entire dataset. This enables the shared usage of pixels among multiple classes, optimizing space and managing redundancy more effectively.

Theoretical Foundation and Methodology

PoDD redefines the conventional approach by focusing on the optimization of a "poster," employing overlapping patches to train models successfully. The method introduces two significant innovations:

PoCO (Poster Class Ordering): An algorithm to semantically organize classes within the poster, optimizing the shared pixel space for closely related classes.
PoDDL (Poster Dataset Distillation Labeling): A strategy for assigning soft labels to overlapping patches, ensuring each patch carries a meaningful learning signal.

These innovations allow PoDD to maintain, and in some instances surpass, the classification accuracy achieved by traditional methods using significantly fewer pixels.

Achievements and Numerical Results

PoDD establishes a new state-of-the-art in dataset distillation, demonstrating comparable or superior performance to prior methods with as little as 0.3 IPC. Specifically, on challenging benchmarks like CIFAR-10, CIFAR-100, and CUB200, PoDD not only matches but occasionally exceeds existing methods, underscoring the method's efficiency and versatility.

Theoretical Implications and Future Directions

The introduction of PoDD paves the way for a profound reevaluation of dataset distillation's possibilities. It suggests that the efficiency of distilled datasets can be significantly enhanced by sharing pixels among classes and reducing redundancy. This approach opens new avenues for research, including exploring alternative class ordering algorithms, investigating the incorporation of augmentations in distillation, and extending PoDD to support more than 1 IPC.

Concluding Remarks

PoDD's methodological innovations and demonstrated effectiveness beckon a new era in dataset distillation, emphasizing efficiency and performance. By distilling datasets into a singular poster with shared pixels among classes, PoDD not only achieves remarkable compression rates but also sets new benchmarks in classification accuracy for distilled datasets. The implications of this research, both practical and theoretical, invite further exploration into more sustainable and efficient practices in machine learning and AI development.