Learning without Forgetting

Published 29 Jun 2016 in cs.CV, cs.LG, and stat.ML | (1606.09282v3)

Abstract: When building a unified vision system or gradually adding new capabilities to a system, the usual assumption is that training data for all tasks is always available. However, as the number of tasks grows, storing and retraining on such data becomes infeasible. A new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data for its existing capabilities are unavailable. We propose our Learning without Forgetting method, which uses only new task data to train the network while preserving the original capabilities. Our method performs favorably compared to commonly used feature extraction and fine-tuning adaption techniques and performs similarly to multitask learning that uses original task data we assume unavailable. A more surprising observation is that Learning without Forgetting may be able to replace fine-tuning with similar old and new task datasets for improved new task performance.

Abstract PDF Upgrade to Chat

Citations (3,930)

View on Semantic Scholar

Summary

The paper demonstrates that incremental CNN learning is achievable by preserving performance on old tasks without requiring historical data.
It employs knowledge distillation and joint optimization to balance acquiring new skills with retaining previous expertise.
Experiments on datasets like ImageNet and PASCAL VOC show that LwF outperforms fine-tuning while closely matching joint training results.

Learning without Forgetting: A Synopsis

In the field of computer vision, it is often necessary for systems to incrementally acquire new skills without losing proficiency in previously learned ones. Traditional methods for achieving this, such as multitask learning or fine-tuning, usually necessitate access to all historical training data, which may be impractical due to storage considerations or proprietary constraints. "Learning without Forgetting" (LwF), proposed by Zhizhong Li and Derek Hoiem, addresses this challenge by enabling Convolutional Neural Networks (CNNs) to assimilate new tasks using only the training data for those tasks while preserving performance on previously learned tasks.

Key Contributions

Need for Task Incrementality: LwF is motivated by practical applications where incremental learning is crucial—such as updating a robot's object recognition capabilities or expanding a safety system to detect additional hazards. The method is relevant for any scenario where retraining a model from scratch on aggregated data is infeasible.
Learning with Limited Data: The authors challenge the assumption that training data for all tasks must be simultaneously available. By doing so, LwF is well-suited for dynamic environments where tasks are added sequentially, and access to past training data may be restricted due to privacy, cost, or logistical issues.
Methodology: LwF retains the shared parameters $\theta_s$ of the CNN unchanged for previously learned tasks while updating them along with new task-specific parameters $\theta_n$ using new task data. This approach effectively prevents the model from 'forgetting' previous tasks—a common issue known as Catastrophic Forgetting.

Experiments and Results

The authors evaluate the efficacy of LwF using various image classification benchmarks, demonstrating the method's ability to learn new tasks without significant degradation in performance on old tasks.

Tasks and Datasets:

The experiments span multiple datasets including ImageNet, Places365, PASCAL VOC, and CUB, covering tasks of varying complexity and similarity.

Performance Metrics:

The paper reports mean average precision for VOC, and accuracy for other datasets, analyzing both old and new task performances simultaneously.

Comparison with Baselines:

LwF is rigorously compared against feature extraction, fine-tuning, fine-tuning~FC, and joint training. Results reveal that LwF often outperforms existing methods for new tasks, and closely approximates joint training performance where the latter requires access to old task data which LwF does not.

Technical Insights

Knowledge Distillation: The use of knowledge distillation where the model learns to replicate the outputs of its previous state ensures that responses for old tasks remain stable. Hyperparameters such as the temperature in the softmax layer play a critical role in balancing the new task adaptation with the retention of previous knowledge.
Warm-up and Joint Optimization: The two-stage training involving a warm-up phase for the new task-specific parameters followed by joint optimization ensures efficient convergence and stability in performance across tasks.
Implications of Training Data Quantity: Experiments show that LwF's advantages over fine-tuning and feature extraction become more pronounced with increasing new task data—indicating the method's robustness.

Future Directions

LwF has far-reaching implications:

Broad Applicability: Beyond image classification, extending LwF to tasks such as semantic segmentation, object detection, and even domains outside computer vision can make AI models more versatile.
Efficient Online Learning: Integrating LwF into online learning frameworks can enable AI systems to adapt in real-time, leveraging new information as it becomes available.
Robustness: Exploring variants that incorporate small representative datasets from previously learned tasks can potentially enhance the robustness of the model against catastrophic forgetting.

Conclusion

Zhizhong Li and Derek Hoiem's "Learning without Forgetting" presents a significant step towards enabling CNNs to efficiently learn new tasks while preserving their performance on old tasks without requiring historical data. The method bridges a critical gap in incremental learning, paving the way for AI systems that continuously evolve and adapt in real-world settings.

Markdown Report Issue