Coop: Memory is not a Commodity

Published 1 Nov 2023 in cs.LG, cs.AI, and cs.SE | (2311.00591v1)

Abstract: Tensor rematerialization allows the training of deep neural networks (DNNs) under limited memory budgets by checkpointing the models and recomputing the evicted tensors as needed. However, the existing tensor rematerialization techniques overlook the memory system in deep learning frameworks and implicitly assume that free memory blocks at different addresses are identical. Under this flawed assumption, discontiguous tensors are evicted, among which some are not used to allocate the new tensor. This leads to severe memory fragmentation and increases the cost of potential rematerializations. To address this issue, we propose to evict tensors within a sliding window to ensure all evictions are contiguous and are immediately used. Furthermore, we proposed cheap tensor partitioning and recomputable in-place to further reduce the rematerialization cost by optimizing the tensor allocation. We named our method Coop as it is a co-optimization of tensor allocation and tensor rematerialization. We evaluated Coop on eight representative DNNs. The experimental results demonstrate that Coop achieves up to $2\times$ memory saving and hugely reduces compute overhead, search latency, and memory fragmentation compared to the state-of-the-art baselines.

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Coop, which co-optimizes tensor allocation and rematerialization with a sliding window algorithm to reduce memory fragmentation.
It employs cheap tensor partitioning and recomputable in-place operations to lower compute overhead by up to 30% under strict memory ratios.
Coop outperforms state-of-the-art techniques in search latency and efficiency across models like GPT-3 and BERT Large.

Coop: Memory is not a Commodity

Introduction

The paper "Coop: Memory is not a Commodity" (2311.00591) addresses the limitations of existing tensor rematerialization techniques in deep learning frameworks by introducing Coop, a method that co-optimizes tensor allocation and rematerialization. Traditional methods overlook memory systems, leading to inefficiencies due to fragmented memory and increased recomputation costs. Coop proposes a sliding window algorithm for evicting contiguous tensors and introduces innovations such as cheap tensor partitioning and recomputable in-place operations, achieving improved memory usage and computational efficiency.

Figure 1: Comparison between DTR and Coop. DTR overlooks the underlying memory system, resulting in redundant evictions, whereas Coop optimizes tensor rematerialization and allocation.

Background and Motivation

Tensor rematerialization, also known as activation checkpointing, is pivotal in training large DNNs with limited memory. Existing techniques assume fungible memory, leading to fragmentations when tensors are evicted non-contiguously. Coop addresses this by using a sliding window for contiguous tensor eviction, reducing memory fragmentation. DNN frameworks' memory allocators, while simpler compared to CPUs, suffer when chunked memory is not released for new tensors, motivating Coop's approach to optimal tensor co-location and recomputation minimization.

Methodology

Coop implements three core modules designed to optimize both tensor allocation and rematerialization:

Sliding Window Algorithm: This efficiently finds contiguous tensors for eviction, optimizing search time from $O(2^N)$ to $O(N)$ . This approach addresses the traditional overhead caused by fragmentations, reducing the need for repeated evictions.
Cheap Tensor Partitioning: Tensors are classified by computational cost density and allocated on opposite ends of the memory pool. This separation ensures low-cost tensors are evicted first, optimizing overall system memory availability and reducing unnecessary recomputations.
Recomputable In-place Operations: Inspired by functional in-place paradigms, Coop reuses memory blocks for in-place operations without additional allocations, crucially allowing parameters to remain contiguous in memory, thus reducing fragmentation (Figure 2).
Figure 2: Illustration of cheap tensor partitioning in Coop. Tensors are allocated from both sides of the memory pool to optimize eviction efficiency.

Evaluation

Coop was evaluated against state-of-the-art methods, namely DTR and DTE, across eight DNNs, including GPT-3 and ResNet variants. The evaluations used multiple performance metrics such as compute overhead, search latency, and memory fragmentation rate:

Compute Overhead: Coop consistently showed lower overhead across all memory budgets, particularly excelling in environments with stringent memory constraints. For instance, Coop reduced overhead by 30% in training BERT Large under a 50% memory ratio compared to DTR and DTE.
Search Latency: With its single-traversal sliding window method, Coop significantly reduced search latency, offering faster contiguous memory location consistently across various models (Figure 3).
Memory Fragmentation: Coop cut fragmentation rates down to less than 5% across models, demonstrating a more efficient memory resource utilization, as indicated by the minimum fragmentation observed in BERT Large and BiLSTM training scenarios.
Figure 3: Search latency of Coop, DTR, and DTE under different memory ratios. Coop maintains uniform low latency due to its efficient contiguous memory allocation strategy.

Conclusion

Coop substantively improves tensor rematerialization by integrating memory system awareness into deep learning training frameworks, minimizing memory fragmentation, and lowering compute overhead. Its co-optimization strategy of tensor rematerialization and allocation leverages memory efficiently, thereby enabling the better handling of large-scale models under limited memory conditions. Future exploration of Coop could involve its integration with heterogeneous memory systems or extended techniques for handling swaps in distributed training environments. Such advancements might offer further substantial improvements in training large, dynamic DNN workloads more efficiently.