Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cut Cross-Entropy for LLM Efficiency

Updated 7 January 2026
  • Cut Cross-Entropy is a novel approach that minimizes memory usage by streamlining cross-entropy computation in large-vocabulary neural language models.
  • It employs custom kernels and blockwise computations to avoid materializing full logit matrices, drastically reducing DRAM usage and accelerating training.
  • The method leverages softmax sparsity and gradient filtering to optimize backward passes while maintaining convergence and reducing compute overhead.

Cut Cross-Entropy (CCE) refers to a family of techniques for optimizing the computation of cross-entropy loss in large-vocabulary neural models, particularly LLMs. The defining feature of CCE is the reduction—often elimination—of the need to materialize the full logit matrix for all tokens and all vocabulary items in memory, addressing a dominant training-time memory bottleneck. Two lines of research employ the "Cut Cross-Entropy" phrase: a memory-efficient CCE algorithm for LLMs that avoids explicit full-matrix construction (Wijmans et al., 2024), and the ISBE approach arguing for the redundancy of explicit cross-entropy computation by leveraging the structural cancellation of gradients in classification models (Skarbek, 2023).

1. Classical Cross-Entropy Loss and Its Bottlenecks

Standard cross-entropy (CE) loss requires first transforming features into a logit matrix zRT×Vz \in \mathbb{R}^{T \times V}, where TT is the number of next-token predictions in a batch and VV is the vocabulary size. The cross-entropy over the batch is: L=t=1Tlogsoftmaxyt(zt,)=t=1T[zt,ytlogi=1Vexp(zt,i)]L = - \sum_{t=1}^T \log \operatorname{softmax}_{y_t}(z_{t,\cdot}) = - \sum_{t=1}^T \left[ z_{t,y_t} - \log \sum_{i=1}^V \exp(z_{t,i}) \right] This implementation entails:

  • Memory: O(TV)O(T \cdot V) words—for storing all logits.
  • Compute: O(TVD)O(T \cdot V \cdot D) FLOPs, where DD is the embedding dimension.

In practice, for large VV, the memory footprint to materialize zz can surpass all other model components combined. For example, in the Gemma 2 model (V=256,128|V| = 256,128, T=8,192T = 8,192, D=2,304D = 2,304), the loss computation alone previously consumed \sim24 GB GPU memory, or 28 GB including the classifier head (Wijmans et al., 2024).

2. CCE Algorithmic Principle: Minimal Logit Materialization

The core insight behind the CCE algorithm (Wijmans et al., 2024) is that, for each token prediction tt, only two scalars are required:

  • The logit for the correct token zt,ytz_{t,y_t}.
  • The normalization scalar logi=1Vexp(zt,i)\log \sum_{i=1}^V \exp(z_{t,i}).

CCE avoids global memory storage of the full T×VT \times V logit matrix by:

  • Computing zt,ytz_{t,y_t} via direct indexed matrix products, consuming only O(TD)O(T \cdot D) memory.
  • Computing the log-sum-exp across VV via streaming reduction in on-chip SRAM, never materializing the full logit vector in DRAM.

In implementation, custom kernels perform these operations in a blockwise fashion, maintaining only small intermediate buffers per CUDA block and relying on atomic operations to aggregate results. The forward and backward passes are handled by distinct kernels, with forward pass using "indexed mat-mul" and "linear + log-sum-exp" operations, and the backward pass recomputing necessary blocks for efficient gradient updates.

3. Memory and Compute Complexity, and Empirical Results

CCE delivers a drastic reduction:

  • Memory: O(T+V)O(T+V) global memory (for output and bookkeeping) rather than O(TV)O(T \cdot V) for logits.
  • Compute: Identical O(TVD)O(T \cdot V \cdot D) arithmetic to standard CE but without global T×VT \times V tensor I/O.
  • On-chip block usage: O(D(NB+MB))O(D \cdot (N_B+M_B)) where NBN_B and MBM_B are block sizes for tokens and vocabulary, never spilling to global memory.

Empirical results for Gemma 2 (2B parameters) on an A100 GPU (batch = 8,192, V=256,000|V|=256,000): | Method | Loss mem | Loss ms | Grad mem | Grad ms | Total mem | Total ms | |--------------------|----------|---------|----------|---------|-----------|----------| | Baseline PyTorch CE| 24,000MB | 82ms | 16,000MB | 121ms | 28,000MB | 207ms | | torch.compile | 4,000MB | 49ms | 12,000MB | 92ms | 16,000MB | 143ms | | TorchTune (8ch) | 8,000MB | 55ms | 1,630MB | 115ms | 9,631MB | 170ms | | Liger Kernels | 1,474MB | 302ms | — | — | 1,474MB | 303ms | | CCE (Ours) | 1MB | 43ms| 1,163MB | 95ms | 1,164MB | 135ms |

These results reflect a reduction by over 99% in DRAM for the loss, with total classifier head memory reduced from 28 GB to 1 GB—most of which is now the gradient storage CRD×V\nabla C \in \mathbb{R}^{D \times V} in fp16/bf16 (Wijmans et al., 2024).

4. Softmax Sparsity and Gradient Filtering Optimization

CCE exploits the sparsity inherent in the softmax output. In bfloat16, any entry <ϵ=212< \epsilon = 2^{-12} is numerically zero. Empirically, in Gemma 2, <0.02%<0.02\% of vocabulary tokens per output have non-negligible softmax values. Therefore, the backward kernel skips entire blocks whose maximum is below ϵ\epsilon.

This "gradient filtering" yields a 3.5×3.5\times backward speedup, with no measurable difference in final convergence. Additionally, vocabulary blocks are sorted by average logit per forward pass, promoting block-level sparsity and further reducing nonzero block computations (Wijmans et al., 2024).

A plausible implication is that as vocabulary size increases, efficiency gains from blockwise gradient filtering become more pronounced, especially on architectures with limited DRAM bandwidth.

5. The ISBE Principle: Cross-Entropy Redundancy in Backpropagation

Independent of the above algorithmic memory-optimization, recent work argues that explicit cross-entropy computation is redundant for deep learning classifiers, denoting this approach ISBE (Inference–Score–Backprop–Error) (Skarbek, 2023). The observation is that, for SoftMax-CE blocks, the gradient with respect to logits is: Lzj=yjyj\frac{\partial L}{\partial z_j} = y_j - y^\circ_j where y=softmax(z)y = \operatorname{softmax}(z) and yy^\circ is the target probability vector. Thus, backpropagation through softmax+CE\operatorname{softmax} + CE is equivalent to simply passing yyy-y^\circ backward, without computing the explicit cross-entropy scalar. This property generalizes to any activation FF with monotonicity and appropriate range, using a pointwise error F(z)yF(z) - y^\circ.

ISBE prescribes:

  • Normalizing logits with a monotonic, bounded FF (e.g., Sigmoid, Tanh).
  • Direct backpropagation of F(z)yF(z)-y^\circ as the network input error—skipping explicit entropy computation.

Empirical results on MNIST show up to 3%3\% reduction in backprop time, with no loss—and occasionally slight improvement—in accuracy. ISBE's conditions for exact match to cross-entropy gradients only hold for actual SoftMax (possibly with bias/relocation); for other activations, MSE on normalized scores is employed as an effective surrogate (Skarbek, 2023).

6. Tradeoffs and Implementation Considerations

CCE delivers the following operational benefits (Wijmans et al., 2024):

  • Memory usage: Reduced by orders of magnitude (e.g., 24 GB to 1 MB for the loss).
  • Throughput: Equal or slightly better (≈5%) than best fused implementations, due to elimination of T×VT \times V tensor writes to DRAM.
  • Convergence: Indistinguishable from standard CE as verified on several LLMs.
  • Portability: Implemented in custom Triton kernels, but easily adapted to CUDA.

Tradeoffs include the need for custom kernels and per-vocabulary sorting, the latter of which adds up to +15%+15\% latency if omitted. Gradient filtering is critical; omitting it increases backward time 3.4×3.4\times but does not affect accuracy.

ISBE's primary tradeoff is that, outside of true SoftMax (with relocation), it does not exactly replicate CE, and for non-standard activations, gradient correspondence is lost. Properly matching label encoding to the activation's range is essential to preserve gradient meaning and learning stability (Skarbek, 2023).

7. Context, Limitations, and Applicability

CCE is particularly advantageous for LLMs with very large output vocabularies, where memory bandwidth rather than compute is the critical bottleneck. Its applicability to hybrid architectural settings depends on the availability of on-chip SRAM and low-latency indexed matrix operations.

ISBE's technique aligns with modern autodifferentiation best practices, exploiting automatic multiplication by the activation's derivative; users need only supply error terms. However, the approach can fail for exotic normalization schemes that do not preserve the conditions for error gradient equivalence (Skarbek, 2023).

CCE and ISBE represent orthogonal but complementary advances: CCE addresses the resource scaling issue in cross-entropy computation for large-vocabulary LLMs (Wijmans et al., 2024), while ISBE foregrounds theoretical redundancies in loss computation for classifier networks (Skarbek, 2023). Both contribute to the ongoing re-examination of canonical deep learning pipelines for efficiency and theoretical foundation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cut Cross-Entropy.