Papers
Topics
Authors
Recent
Search
2000 character limit reached

LTE: Learn-To-be-Efficient Algorithm

Updated 27 January 2026
  • LTE is a training algorithm that induces highly structured activation sparsity in LLM feedforward networks, balancing computational efficiency and task performance.
  • It employs a two-stage training process with soft routing followed by hard thresholding, achieving up to 2.6× FLOPs reduction and 25% latency reduction.
  • LTE supports both ReLU and soft-activation models and utilizes a custom sparse-MoE CUDA kernel for practical hardware acceleration.

Learn-To-be-Efficient (LTE) is a training algorithm for LLMs that induces highly structured activation sparsity in their feedforward networks (FFNs), enabling a superior trade-off between computational efficiency and task performance. Unlike prior approaches that leverage post-hoc sparsity or focus solely on ReLU-based models, LTE applies to both ReLU and non-ReLU activations, training the model and a set of lightweight routers to activate minimal subsets of neurons in a hardware-efficient pattern, with empirical results demonstrating FLOPs reductions up to ∼2.6× and up to 25% inference latency reduction at 50% sparsity for LLaMA2-7B, with negligible loss in accuracy (Zheng et al., 2024).

1. Joint Loss and Structured Sparsity Mechanism

LTE augments the standard task loss with a structured sparsity penalty. For each FFN of a pretrained LLM, neurons are clustered into NN balanced "experts" (typically in blocks of 32), and a trainable router Gl()RNG_l(\cdot) \in \mathbb{R}^N at each layer ll governs expert activations. During Stage 1 training, the following loss is optimized: Ls1=Ltask(f(x),y)+λ1Lefficiency+λ2Lseparability\mathcal{L}_{\mathrm{s1}} = \mathcal{L}_{\mathrm{task}}(f(x),y) + \lambda_1\,\mathcal{L}_{\mathrm{efficiency}} + \lambda_2\,\mathcal{L}_{\mathrm{separability}} Here:

  • Lefficiency=1LNl=1Li=1NGl(x)i2\mathcal{L}_{\mathrm{efficiency}} = \frac{1}{L N}\sum_{l=1}^L\sum_{i=1}^N |\,G_l(x)_i|^2, which encourages routers to deactivate unnecessary experts;
  • Lseparability=1LNl=1Li=1N1(Gl(x)iτ)2\mathcal{L}_{\mathrm{separability}} = \frac{1}{L N}\sum_{l=1}^L\sum_{i=1}^N \frac{1}{(G_l(x)_i - \tau)^2}, which discourages routers from issuing outputs near the gating threshold τ\tau, stabilizing subsequent hard masking.

This structured regularizer, Rsparsity\mathcal{R}_{\mathrm{sparsity}}, can be written as: Rsparsity=Lefficiency+Lseparability\mathcal{R}_{\mathrm{sparsity}} = \mathcal{L}_{\mathrm{efficiency}} + \mathcal{L}_{\mathrm{separability}}

2. Two-Stage Training Algorithm

LTE employs a two-stage training process to specialize routers and then freeze the expert selection mechanism:

  • Stage 1 (Soft Routing):
    • Routers are trained jointly with the model using the full loss above.
    • Soft expert scores from Gl(x)G_l(x) (sigmoid-activated) weight expert contributions linearly in the FFN:

    FFNl(xl)=i=1NGl(xl)iExperti(xl)\mathrm{FFN}_l(x_l) = \sum_{i=1}^N G_l(x_l)_i \cdot \mathrm{Expert}_i(x_l)

  • Stage 2 (Hard Thresholding):

    • Routers are frozen; for each router GlG_l, a binary mask is computed:

    expert-mask(G(x)i)=1{G(x)i>τ}\mathrm{expert\text{-}mask}(G(x)_i) = \mathbf{1}\{G(x)_i > \tau\} - Only experts above threshold contribute to the FFN; the model is fine-tuned with Ltask\mathcal{L}_{\mathrm{task}} only, adapting weights to the new hard sparse structure.

A summarizing pseudocode for LTE:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Input: pretrained LLM f, #experts N, threshold τ (e.g. 0.5),
       λ1 (efficiency coeff), λ2 (separability coeff), 
       task data D

// Step 0: Group neurons into experts via balanced k-means
for each FFN layer l:
  cluster its W₁ columns into N groups of size d_FFN/N

// Attach a sigmoid router G_l: x↦sigmoid(x W_g^{(l)}) ∈ [0,1]^N

// Stage 1: joint model+router training (soft routing)
for epoch in 1…Epoch1:
  for (x,y) in D:
    FFN_l(x_l) = ∑_{i=1}^N G_l(x_l)_i · Expert_i(x_l)
    Compute full loss and backpropagate

// Stage 2: freeze routers, switch to hard thresholding
for epoch in 1…Epoch2:
  for (x,y) in D:
    for l:
      mask_i = 1 if G_l(x_l)_i > τ else 0
      FFN_l(x_l) = ∑_{i:mask_i=1} Expert_i(x_l)
    Compute L_task and update only f

Return fine-tuned sparse LLM f_LTE.

3. Architectural Adaptation for ReLU and Soft Activations

LTE supports both ReLU-based and soft-activation (e.g., SwiGLU) heads:

  • ReLU (e.g., GPT-2/OPT):

h=xW1+b1,FFN(x)=ReLU(h)W2+b2h = xW_1 + b_1, \quad \mathrm{FFN}(x) = \mathrm{ReLU}(h)W_2 + b_2

Neurons are grouped via columns of W1W_1, with routers operating on self-attention output xx.

  • Soft-Activation (e.g., LLaMA’s SwiGLU):

SwiGLU(x)=(Swish(xWA)(xWB))W2\mathrm{SwiGLU}(x) = (\mathrm{Swish}(xW_A) \odot (xW_B)) W_2

Experts are formed by clustering columns of WAW_A; routers and training methodology are otherwise unchanged. No modification to forward/backward computation is necessary for non-ReLU activations.

4. Inference Kernel and Hardware Efficiency

Efficient exploitation of structured sparsity is realized through a custom sparse-MoE CUDA kernel:

  • Each expert block (32-wide) is packed contiguously in memory.

  • At inference, binary masks {1[Gi>τ]}\{\mathbf{1}[G_i > \tau]\} are used to select active experts.

  • A single fused kernel:

    1. Gathers active blocks into a small dense buffer.
    2. Performs a batched GEMM on the buffer.
    3. Writes output back to the original locations.

Routers account for approximately 1% of FFN FLOPs; thresholding is a binary mask operation. Wall-clock throughput scales almost linearly with the fraction of blocks dropped. Empirical measurements indicate up to 2.6× FFN FLOPs reduction translates to ~2× end-to-end speedup on modern GPUs (Zheng et al., 2024).

5. Empirical Evaluation

Representative accuracy and efficiency results:

Table: Comparative GFLOPs/token Across Methods

Method XSum E2E Wikitext
Full (dense) 12.06 12.06 12.06
KLA (oracle) 7.14 6.42 7.87
MoEfication (ReLU) 10.45
R-LLaMA+MoE 8.27 7.39 11.10
LTE (Ours) 5.38 4.65 6.59
  • NLU (RoBERTa-base on GLUE) at 90% FFN sparsity:

    • SST-2: 93.3% (vs. 93.5% dense)
    • MRPC F1: 88.1% (vs. 89.0%)
    • QNLI: 90.8% (vs. 91.0%)
    • MNLI-m: 90.1% (vs. 90.5%)
  • LLaMA-7B: Permitting up to 0.05 ROUGE-L drop on XSum/E2E and 0.5 PPL on WikiText, LTE achieves the lowest GFLOPs/token across tasks.

Trade-off curves (sparsity vs. accuracy) consistently show LTE dominating standard fine-tuning, MoEfication, and KLA oracle baselines, notably at high sparsity (80–95%).

6. Ablation Analysis and Best Practices

  • Two-stage training is critical; Stage 1 router training confers ≈10% higher accuracy at high sparsity (90%) compared to random or Softmax-top-KK routers.
  • λ1\lambda_1 directly tunes average sparsity; λ10.1\lambda_1 \approx 0.1–0.3 yields 80–90% sparsity with <1% accuracy degradation.
  • Layer-wise sparsity adapts: In RoBERTa-base on MRPC, lower layers retain ~20% active experts while upper layers drop ~80%; the reverse is observed in GPT-2 medium on WikiText.
  • τ=0.5\tau = 0.5, λ2=0.5\lambda_2 = 0.5 remain effective across experiments.

7. Context and Implications

LTE extends structured sparse Mixture-of-Experts (MoE) approaches to a broader class of LLMs, supporting both traditional ReLU and newer soft-gated architectures. The approach incurs minimal overhead—routers are small and computationally efficient—enabling significant inference acceleration without the post-hoc brittleness or limited applicability of earlier MoEfication techniques. The highly structured, hardware-compatible sparsity regime induced by LTE unlocks substantial FLOPs savings at marginal or negligible performance cost, with established generality across language understanding, generation, and instruction tuning tasks (Zheng et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Learn-To-be-Efficient (LTE).