FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models

Published 2 Apr 2026 in cs.LG, cs.AI, cs.CL, and cs.DC | (2604.01762v1)

Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a crucial paradigm for adapting LLMs under constrained computational budgets. However, standard PEFT methods often struggle in multi-task fine-tuning settings, where diverse optimization objectives induce task interference and limited parameter budgets lead to representational deficiency. While recent approaches incorporate mixture-of-experts (MoE) to alleviate these issues, they predominantly operate in the spatial domain, which may introduce structural redundancy and parameter overhead. To overcome these limitations, we reformulate adaptation in the spectral domain. Our spectral analysis reveals that different tasks exhibit distinct frequency energy distributions, and that LLM layers display heterogeneous frequency sensitivities. Motivated by these insights, we propose FourierMoE, which integrates the MoE architecture with the inverse discrete Fourier transform (IDFT) for frequency-aware adaptation. Specifically, FourierMoE employs a frequency-adaptive router to dispatch tokens to experts specialized in distinct frequency bands. Each expert learns a set of conjugate-symmetric complex coefficients, preserving complete phase and amplitude information while theoretically guaranteeing lossless IDFT reconstruction into real-valued spatial weights. Extensive evaluations across 28 benchmarks, multiple model architectures, and scales demonstrate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer trainable parameters. These results highlight the promise of spectral-domain expert adaptation as an effective and parameter-efficient paradigm for LLM fine-tuning.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces FourierMoE, which leverages spectral parameterization and frequency-aware routing to achieve efficient multi-task LLM adaptation.
It demonstrates significant performance gains and parameter efficiency over existing PEFT and MoE methods across diverse NLP and CV benchmarks.
The study underscores the importance of frequency specialization, conjugate symmetry, and dynamic rank allocation to mitigate task interference in model adaptation.

FourierMoE: Fourier Mixture-of-Experts Adaptation of LLMs

Introduction and Motivation

Parameter-efficient fine-tuning (PEFT) of LLMs is fundamental for practical adaptation under constrained computational resources. While LoRA and related methods offer strong single-task performance, their applicability in multi-task fine-tuning is hampered by task interference and representational limitations, since a single parameter set cannot sufficiently capture the diverse objectives of heterogeneous tasks. Existing mixture-of-experts (MoE) approaches, when applied in the spatial domain, partially alleviate these issues via routing but incur structural redundancy, parameter inefficiency, and poor control over expert orthogonality.

This paper presents a paradigm shift by reformulating LLM adaptation in the spectral domain. Spectral analysis reveals that distinct tasks and LLM layers exhibit markedly different frequency energy profiles, motivating the need for frequency-specific adaptation rather than uniform spatial-domain modulation. This insight is operationalized via FourierMoE, a spectral MoPE architecture that leverages the inverse discrete Fourier transform (IDFT) and token-level, frequency-aware expert routing.

Methodology

FourierMoE integrates three core mechanisms: spectral parameterization, frequency-specialized experts, and a frequency-adaptive router. Each expert operates by learning a sparse set of conjugate-symmetric complex Fourier coefficients, supporting lossless reconstruction of real-valued weight updates via the IDFT. This formulation is strictly more expressive than prior real-only PEFT spectral adaptation methods.

Formally, let $W_0 \in \mathbb{R}^{M\times N}$ denote frozen pretrained weights, and $\Delta W$ the learnable update. $\Delta W$ is parameterized as the IDFT of a sparse spectral matrix $F \in \mathbb{C}^{M\times N}$ . For each expert, only a small band-limited subset of coefficients is activated, with bands dynamically initialized by Gaussian kernels centered at different frequencies, enforcing spectral specialization and orthogonality.

The routing mechanism $G(x)$ (parameterized as a lightweight neural module) dynamically assigns tokens to the top- $k$ experts, where routing is governed by the input and the frequency bands each expert covers. Crucially, spatial weight updates are reconstructed per expert-per-layer, not per token, limiting reconstruction cost to the number of activated experts $k$ .

Conjugate symmetry is strictly enforced on coefficients, following the property that the IDFT of a conjugate-symmetric spectrum yields a real matrix, thereby avoiding information loss associated with ad hoc truncation as in FourierFT and LFMA. Detailed theoretical analysis in the paper formalizes the phase-amplitude completeness and the necessity of this symmetry for exact and efficient adaptation.

Experimental Results

FourierMoE is evaluated on 28 benchmarks, spanning commonsense and math reasoning (LLaMA/Gemma/Qwen backbones), image classification (CLIP ViT-B/32), and NLU (RoBERTa-large/GLUE). Evaluation is exhaustive against single-PEFT (LoRA, rsLoRA, DoRA, PiSSA, MiLoRA, KaSA, LoRA-Dash, NEAT, TopLoRA, MELoRA, FourierFT) and MoPE (MoLoRA, AdaMoLE, HydraLoRA, GOAT) baselines.

Across all domains:

Commonsense reasoning (LLaMA-2 7B, Gemma 7B): FourierMoE achieves highest average accuracy (e.g., 83.95 on LLaMA-2, 88.19 on Gemma), outperforming GOAT by up to 1.22% while using $16\times$ fewer trainable parameters.
Math reasoning (LLaMA-3 8B, Qwen2.5-14B): FourierMoE exceeds the best single-PEFT baselines (e.g., DoRA) by 3.01% and the strongest MoPE by 3.61% on LLaMA-3, and reaches SOTA on Qwen2.5-14B.
Image classification (CLIP): FourierMoE notably outperforms both spatial and spectral baselines, with an average accuracy of 85.10 vs. 82.25 (FFT) and 84.4 (FFT MoE), with only 0.42% trainable parameters.
NLU (RoBERTa-large/GLUE): On six out of seven tasks, FourierMoE achieves the highest accuracy (e.g., 91.40 average), a 1.89% improvement over MiLoRA and a 1.64% gain over GOAT, outperforming FourierFT under identical parameter constraints.

Ablation studies confirm the necessity of imaginary components, frequency band specialization, and conjugate symmetry—whereas removing any yields clear degradation. Scaling analysis shows that both the number of experts and the activated coefficient count per expert have optimal ranges, beyond which capacity saturates or overfitting emerges. FourierMoE consistently maintains performance across a range of expert and router hyperparameters.

Theoretical Insights

Spectral parameterization provides a strict upper bound on spatial rank proportional to the number of active frequency components. Unlike spatial low-rank updates, each spectral rank-1 update distributes information globally throughout $W$ , facilitating both low-rank (compressive) and high-rank (expressive) updates conditioned on input demand.

FourierMoE’s router implements input-dependent dynamic rank allocation. For less complex tokens, low-bandwidth (low-rank) experts are selected; for complex tokens, high-bandwidth experts are dynamically engaged, providing greater representational flexibility compared to fixed-rank methods such as LoRA.

Furthermore, spectral specialization inherently encourages expert orthogonality, mitigating task interference—a principal cause of performance degradation in shared-parameter PEFT under multi-task objectives.

Practical Implications and Efficiency

FourierMoE offers a highly favorable tradeoff between adaptation capacity, parameter efficiency, and computational overhead. While introducing slightly higher training/inference latency than LoRA, it achieves stronger performance with significantly fewer trainable parameters and requires less memory than typical MoPE baselines. The design ensures the computational cost is dominated by the number of active experts $k$ (kept small), rendering it scalable for models with large expert pools.

Cross-modal generalizability is empirically demonstrated: FourierMoE efficiently adapts to both NLP and CV tasks using the same principles, and the spectral-domain adaptation is robust to diverse downstream settings.

Implications and Future Directions

FourierMoE advances the capability to perform robust, scalable, and efficient LLM adaptation under multi-task and cross-modal workloads. Its reliance on the spectral domain encourages more principled capacity management (via dynamic bandwidths and conjugate symmetry), fostering both theoretical rigor and empirical efficacy. Future advances may explore:

Learning adaptive frequency bands across layers for deeper expressivity;
Hierarchical spectral routing (e.g., cross-layer expert selection);
Extending spectral MoE approaches beyond adaptation—e.g., in structured pruning, knowledge distillation, or cross-modal transfer.

Investigation into spectral interventions for RLHF, alignment, or stochastic weight averaging may also be productive. More generally, the formal connection between spectral sparsity and spatial expressivity warrants further study, potentially yielding new avenues for compression, continual learning, and robust generalization.

Conclusion

FourierMoE demonstrates that frequency-aware parameter-efficient adaptation, via spectral specialization and MoE routing, offers substantial gains in both single-task and multi-task fine-tuning of LLMs and vision models. By combining expressivity, orthogonality, and parameter sparsity, FourierMoE redefines the design space for scalable adaptation, with immediate implications for the efficient deployment and continual evolution of foundation models in diverse real-world applications.

Reference: "FourierMoE: Fourier Mixture-of-Experts Adaptation of LLMs" (2604.01762).

Markdown Report Issue