Modular PEFT Ref. Architecture

Updated 11 December 2025

Modular PEFT Reference Architecture is a design blueprint that incorporates lightweight, parameter-efficient modules into pre-trained models for specialized task adaptation.
It specifies precise insertion points, such as embedding and attention slots, ensuring systematic integration and interoperability of diverse adaptation modules.
The architecture enables effective trade-off analysis between adaptation power, resource efficiency, and training speed, supporting multi-domain and multimodal applications.

A Modular Parameter-Efficient Fine-Tuning (PEFT) Reference Architecture is a general structural blueprint for integrating compact, specialized adaptation modules into large pre-trained models (PLMs) to enable efficient adaptation across tasks, domains, and modalities. This reference design specifies plug-and-play insertion of well-scoped PEFT modules at defined locations in the model, supporting compositionality, extensibility, and organized trade-off analysis between adaptation power and resource efficiency. The architecture underpins most state-of-the-art PEFT systems, including those tailored for multi-domain, multi-modal, federated, and mixture-of-experts (MoE) transformer frameworks (Seo et al., 9 Mar 2025, Sabry et al., 2023, Prottasha et al., 19 Apr 2025, Liu et al., 4 Aug 2025, Patel et al., 24 Jan 2025).

1. Core Principles and Architectural Scope

The modular PEFT reference architecture is defined by the following principles:

Separation of Backbone and Adaptation Modules: The pretrained backbone (PLM) remains frozen or partially frozen, providing the foundation for reasoning, while task- or domain-specific adaptation modules (“adapters”/“PEMs”) supply lightweight, trainable correction or control pathways (Sabry et al., 2023, Patel et al., 24 Jan 2025).
Named Insertion Points and Interfaces: Adaptation modules are inserted at precisely specified slots such as input embedding, attention projections (Q/K/V/O), feedforward sublayers (FFN), or post-layer normalization junctions. Each module exhibits a documented interface specifying inputs, outputs, and parameter convention (Sabry et al., 2023, Seo et al., 9 Mar 2025, Belanec et al., 2 Dec 2025).
Composition and Extensibility: Multiple PEFT modules may coexist in parallel or sequentially, supporting additive, multiplicative, or even router/composed interaction. Reuse and recombination across tasks or domains is a first-class guarantee (Patel et al., 24 Jan 2025, Sabry et al., 2023).
Parameter and Efficiency Accounting: The architecture enables a priori analysis of parameter count, memory, throughput, and efficiency trade-offs for any compatible PEFT method (Prottasha et al., 19 Apr 2025, Belanec et al., 2 Dec 2025).

A schematic of the modular architecture is given in Table 1.

Block Type	PEFT Module Examples	Typical Slot
Prompt/Prefix	Soft Prompt, Prefix Tuning	Embedding, Attention
Adapter	Houlsby, Compacter	MLP after attention/FFN
Reparameterize	LoRA, IA³	Linear projections
MoE	MoFE, PERFT-Adapters	FFN/MoE block

2. Modular Components and Embedding Strategies

A modular PEFT system includes the following types of components:

Base Model: Frozen or partially trainable stack of layers (transformer encoder/decoder), responsible for all “standard” model computations (token embeddings, position encoding, MHSA, FFN blocks, residual connections, layer normalization). e.g., TinyLlama, BERT, ViT (Seo et al., 9 Mar 2025, Prottasha et al., 19 Apr 2025).
PEFT Modules: Small residual or multiplicative circuits (adapters, LoRA/IA³ projections, prompt vectors) inserted at predefined locations, uniquely identified by “module type” and potentially by domain/context label. Each PEFT module m comprises parameters φ of dimensionality orders of magnitude smaller than the backbone’s θ (Sabry et al., 2023, Hadji-Kyriacou et al., 2023).
Gating/Router Systems (optional): For MoE-style or multi-module setups, a lightweight gating network or router computes dynamic mixture weights for each module or expert at inference time (Seo et al., 9 Mar 2025, Liu et al., 4 Aug 2025).

Integration points include parallel (additive to hidden state or linear map), sequential (input/output chaining), or contextual (per-token context-dependent adapters (Hadji-Kyriacou et al., 2023)). Modules may be loaded, replaced, or composed at runtime, and their states versioned and indexed for task/domain management (Patel et al., 24 Jan 2025).

3. Formal Parameterization and Forward Pass Semantics

Let x denote the input, θ parameters of the (frozen) base model, and {φ_i} adaptation modules plugged at slots S_i:

Standard Forward Layer:

For PLM layer ℓ: $h_\ell = \mathrm{LayerNorm}(h_{\ell-1} + a_\ell + f_\ell)$ with $a_\ell$ (attention), $f_\ell$ (FFN).

PEFT-Enhanced Layer (residual additive):

$h_\ell = G_\ell(h_{\ell-1};\theta_\ell, \phi_\ell)$ , with $h_\ell = \mathrm{LayerNorm}(h_{\ell-1} + a_\ell + f_\ell + \Delta h_\ell)$ , $\Delta h_\ell = m_\ell(\cdot;\phi_\ell)$ , and $\phi_\ell=\varnothing$ if no module inserted (Sabry et al., 2023, Seo et al., 9 Mar 2025).

MoE and Router Example (MoFE):

Mixture over K frozen experts $E_i$ , router computes gate weights $g_i$ : $y(h) = \sum_{i=1}^K g_i(h) \cdot E_i(h)$ , with $g = \operatorname{SoftmaxTop}_m(Vh)$ , $V\in\mathbb{R}^{K\times d}$ (Seo et al., 9 Mar 2025).

For composition, multiple modules may be summed at each slot: $\Delta h_\ell = m^{(1)}_\ell(\cdot; \phi^{(1)}_\ell) + m^{(2)}_\ell(\cdot; \phi^{(2)}_\ell)$ or chained: $h_\ell = h_\ell + m^{(1)}_\ell(h_\ell) \to h_\ell = h_\ell + m^{(2)}_\ell(h_\ell)$ (Sabry et al., 2023, Patel et al., 24 Jan 2025).

Table 2 compares typical parameter costs.

| Method | Formula for Δ|θ| (Added Params) | Location | | -------------- | ------------------------------ | ------------------ | | Prompt Tuning | $n\cdot d$ | Input embedding | | Prefix Tuning | $L\cdot 2nd$ | Attention | | LoRA | $2rd$ | Linear projections | | Adapter | $L\cdot 2d d_h$ | FFN/Attention | | MoFE (K exp) | $K\cdot P_{ex}$ (frozen) | FFN/MoE block | | PERFT | $M\cdot 2Dr + D\cdot M$ | MoE parallel |

4. Module Composition, Reusability, and Domain Generalization

A key property is composability: modular PEFT architectures support merging, interpolation, and weighted gating of independently fine-tuned modules:

Module Summation & Convex Combination:

For N adaptation modules (e.g., domains), $\theta_C = \sum_{i=1}^N \lambda_i\,\theta_i$ , $\sum_i \lambda_i=1$ (Patel et al., 24 Jan 2025).

Block-wise, Element-wise Gating:

$\theta_C[j] = \sum_{i=1}^N g_i[j] \theta_i[j]$ , $g_i[j] \in [0,1]$

Plug-and-Play Multi-domain Assembly:

Modular repositories track PEMs/versioned adapters by domain, base model checkpoint, and method. PEMs are dynamically composed at inference for composite tasks (Patel et al., 24 Jan 2025, Hadji-Kyriacou et al., 2023).

The compositional design allows the same backbone to power distinct tasks, multi-domain generalization, or federated updates via local adapters and central aggregation (Chua et al., 2023). The shared subspace structure enables summing without additional fine-tuning and preserves directional biases.

5. Efficiency Analysis and Trade-off Practices

The architecture robustly characterizes memory, parameter, and compute efficiency:

Parameter Count:

$P_{total} = P_{base} + \sum_{\ell \in S} |\phi_\ell|$

Memory Footprint:

$\Delta M \approx \Delta|θ| \times$ wordsize (bytes, e.g., 2 for FP16) (Sabry et al., 2023, Prottasha et al., 19 Apr 2025).

Training Speed:

Time ∼ O(forward/backward FLOPs in PEFT modules) Many methods reduce training time by >50–70% compared to full fine-tuning with only ∼2–3 point accuracy drop; specific MoFE results capture this directly (Seo et al., 9 Mar 2025).

Composite Efficiency Metric:

TPME = weighted norm of {train-time, parameter, GPU memory} (Fu et al., 2024).

Best-practice guidelines:

Use LoRA or IA³ for lowest-overhead tasks; adapters for deeper influence; compacters to enable sharing; prompt/prefix for simple, memory-constrained scenarios (Sabry et al., 2023, Prottasha et al., 19 Apr 2025).
Modularize via uniform PEFT-branch interface (Prottasha et al., 19 Apr 2025, Belanec et al., 2 Dec 2025).
Employ load-balancing loss or router reset if MoE gating degenerates (Seo et al., 9 Mar 2025, Liu et al., 4 Aug 2025).

6. Mixture-of-Experts, Contextual, Multimodal, and Federated Extensions

Advanced modular PEFT architectures generalize to:

MoE and Sparse Routing:

MoFE and PERFT instantiate mixtures of frozen (domain) experts, routed by parameter-efficient gates. Adapter mixtures (PERFT) increase efficiency in MoE LLMs over MoE-agnostic LoRA, especially with token-wise soft top- $K$ selection (Seo et al., 9 Mar 2025, Liu et al., 4 Aug 2025, Liu et al., 2024).

Context-Aware and Multi-Modal PEFT:

Context-PEFT injects parallel context-specific adapters for each token-domain (modality, task, semantic role), replacing single-module updates with C-way selection (Hadji-Kyriacou et al., 2023).

Multi-modal/Decoupled Frameworks:

IISAN decouples adaptation into separate intra-modal and inter-modal towers, drastically reducing GPU/memory cost compared to embedded fusion (Fu et al., 2024).

Federated/Privacy-Preserving Patterns:

FedPEAT combines centralized backbone, distributed adapter fine-tuning, and optional emulation (distilled or compressed base models), orchestrated by RL-informed resource control (Chua et al., 2023).

These modular extensions share the backbone-PEFT interface, router/gating schema, and compositional layer, enabling seamless scaling across distributed, heterogeneous, or multi-functional environments.

7. Implementation and Benchmarking Frameworks

Modern modular reference architectures (e.g., PEFT-Factory) formalize interfaces and workflows:

Core Modules: PEFT methods registry, dataset loaders, base model loader, metrics/evaluators (Belanec et al., 2 Dec 2025).
Interface Standards:
- PeftConfig (hyperparameter dataclass)
- BaseTuner (module/adapter instantiation and forward logic)
- Registry/Plugin architecture for custom method addition
- Command-line/YAML configuration for instantiation and reproducibility
Parameter/Memory Formulas:
- Overhead: $N_{peft}/N_{total}\times 100\%$
- Memory estimate: $M_{total}+M_{peft}\approx 4(N_{total}+N_{peft})$ bytes (FP32)
Evaluation:
- Metrics: BLEU, ROUGE, token acc., F1, PEFT-specific efficiency metrics (e.g., PSCP, TPME)
- Results are logged and versioned for comparisons (Belanec et al., 2 Dec 2025, Fu et al., 2024).

The modular PEFT reference design thus guarantees, through clear definition of slots, module contracts, and efficiency metrics, a scalable and extensible substrate for future PEFT research and applications across evolving model and domain frontiers.