Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Prompt Architectures

Updated 17 February 2026
  • Multi-Prompt Architectures are compositional frameworks that use source-specific soft prompts to dynamically adapt transformer models to diverse tasks without re-tuning core parameters.
  • They enable scalable, privacy-preserving model construction by training each prompt independently on partitioned data, supporting federated, continual, and regulated learning environments.
  • Empirical results demonstrate near-paragon accuracy with efficient inference through modular prompt composition and structured attention, outperforming naive concatenation methods.

Multi-prompt architectures are compositional frameworks that enable a transformer-based model to process multiple, source-specific soft prompts simultaneously, allowing dynamic adaptation to varying task, domain, or user requirements at inference by on-the-fly prompt assembly. Each prompt is tuned on an isolated subset of data, and the architecture's core principle is modularity and composability—distinct prompts are trained independently and can be arbitrarily combined into a multi-prompt model for downstream prediction without re-tuning the backbone parameters or other prompts. This approach supports scalable, privacy-preserving, and customizable model construction, facilitating practical adaptation in federated, continual, or regulated learning environments.

1. Problem Setting and Model Structure

In the À-la-carte Prompt Tuning (APT) paradigm, a frozen vision transformer backbone fθf_\theta (e.g., ViT-B/16 pretrained on a large dataset) forms the shared foundation. The dataset is partitioned into nn non-overlapping sources D={D1,,Dn}\mathcal{D} = \{D_1, \ldots, D_n\}, where each DiX×YD_i \subset \mathcal{X} \times \mathcal{Y}. For each source DiD_i, a dedicated soft prompt p(i)RLp×dp^{(i)} \in \mathbb{R}^{L_p \times d} and a small classifier head headi\mathrm{head}_i are learned through prompt tuning. These prompt modules are lightweight (typically one "prompt token" per-layer plus a small set of learnable "memory tokens") and are independent—training can be performed separately per DiD_i on disjoint hardware or schedules.

At inference, users select any subset I{1,,n}I \subseteq \{1, \ldots, n\} according to application needs or data access constraints. The corresponding prompts {p(i)iI}\{p^{(i)} | i \in I\} are concatenated to form a composite input prompt, which is injected alongside the standard input tokens and forwarded through the backbone to obtain prediction logits for each participating prompt head.

2. Mathematical Formulation, Prompt Composition, and Attention Structure

Each prompt module and head is trained on its corresponding source DiD_i by minimizing standard supervised loss on its subset: LDi(p(i),headi)=(x,y)Di(softmax(headi(pL(i)(x))),y)L_{D_i}(p^{(i)}, \mathrm{head}_i) = \sum_{(x, y) \in D_i} \ell\left(\mathrm{softmax}\big(\mathrm{head}_i(p_L^{(i)}(x))\big), y\right) where the backbone layers recursively combine patch- and class-token embeddings z0(x)z_0(x) with prompt p(i)p^{(i)}, as they pass through the LL transformer layers: [zL(x),pL(i)(x)]=FθLFθ1([z0(x),p(i)])[z_L(x), p_L^{(i)}(x)] = F^L_\theta \circ \cdots \circ F^1_\theta([z_0(x), p^{(i)}])

For a given subset I={i1,,ik}I=\{i_1,\ldots,i_k\} at inference, prompts are concatenated: p(I)=[p(i1)p(i2)p(ik)]p^{(I)} = [p^{(i_1)}\, \|\, p^{(i_2)}\,\|\, \cdots\,\|\, p^{(i_k)}] and the attention mechanism is restructured to enforce modularity. Backbone tokens zz_\ell attend only to themselves, prompt tokens p(i)p_\ell^{(i)} attend only to zz_\ell and their own per-layer memory tokens m(i)m_\ell^{(i)}, and memory tokens do not attend to anything. This structured attention mask ensures the independence and isolation of prompts at the attention level, eliminating representational interference.

Output predictions are aggregated by simple averaging: y^I=1kiIsoftmax(headi(pL(i)))\hat{y}_I = \frac{1}{k} \sum_{i \in I} \mathrm{softmax}(\mathrm{head}_i(p_L^{(i)})) Optional weighting (APT-W) further adjusts this aggregation through similarity-based distances to source-specific prototypes.

3. Training, Modularity, and Theoretical Guarantees

Prompt training in APT is strictly decoupled: each (p(i),headi)(p^{(i)}, \mathrm{head}_i) is fit only on DiD_i and can be checkpointed, updated, or discarded in isolation. No raw data is exchanged between sources. This "a-la-carte" composability guarantees that a composed multi-prompt model f(x;I)f(x; I) depends strictly on iIDi\bigcup_{i \in I} D_i, supporting rigorous access control and privacy constraints.

Theoretical analysis demonstrates that, owing to structured attention and noninteracting prompt pathways, inference complexity scales as O(N2+(N+dmem)I)O(N^2 + (N + d_{\text{mem}})|I|), significantly more efficiently than traditional ensemble approaches, which incur O(IN2)O(|I| N^2) computation.

4. Empirical Results and Performance Analysis

Empirical validation across fine-grained visual recognition (multiple vision datasets), sharding/forgetting scenarios, continual learning benchmarks, and class/domain incremental learning confirms that APT multi-prompt models:

  • Achieve accuracy within 5%5\% of a "paragon" model trained on the union of the selected data sources.
  • Match or outperform naive concatenation or averaging approaches, which can degrade performance by over 10 points relative to the paragon, as documented in Table 1 and Fig. 2.
  • Support dynamic forgetting by deletion of a prompt and associated head, with only minor decreases in performance even after sequentially removing half the prompts.

On continual learning benchmarks such as Split CIFAR-100 and CORe50, APT achieves state-of-the-art results without replay buffers. Prompt tuning with memory tokens nearly matches full backbone fine-tuning and consistently outperforms head-only tuning in out-of-domain tasks.

5. Modularity, Privacy, and Practical Implications

A central advantage of APT is full modularity: each participant in a federated, organizational, or privacy-sensitive environment trains only on their own data (DiD_i) and distributes only their (p(i),headi)(p^{(i)}, \mathrm{head}_i) modules, never raw data. Model customization is immediate: users select which modules to compose at inference without needing to retrain global parameters. This property enables a variety of advanced use cases:

  • Privacy-preserving APIs: restrict model predictions to authorized datasets.
  • Dynamic unlearning: remove data or source from the model by deleting the prompt.
  • Versioned and customizable models: select subsets II for on-demand specialization.
  • Large-scale model marketplaces: lightweight prompt "apps" composed per user or task.
  • Efficient storage and update: each prompt is <0.06%<0.06\% of backbone parameters.

6. Limitations and Future Directions

While APT's strict independence between prompts provides strong modularity and practical efficiency, it also precludes prompt-to-prompt interaction at inference, which can slightly limit synergy between sources and may result in a few percent degradation on challenging out-of-domain generalization tasks. Model quality is contingent on the backbone pretraining; misaligned or poorly pretrained encoders can degrade when conjoined with many prompts. Unlike classical ensembles, deep feature fusion across prompts is not realized beyond the scope of memory tokens.

A plausible implication is that hybrid architectures involving limited, learnable inter-prompt communication or group-wise adaptation could ameliorate these limitations, albeit at a modest cost to strict modularity.

7. Summary Table: Core APT Workflow

Design Step Description
Data partitioning Split dataset into nn sources {Di}\{D_i\}
Prompt training Fit p(i),headip^{(i)}, \mathrm{head}_i separately for each DiD_i
Structured attention Masked cross-attention per prompt, using memory tokens for each
Prompt composition Concatenate selected prompts {p(i)}\{p^{(i)}\} at inference for I[n]I \subset [n]
Prediction ensembling Aggregate per-prompt predictions, optionally with weighting
Dynamic (un)learning Add/remove (p(i),headi)(p^{(i)}, \mathrm{head}_i) modules without further backbone adaptation

APT establishes a practical foundation for scalable, modular, and privacy-preserving model construction in vision transformers, with empirical robustness and theoretical guarantees for multi-prompt inference (Bowman et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Prompt Architectures.