Multi-Prompt Architectures

Updated 17 February 2026

Multi-Prompt Architectures are compositional frameworks that use source-specific soft prompts to dynamically adapt transformer models to diverse tasks without re-tuning core parameters.
They enable scalable, privacy-preserving model construction by training each prompt independently on partitioned data, supporting federated, continual, and regulated learning environments.
Empirical results demonstrate near-paragon accuracy with efficient inference through modular prompt composition and structured attention, outperforming naive concatenation methods.

Multi-prompt architectures are compositional frameworks that enable a transformer-based model to process multiple, source-specific soft prompts simultaneously, allowing dynamic adaptation to varying task, domain, or user requirements at inference by on-the-fly prompt assembly. Each prompt is tuned on an isolated subset of data, and the architecture's core principle is modularity and composability—distinct prompts are trained independently and can be arbitrarily combined into a multi-prompt model for downstream prediction without re-tuning the backbone parameters or other prompts. This approach supports scalable, privacy-preserving, and customizable model construction, facilitating practical adaptation in federated, continual, or regulated learning environments.

1. Problem Setting and Model Structure

In the À-la-carte Prompt Tuning (APT) paradigm, a frozen vision transformer backbone $f_\theta$ (e.g., ViT-B/16 pretrained on a large dataset) forms the shared foundation. The dataset is partitioned into $n$ non-overlapping sources $\mathcal{D} = \{D_1, \ldots, D_n\}$ , where each $D_i \subset \mathcal{X} \times \mathcal{Y}$ . For each source $D_i$ , a dedicated soft prompt $p^{(i)} \in \mathbb{R}^{L_p \times d}$ and a small classifier head $\mathrm{head}_i$ are learned through prompt tuning. These prompt modules are lightweight (typically one "prompt token" per-layer plus a small set of learnable "memory tokens") and are independent—training can be performed separately per $D_i$ on disjoint hardware or schedules.

At inference, users select any subset $I \subseteq \{1, \ldots, n\}$ according to application needs or data access constraints. The corresponding prompts $\{p^{(i)} | i \in I\}$ are concatenated to form a composite input prompt, which is injected alongside the standard input tokens and forwarded through the backbone to obtain prediction logits for each participating prompt head.

2. Mathematical Formulation, Prompt Composition, and Attention Structure

Each prompt module and head is trained on its corresponding source $D_i$ by minimizing standard supervised loss on its subset: $L_{D_i}(p^{(i)}, \mathrm{head}_i) = \sum_{(x, y) \in D_i} \ell\left(\mathrm{softmax}\big(\mathrm{head}_i(p_L^{(i)}(x))\big), y\right)$ where the backbone layers recursively combine patch- and class-token embeddings $z_0(x)$ with prompt $p^{(i)}$ , as they pass through the $L$ transformer layers: $[z_L(x), p_L^{(i)}(x)] = F^L_\theta \circ \cdots \circ F^1_\theta([z_0(x), p^{(i)}])$

For a given subset $I=\{i_1,\ldots,i_k\}$ at inference, prompts are concatenated: $p^{(I)} = [p^{(i_1)}\, \|\, p^{(i_2)}\,\|\, \cdots\,\|\, p^{(i_k)}]$ and the attention mechanism is restructured to enforce modularity. Backbone tokens $z_\ell$ attend only to themselves, prompt tokens $p_\ell^{(i)}$ attend only to $z_\ell$ and their own per-layer memory tokens $m_\ell^{(i)}$ , and memory tokens do not attend to anything. This structured attention mask ensures the independence and isolation of prompts at the attention level, eliminating representational interference.

Output predictions are aggregated by simple averaging: $\hat{y}_I = \frac{1}{k} \sum_{i \in I} \mathrm{softmax}(\mathrm{head}_i(p_L^{(i)}))$ Optional weighting (APT-W) further adjusts this aggregation through similarity-based distances to source-specific prototypes.

3. Training, Modularity, and Theoretical Guarantees

Prompt training in APT is strictly decoupled: each $(p^{(i)}, \mathrm{head}_i)$ is fit only on $D_i$ and can be checkpointed, updated, or discarded in isolation. No raw data is exchanged between sources. This "a-la-carte" composability guarantees that a composed multi-prompt model $f(x; I)$ depends strictly on $\bigcup_{i \in I} D_i$ , supporting rigorous access control and privacy constraints.

Theoretical analysis demonstrates that, owing to structured attention and noninteracting prompt pathways, inference complexity scales as $O(N^2 + (N + d_{\text{mem}})|I|)$ , significantly more efficiently than traditional ensemble approaches, which incur $O(|I| N^2)$ computation.

4. Empirical Results and Performance Analysis

Empirical validation across fine-grained visual recognition (multiple vision datasets), sharding/forgetting scenarios, continual learning benchmarks, and class/domain incremental learning confirms that APT multi-prompt models:

Achieve accuracy within $5\%$ of a "paragon" model trained on the union of the selected data sources.
Match or outperform naive concatenation or averaging approaches, which can degrade performance by over 10 points relative to the paragon, as documented in Table 1 and Fig. 2.
Support dynamic forgetting by deletion of a prompt and associated head, with only minor decreases in performance even after sequentially removing half the prompts.

On continual learning benchmarks such as Split CIFAR-100 and CORe50, APT achieves state-of-the-art results without replay buffers. Prompt tuning with memory tokens nearly matches full backbone fine-tuning and consistently outperforms head-only tuning in out-of-domain tasks.

5. Modularity, Privacy, and Practical Implications

A central advantage of APT is full modularity: each participant in a federated, organizational, or privacy-sensitive environment trains only on their own data ( $D_i$ ) and distributes only their $(p^{(i)}, \mathrm{head}_i)$ modules, never raw data. Model customization is immediate: users select which modules to compose at inference without needing to retrain global parameters. This property enables a variety of advanced use cases:

Privacy-preserving APIs: restrict model predictions to authorized datasets.
Dynamic unlearning: remove data or source from the model by deleting the prompt.
Versioned and customizable models: select subsets $I$ for on-demand specialization.
Large-scale model marketplaces: lightweight prompt "apps" composed per user or task.
Efficient storage and update: each prompt is $<0.06\%$ of backbone parameters.

6. Limitations and Future Directions

While APT's strict independence between prompts provides strong modularity and practical efficiency, it also precludes prompt-to-prompt interaction at inference, which can slightly limit synergy between sources and may result in a few percent degradation on challenging out-of-domain generalization tasks. Model quality is contingent on the backbone pretraining; misaligned or poorly pretrained encoders can degrade when conjoined with many prompts. Unlike classical ensembles, deep feature fusion across prompts is not realized beyond the scope of memory tokens.

A plausible implication is that hybrid architectures involving limited, learnable inter-prompt communication or group-wise adaptation could ameliorate these limitations, albeit at a modest cost to strict modularity.

7. Summary Table: Core APT Workflow

Design Step	Description
Data partitioning	Split dataset into $n$ sources $\{D_i\}$
Prompt training	Fit $p^{(i)}, \mathrm{head}_i$ separately for each $D_i$
Structured attention	Masked cross-attention per prompt, using memory tokens for each
Prompt composition	Concatenate selected prompts $\{p^{(i)}\}$ at inference for $I \subset [n]$
Prediction ensembling	Aggregate per-prompt predictions, optionally with weighting
Dynamic (un)learning	Add/remove $(p^{(i)}, \mathrm{head}_i)$ modules without further backbone adaptation

APT establishes a practical foundation for scalable, modular, and privacy-preserving model construction in vision transformers, with empirical robustness and theoretical guarantees for multi-prompt inference (Bowman et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

À-la-carte Prompt Tuning (APT): Combining Distinct Data Via Composable Prompting (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Prompt Architectures.

Multi-Prompt Architectures

1. Problem Setting and Model Structure

2. Mathematical Formulation, Prompt Composition, and Attention Structure

3. Training, Modularity, and Theoretical Guarantees

4. Empirical Results and Performance Analysis

5. Modularity, Privacy, and Practical Implications

6. Limitations and Future Directions

7. Summary Table: Core APT Workflow

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Multi-Prompt Architectures

1. Problem Setting and Model Structure

2. Mathematical Formulation, Prompt Composition, and Attention Structure

3. Training, Modularity, and Theoretical Guarantees

4. Empirical Results and Performance Analysis

5. Modularity, Privacy, and Practical Implications

6. Limitations and Future Directions

7. Summary Table: Core APT Workflow

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research