Papers
Topics
Authors
Recent
Search
2000 character limit reached

Continual Alignment for SAM (CA-SAM)

Updated 28 November 2025
  • The paper introduces CA-SAM, which utilizes a task-specific lightweight alignment layer and a VAE-based routing mechanism to adapt SAM for medical segmentation while preventing catastrophic forgetting.
  • CA-SAM achieves state-of-the-art segmentation performance with superior Avg-IoU and reduced GFLOPs compared to other continual learning methods.
  • Empirical evaluations on nine diverse medical imaging datasets demonstrate robust continual adaptation and nearly zero degradation on out-of-distribution tasks.

Continual Alignment for SAM (CA-SAM) is a continual learning paradigm designed to adapt the Segment Anything Model (SAM) to streaming medical image segmentation tasks while effectively mitigating catastrophic forgetting and maintaining state-of-the-art segmentation performance under strict parameter and computational constraints. CA-SAM is centered on the introduction of a lightweight, task-specific Alignment Layer and a VAE-based task routing mechanism, enabling robust continual adaptation across highly heterogeneous domains without replay or fine-tuning of the core SAM backbone (Wang et al., 21 Nov 2025).

1. Framework Architecture and Forward Pass

CA-SAM builds on a frozen SAM foundation, composed of a Vision Transformer (ViT)-based encoder E()E(\cdot) and a mask decoder D()D(\cdot), neither of which are updated during continual learning. For each incoming segmentation task tt, a lightweight, trainable Alignment Layer At()A_t(\cdot) is inserted between the encoder and decoder. Only AtA_t is updated for task tt; this decoupling preserves SAM’s strong zero-shot priors and computational efficiency. The overall forward pass for a given image II on task tt involves: Z=E(I),Z~=At(Z),y^=D(Z~)Z = E(I),\qquad \tilde{Z} = A_t(Z),\qquad \hat{y} = D(\tilde{Z}) Each task tt also maintains a dedicated Variational Autoencoder D()D(\cdot)0 to explicitly model and score the distribution of encoder features, enabling accurate and automatic task routing at inference.

2. Mathematical Formulation of the Alignment Layer

The Alignment Layer D()D(\cdot)1 is realized as a compact stack of Channel-Attention Residual Blocks (CAResBlock). For each spatial location D()D(\cdot)2 and pixel index D()D(\cdot)3, the block operates on the encoder feature D()D(\cdot)4: D()D(\cdot)5 where D()D(\cdot)6 is the D()D(\cdot)7-dimensional encoder feature, D()D(\cdot)8 and D()D(\cdot)9 are trainable parameters, and additional channel-attention and residual connections further refine tt0. The transformed features tt1 are computed as: tt2 This architectural minimalism yields an effective feature alignment while also drastically reducing the parameter and compute overhead normally associated with full SAM adaptation.

3. Continual Learning Protocol and Task Routing

CA-SAM operates in a pure continual learning setting: tasks tt3 arrive as a sequence of datasets tt4 with no access to prior data (no replay). For each task:

  • Unique alignment layer tt5 and VAE tt6 are instantiated and trained (with tt7 frozen).
  • The VAE models the global, attention-weighted encoder feature tt8 via softmax attention pooling:

tt9

At()A_t(\cdot)1

At inference, for a test image At()A_t(\cdot)2:

  1. Compute encoder features and their pooled global summary At()A_t(\cdot)3.
  2. For each task At()A_t(\cdot)4, evaluate At()A_t(\cdot)5’s ELBO score At()A_t(\cdot)6 on At()A_t(\cdot)7.
  3. Select the task At()A_t(\cdot)8. If At()A_t(\cdot)9 (threshold from training), use AtA_t0; otherwise, apply the identity alignment AtA_t1, reverting to pure zero-shot SAM.

4. Training, Inference, and Implementation Details

The CA-SAM training and inference workflow is summarized as follows:

Pseudocode Outline

  • For each task AtA_t2:
    • Initialize AtA_t3 (CAResBlock stack), AtA_t4 (MLP encoder/decoder).
    • Train AtA_t5 by minimizing standard segmentation loss (pixel-wise cross-entropy or Dice) on AtA_t6, freezing AtA_t7 and AtA_t8.
    • Train AtA_t9 using attention-pooled encoder outputs.
    • Calibrate routing threshold tt0 as 97th-percentile ELBO from tt1-fold cross-validation over tt2.
  • At inference, compute tt3 for all tt4 and select the appropriate tt5 or tt6 as described.

Efficiency Benchmarks

  • Alignment Layer: 3.54 M trainable parameters (smaller than most adapters).
  • Training computational cost: 514 GFLOPs per tt7 image, representing a 25% reduction over other adapter schemes.
  • Plug-and-play: direct insertion of tt8 between encoder and decoder with no modification to SAM code.
  • Key hyperparameters: Adam optimizer, tt9 learning rate (A), II0 (V), II1, batch size 6, 24 epochs for II2, 10 epochs for II3, attention pooling temperature II4.

5. Experimental Setup and Main Results

CA-SAM is evaluated on a nine-dataset medical continual segmentation benchmark, with tasks covering modalities such as MR, CT, histopathology, endoscopy, and dental X-rays.

Datasets (in task order):

ACDC, EBHI-SEG, 56Nx, DN, Polyp, MSD_Prostate, MSD_Spleen, Promise12, STS-2D

Metrics:

IoU, Boundary IoU (BIoU), Last-IoU (post-stream), Avg-IoU, FF-IoU (average forgetting), evaluated both per-task and across the full sequence; additional zero-shot metrics over five out-of-distribution datasets.

Main Results:

  • Single-dataset adaptation: CA-SAM achieves II5 Avg-IoU and II6 Avg-BIoU with the lowest parameter and FLOP overhead.
  • Continual learning (exemplar-free): II7 Last-IoU, II8 Avg-IoU, II9 FF-IoU, outperforming all classical continual learning methods (e.g., LwF, EWC, ER, DER, L2P, MoDA) and rivaling upper-bound joint training.
  • Zero-shot: CA-SAM maintains tt0 of original SAM’s IoU on unadapted domains, minimizing out-of-distribution (OOD) degradation.
Method Params GFLOPs Avg-IoU Avg-BIoU
SAM zero-shot 0 M 55.08% 37.67%
Decoder-tuning 4.06 M 669.8 70.40% 53.86%
HQ-SAM 5.14 M 678.9 72.91% 58.41%
SAMMed2D 13.31 M 728.2 75.17% 58.97%
CA-SAM 3.54 M 514.3 80.15% 66.52%

6. Ablation Studies and Analysis

Multiple ablation studies empirically isolate the core contributions of CA-SAM:

  • Feature alignment: Post-tt1, total-variation and Jensen-Shannon divergence between feature distributions and full fine-tuning baselines decrease by tt2–tt3.
  • Block depth: Increased number of CAResBlocks in tt4 improves Avg-IoU from tt5 to tt6.
  • Pooling mechanism: Parameter-free attention pooling achieves the highest task-wise IoU/BIoU compared to global average/mean or flattening/CLS-token methods.
  • VAE tt7 coefficient: Stable performance for tt8, ensuring consistent segmentation and OOD accuracy.
  • Task order robustness: CA-SAM’s Last-IoU varies by less than tt9 across three random task orders, outperforming other continual learning baselines which fluctuate by over Z=E(I),Z~=At(Z),y^=D(Z~)Z = E(I),\qquad \tilde{Z} = A_t(Z),\qquad \hat{y} = D(\tilde{Z})0 points.
  • Routing threshold: Z=E(I),Z~=At(Z),y^=D(Z~)Z = E(I),\qquad \tilde{Z} = A_t(Z),\qquad \hat{y} = D(\tilde{Z})1 ELBO percentile is empirically optimal for balancing seen-task IoU (Z=E(I),Z~=At(Z),y^=D(Z~)Z = E(I),\qquad \tilde{Z} = A_t(Z),\qquad \hat{y} = D(\tilde{Z})2) and OOD accuracy (Z=E(I),Z~=At(Z),y^=D(Z~)Z = E(I),\qquad \tilde{Z} = A_t(Z),\qquad \hat{y} = D(\tilde{Z})3).
  • Visualization: t-SNE projections show Z=E(I),Z~=At(Z),y^=D(Z~)Z = E(I),\qquad \tilde{Z} = A_t(Z),\qquad \hat{y} = D(\tilde{Z})4 produces well-clustered feature manifolds per dataset.

7. Relation to Other SAM Continual Adaptation Methods

CA-SAM’s design is distinguished by minimalistic but highly effective domain alignment and a robust, probabilistic routing mechanism:

  • Unlike RegCL (Shu et al., 16 Jul 2025), which merges LoRA adapter parameters to contain model size at a potential cost of “washed-out” domain representations, CA-SAM retains explicit per-task alignment layers, ensuring task-optimal adaptation without interference.
  • Compared to MoDA (Yang et al., 2024) and GFT/GAT-based selector pools, CA-SAM uses a VAE-based attention-pooled feature router, which obviates the need for auxiliary tokens and memory banks and reduces training complexity.
  • A plausible implication is that, while CA-SAM incurs parameter cost linear in the number of tasks (one Z=E(I),Z~=At(Z),y^=D(Z~)Z = E(I),\qquad \tilde{Z} = A_t(Z),\qquad \hat{y} = D(\tilde{Z})5 per task), its FLOP and memory efficiency per task, and its ability to recover zero-shot behaviour when the distribution is not matched, offers a favorable trade-off for multi-institutional medical segmentation where data privacy precludes joint training.

Taken together, CA-SAM constitutes a distinct approach within the broader landscape of continual SAM adaptation: it achieves superior continual segmentation accuracy, nearly eliminates catastrophic forgetting, and maintains OOD generalization with extremely modest additional computational requirements (Wang et al., 21 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Continual Alignment for SAM (CA-SAM).