Validity of PID estimates when using clustering/projection for open-ended LVLM outputs

Determine whether, when applying partial information decomposition to large vision-language models by manually clustering open-ended answers or by training auxiliary projection heads that map model representations into predefined clusters, the resulting PID components (redundancy, vision uniqueness, language uniqueness, and synergy) primarily reflect the original end-to-end behavior of the large vision-language model or the behavior introduced by the added clustering or projection mapping.

Background

The paper adapts the BATCH estimator for partial information decomposition to analyze large vision-LLMs and focuses on multiple-choice VQA because the estimator requires a discrete target space. For open-ended generation, a common workaround is to discretize outputs by clustering answers or adding projection heads to map model representations to predefined clusters.

The authors caution that these added components may introduce sensitivity to clustering and projection hyperparameters, raising a critical validity concern: it is unclear whether PID values computed under such modifications capture the model’s intrinsic end-to-end behavior or artifacts of the added mapping. Resolving this uncertainty is important for extending PID analyses to open-ended tasks without confounding measurement effects.

References

This focus is a deliberate methodological choice: BATCH requires a finite y, thus the natural set of choices in MC-VQA (e.g., {A, B, C, D}) allows for a clean analysis while avoiding the noisy and potentially biased process of manually clustering open-ended answers, or training auxiliary projection heads to map LVLM representations into pre-defined clusters. Such additional components make PID estimates sensitive to clustering and projection hyperparameters, making it unclear whether the estimated quantities primarily reflect the LVLM's original end-to-end behavior or the added mapping, which is not how these models are typically used.

A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models  (2603.29676 - Xiu et al., 31 Mar 2026) in Section 3.2 (A PID Estimation Framework for LVLMs)