Validity of PID estimates when using clustering/projection for open-ended LVLM outputs
Determine whether, when applying partial information decomposition to large vision-language models by manually clustering open-ended answers or by training auxiliary projection heads that map model representations into predefined clusters, the resulting PID components (redundancy, vision uniqueness, language uniqueness, and synergy) primarily reflect the original end-to-end behavior of the large vision-language model or the behavior introduced by the added clustering or projection mapping.
References
This focus is a deliberate methodological choice: BATCH requires a finite y, thus the natural set of choices in MC-VQA (e.g., {A, B, C, D}) allows for a clean analysis while avoiding the noisy and potentially biased process of manually clustering open-ended answers, or training auxiliary projection heads to map LVLM representations into pre-defined clusters. Such additional components make PID estimates sensitive to clustering and projection hyperparameters, making it unclear whether the estimated quantities primarily reflect the LVLM's original end-to-end behavior or the added mapping, which is not how these models are typically used.