Do deeper nonlinear decoders outperform the temporal-attention MLP on TVSD?

Determine whether more complex nonlinear architectures, such as deeper recurrent, convolutional, or transformer-based spatiotemporal networks, can achieve higher decoding performance than the proposed temporal-attention multilayer perceptron (MLP) when mapping 200 ms windows of primate multi-unit activity from the THINGS Ventral Stream Spiking Dataset to semantic image embeddings (e.g., CLIP) for visual decoding.

Background

The paper evaluates a range of architectures (linear models, MLPs, LSTMs, temporal CNNs) for decoding semantic image embeddings from high-density primate intracortical recordings and reports that a lightweight temporal-attention MLP consistently yields the best retrieval performance.

Despite these findings, the authors explicitly note that it remains uncertain whether more complex nonlinear architectures could surpass the reported results, leaving open the question of whether additional architectural complexity would yield further gains on this task and dataset.

References

Fourth, we cannot rule out the possibility that more complex, nonlinear architectures might achieve higher decoding performance.

Simple Models, Rich Representations: Visual Decoding from Primate Intracortical Neural Signals  (2601.11108 - Ciferri et al., 16 Jan 2026) in Limitations subsection, Discussion