Learning Discriminative Representations for Semantic Cross Media Retrieval

Published 18 Nov 2015 in cs.IR | (1511.05659v1)

Abstract: Heterogeneous gap among different modalities emerges as one of the critical issues in modern AI problems. Unlike traditional uni-modal cases, where raw features are extracted and directly measured, the heterogeneous nature of cross modal tasks requires the intrinsic semantic representation to be compared in a unified framework. This paper studies the learning of different representations that can be retrieved across different modality contents. A novel approach for mining cross-modal representations is proposed by incorporating explicit linear semantic projecting in Hilbert space. The insight is that the discriminative structures of different modality data can be linearly represented in appropriate high dimension Hilbert spaces, where linear operations can be used to approximate nonlinear decisions in the original spaces. As a result, an efficient linear semantic down mapping is jointly learned for multimodal data, leading to a common space where they can be compared. The mechanism of "feature up-lifting and down-projecting" works seamlessly as a whole, which accomplishes crossmodal retrieval tasks very well. The proposed method, named as shared discriminative semantic representation learning (\textbf{SDSRL}), is tested on two public multimodal dataset for both within- and inter- modal retrieval. The experiments demonstrate that it outperforms several state-of-the-art methods in most scenarios.

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces SDSRL, which leverages kernel approximation and linear projection techniques to bridge the heterogeneous gap in multimodal data.
The methodology involves non-linear lifting to Hilbert space followed by optimized coordinate descent for semantic projection, ensuring computational efficiency on large datasets.
Experiments on datasets like Wikipedia and NUSWIDE demonstrate SDSRL's superior performance and stable mapping of semantic features across modalities.

Learning Discriminative Representations for Semantic Cross Media Retrieval

The paper "Learning Discriminative Representations for Semantic Cross Media Retrieval" proposes a novel approach to tackling the heterogeneous gap challenge found in multimodal AI problems. The main focus is on finding a unified framework where intrinsic semantic representations can be efficiently compared across different modalities. The paper introduces a method called Shared Discriminative Semantic Representation Learning (SDSRL) which leverages explicit linear semantic projecting in Hilbert space to accomplish semantic cross-media retrieval tasks.

Methodology

Semantic Representation Learning

The core idea of SDSRL is to map multimodal data into a shared semantic space by combining feature lifting via kernel approximation and linear semantic down projection. Initially, raw data is non-linearly lifted to a high-dimensional Hilbert space using kernel methods. This up-lifted representation captures the discriminative structures inherent in multimodal datasets. The approximation ensures that data remains computationally efficient and avoids the infinite-dimensional complexity typically associated with Hilbert spaces.

Linear Semantic Down Projection

Once data is represented in the high-dimensional space, the paper proposes a linear down-projecting strategy. This involves learning projection matrices that map multimodal data into a low-dimensional semantic space. These projection matrices are optimized to maintain semantic correlations within and between different data modalities. The optimization process ensures that the intrinsic latent feature vectors span the semantic space effectively, allowing for accurate cross-modal retrieval.

Implementation Details

Kernel Approximation

The paper applies Nystroem kernel approximation to map raw modality features into finite-dimensional vectors, effectively transforming the original datasets into representations that can be processed with linear operations in the Hilbert space.

Multimodal Projection Learning

For efficient learning, the paper uses a coordinate descent approach for multimodal projecting learning. By iteratively optimizing separate entries of the projection matrices, the computational complexity is kept manageable. This optimization is divided into sub-problems, solving intra-modal and inter-modal retrieval tasks independently before a final joint optimization. This stepwise approach makes the implementation highly efficient for large-scale datasets.

Experimental Evaluation

The SDSRL framework was tested on publicly available datasets (Wikipedia and NUSWIDE) against several state-of-the-art methods. Performance was evaluated using Mean Average Precision (MAP) and precision-recall curves.

Results

The experiments demonstrated that SDSRL consistently outperformed competing methods across various dimensions of semantic space and feature types. Notably, SDSRL maintained stable performance metrics after reaching certain semantic dimensions, correlating well with the number of semantic categories present in the datasets.

Computational Efficiency

SDSRL's computational complexity allows it to scale well with large datasets. The kernel approximation technique ensures feature lifting computations are swift, while the coordinate descent method for projection learning offers an efficient convergence even with large feature sets.

Conclusion

The SDSRL methodology provides a flexible yet powerful framework for semantic cross-media retrieval, effectively bridging the heterogeneous gap between modalities. Its ability to generalize across different data types and achieve high performance in both inter-media and intra-media retrieval tasks illustrates its robustness and potential for practical applications in multimodal AI retrieval systems. Future developments could focus on extending SDSRL to accommodate more complex datasets and exploring non-linear projection strategies within the learned semantic space.

Markdown Report Issue