Virtual Fusion with Contrastive Learning for Single Sensor-based Activity Recognition

Published 1 Dec 2023 in cs.LG, cs.AI, and cs.CV | (2312.02185v1)

Abstract: Various types of sensors can be used for Human Activity Recognition (HAR), and each of them has different strengths and weaknesses. Sometimes a single sensor cannot fully observe the user's motions from its perspective, which causes wrong predictions. While sensor fusion provides more information for HAR, it comes with many inherent drawbacks like user privacy and acceptance, costly set-up, operation, and maintenance. To deal with this problem, we propose Virtual Fusion - a new method that takes advantage of unlabeled data from multiple time-synchronized sensors during training, but only needs one sensor for inference. Contrastive learning is adopted to exploit the correlation among sensors. Virtual Fusion gives significantly better accuracy than training with the same single sensor, and in some cases, it even surpasses actual fusion using multiple sensors at test time. We also extend this method to a more general version called Actual Fusion within Virtual Fusion (AFVF), which uses a subset of training sensors during inference. Our method achieves state-of-the-art accuracy and F1-score on UCI-HAR and PAMAP2 benchmark datasets. Implementation is available upon request.

Abstract PDF HTML Upgrade to Chat

References (43)

Citations (6)

View on Semantic Scholar

Summary

The paper introduces Virtual Fusion, which uses contrastive learning to exploit correlations between multiple sensors during training while relying on a single sensor during inference.
It extends the approach with Actual Fusion within Virtual Fusion (AFVF), enabling flexible sensor deployment by fusing data via both early and late fusion, with late fusion showing superior performance.
Experimental results on UCI-HAR and PAMAP2 benchmarks demonstrate improved accuracy and F1-score compared to traditional single-sensor methods, validating the framework's effectiveness.

Virtual Fusion for Activity Recognition

The paper "Virtual Fusion with Contrastive Learning for Single Sensor-based Activity Recognition" (2312.02185) introduces a novel approach to Human Activity Recognition (HAR) that leverages contrastive learning to exploit correlations between multiple sensors during training, while relying on only a single sensor during inference. This method addresses the limitations of traditional sensor fusion, which often entails significant costs and complexities related to setup, operation, and maintenance. The authors also present an extension of this approach named Actual Fusion within Virtual Fusion (AFVF), which allows for inference using a subset of the sensors used during training.

Problem Formulation and Approach

The core idea behind Virtual Fusion is to train models using both labeled and unlabeled data from multiple time-synchronized sensors. The labeled dataset, denoted as $D_{lbl}$ , consists of data-label pairs $(x_i^m, y_i)$ , where $x_i^m$ represents the data from sensor $m$ and $y_i$ is the corresponding activity label. The unlabeled dataset, $D_{ulb}$ , contains data from multiple sensors without labels, represented as $x_i^m$ . The method aims to train a classification model for each modality $m \in M_{cls}$ , where $M_{cls}$ is the set of modalities used for classification. The classification model consists of a feature extractor $f^m$ that maps the input $x^m$ to a latent feature vector $z^m$ , and a classifier $c^m$ that maps $z^m$ to the predicted activity label $y$ .

Figure 1: Overall training process of Virtual Fusion. Dotted lines are optional, depending on label availability.

The Virtual Fusion framework (Figure 1) employs a contrastive learning approach using a multi-view NT-Xent loss function to exploit the correlation between different sensor modalities. The NT-Xent loss, derived from the SimCLR framework, is used to maximize the similarity between feature vectors from different sensors that correspond to the same activity. For two modalities $m_1$ and $m_2$ , the NT-Xent loss for a sample at index $i$ is defined as:

$\ell(z^{m_1}_i, z^{m_2}_i) = -\log \frac{\exp(\text{sim}(z^{m_1}_i, z^{m_2}_i) / \tau)}{\sum_{j=1}^B \exp(\text{sim}(z^{m1}_i, z^{m2}_j) / \tau)},$

where $\text{sim}$ is the cosine similarity function, $\tau$ is a temperature hyper-parameter, and $B$ is the mini-batch size.

Actual Fusion within Virtual Fusion (AFVF)

The authors extend Virtual Fusion to AFVF, which enables inference using a subset of the sensors used during training. This is particularly useful in scenarios where certain sensors may not be available or practical during deployment. AFVF supports both early and late fusion techniques to combine data from multiple sensors. In early fusion, data from different sensors are fused at the data level, while in late fusion, features extracted from individual sensors are fused at the feature level.

Figure 2: Examples of AFVF that fuses 2 out of multiple modalities. The dotted line connections are only applicable if $m \in M_{lbl}$ .

The authors found that late fusion generally yields better results than early fusion in AFVF due to its ability to capture more nuanced information from each sensor using dedicated feature extractors (Figure 2). They argue that the fused modality should be included in the contrastive loss computation to directly support the classification task. The fused feature vector $z^{fused}$ is computed as:

$z^{fused} = \text{project}(\text{concatenate}(z^1, ..., z^n)),$

where $z^1, ..., z^n$ are the feature vectors from the individual sensors, and "project" refers to a fully connected layer used as a projector.

Figure 3: Example of AFVF that fuses all modalities. Early fusion is not applicable.

In scenarios where all training sensors are available during testing, late AFVF is advantageous, as it allows for the fusion of all modalities (Figure 3). Early AFVF is not applicable in this case because it does not produce features for the source modalities.

Experimental Results

The authors conducted experiments on several benchmark datasets, including UCI-HAR and PAMAP2, to evaluate the performance of Virtual Fusion and AFVF. The results demonstrate that Virtual Fusion consistently outperforms single-sensor training, and in some cases, it even surpasses actual sensor fusion. Specifically, AFVF achieved state-of-the-art accuracy and F1-score on both benchmark datasets.

The authors also performed ablation studies to validate the design choices of Virtual Fusion, such as the use of a multi-view NT-Xent loss and the inclusion of the fused modality in the contrastive loss computation. The results of these studies support the effectiveness of the proposed approach.

Conclusion

The paper presents a compelling approach to HAR that addresses the limitations of traditional sensor fusion by leveraging contrastive learning and virtual fusion techniques. The proposed method offers increased flexibility in sensor selection and deployment, while achieving state-of-the-art performance on benchmark datasets.

The use of unlabeled multimodal data for representation learning is a promising avenue for future research, particularly given the relative ease and lower cost of collecting unlabeled data compared to labeled data. Future work could explore the use of domain adaptation or generalization techniques to further improve the performance of Virtual Fusion, as well as investigate the effects of the number of sensors and sensor characteristics on the method's accuracy.

Markdown Report Issue