Peak Alignment of Gas Chromatography-Mass Spectrometry Data with Deep Learning

Published 2 Apr 2019 in cs.LG, cs.CV, and stat.ML | (1904.01205v3)

Abstract: We present ChromAlignNet, a deep learning model for alignment of peaks in Gas Chromatography-Mass Spectrometry (GC-MS) data. In GC-MS data, a compound's retention time (RT) may not stay fixed across multiple chromatograms. To use GC-MS data for biomarker discovery requires alignment of identical analyte's RT from different samples. Current methods of alignment are all based on a set of formal, mathematical rules. We present a solution to GC-MS alignment using deep learning neural networks, which are more adept at complex, fuzzy data sets. We tested our model on several GC-MS data sets of various complexities and analysed the alignment results quantitatively. We show the model has very good performance (AUC $\sim 1$ for simple data sets and AUC $\sim 0.85$ for very complex data sets). Further, our model easily outperforms existing algorithms on complex data sets. Compared with existing methods, ChromAlignNet is very easy to use as it requires no user input of reference chromatograms and parameters. This method can easily be adapted to other similar data such as those from liquid chromatography. The source code is written in Python and available online.

Abstract PDF Upgrade to Chat

Summary

The paper presents ChromAlignNet, a deep learning model that uses a Siamese architecture for one-shot peak alignment in GC-MS data.
It integrates features from mass spectrum, peak profile, and chromatogram segments, achieving an AUC of nearly 1 on simple data sets.
The method outperforms traditional alignment approaches in handling complex retention time variations, though further work is needed to reduce false positives.

Peak Alignment of Gas Chromatography-Mass Spectrometry Data with Deep Learning

Introduction

The paper discusses ChromAlignNet, a deep learning model designed for peak alignment in Gas Chromatography-Mass Spectrometry (GC-MS) data. The main challenge addressed by the research is the variation in retention times (RT) across different samples, which can hinder the use of GC-MS in biomarker discovery. Traditional alignment methods based on rigid mathematical rules can be inadequate for the inherently complex and fuzzy nature of metabolomics data. ChromAlignNet offers a solution by employing deep neural networks to align chromatographic peaks, providing a more flexible and accurate method.

Network Architecture

ChromAlignNet employs a Siamese neural network architecture designed for One-Shot Learning, which is suitable for the pairwise comparison of chromatographic peaks from different samples. The network consists of three Siamese sub-networks, each dedicated to encoding different features of the peaks: the mass spectrum at peak maximum, the entire chromatogram segment, and the detailed peak profile. The network outputs a probability indicating how likely two peaks should be aligned together. The architecture combines these outputs to make comprehensive alignment predictions across multiple samples.

Methodology

Data Preprocessing: Peaks are detected automatically from GC-MS data, and features necessary for the network are extracted, including mass spectrum, peak profile, and chromatogram segment.
Training Process: The network training uses ambient air and human breath sample datasets to generate positive and negative pairs of peaks. Training involves minimizing a composite loss function using the Adam optimizer, with validation used to prevent overfitting.
Group Assignment: Pairwise alignment predictions are translated into complete chromatographic alignment using hierarchical clustering.

Experimental Results

ChromAlignNet was tested on multiple data sets with different complexities.

Performance Metrics: The model achieved an AUC close to 1 for simpler data sets and around 0.85 for more complex data sets, outperforming conventional methods.
Runtime: The model is efficient, with prediction times scalable to larger datasets, facilitated by parallel processing capabilities.

Implementation and Comparison with Existing Methods

ChromAlignNet showcases competitive advantages over traditional algorithms like COW and GCalignR, particularly in handling complex data where RT shifts exceed typical expectations. The network requires minimal user input and no reference chromatograms, simplifying use and improving robustness across diverse samples. However, high false positive rates suggest areas for refinement.

Conclusion

ChromAlignNet marks a significant advancement in the alignment of GC-MS data by leveraging deep learning to deal with complex RT variations and providing an easy-to-use, flexible tool that can be adapted to similar chromatographic data. Its ability to operate without extensive user parameters or reference data offers practical benefits for laboratory applications, supporting its potential adoption in metabolomics for health diagnostics and research. Future work will focus on reducing false positives and further optimizing network components for improved performance.