Machine learning and natural language processing models to predict the extent of food processing

Published 23 Dec 2024 in q-bio.BM and cs.LG | (2412.17217v1)

Abstract: The dramatic increase in consumption of ultra-processed food has been associated with numerous adverse health effects. Given the public health consequences linked to ultra-processed food consumption, it is highly relevant to build computational models to predict the processing of food products. We created a range of machine learning, deep learning, and NLP models to predict the extent of food processing by integrating the FNDDS dataset of food products and their nutrient profiles with their reported NOVA processing level. Starting with the full nutritional panel of 102 features, we further implemented coarse-graining of features to 65 and 13 nutrients by dropping flavonoids and then by considering the 13-nutrient panel of FDA, respectively. LGBM Classifier and Random Forest emerged as the best model for 102 and 65 nutrients, respectively, with an F1-score of 0.9411 and 0.9345 and MCC of 0.8691 and 0.8543. For the 13-nutrient panel, Gradient Boost achieved the best F1-score of 0.9284 and MCC of 0.8425. We also implemented NLP based models, which exhibited state-of-the-art performance. Besides distilling nutrients critical for model performance, we present a user-friendly web server for predicting processing level based on the nutrient panel of a food product: https://cosylab.iiitd.edu.in/food-processing/.

Abstract PDF Upgrade to Chat

Summary

The paper develops and evaluates machine learning and natural language processing models to predict food processing levels using nutrient data and textual descriptions based on the NOVA classification system.
High-performing models like LGBM achieved F1-scores up to 0.9583 when combined with nutrient data and GPT-2 embeddings, demonstrating the value of integrating NLP features.
The research provides a practical web server tool for researchers and policymakers to assess food processing levels and contributes to public health efforts against ultra-processed foods.

Predictive Modeling of Food Processing Levels Using Machine Learning and NLP Techniques

The study presented in this paper addresses the need for computational models capable of predicting food processing levels, particularly in the context of the increasing health concerns associated with ultra-processed foods (UPFs). Leveraging both ML and NLP methodologies, this research aims to classify foods based on the NOVA food processing system, which is extensively utilized due to its precision and comprehensive framework.

The research implements a variety of models utilizing a dataset drawn from the Food and Nutrient Database for Dietary Studies (FNDDS). The dataset is categorized using NOVA classification labels, which provide a globally recognized framework for determining food processing levels. Initial assessments involve full utilization of the 102 nutrient features from the FNDDS, with subsequent feature reduction to 65 and finally 13 nutrients. In each case, ensemble methods such as LGBM, Random Forest, and Gradient Boost consistently demonstrate high performance, with LGBM emerging as the most effective model for the full 102-nutrient evaluation, achieving an F1-score of 0.9411 and an MCC of 0.8691.

The study further enhances feature representation through the integration of NLP-based models utilizing pre-trained word embeddings such as BERT, XLM-RoBERTa, and GPT-2. These embeddings enrich textual food descriptions, category names, and macro class data, contributing to improved model accuracy. Notably, the combination of the 13-nutrient data set with GPT-2 embeddings and LGBM yielded an F1-score of 0.9583 and an MCC of 0.9091, illustrating the efficacy of this comprehensive approach.

The analysis also incorporates pre-processing techniques such as Synthetic Minority Oversampling Technique (SMOTE) and stratified k-fold cross-validation to mitigate class imbalance and enhance model reliability. Additionally, SHAP (SHapley Additive exPlanations) analysis offers insights into the contribution of individual features, identifying 'Sodium', 'Energy', and specific fatty acids as consistent predictors across NOVA classes.

This investigation underscores the potential of ML and NLP technologies in accurately predicting food processing levels, contributing to public health initiatives aimed at reducing the intake of UPFs and associated health risks such as obesity and cardiovascular diseases. Further research is necessary to expand the dataset beyond the predominantly U.S.-based FNDDS data, allowing for more generalizable applications across diverse global dietary patterns. Moreover, the integration of robust NLP methodologies can deepen insights into textual diet descriptions, enhancing the predictive power of such models.

The development of a user-friendly web server accessible through the authors' designated platform marks a significant step toward practical application, providing researchers and policymakers with a valuable tool for assessing food processing levels based on nutrient data. This research lays the groundwork for future studies focusing on refining algorithms and expanding datasets to maintain pace with evolving dietary trends and health concerns globally.

Markdown Report Issue