- The paper develops and evaluates machine learning and natural language processing models to predict food processing levels using nutrient data and textual descriptions based on the NOVA classification system.
- High-performing models like LGBM achieved F1-scores up to 0.9583 when combined with nutrient data and GPT-2 embeddings, demonstrating the value of integrating NLP features.
- The research provides a practical web server tool for researchers and policymakers to assess food processing levels and contributes to public health efforts against ultra-processed foods.
Predictive Modeling of Food Processing Levels Using Machine Learning and NLP Techniques
The study presented in this paper addresses the need for computational models capable of predicting food processing levels, particularly in the context of the increasing health concerns associated with ultra-processed foods (UPFs). Leveraging both ML and NLP methodologies, this research aims to classify foods based on the NOVA food processing system, which is extensively utilized due to its precision and comprehensive framework.
The research implements a variety of models utilizing a dataset drawn from the Food and Nutrient Database for Dietary Studies (FNDDS). The dataset is categorized using NOVA classification labels, which provide a globally recognized framework for determining food processing levels. Initial assessments involve full utilization of the 102 nutrient features from the FNDDS, with subsequent feature reduction to 65 and finally 13 nutrients. In each case, ensemble methods such as LGBM, Random Forest, and Gradient Boost consistently demonstrate high performance, with LGBM emerging as the most effective model for the full 102-nutrient evaluation, achieving an F1-score of 0.9411 and an MCC of 0.8691.
The study further enhances feature representation through the integration of NLP-based models utilizing pre-trained word embeddings such as BERT, XLM-RoBERTa, and GPT-2. These embeddings enrich textual food descriptions, category names, and macro class data, contributing to improved model accuracy. Notably, the combination of the 13-nutrient data set with GPT-2 embeddings and LGBM yielded an F1-score of 0.9583 and an MCC of 0.9091, illustrating the efficacy of this comprehensive approach.
The analysis also incorporates pre-processing techniques such as Synthetic Minority Oversampling Technique (SMOTE) and stratified k-fold cross-validation to mitigate class imbalance and enhance model reliability. Additionally, SHAP (SHapley Additive exPlanations) analysis offers insights into the contribution of individual features, identifying 'Sodium', 'Energy', and specific fatty acids as consistent predictors across NOVA classes.
This investigation underscores the potential of ML and NLP technologies in accurately predicting food processing levels, contributing to public health initiatives aimed at reducing the intake of UPFs and associated health risks such as obesity and cardiovascular diseases. Further research is necessary to expand the dataset beyond the predominantly U.S.-based FNDDS data, allowing for more generalizable applications across diverse global dietary patterns. Moreover, the integration of robust NLP methodologies can deepen insights into textual diet descriptions, enhancing the predictive power of such models.
The development of a user-friendly web server accessible through the authors' designated platform marks a significant step toward practical application, providing researchers and policymakers with a valuable tool for assessing food processing levels based on nutrient data. This research lays the groundwork for future studies focusing on refining algorithms and expanding datasets to maintain pace with evolving dietary trends and health concerns globally.