- The paper presents a deep neural network approach that predicts in vitro characteristics of oral films and sustained-release tablets with over 80% accuracy.
- It introduces the MD-FIS algorithm to effectively split small, imbalanced datasets and automatically extract features without manual intervention.
- The study shows that DNNs outperform traditional methods like MLR, SVM, and RF, offering a scalable, data-driven alternative for drug formulation development.
This paper explores the application of deep learning to predict pharmaceutical formulations, aiming to replace the traditional, time-consuming, and expensive trial-and-error approach. The authors focus on predicting in vitro characteristics of two dosage forms: oral fast-disintegrating films (OFDF) and oral sustained-release matrix tablets (SRMT).
Problem Statement:
Traditional pharmaceutical formulation development relies heavily on the experience of individual scientists. This process is inefficient, costly, and makes it difficult to achieve optimal formulations. Machine learning offers a potential solution by enabling data-driven predictions based on existing experimental data. However, existing machine learning techniques often require significant domain expertise for feature extraction and may suffer from low accuracy due to limited data.
Proposed Solution:
The authors propose using deep learning, specifically deep neural networks (DNNs), to predict pharmaceutical formulations. Deep learning's advantage is its ability to automatically extract features from data, eliminating the need for manual feature engineering. The authors also address the challenge of small, imbalanced datasets common in pharmaceutical formulation by developing an automatic dataset selection algorithm called MD-FIS (Maximum Dissimilarity algorithm with Small group Filter and Representative Initial Set selection) to create training, validation, and test sets. Finally, they introduce pharmaceutically relevant evaluation criteria (similarity factor f2 for dissolution profiles and a tolerance-based accuracy for disintegration time) to assess model performance.
Methods:
- Data Collection: The study uses a dataset of 131 OFDF and 145 SRMT formulations extracted from Web of Science. The data include:
- Types and contents of drugs and excipients.
- Process parameters (e.g., weight, thickness, tensile strength).
- In vitro characteristics (disintegration time for OFDF and cumulative dissolution profiles for SRMT).
- Molecular descriptors to represent the properties of APIs.
- Data Preprocessing: Excipient types were encoded to numerical values, and API properties were described using nine molecular descriptors.
- Data Splitting: The data was split into training, validation, and test sets. A key contribution is the MD-FIS algorithm, which addresses the limitations of random or original maximum dissimilarity data splitting in the context of small, imbalanced datasets. MD-FIS works in three steps:
- Filters out small API groups to avoid bias.
- Selects a representative initial dataset by randomly generating 10,000 initial sets and choosing the one with the highest similarity to the remaining data.
- Iteratively picks data points with the maximum "cost," which balances dissimilarity to the initial set with similarity to other data within the same API group, preventing the selection of boundary or outlier data.
- Model Development: DNNs were trained using the DeepLearning4j framework. For OFDF, a 10-layer feed-forward network with 50 nodes per layer was used. For SRMT, a 9-layer network with 30 nodes per layer was used. Tanh and sigmoid were used as activation functions, and batch gradient descent with momentum was used for training.
- Comparison with Other Machine Learning Methods: Six conventional machine learning methods were used as benchmarks: multiple linear regression (MLR), partial least squares regression (PLSR), support vector machine (SVM), artificial neural networks (ANNs), random forest (RF), and k-nearest neighbors (k-NN). These models were trained using scikit-learn.
- Evaluation Metrics: Instead of relying solely on standard regression metrics like correlation coefficient and coefficient of determination, the authors used:
- f2 Similarity Factor: For SRMT, this evaluates the similarity of predicted and experimental cumulative drug release curves. f2 >= 50 indicates a successful prediction.
- Disintegration Time Accuracy: For OFDF, the percentage of predictions where the error between predicted and experimental disintegration time is less than or equal to 10 seconds.
Results:
- Random data splitting resulted in highly variable and low average accuracy, highlighting its unsuitability for this dataset.
- Manual data splitting required expert domain knowledge and was not scalable.
- The original maximum dissimilarity algorithm selected non-representative data, leading to poor performance.
- The MD-FIS algorithm significantly improved data splitting, yielding better prediction accuracies.
- The DNNs outperformed all other machine learning methods in predicting both OFDF disintegration time and SRMT dissolution profiles. Accuracies for both OFDF and SRMT were above 80%. The DNN's high accuracy on SRMT prediction demonstrates its ability to improve model accuracy in multi-label formulation prediction by leveraging the shared information among multiple tasks.
- Graphical representations of experimental vs. predicted values visually confirmed the high predictive performance of the DNNs.
Conclusion:
The authors demonstrated the successful application of deep learning for predicting pharmaceutical formulations, overcoming the challenges of small, imbalanced datasets through the development of the MD-FIS algorithm and the use of pharmaceutically relevant evaluation metrics. The results suggest that deep learning can significantly improve the accuracy of formulation prediction compared to traditional machine learning methods and holds promise for accelerating drug product development, reducing costs, and enabling data-driven methodologies in pharmaceutical research. The authors propose further research into other machine learning methods, such as transfer learning, for further performance improvements.