Machine learning has been increasingly used in biochemistry. However, in organic chemistry and other experiment-based fields, data collected from real experiments are inadequate and the current coronavirus disease (COVID-19) pandemic has made the situation even worse. Such limited data resources may result in the low performance of modeling and affect the proper development of a control strategy. This paper proposes a feasible machine learning solution to the problem of small sample size in the bio-polymerization process. To avoid overfitting, the variational auto-encoder and generative adversarial network algorithms are used for data augmentation. The random forest and artificial neural network algorithms are implemented in the modeling process. The results prove that data augmentation techniques effectively improve the performance of the regression model. Several machine learning models were compared and the experimental results show that the random forest model with data augmentation by the generative adversarial network technique achieved the best performance in predicting the molecular weight on the training set (with an R of 0.94) and on the test set (with an R of 0.74), and the coefficient of determination of this model was 0.74.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9357554PMC
http://dx.doi.org/10.1016/j.ese.2022.100172DOI Listing

Publication Analysis

Top Keywords

data augmentation
16
machine learning
16
control strategy
8
bio-polymerization process
8
generative adversarial
8
adversarial network
8
network algorithms
8
random forest
8
data
6
machine
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!