目的:采用随机森林算法,结合西洋参的理化性质,建立精准预测西洋参生长年限的机器学习模型,为市场上西洋参年限鉴定提供工具。方法:以前期收集的106批2~4年的西洋参样品为基础,采用西洋参醇溶性浸出物含量、人参皂苷Rb1含量、西洋参长度等9种理化特征作为数据特征进行分析。按照随机分割方法将数据集分为训练集和验证集,使用随机森林算法进行建模,并以多元线性回归作为对照模型,分别进行训练和验证。对所有特征的重要性进行分析和筛选,采用筛选后的特征再次进行建模,评估模型的准确性。结果:初步建模结果表明,随机森林模型预测准确度优于多元线性回归,特征重要性分析表明,长度、重量、醇溶性浸出物含量、水溶性浸出物含量、人参皂苷Rb1含量5种理化性质的重要性较高。使用筛选后的特征再次建模,得到改进后的随机森林模型。改进后的模型较原始模型准确性均有一定的提升:验证集上的均方误差为0.017,决定系数为0.950,可用于鉴别2~4年生西洋参。结论:基于我国规范化种植的不同生长年限的西洋参样品,建立了生长年限判定的数学统计分析方法。该方法快速、准确、可靠,可作为西洋参生长年限判断的依据,从而为西洋参质量评价提供了新的研究思路。
Objective: To authenticate the cultivation of American ginseng (AG) by using a random forest (RF) algorithm based on the physicochemical properties of AG. Methods: Nine physicochemical properties measured from 106 batches of AG samples with ages ranging from 2-4 years constituted the data set. The features of the AG include five saponins (Rg1, Re, Rb1, Rd, and F11), the content of alcohol and aqueous extractives, the length and the weight of AG, which were used as the inputs of the machine learning model. The total data were divided randomly into a training set and a validation set at a ratio of 4:1. RF was employed to build the machine learning model, while multivariate linear regression (MLR) was used as a benchmark algorithm. The impurity of the features in RF and the coefficient in MLR were calculated to rank the importance of features. The most important features were selected as new inputs to build the modified model. Results: The preliminary results showed that RF had a better performance than the MLR. Feature importance analysis indicated that five features including length, weight, content of aqueous extractives, content of ethanol extractives, Rb1 had a higher contribution to the predictive models. After training on these five features, two modified models were obtained, which showed higher accuracy than the original models. The modified RF model outperformed other models with an MSE value of 0.017 and R2 value of 0.950 for the validation data set and it was acceptable for the authentication of the growth year of AG. Conclusion: The modified RF model built in this study is accurate enough and can be used as a valuable tool to predict the cultivation age of AG.
[1] XIONG H, ZHANG AH, ZHAO QQ, et al. Discovery of quality-marker ingredients of Panax quinquefolius driven by high-throughput chinmedomics approach[J].Phytomedicine, 2020, 74: 152928
[2] 唐艳,闫述模,汪静静,等. 基于UPLC 及多成分分析的西洋参质量评价[J].中国中药杂志, 2016, 41(9): 1678
TANG Y, YAN SM, WANG JJ, et al. Quality evaluation of American ginseng using UPLC coupled with multivariate analysis[J].Chin J Chin Mater Med, 2016, 41(9): 1678
[3] GB/T 36397—2018 西洋参分等质量[S].2018
GB/T 36397—2018 Grade Quality of American ginseng[S].2018
[4] LIANG J, CHEN L, GUO YH, et al. Simultaneous determination and analysis of major ginsenosides in wild American ginseng grown in Tennessee[J].Chem Biodiv, 2019, 16(7): e1900203
[5] YANG L, HOU A, ZHANG J, et al. Panacis Quinquefolii Radix: a review of the botany, phytochemistry, quality control, pharmacology, toxicology and industrial applications research progress[J].Front Pharmacol, 2020, 11: 1876
[6] 杨洁瑜,王自,侯惠婵,等. 人参和西洋参染色的快速检测研究[J].今日药学, 2021, 31(6): 438
YANG JY, WANG Z, HOU HC, et al. Rapid detection of illegal dyes in Panax ginseng and Panax quinquefolium[J].Pharm Today, 2021, 31(6): 438
[7] 中华人民共和国药典2020年版.一部[S].2020: 136
ChP 2020. Vol Ⅰ [S].2020: 136
[8] QIAO X, QU C, LUO Q, et al. UHPLC-qMS spectrum-effect relationships for Rhizoma Paridis extracts[J].J Pharm Biomed Anal, 2021, 194: 113770
[9] SU R, WU H, LIU X, et al. Predicting drug-induced hepatotoxicity based on biological feature maps and diverse classification strategies[J].Brief Bioinfor, 2021, 22(1): 428
[10] SUN X, CHEN P, COOK SL, et al. Classification of cultivation locations of Panax quinquefolius L. samples using high performance liquid chromatography-electrospray ionization mass spectrometry and chemometric analysis[J].Anal Chem, 2012, 84(8): 3628
[11] PARK SE, SEO SH, KIM EJ, et al. Metabolomic approach for discrimination of cultivation age and ripening stage in ginseng berry using gas chromatography-mass spectrometry[J].Molecules, 2019, 24(21): 3837
[12] 严华,张慧秀,白宗利,等. 人参属西洋参、人参和三七特征图谱[J].中国现代中药, 2019, 21(11): 1512
YAN H, ZHANG HX, BAI ZL, et al. Finger-print of Panax quinquefolium, Panax ginseng and Panax notoginseng[J].Mod Chin Med,2019, 21(11): 1512
[13] SUN LX, YANG HB, LI J, et al. In silico prediction of compounds binding to human plasma proteins by QSAR models[J].Chem Med Chem, 2018, 13(6): 572
[14] PEI J, ZHENG Z, MERZ KM, et al. Random forest refinement of the KECSA2 knowledge-based scoring function for protein decoy detection[J].J Chem Inf Model, 2019, 59(5): 1919
[15] HUANG SH, TUNG CW, FULOP F, et al. Developing a QSAR model for hepatotoxicity screening of the active compounds in traditional Chinese medicines[J].Food Chem Toxicol, 2015, 78: 71
[16] LIU Y, ZHANG Y, LIU D, et al. Prediction of ESRD in IgA nephropathy patients from an Asian Cohort: a random forest model[J].Kidney Blood Press Res, 2018, 43(6): 1852
[17] XIA YG, SONG Y, LIANG J, et al. Quality analysis of American ginseng cultivated in Heilongjiang using UPLC-ESI(-)-MRM-MS with chemometric methods[J].Molecules, 2018, 23(9): 2396
[18] ZHAO H, XU J, GHEBREZADIK H, et al. Metabolomic quality control of commercial Asian ginseng, and cultivated and wild American ginseng using (1)H NMR and multi-step PCA[J].J Pharm Biomed Anal, 2015, 114: 113
[19] HU X, YAN H, WANG X, et al. Machine learning methods to predict the cultivation age of Panacis Quinquefolii Radix[J].Chin Med, 2021, 16(1): 100