MeltingPointModel003

Modeling the Melting Point of Bergstrom's Set of Solid Drugs
Researcher: Andrew Lang

Introduction
We present a random forest melting point model of Bergstrom's [1] set of 277 solid drugs (ONSMP006) to compare with a recent paper by Deeb //et. al.// [2]. Deeb //et. al.// began with bergstrom's set of 277 solid drugs and calculated 1497 descriptors using the proprietary programs HyperChem and Dragon. They then calculated multiple linear regression models using artificial neural networks and genetic algorithms for feature selection. The best models (best being lowest RMSE of melting point prediction for the test set) had 5 descriptors (TestSet: R2: 0.556, RMSE:36.376 °C) and 9 descriptors (TestSet: R2: 0.567, RMSE: 35.550 °C) respectively.

Method
The Bergstrom dataset (ONSM006) of 277 compounds with 150 CDK descriptors was analysed for equivalent SMILES. Three pairs were found and removed, see table below, leaving 271 compounds. The dataset was then compared to ONSMP013 to see if the were any discrepancies between the Bergstrom set and any of the other datasets. We found 202 out of 271 compounds had melting point values in other datasets. All 202 matches differed by less than 5C. What is interesting is that of the 202 matches 179 matched with the Karthikeyan dataset. This is interesting because the Karthikeyan paper used the Bergstrom set as a test set. The data was split into a [|training.xlsx] set of 185 compounds and a [|test.xlsx] set of 92 compounds. A random forest was created using the randomForest package (v.4.5-34) in R (v.2.11.0) using the following code: code setwd("C:/Users/alang/Documents/MyMesh/ONSC/CDKDescriptors/meltingpointmodel/20110315Bergstrom") mydata = read.csv(file="training.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE) varImpPlot(mydata.rf) print(mydata.rf)
 * name || SMILES || CSID || mpC ||
 * ibafloxacin || c1(c(c3c2c(c1)C(=O)C(=CN2C(CC3)C)C(=O)O)C)F || 64324 || 269 ||
 * hymecromone || CC1CCc2c(C)c(F)cc3c(=O)c(cn1c23)C(O)=O || 64324 || 194 ||
 * pentifylline || OC4CCN(CCCN2c1ccccc1Sc3ccc(C#N)cc23)CC4 || 4585 || 82 ||
 * perciyazine || N1(c2c(ccc(c2)C#N)Sc2c1cccc2)CCCN1CCC(O)CC1 || 4585 || 116 ||
 * mephenytoin || CCC1(NC(=O)N(C)C1=O)c2ccccc2 || 3920 || 136 ||
 * methetoin || CCC1(NC(=O)N(C)C1=O)c2ccccc2 || 3920 || 210 ||
 * 1) load in data
 * 1) do random forest [randomForest 4.5-34]

[ouput] Call: randomForest(formula = mpC ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 50

Mean of squared residuals: 2006.118 % Var explained: 36.29 [output] code The variable importance plot is shown below.

Results
The random forest model was used to predict the melting points of the test set compounds with an R2 of 0.360, an AAE of 31.985 °C, and a RMSE of 40.873 °C. A plot of the predicted melting point vs the measured melting point can be seen below, the training set in blue and the test set in red. Points are sized according to TopoPSA.

Conclusion
Random forest (and other non-linear) models typically perform better than linear models for predicting melting points [3], thus the better performance (TestSet: R2: 0.567, RMSE: 35.550 °C) of the models created by Deeb //et. al.// over the random forest model presented here (TestSet: R2: 0.360, RMSE: 40.873 °C) is likely due to them using a much more diverse set of descriptors, 1497 as compared to 150.
 * [What kind of results do you get if you predict only the Bergstrom dataset using Model002? JCB]**
 * [Running the Bergstrom dataset through Model002 gives an R2: 0.928, AAE: 11.428 °C, and RMSE: 14.673 °C. This is extremely good but we should note that the Bergstrom set was part of the training set for Model002. Even so, Deeb //et. al.// never got the RMSE below 30 °C for any of their models. AL]**