Modeling the Melting Point of Bergstrom's Set of Solid Drugs

Researcher: Andrew Lang

Introduction

We present a random forest melting point model of Bergstrom's [1] set of 277 solid drugs (ONSMP006) to compare with a recent paper by Deeb et. al. [2]. Deeb et. al. began with bergstrom's set of 277 solid drugs and calculated 1497 descriptors using the proprietary programs HyperChem and Dragon. They then calculated multiple linear regression models using artificial neural networks and genetic algorithms for feature selection. The best models (best being lowest RMSE of melting point prediction for the test set) had 5 descriptors (TestSet: R2: 0.556, RMSE:36.376 °C) and 9 descriptors (TestSet: R2: 0.567, RMSE: 35.550 °C) respectively.

Method

The Bergstrom dataset (ONSM006) of 277 compounds with 150 CDK descriptors was analysed for equivalent SMILES. Three pairs were found and removed, see table below, leaving 271 compounds. The dataset was then compared to ONSMP013 to see if the were any discrepancies between the Bergstrom set and any of the other datasets. We found 202 out of 271 compounds had melting point values in other datasets. All 202 matches differed by less than 5C. What is interesting is that of the 202 matches 179 matched with the Karthikeyan dataset. This is interesting because the Karthikeyan paper used the Bergstrom set as a test set.
name
SMILES
CSID
mpC
ibafloxacin
c1(c(c3c2c(c1)C(=O)C(=CN2C(CC3)C)C(=O)O)C)F
64324
269
hymecromone
CC1CCc2c(C)c(F)cc3c(=O)c(cn1c23)C(O)=O
64324
194
pentifylline
OC4CCN(CCCN2c1ccccc1Sc3ccc(C#N)cc23)CC4
4585
82
perciyazine
N1(c2c(ccc(c2)C#N)Sc2c1cccc2)CCCN1CCC(O)CC1
4585
116
mephenytoin
CCC1(NC(=O)N(C)C1=O)c2ccccc2
3920
136
methetoin
CCC1(NC(=O)N(C)C1=O)c2ccccc2
3920
210
The data was split into a training.xlsx set of 185 compounds and a test.xlsx set of 92 compounds. A random forest was created using the randomForest package (v.4.5-34) in R (v.2.11.0) using the following code:
setwd("C:/Users/alang/Documents/MyMesh/ONSC/CDKDescriptors/meltingpointmodel/20110315Bergstrom")
## load in data
mydata = read.csv(file="training.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.5-34]
mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE)
varImpPlot(mydata.rf)
print(mydata.rf)
 
[ouput]
Call:
 randomForest(formula = mpC ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 50
 
          Mean of squared residuals: 2006.118
                    % Var explained: 36.29
[output]
The variable importance plot is shown below.
RFDescriptorImportance.png

Results

The random forest model was used to predict the melting points of the test set compounds with an R2 of 0.360, an AAE of 31.985 °C, and a RMSE of 40.873 °C. A plot of the predicted melting point vs the measured melting point can be seen below, the training set in blue and the test set in red. Points are sized according to TopoPSA.
RFObsPredicted.png

Conclusion

Random forest (and other non-linear) models typically perform better than linear models for predicting melting points [3], thus the better performance (TestSet: R2: 0.567, RMSE: 35.550 °C) of the models created by Deeb et. al. over the random forest model presented here (TestSet: R2: 0.360, RMSE: 40.873 °C) is likely due to them using a much more diverse set of descriptors, 1497 as compared to 150.
[What kind of results do you get if you predict only the Bergstrom dataset using Model002? JCB]
[Running the Bergstrom dataset through Model002 gives an R2: 0.928, AAE: 11.428 °C, and RMSE: 14.673 °C. This is extremely good but we should note that the Bergstrom set was part of the training set for Model002. Even so, Deeb et. al. never got the RMSE below 30 °C for any of their models. AL]

References

[1] Bergström C.A.; Norinder U.; Luthman K.; Artursson P. Molecular descriptors influencing melting point and their role in classification of solid drugs. J Chem Inf Comput Sci. 2003 Jul-Aug;43(4):1177-85.
[2] Deeb O.; Goodarzi M. and Alfalah S. Prediction of melting point from drug-like compounds via QSPR methods. Molecular Physics. 2011. Vol. 109. No. 4. 20 pp.507-516
[3] Hughes L.D.; Palmer D.S.; Nigsch F., and Mitchell J.B.O. Why Are Some Properties More Difficult To Predict than Others? A Study of QSPR Models of Solubility, Melting Point, and Log P. J. Chem. Inf. Model. 2008, 48, 220-232