MeltingPointModel005

Random Forest Based Melting Point Prediction on a Highly Curated Double Validated Dataset II

 * Researchers: Jean-Claude Bradley and Andrew Lang**

Introduction
Upon examining the outliers of the linear model in MPModel004 it was found that many were outliers possibly due to incorrect ALogP predictions. What follows is a repeat of the same procedure used in MPModel004 but without the descriptors ALogP or ALogp2.

Procedure
Using the same [|training set] (and [|test set]) as in MPModel004 but with ALogP and ALop2 removed we created a random forest model using the following code: code mydata = read.csv(file="20110729DVTrainingSet.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE) print(mydata.rf)
 * 1) do random forest [randomForest 4.5-34]

[output] Call: randomForest(formula = mpC ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 43

Mean of squared residuals: 1536.594 % Var explained: 82.1 [output]

varImpPlot(mydata.rf,main="Random Forest Variable Importance") code The RF reports an OOB R2 of 0.82 and an OOB RMSE of 39.20 °C with resulting image below showing the importance of the descriptors. The model was then saved so that it can be used later: code .saveRDS(mydata.rf, file = "ONSMPModel005") code The model is available for [|download] to use for batch melting point prediction with a CC0 license.
 * 1) get plot of importance - %IncMSE and IncNodePurity

code testdata = read.csv(file="20110729DVTestSet.csv",head=TRUE,row.names="molID") test.predict <- predict(mydata.rf,testdata) write.csv(test.predict, file = "RFTestSetPredict.csv") code The random forest model had an AAE of 11.92 °C, a RMSE of 16.00 (39.20) °C, a Bias of -0.11 °C and an R2 value of 0.97 (0.82) when used to predict the melting points of 2205 training-set compounds (with the OOB values given in parentheses), using the following code: code testdata = read.csv(file="20110729DVTrainingSet.csv",head=TRUE,row.names="molID") test.predict <- predict(mydata.rf,testdata) write.csv(test.predict, file = "RFTrainingSetPredict.csv") code The figures below show: a) Predicted melting point vs. Experimental melting point for the training set, b) Predicted melting point vs. Experimental melting point for the test set, and c) Residuals vs. Experimental melting point for the test set. The labels are ChemSpider IDs.
 * Random Forest Model Results.** The random forest model had an AAE of 29.59 °C, a RMSE of 40.90 °C, a Bias of 0.94 °C and an R2 value of 0.82 when used to predict the melting points of 500 test-set compounds, using the following code:





code mydata = read.csv(file="20110729DVTrainingSet.csv",head=TRUE,row.names="molID") library(leaps) attach(mydata) leaps<-regsubsets(mpC ~ nHBDon + TopoPSA + XLogP + SC.3 + WTPT.3 + WTPT.5 + ATSc1 + BCUTp.1l + nAtomP + nHBAcc + MDEC.12 + MDEC.22 + ATSm3 +  BCUTw.1h + nAtomLC + VC.3 + ATSc4 + VPC.6 + BCUTp.1h + WTPT.4 + bpol + SP.7 + SPC.6 + ATSp5 + MDEC.23 + SCH.7 + nRotB + WTPT.2 + VP.2 + ATSm4 + fragC + MDEC.33 + VCH.7 + MW + SPC.4 + VCH.6 + ATSm5 + SCH.6 + ATSm1, nvmax=12,data=mydata,nbest=1,really.big=TRUE)
 * Linear model.** The descriptors indicated as important by the random forest were used to create linear models on the same training set of data using the following code - using the leaps package to select the best three and eight descriptors:

plot(leaps,scale="adjr2")
 * 1) models are ordered by the selection statistic - create plot.

fit <- lm(mpC ~ nHBDon + nAtomP + MW,data=mydata) summary(fit) [output] Call: lm(formula = mpC ~ nHBDon + nAtomP + MW, data = mydata)
 * 1) summary of fit

Residuals: Min      1Q   Median       3Q      Max -233.390 -39.888   -3.054   32.032  273.857

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -85.29220   3.31417  -25.74   <2e-16 *** nHBDon      48.61551    1.61201   30.16   <2e-16 *** nAtomP       8.96068    0.27938   32.07   <2e-16 *** MW           0.29533    0.01788   16.52   <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 58.18 on 2201 degrees of freedom Multiple R-squared: 0.6063,    Adjusted R-squared: 0.6057 F-statistic: 1130 on 3 and 2201 DF,  p-value: < 2.2e-16 [output] fit <- lm(mpC ~ nHBDon + TopoPSA +nAtomP + nHBAcc + bpol + WTPT.2 + VP.2 + MW,data=mydata) summary(fit) [output] Call: lm(formula = mpC ~ nHBDon + TopoPSA + nAtomP + nHBAcc + bpol +   WTPT.2 + VP.2 + MW, data = mydata)
 * 1) summary of fit

Residuals: Min      1Q   Median       3Q      Max -241.266 -35.625   -5.842   32.009  223.725

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -547.24397  31.48865 -17.379  < 2e-16 *** nHBDon       31.80205    1.80572  17.612  < 2e-16 *** TopoPSA       0.64370    0.06584   9.777  < 2e-16 *** nAtomP        2.67930    0.40091   6.683 2.96e-11 *** nHBAcc        8.88675    1.13111   7.857 6.12e-15 *** bpol         -3.11410    0.21140 -14.731  < 2e-16 *** WTPT.2      260.08996   17.42697  14.925  < 2e-16 *** VP.2          9.99239    1.43245   6.976 4.01e-12 *** MW            0.26072    0.02852   9.143  < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 52.4 on 2196 degrees of freedom Multiple R-squared: 0.6814,    Adjusted R-squared: 0.6802 F-statistic: 587.1 on 8 and 2196 DF, p-value: < 2.2e-16 [output] code The image below shows the best n descriptors together with the adjusted r-squared. The 3-descriptor linear model is interesting that it can predict melting points using 3 common easy to calculate properties. On the 500 compound test-set, the 3-descriptor linear model predicted melting points with an AAE of 47.55 °C, a RMSE of 62.27 °C, a bias of -2.24 °C, and an R2 of 0.59. The 8-descriptor model faired better, but not excessively so, with an AAE of 42.31 °C, a RMSE of 56.17 °C, a bias of -0.61 °C, and an R2 of 0.67 on the test set.

Figures were created showing residuals for the 3-descriptor model and the 8-descriptor model. The labels are ChemSpider IDs.

Discussion
Removal of ALogP and ALogp2 has little impact on altering the outliers in either model; linear or random forest model. Since the ALogP descriptors are known to sometimes give incorrect values, it was thought that removing them could improve the models. With the above analysis, it seems that leaving them in the random forest model is justified but care should be taken when including them in linear models.