Random Forest Based Melting Point Prediction on a Highly Curated Double Validated Dataset II

Researchers: Jean-Claude Bradley and Andrew Lang

Introduction

Upon examining the outliers of the linear model in MPModel004 it was found that many were outliers possibly due to incorrect ALogP predictions. What follows is a repeat of the same procedure used in MPModel004 but without the descriptors ALogP or ALogp2.

Procedure

Using the same training set (and test set) as in MPModel004 but with ALogP and ALop2 removed we created a random forest model using the following code:
mydata = read.csv(file="20110729DVTrainingSet.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.5-34]
mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
 
[output]
Call:
 randomForest(formula = mpC ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 43
 
          Mean of squared residuals: 1536.594
                    % Var explained: 82.1
[output]
 
## get plot of importance - %IncMSE and IncNodePurity
varImpPlot(mydata.rf,main="Random Forest Variable Importance")
The RF reports an OOB R2 of 0.82 and an OOB RMSE of 39.20 °C with resulting image below showing the importance of the descriptors.
20110729RFVarImportance.png
The model was then saved so that it can be used later:
.saveRDS(mydata.rf, file = "ONSMPModel005")
The model is available for download to use for batch melting point prediction with a CC0 license.

Random Forest Model Results. The random forest model had an AAE of 29.59 °C, a RMSE of 40.90 °C, a Bias of 0.94 °C and an R2 value of 0.82 when used to predict the melting points of 500 test-set compounds, using the following code:
testdata = read.csv(file="20110729DVTestSet.csv",head=TRUE,row.names="molID")
test.predict <- predict(mydata.rf,testdata)
write.csv(test.predict, file = "RFTestSetPredict.csv")
The random forest model had an AAE of 11.92 °C, a RMSE of 16.00 (39.20) °C, a Bias of -0.11 °C and an R2 value of 0.97 (0.82) when used to predict the melting points of 2205 training-set compounds (with the OOB values given in parentheses), using the following code:
testdata = read.csv(file="20110729DVTrainingSet.csv",head=TRUE,row.names="molID")
test.predict <- predict(mydata.rf,testdata)
write.csv(test.predict, file = "RFTrainingSetPredict.csv")
The figures below show: a) Predicted melting point vs. Experimental melting point for the training set, b) Predicted melting point vs. Experimental melting point for the test set, and c) Residuals vs. Experimental melting point for the test set. The labels are ChemSpider IDs.
20110729Training.png

20110729Test.png

20110729TestResiduals2.png

Linear model. The descriptors indicated as important by the random forest were used to create linear models on the same training set of data using the following code - using the leaps package to select the best three and eight descriptors:
mydata = read.csv(file="20110729DVTrainingSet.csv",head=TRUE,row.names="molID")
library(leaps)
attach(mydata)
leaps<-regsubsets(mpC ~ nHBDon + TopoPSA + XLogP + SC.3 + WTPT.3  + WTPT.5 + ATSc1 + BCUTp.1l + nAtomP + nHBAcc +
MDEC.12 + MDEC.22 + ATSm3 +  BCUTw.1h + nAtomLC + VC.3 + ATSc4 + VPC.6 + BCUTp.1h + WTPT.4 + bpol + SP.7 + SPC.6 + ATSp5 + MDEC.23 +
SCH.7 + nRotB + WTPT.2 + VP.2 + ATSm4 + fragC + MDEC.33 + VCH.7 + MW + SPC.4 + VCH.6 + ATSm5 + SCH.6 + ATSm1,
nvmax=12,data=mydata,nbest=1,really.big=TRUE)
 
## models are ordered by the selection statistic - create plot.
plot(leaps,scale="adjr2")
 
fit <- lm(mpC ~ nHBDon + nAtomP + MW,data=mydata)
## summary of fit
summary(fit)
[output]
Call:
lm(formula = mpC ~ nHBDon + nAtomP + MW, data = mydata)
 
Residuals:
     Min       1Q   Median       3Q      Max
-233.390  -39.888   -3.054   32.032  273.857
 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -85.29220    3.31417  -25.74   <2e-16 ***
nHBDon       48.61551    1.61201   30.16   <2e-16 ***
nAtomP        8.96068    0.27938   32.07   <2e-16 ***
MW            0.29533    0.01788   16.52   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 58.18 on 2201 degrees of freedom
Multiple R-squared: 0.6063,     Adjusted R-squared: 0.6057
F-statistic:  1130 on 3 and 2201 DF,  p-value: < 2.2e-16
[output]
fit <- lm(mpC ~ nHBDon + TopoPSA +nAtomP + nHBAcc + bpol + WTPT.2 + VP.2 + MW,data=mydata)
## summary of fit
summary(fit)
[output]
Call:
lm(formula = mpC ~ nHBDon + TopoPSA + nAtomP + nHBAcc + bpol +
    WTPT.2 + VP.2 + MW, data = mydata)
 
Residuals:
     Min       1Q   Median       3Q      Max
-241.266  -35.625   -5.842   32.009  223.725
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -547.24397   31.48865 -17.379  < 2e-16 ***
nHBDon        31.80205    1.80572  17.612  < 2e-16 ***
TopoPSA        0.64370    0.06584   9.777  < 2e-16 ***
nAtomP         2.67930    0.40091   6.683 2.96e-11 ***
nHBAcc         8.88675    1.13111   7.857 6.12e-15 ***
bpol          -3.11410    0.21140 -14.731  < 2e-16 ***
WTPT.2       260.08996   17.42697  14.925  < 2e-16 ***
VP.2           9.99239    1.43245   6.976 4.01e-12 ***
MW             0.26072    0.02852   9.143  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 52.4 on 2196 degrees of freedom
Multiple R-squared: 0.6814,     Adjusted R-squared: 0.6802
F-statistic: 587.1 on 8 and 2196 DF,  p-value: < 2.2e-16
[output]
The image below shows the best n descriptors together with the adjusted r-squared.
20110729Leaps.png
The 3-descriptor linear model is interesting that it can predict melting points using 3 common easy to calculate properties. On the 500 compound test-set, the 3-descriptor linear model predicted melting points with an AAE of 47.55 °C, a RMSE of 62.27 °C, a bias of -2.24 °C, and an R2 of 0.59. The 8-descriptor model faired better, but not excessively so, with an AAE of 42.31 °C, a RMSE of 56.17 °C, a bias of -0.61 °C, and an R2 of 0.67 on the test set.

Figures were created showing residuals for the 3-descriptor model and the 8-descriptor model. The labels are ChemSpider IDs.
20110729Test3DescResiduals.png
20110729Test8DescResiduals.png

Discussion

Removal of ALogP and ALogp2 has little impact on altering the outliers in either model; linear or random forest model. Since the ALogP descriptors are known to sometimes give incorrect values, it was thought that removing them could improve the models. With the above analysis, it seems that leaving them in the random forest model is justified but care should be taken when including them in linear models.