Random Forest Based Melting Point Prediction on a Highly Curated Double Validated Dataset II

Researchers: Jean-Claude Bradley and Andrew Lang

Introduction

Upon examining the outliers of the linear model in MPModel004 it was found that many were outliers possibly due to incorrect ALogP predictions. What follows is a repeat of the same procedure used in MPModel004 but without the descriptors ALogP or ALogp2.

Procedure

Using the same training set (and test set) as in MPModel004 but with ALogP and ALop2 removed we created a random forest model using the following code:

mydata = read.csv(file="20110729DVTrainingSet.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.5-34]
mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
[output]
Call:
randomForest(formula = mpC ~ ., data = mydata, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 43
Mean of squared residuals: 1536.594
% Var explained: 82.1
[output]
## get plot of importance - %IncMSE and IncNodePurity
varImpPlot(mydata.rf,main="Random Forest Variable Importance")

The RF reports an OOB R2 of 0.82 and an OOB RMSE of 39.20 °C with resulting image below showing the importance of the descriptors.

The model was then saved so that it can be used later:

Random Forest Model Results. The random forest model had an AAE of 29.59 °C, a RMSE of 40.90 °C, a Bias of 0.94 °C and an R2 value of 0.82 when used to predict the melting points of 500 test-set compounds, using the following code:

The random forest model had an AAE of 11.92 °C, a RMSE of 16.00 (39.20) °C, a Bias of -0.11 °C and an R2 value of 0.97 (0.82) when used to predict the melting points of 2205 training-set compounds (with the OOB values given in parentheses), using the following code:

The figures below show: a) Predicted melting point vs. Experimental melting point for the training set, b) Predicted melting point vs. Experimental melting point for the test set, and c) Residuals vs. Experimental melting point for the test set. The labels are ChemSpider IDs.

Linear model. The descriptors indicated as important by the random forest were used to create linear models on the same training set of data using the following code - using the leaps package to select the best three and eight descriptors:

The image below shows the best n descriptors together with the adjusted r-squared.

The 3-descriptor linear model is interesting that it can predict melting points using 3 common easy to calculate properties. On the 500 compound test-set, the 3-descriptor linear model predicted melting points with an AAE of 47.55 °C, a RMSE of 62.27 °C, a bias of -2.24 °C, and an R2 of 0.59. The 8-descriptor model faired better, but not excessively so, with an AAE of 42.31 °C, a RMSE of 56.17 °C, a bias of -0.61 °C, and an R2 of 0.67 on the test set.

Figures were created showing residuals for the 3-descriptor model and the 8-descriptor model. The labels are ChemSpider IDs.

Discussion

Removal of ALogP and ALogp2 has little impact on altering the outliers in either model; linear or random forest model. Since the ALogP descriptors are known to sometimes give incorrect values, it was thought that removing them could improve the models. With the above analysis, it seems that leaving them in the random forest model is justified but care should be taken when including them in linear models.

## Random Forest Based Melting Point Prediction on a Highly Curated Double Validated Dataset II

Researchers: Jean-Claude Bradley and Andrew Lang## Introduction

Upon examining the outliers of the linear model in MPModel004 it was found that many were outliers possibly due to incorrect ALogP predictions. What follows is a repeat of the same procedure used in MPModel004 but without the descriptors ALogP or ALogp2.## Procedure

Using the same training set (and test set) as in MPModel004 but with ALogP and ALop2 removed we created a random forest model using the following code:The RF reports an OOB R2 of 0.82 and an OOB RMSE of 39.20 °C with resulting image below showing the importance of the descriptors.

The model was then saved so that it can be used later:

The model is available for download to use for batch melting point prediction with a CC0 license.

Random Forest Model Results.The random forest model had an AAE of 29.59 °C, a RMSE of 40.90 °C, a Bias of 0.94 °C and an R2 value of 0.82 when used to predict the melting points of 500 test-set compounds, using the following code:The random forest model had an AAE of 11.92 °C, a RMSE of 16.00 (39.20) °C, a Bias of -0.11 °C and an R2 value of 0.97 (0.82) when used to predict the melting points of 2205 training-set compounds (with the OOB values given in parentheses), using the following code:

The figures below show: a) Predicted melting point vs. Experimental melting point for the training set, b) Predicted melting point vs. Experimental melting point for the test set, and c) Residuals vs. Experimental melting point for the test set. The labels are ChemSpider IDs.

Linear model.The descriptors indicated as important by the random forest were used to create linear models on the same training set of data using the following code - using the leaps package to select the best three and eight descriptors:The image below shows the best n descriptors together with the adjusted r-squared.

The 3-descriptor linear model is interesting that it can predict melting points using 3 common easy to calculate properties. On the 500 compound test-set, the 3-descriptor linear model predicted melting points with an AAE of 47.55 °C, a RMSE of 62.27 °C, a bias of -2.24 °C, and an R2 of 0.59. The 8-descriptor model faired better, but not excessively so, with an AAE of 42.31 °C, a RMSE of 56.17 °C, a bias of -0.61 °C, and an R2 of 0.67 on the test set.

Figures were created showing residuals for the 3-descriptor model and the 8-descriptor model. The labels are ChemSpider IDs.

## Discussion

Removal of ALogP and ALogp2 has little impact on altering the outliers in either model; linear or random forest model. Since the ALogP descriptors are known to sometimes give incorrect values, it was thought that removing them could improve the models. With the above analysis, it seems that leaving them in the random forest model is justified but care should be taken when including them in linear models.