Random Forest Based Melting Point Prediction on a Highly Curated Double Validated Dataset III

Researchers: Jean-Claude Bradley, Andrew Lang, Antony Williams, et. al.

Introduction

A random forest melting point model (MPModel004) that predicts melting points from 2D structure (SMILES) has a test-set R2 of 0.82. The model seems to consistently underestimate the melting point of molecules with relatively high melting points. This may be due to a lack of information in the 2D descriptors. To see if this is the case, below we perform the same procedure as MPModel004 except now we include 3D descriptors. 3D descriptors related to symmetry have been used to predict or explain melting points (Abramowitz1990, Brown2000).

Procedure

Calculating the descriptors. Using the original dataset from MPModel004, we calculated 3D structures (adding hydrogens) using ChemAxon's molconvert, e.g.:
molconvert -3 sdf:H smiles.smi -o molecules.sdf
We ran the resulting sdf file through Rajarshi Guha's CDK Descriptor Calculator GUI(v 1.3.2) calculating all descriptors except Ionization Potential (which is broken). We then removed descriptors with a lot of NA entries (Wgamma1.unity, Wgamma2.unity, Wgamma3.unity, WG.unity, Kier3, Weta1.unity, WD.unity, HybRatio, Weta2.unity), and descriptors with less than 10 non-zero entries (nR, nN, nD, nC, nF, nQ, nE, nH, nI, nP, nL, nK, nM, nS, nT, nY, nV, nW, khs.sLi, khs.ssBe, khs.ssssBe, khs.ssBH, khs.sssB, khs.ssssB, khs.sNH3, khs.ssNH2, khs.dNH, khs.sssNH, khs.ssssN, khs.sSiH3, khs.ssSiH2, khs.sssSiH, khs.sPH2, khs.ssPH, khs.sssP, khs.sssssP, khs.dssS, khs.sGeH3, khs.ssGeH2, khs.sssGeH, khs.ssssGe, khs.sAsH2, khs.ssAsH, khs.sssAs, khs.sssdAs, khs.sssssAs, khs.sSeH, khs.dSe, khs.ssSe, khs.aaSe, khs.dssSe, khs.ddssSe, khs.sSnH3, khs.ssSnH2, khs.sssSnH, khs.ssssSn, khs.sPbH3, khs.ssPbH2, khs.sssPbH, khs.ssssPb). Two compounds were also removed from the dataset: methane had NA for topoShape, and octaphenylcyclotetrasiloxane because it again failed to generate any descriptors.

Feature selection. As in MPModel004, we performed feature selection using the following code (caret package):
## load in data
mydata = read.csv(file="20110730DV3DCombinedWithConstantDescriptorsRemoved.csv",head=TRUE,row.names="molID")
## correlation matrix
cor.mat = cor(mydata)
## find correlation r > 0.95
findCorrelation(cor.mat, cutoff = .95, verbose = TRUE)
[output]
[1] 102 129 217 103  73 104 110 225 130 174 126 215 131 111 105  67  53 127 216 223 128  68 133 106  69  71 108 107
119 115 114 175  39  46 11  55 202  14 178 22 122  47  24  15  37  38  21 209 207  58  59 100   6  90  84
[output]
The recommended columns (SP-0, GRAVH-1, WTPT-1, SP-1, nB, SP-2, VP-0, Zagreb, GRAVH-2, Kier1, GRAV-1, VABC, GRAVH-3, VP-1, SP-3, ATSp1, apol, GRAV-2, MW, WPOL, GRAV-3, ATSp2, GRAV-5, SP-4, ATSp3, ATSp5, SP-6, SP-5, SPC-5, VP-5, VP-4, Kier2, fragC, WT.unity, PPSA-2, nAromBond, MOMI-X, PNSA-2, LOBMAX, FNSA-1, VPC-5, WA.unity, FNSA-3, PNSA-3, RHSA, RPSA, FPSA-3, PetitjeanNumber, MOMI-YZ, ATSc2, ATSc3, VC-5, BCUTc-1l, VCH-4, SCH-3) were deleted, leaving a total of 170 descriptors.

The resulting spreadsheet of 2704 melting points with 170 2D & 3D CDK descriptors was then split into two sets: a test set of 500 compounds and a training set of 2204 compounds. The test set and training set contain the same compounds as in MPModel004 for comparisons sake. The traing set is smaller by one compound because methane was removed in this analysis because it failed to generate the topoShape 3D descriptor.

Building the Models

Random Forest. The randomForest package (v 4.5-34) in R was used to build the random forest models using the following code:
mydata = read.csv(file="20110730DV3DTrainingSet.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.5-34]
mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
 
[output]
Call:
 randomForest(formula = mpC ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 56
 
          Mean of squared residuals: 1515.904
                    % Var explained: 82.29
[output]
 
## get plot of importance - %IncMSE and IncNodePurity
varImpPlot(mydata.rf,main="Random Forest Variable Importance")
The RF reports an OOB R2 of 0.82 and an OOB RMSE of 38.93 °C with resulting image below showing the importance of the descriptors. The image points to the number of hydrogen bond donors (nHBDon) and topological polar surface area as the most important physiochemical properties for melting point prediction.
201107303DRFVarImp.png
The model was then saved so that it could be deployed as a web service using the following code:
.saveRDS(mydata.rf, file = "ONSMPModel006")
The model is available for download to use for batch melting point prediction with a CC0 license. Note that you must convert your structure to 3D sdf format and calculate all descriptors (except Ionization Potential), including the 3D descriptors.

Random Forest Model Results. The random forest model had an AAE of 29.52 °C, a RMSE of 40.79 °C, a Bias of 0.64 °C and an R2 value of 0.83 when used to predict the melting points of 500 test-set compounds, using the following code:
testdata = read.csv(file="20110730DV3DTestSet.csv",head=TRUE,row.names="molID")
test.predict <- predict(mydata.rf,testdata)
write.csv(test.predict, file = "RFTestSetPredict.csv")
The random forest model had an AAE of 11.54 °C, a RMSE of 15.57 (38.93) °C, a Bias of -0.13 °C and an R2 value of 0.97 (0.82) when used to predict the melting points of 2204 training-set compounds (with the OOB values given in parentheses), using the following code:
testdata = read.csv(file="20110730DV3DTrainingSet.csv",head=TRUE,row.names="molID")
test.predict <- predict(mydata.rf,testdata)
write.csv(test.predict, file = "RFTrainingSetPredict.csv")
The figures below show: a) Predicted melting point vs. Experimental melting point for the training set, b) Predicted melting point vs. Experimental melting point for the test set, and c) Residuals vs. Experimental melting point for the test set. The labels are ChemSpider IDs.
201107303DRFTraining.png
201107303DRFTest.png
201107303DRFTestResiduals.png

Linear model. The descriptors indicated as important by the random forest were used to create a linear model on the same training set of data using the following code - using the leaps package to select the best 5 descriptors:
mydata = read.csv(file="20110730DV3DTrainingSet.csv",head=TRUE,row.names="molID")
library(leaps)
attach(mydata)
leaps<-regsubsets(mpC ~ nHBDon + TopoPSA + FNSA.2 + WTPT.5 + nAtomP + XLogP + ATSp4 + BCUTp.1l + MOMI.XY + WTPT.3 + GRAV.6 + TPSA +
RPCG + nHBAcc + GRAV.4 + MOMI.Z + SC.3 + nAtomLC + Wnu1.unity + RNCG + WTPT.2 + SCH.7 + SP.7 + ALogP + FMF + nRotB + THSA +
Wnu2.unity + ATSc1 + BCUTp.1h + SPC.6 + MDEC.33 + VCH.7 + ATSm3 + WNSA.2 + DPSA.2 + ATSm2 + ATSm4 + MDEC.23,
nvmax=12,data=mydata,nbest=1,really.big=TRUE)
 
## models are ordered by the selection statistic - create plot.
plot(leaps,scale="adjr2")
 
fit <- lm(mpC ~ nHBDon +TopoPSA + WTPT.3 + GRAV.6 +  FMF,data=mydata)
## summary of fit
summary(fit)
[output]
Call:
lm(formula = mpC ~ nHBDon + TopoPSA + WTPT.3 + GRAV.6 + FMF,
    data = mydata)
 
Residuals:
     Min       1Q   Median       3Q      Max
-235.082  -35.751   -4.526   30.746  203.201
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -223.98738    6.90093 -32.458  < 2e-16 ***
nHBDon        34.27893    1.78262  19.230  < 2e-16 ***
TopoPSA        1.24969    0.07394  16.902  < 2e-16 ***
WTPT.3        -3.33092    0.41333  -8.059 1.25e-15 ***
GRAV.6        17.03089    0.72415  23.518  < 2e-16 ***
FMF          138.86038    6.66903  20.822  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 52.25 on 2198 degrees of freedom
Multiple R-squared: 0.6819,     Adjusted R-squared: 0.6812
F-statistic: 942.2 on 5 and 2198 DF,  p-value: < 2.2e-16
[output]
The image below shows the best n descriptors together with the adjusted r-squared.
201107303DLeaps.png
The 5-descriptor linear model was used to predict the melting point values of the 2204 training set with an AAE of 40.59 °C, a RMSE of 52.18 °C, a bias of 0.00 °C, and an R2 of 0.68. The linear model was used to predict the melting points of the 500 compound test-set with an AAE of 43.15 °C, a RMSE of 57.01 °C, a bias of -1.61 °C, and an R2 of 0.66. Figures were again created showing a) Predicted melting point vs. Experimental melting point for the training set, b) Predicted melting point vs. Experimental melting point for the test set, and c) Residuals vs. Experimental melting point for the test set. The labels are ChemSpider IDs.
201107303DLinearTraining.png
201107303DLinearTest.png
201107303DLinearTestResiduals.png

Comparison

The table below shows a comparison between MPModel004 (2D descriptors only) and the model presented here, MPModel006 (2D+3D descriptors). Both models perform poorly for compounds with melting points greater then 200C as compared to compounds with melting points between -200C and 200C. The addition of 3D descriptors from the CDK has little effect for the random forest. They do improve the linear model.
Model
TestSet Range
AAE
RMSE
R2
Bias
004 RF
All
29.51
40.91
0.82
0.97
006 RF
All
29.52
40.79
0.83
0.64
004 RF
-200 - 200
25.61
33.05
0.85
-4.95
006 RF
-200 - 200
25.62
32.98
0.85
-5.30
004 RF
200 - 450
84.69
99.48
-2.59
84.69
006 RF
200 - 450
84.63
99.13
-2.57
84.63
004 Linear(7)
ALL
44.09
58.19
0.65
-0.50
006 Linear(5)
ALL
43.15
57.01
0.66
-1.61
004 Linear(7)
-200 - 200
39.76
50.87
0.64
-7.87
006 Linear(5)
-200 - 200
38.55
49.13
0.66
-9.32
004 Linear(7)
200 - 450
105.42
121.15
-4.33
103.83
006 Linear(5)
200 - 450
108.26
122.81
-4.47
107.52

Conclusion

Including 3D descriptors from the CDK gives a slight but not significant improvement in the random forest model. This matches the experience of Bergstrom, Karthikeyan, Nigsch et. al. (Nigsch2006) and Hughes et. al. (Hughes2008). The test-set bias statistic did improve significantly, it is smaller 0.64 as compared to 0.97 from MPModel004, but AAE, RMSE, and R2 are only slightly better, if at all, for the model including 3D descriptors.

For the linear model we see that the addition of 3D descriptors improves the test-set predictive ability of the model: MPModel006 5-descriptor linear model R2 0.66 compared to MPModel004 7-descriptor linear model R2 0.65. Improvement can also be seen in comparing the 'leaps' plots, MPModel004 leveling off at around 0.7 whereas MPModel006 levels of at around 0.73. The new 3D descriptor that appears in the linear model is GRAV.6 which is a descriptor characterizing the mass distribution of the molecule. This descriptors works well in combination with FMF, the molecular complexity of the molecule in terms of its Murcko framework.

[Although the addition of 3D descriptors did not improve the predictive ability for the bulk of molecules, it may be that 3D factors are only important for a minority of molecules. It would be interesting to do a run specifically on a set of molecules known to be affected by symmetry or trans, meta or para dicarboxylic acids (the prototypical case of fumaric and maleic acids). A convenient dataset might be the one used by Abramowitz1990. JCB]