Random Forest Based Melting Point Prediction

Researchers: Jean-Claude Bradley, Andrew Lang, Tony Williams, et. al.

Introduction

Two recent random forest melting point models were created on a highly curated, double plus validated subset (ONSMP029) of a dataset of 20152 compounds (ONSMP028). These two models,
MPModel004 2D Descriptors and MPModel006 2D+3D Descriptors, built on a training set of 2205 compounds, had an AAE of 29.51 °C, a RMSE of 40.91 °C, a Bias of 0.97 °C and an R2 value of 0.82; and an AAE of 29.52 °C, a RMSE of 40.79 °C, a Bias of 0.64 °C and an R2 value of 0.83, respectively, when used to predict the melting points of 500 test-set compounds. Here we present a melting point model MPModel007, beginning with the entire dataset of 20152 compounds.

Procedure

The dataset (ONSMP028) was scanned for salts (SMILES with periods) and metals; finding 219 compounds. These compounds were removed to create ONSMP030 which was uploaded to the wiki. From ONSMP030 we removed all multiple measurements with ranges greater than 10 °C. The SMILES of the remaining compounds in the dataset were run through Rajarshi Guha's CDK Descriptor Calculator GUI (v1.3.2) to generate all 2D CDK descriptors except Ionization Potential (which is broken).

We removed compounds that failed to generate some or all CDK descriptors, descriptors Kier3 and HybRation as they contained multiple NA entries, and descriptors with less than 50 non-zero entries (khs.sLi, khs.ssBe, khs.ssssBe, khs.ssBH, khs.ssssB, khs.sNH3, khs.ssNH2, khs.sssNH, khs.ssssN, khs.sSiH3, khs.ssSiH2, khs.sssSiH, khs.sPH2, khs.ssPH, khs.sssssP, khs.dssS, khs.sGeH3, khs.ssGeH2, khs.sssGeH, khs.ssssGe, khs.sAsH2, khs.ssAsH, khs.sssAs, khs.sssdAs, khs.sssssAs, khs.sSeH, khs.dSe, khs.ssSe, khs.aaSe, khs.dssSe, khs.ddssSe, khs.sSnH3, khs.ssSnH2, khs.sssSnH, khs.ssssSn, khs.sPbH3, khs.ssPbH2, khs.sssPbH, khs.ssssPb). This left a dataset of 19515 compounds and 167 descriptors.

Feature selection. The following code was used to find collinear descriptors (using the caret package):
## load in data
mydata = read.csv(file="20110803WithmolIDFeatureSelected.csv",head=TRUE,row.names="molID")
## correlation matrix
cor.mat = cor(mydata)
## find correlation r > 0.95
findCorrelation(cor.mat, cutoff = .95, verbose = TRUE)
[output]
 [1] 168  64  65 166  33  27 160  28  66  70  62  29  63 158  30  71  13  67  68 159  73  79  75  83  14   7  88  18  17  44
[output]
The recommended columns (Zagreb, SP-2, SP-3, WPOL, nB, ATSp1, ATSp2, WTPT-1, SP-4, VP-0, SP-0, ATSp3, SP-1, VABC, ATSp4, VP-1, apol, SP-5, SP-6, MW, VP-3, SPC-5, VP-5, VPC-6, naAromAtom, BCUTc-1l, khs.sssB, ATSc2, ATSc1, SCH-3) were deleted, leaving a total of 137 descriptors.

The resulting dataset of 19515 compounds with 137 descriptors was then split into a training-set of 16015 compounds and a test-set of 3500 compounds. The test-set here contains the 500 compound test-set of both previous models as a subset.

Visualizing the Chemical Space. The entire dataset of 19515 melting point measurements contains compounds with melting points ranging from -205 °C to 562 °C (with a mean of 106.1 °C) whereas the double+ validated dataset has melting points ranging from -187.75 °C to 437.65 °C (with a mean of 60.4 °C). Principal component analysis was performed using the program R (v 2.13.0) with the following code:
mydata = read.csv(file="20110803combined.csv",head=TRUE,row.names="molID")
data <- mydata[, 3:140] ## Don't include CSID or DV (DV flag) in pca, but do include mpC
pc1 <- prcomp(data, scale. = T)
x <- pc1$x
summary(pc1)
 
[output]
Importance of components:
                          PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9    PC10    PC11    PC12
Standard deviation     5.3040 3.13111 2.89099 2.69493 2.32362 2.07313 1.99967 1.88602 1.80153 1.73814 1.67723 1.59750
Proportion of Variance 0.2039 0.07104 0.06056 0.05263 0.03912 0.03114 0.02898 0.02578 0.02352 0.02189 0.02038 0.01849
Cumulative Proportion  0.2039 0.27490 0.33547 0.38809 0.42722 0.45836 0.48734 0.51311 0.53663 0.55853 0.57891 0.59740
[output]
 
colvec <- mydata$DV
plot(x[, 1], x[, 2], pch = 20, col = colvec+2,xlim=c(-10,50),ylim=c(-15,15))
legend(30, 15, c("Single measurement","Double+ validated"), cex=1, col = c(2, 3),pch=20)
From the resulting figure below, of the first two principal components, we see that: 1. The entire datasets comprises a much larger chemical space than the double+ validated data set by itself. The bottom leg [0,25]x[-10,-5] being comprised of mainly the Karthikeyan data, see the PCA from MPModel001.
20110803PCA.png

Building the Models

Random Forest. The randomForest package (v 4.5-34) in R was used to build the random forest models using the following code:
mydata = read.csv(file="20110803TrainingSet.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.5-34]
mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
 
[output]
Call:
 randomForest(formula = mpC ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 45
 
          Mean of squared residuals: 1665.557
                    % Var explained: 80.03
[output]
 
## get plot of importance - %IncMSE and IncNodePurity
varImpPlot(mydata.rf,main="Random Forest Variable Importance")
The RF reports an OOB R2 of 0.80 and an OOB RMSE of 40.81 °C with resulting image below showing the importance of the descriptors. The image points to both the number of hydrogen bond donors (nHBDon) and the topological polar surface area (TopoPSA) as the most important physiochemical properties for melting point prediction as found in all previous analyses.
20110803RFVarImp.png
The model was then saved so that it could be deployed as a web service using the following code:
.saveRDS(mydata.rf, file = "ONSMPModel007")
The model is available for download to use for batch melting point prediction with a CC0 license.

Random Forest Model Results.The random forest model had an AAE of 29.36 °C, a RMSE of 40.18 °C, a Bias of 0.58 °C and an R2 value of 0.81 when used to predict the melting points of 3500 test-set compounds and an AAE of 26.62 °C, a RMSE of 36.35 °C, a Bias of 1.63 °C and an R2 value of 0.86 when used to predict the 500 compound test-set from melting point models 004 and 006, using the following code:
testdata = read.csv(file="20110803TestSet.csv",head=TRUE,row.names="molID")
test.predict <- predict(mydata.rf,testdata)
write.csv(test.predict, file = "RFTestSetPredict.csv")
The random forest model had an AAE of 12.08 °C, a RMSE of 16.73 (40.81) °C, a Bias of -0.07 °C and an R2 value of 0.97 (0.80) when used to predict the melting points of 16015 training-set compounds (with the OOB values given in parentheses), using the following code:
testdata = read.csv(file="20110803TrainingSet.csv",head=TRUE,row.names="molID")
test.predict <- predict(mydata.rf,testdata)
write.csv(test.predict, file = "RFTrainingSetPredict.csv")
The figures below show: a) Predicted melting point vs. Experimental melting point for the training set coloured by residual, b) Predicted melting point vs. Experimental melting point for the 3500 compound test-set with the test-set from models 004 and 006 coloured red, and c) Residuals vs. Experimental melting point for the 3500 compound test-set with the test-set from models 004 and 006 coloured red. The labels are ChemSpider IDs.
20110803Training.png
20110803Test.png
20110803TestResiduals.png

Linear model. The descriptors indicated as important by the random forest were used to create a linear model on the same training set of data using the following code - using the leaps package to select the best 7 descriptors:
## load in data
mydata = read.csv(file="20110803TrainingSet.csv",head=TRUE,row.names="molID")
attach(mydata)
leaps<-regsubsets(mpC ~ nHBDon + TopoPSA + nAtomP + BCUTp.1l + XLogP + WTPT.5 + ATSc3 + BCUTw.1h + ALogP + ATSc5 + BCUTp.1h +
WTPT.3 + WTPT.4 + nRotB + MDEC.22 + ALogp2 + ATSc4 + AMR + khs.ssCH2 + MDEC.33 + BCUTw.1l + WTPT.2 + khs.ssO + bpol +
MDEC.23 + C1SP3 + VPC.5 + khs.dssC + VP.6 + SPC.6 + SCH.7 + fragC + SPC.4 + SP.7 + nHBAcc + VCH.7 + SC.3 + ATSm4 +
SCH.6 + FMF + MDEO.11 + ATSm2 + ATSm3 + ATSp5 + VCH.6,nvmax=12,data=mydata,nbest=3,really.big=TRUE)
## models are ordered by the selection statistic - create plot.
plot(leaps,scale="adjr2")
 
fit <- lm(mpC ~ nHBDon + TopoPSA + nAtomP + XLogP + WTPT.2 + khs.ssO + ATSm4,data=mydata)
summary(fit)
[output]
Call:
lm(formula = mpC ~ nHBDon + TopoPSA + nAtomP + XLogP + WTPT.2 +
    khs.ssO + ATSm4, data = mydata)
 
Residuals:
    Min      1Q  Median      3Q     Max
-486.90  -39.06   -4.94   35.77  462.92
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -748.48164   13.05738  -57.32   <2e-16 ***
nHBDon        17.62427    0.57229   30.80   <2e-16 ***
TopoPSA        0.46929    0.01998   23.49   <2e-16 ***
nAtomP         3.61419    0.10552   34.25   <2e-16 ***
XLogP         -5.24138    0.21690  -24.16   <2e-16 ***
WTPT.2       399.76211    6.97494   57.31   <2e-16 ***
khs.ssO      -14.73832    0.52748  -27.94   <2e-16 ***
ATSm4          0.75946    0.02146   35.38   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 59.4 on 16007 degrees of freedom
Multiple R-squared: 0.5772,     Adjusted R-squared: 0.577
F-statistic:  3121 on 7 and 16007 DF,  p-value: < 2.2e-16
[output]
The image below shows the best n descriptors together with the adjusted r-squared.
20110803Leaps.png

Discussion

Neither the random forest model nor the linear model perform quite as well as their respective models from MPModel004, though the RF model does predict the melting points of the 500 compound test-set from MPModel004 with similar (slightly better) accuracy. This, together with the PCA, suggests that either the entire dataset (since it is more diverse) is harder to model or that the data in the entire dataset is not of as high a quality as the double+ validated dataset. Most likely both of these are true to some degree.