MeltingPointModel007

Random Forest Based Melting Point Prediction

 * Researchers: Jean-Claude Bradley, Andrew Lang, Tony Williams, //et. al.//**

Introduction
Two recent random forest melting point models were created on a highly curated, double plus validated subset (ONSMP029) of a dataset of 20152 compounds (ONSMP028). These two models, MPModel004 2D Descriptors and MPModel006 2D+3D Descriptors, built on a training set of 2205 compounds, had an AAE of 29.51 °C, a RMSE of 40.91 °C, a Bias of 0.97 °C and an R2 value of 0.82; and an AAE of 29.52 °C, a RMSE of 40.79 °C, a Bias of 0.64 °C and an R2 value of 0.83, respectively, when used to predict the melting points of 500 test-set compounds. Here we present a melting point model MPModel007, beginning with the entire dataset of 20152 compounds.

Procedure
The dataset (ONSMP028) was scanned for salts (SMILES with periods) and metals; finding. These compounds were removed to create ONSMP030 which was uploaded to the wiki. From ONSMP030 we removed all multiple measurements with ranges greater than 10 °C. The SMILES of the remaining compounds in the dataset were run through Rajarshi Guha's CDK Descriptor Calculator GUI (v1.3.2) to generate all 2D CDK descriptors except Ionization Potential (which is broken).

We removed, descriptors Kier3 and HybRation as they contained multiple NA entries, and descriptors with less than 50 non-zero entries (khs.sLi, khs.ssBe, khs.ssssBe, khs.ssBH, khs.ssssB, khs.sNH3, khs.ssNH2, khs.sssNH, khs.ssssN, khs.sSiH3, khs.ssSiH2, khs.sssSiH, khs.sPH2, khs.ssPH, khs.sssssP, khs.dssS, khs.sGeH3, khs.ssGeH2, khs.sssGeH, khs.ssssGe, khs.sAsH2, khs.ssAsH, khs.sssAs, khs.sssdAs, khs.sssssAs, khs.sSeH, khs.dSe, khs.ssSe, khs.aaSe, khs.dssSe, khs.ddssSe, khs.sSnH3, khs.ssSnH2, khs.sssSnH, khs.ssssSn, khs.sPbH3, khs.ssPbH2, khs.sssPbH, khs.ssssPb). This left a dataset of 19515 compounds and 167 descriptors.

code mydata = read.csv(file="20110803WithmolIDFeatureSelected.csv",head=TRUE,row.names="molID") cor.mat = cor(mydata) findCorrelation(cor.mat, cutoff = .95, verbose = TRUE) [output] [1] 168 64  65 166  33  27 160  28  66  70  62  29  63 158  30  71  13  67  68 159  73  79  75  83  14   7  88  18  17  44 [output] code The recommended columns (Zagreb, SP-2, SP-3, WPOL, nB, ATSp1, ATSp2, WTPT-1, SP-4, VP-0, SP-0, ATSp3, SP-1, VABC, ATSp4, VP-1, apol, SP-5, SP-6, MW, VP-3, SPC-5, VP-5, VPC-6, naAromAtom, BCUTc-1l, khs.sssB, ATSc2, ATSc1, SCH-3) were deleted, leaving a total of 137 descriptors.
 * Feature selection.** The following code was used to find collinear descriptors (using the caret package):
 * 1) load in data
 * 1) correlation matrix
 * 1) find correlation r > 0.95

The resulting dataset of 19515 compounds with 137 descriptors was then split into a training-set of 16015 compounds and a test-set of 3500 compounds. The test-set here contains the 500 compound test-set of both previous models as a subset.

code mydata = read.csv(file="20110803combined.csv",head=TRUE,row.names="molID") data <- mydata[, 3:140] ## Don't include CSID or DV (DV flag) in pca, but do include mpC pc1 <- prcomp(data, scale. = T) x <- pc1$x summary(pc1)
 * Visualizing the Chemical Space.** The entire dataset of 19515 melting point measurements contains compounds with melting points ranging from -205 °C to 562 °C (with a mean of 106.1 °C) whereas the double+ validated dataset has melting points ranging from -187.75 °C to 437.65 °C (with a mean of 60.4 °C). Principal component analysis was performed using the program R (v 2.13.0) with the following code:

[output] Importance of components: PC1    PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9    PC10    PC11    PC12 Standard deviation    5.3040 3.13111 2.89099 2.69493 2.32362 2.07313 1.99967 1.88602 1.80153 1.73814 1.67723 1.59750 Proportion of Variance 0.2039 0.07104 0.06056 0.05263 0.03912 0.03114 0.02898 0.02578 0.02352 0.02189 0.02038 0.01849 Cumulative Proportion 0.2039 0.27490 0.33547 0.38809 0.42722 0.45836 0.48734 0.51311 0.53663 0.55853 0.57891 0.59740 [output]

colvec <- mydata$DV plot(x[, 1], x[, 2], pch = 20, col = colvec+2,xlim=c(-10,50),ylim=c(-15,15)) legend(30, 15, c("Single measurement","Double+ validated"), cex=1, col = c(2, 3),pch=20) code From the resulting figure below, of the first two principal components, we see that: 1. The entire datasets comprises a much larger chemical space than the double+ validated data set by itself. The bottom leg [0,25]x[-10,-5] being comprised of mainly the Karthikeyan data, see the PCA from MPModel001.

Building the Models
code mydata = read.csv(file="20110803TrainingSet.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE) print(mydata.rf)
 * Random Forest.** The randomForest package (v 4.5-34) in R was used to build the random forest models using the following code:
 * 1) do random forest [randomForest 4.5-34]

[output] Call: randomForest(formula = mpC ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 45

Mean of squared residuals: 1665.557 % Var explained: 80.03 [output]

varImpPlot(mydata.rf,main="Random Forest Variable Importance") code The RF reports an OOB R2 of 0.80 and an OOB RMSE of 40.81 °C with resulting image below showing the importance of the descriptors. The image points to both the number of hydrogen bond donors (nHBDon) and the topological polar surface area (TopoPSA) as the most important physiochemical properties for melting point prediction as found in all previous analyses. The model was then saved so that it could be deployed as a web service using the following code: code .saveRDS(mydata.rf, file = "ONSMPModel007") code The model is available for download to use for batch melting point prediction with a CC0 license.
 * 1) get plot of importance - %IncMSE and IncNodePurity

code testdata = read.csv(file="20110803TestSet.csv",head=TRUE,row.names="molID") test.predict <- predict(mydata.rf,testdata) write.csv(test.predict, file = "RFTestSetPredict.csv") code The random forest model had an AAE of 12.08 °C, a RMSE of 16.73 (40.81) °C, a Bias of -0.07 °C and an R2 value of 0.97 (0.80) when used to predict the melting points of 16015 training-set compounds (with the OOB values given in parentheses), using the following code: code testdata = read.csv(file="20110803TrainingSet.csv",head=TRUE,row.names="molID") test.predict <- predict(mydata.rf,testdata) write.csv(test.predict, file = "RFTrainingSetPredict.csv") code The figures below show: a) Predicted melting point vs. Experimental melting point for the training set coloured by residual, b) Predicted melting point vs. Experimental melting point for the 3500 compound test-set with the test-set from models 004 and 006 coloured red, and c) Residuals vs. Experimental melting point for the 3500 compound test-set with the test-set from models 004 and 006 coloured red. The labels are ChemSpider IDs.
 * Random Forest Model Results.**The random forest model had an AAE of 29.36 °C, a RMSE of 40.18 °C, a Bias of 0.58 °C and an R2 value of 0.81 when used to predict the melting points of 3500 test-set compounds and an AAE of 26.62 °C, a RMSE of 36.35 °C, a Bias of 1.63 °C and an R2 value of 0.86 when used to predict the 500 compound test-set from melting point models 004 and 006, using the following code:

code mydata = read.csv(file="20110803TrainingSet.csv",head=TRUE,row.names="molID") attach(mydata) leaps<-regsubsets(mpC ~ nHBDon + TopoPSA + nAtomP + BCUTp.1l + XLogP + WTPT.5 + ATSc3 + BCUTw.1h + ALogP + ATSc5 + BCUTp.1h + WTPT.3 + WTPT.4 + nRotB + MDEC.22 + ALogp2 + ATSc4 + AMR + khs.ssCH2 + MDEC.33 + BCUTw.1l + WTPT.2 + khs.ssO + bpol + MDEC.23 + C1SP3 + VPC.5 + khs.dssC + VP.6 + SPC.6 + SCH.7 + fragC + SPC.4 + SP.7 + nHBAcc + VCH.7 + SC.3 + ATSm4 + SCH.6 + FMF + MDEO.11 + ATSm2 + ATSm3 + ATSp5 + VCH.6,nvmax=12,data=mydata,nbest=3,really.big=TRUE) plot(leaps,scale="adjr2")
 * Linear model.** The descriptors indicated as important by the random forest were used to create a linear model on the same training set of data using the following code - using the leaps package to select the best 7 descriptors:
 * 1) load in data
 * 1) models are ordered by the selection statistic - create plot.

fit <- lm(mpC ~ nHBDon + TopoPSA + nAtomP + XLogP + WTPT.2 + khs.ssO + ATSm4,data=mydata) summary(fit) [output] Call: lm(formula = mpC ~ nHBDon + TopoPSA + nAtomP + XLogP + WTPT.2 +   khs.ssO + ATSm4, data = mydata)

Residuals: Min     1Q  Median      3Q     Max -486.90 -39.06   -4.94   35.77  462.92

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -748.48164  13.05738  -57.32   <2e-16 *** nHBDon       17.62427    0.57229   30.80   <2e-16 *** TopoPSA       0.46929    0.01998   23.49   <2e-16 *** nAtomP        3.61419    0.10552   34.25   <2e-16 *** XLogP        -5.24138    0.21690  -24.16   <2e-16 *** WTPT.2      399.76211    6.97494   57.31   <2e-16 *** khs.ssO     -14.73832    0.52748  -27.94   <2e-16 *** ATSm4         0.75946    0.02146   35.38   <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 59.4 on 16007 degrees of freedom Multiple R-squared: 0.5772,    Adjusted R-squared: 0.577 F-statistic: 3121 on 7 and 16007 DF,  p-value: < 2.2e-16 [output] code The image below shows the best n descriptors together with the adjusted r-squared.

Discussion
Neither the random forest model nor the linear model perform quite as well as their respective models from MPModel004, though the RF model does predict the melting points of the 500 compound test-set from MPModel004 with similar (slightly better) accuracy. This, together with the PCA, suggests that either the entire dataset (since it is more diverse) is harder to model or that the data in the entire dataset is not of as high a quality as the double+ validated dataset. Most likely both of these are true to some degree.