MeltingPointModel004

Random Forest Based Melting Point Prediction on a Highly Curated Double Validated Dataset

 * Researchers: Jean-Claude Bradley, Andrew Lang, Antony Williams, //et. al.//**

Introduction
Melting points have been collected from several Open Data sources. After several rounds of curation, we have found errors in all sources - both structural and incorrect mp values. This has led us to believe that the difficulty in modeling melting points has been due, at least in part, to inconsistencies in the underlying data. To this end we have taken a subset of 2706, at least doubly validated, melting points from our already highly curated set of 20,155 melting points, to use as the data for modeling. Using this data we constructed ....

Procedure

 * The data.** We took a subset of (ONSMP028) where we have at least two (non-identical) values that differ by at most 5C. On this set we performed another round of manual curation, removing all compounds that had at least one chiral center, possessed cis/trans isomerism, were inorganic or a salt. This left us with a dataset of 2706 compounds (ONSMP029).

code mydata = read.csv(file="20110727DVwithMolIDFailedRemoved.csv",head=TRUE,row.names="molID") cor.mat = cor(mydata) findCorrelation(cor.mat, cutoff = .95, verbose = TRUE) [output] [1] 63 164  32 156  64  61 162  26  27  65  28  69  22  29  66  67 124  70 154  12  78  74  73  13 125  81  59   7  49  43  17  18 [output] code The recommended columns (SP-2, Zagreb, nB, WTPT-1, SP-3, SP-0, WPOL, ATSp1, ATSp2, SP-4, ATSp3, VP-0, ATSm2, ATSp4, SP-5, SP-6, Kier1, VP-1, VABC, apol, SPC-5, VP-5, VP-4, naAromAtom, Kier2, VPC-5, VC-5, BCUTc-1h, VCH-4, SCH-3, ATSc2, ATSc3) were deleted, leaving a total of 132 descriptors.
 * Calculating descriptors and feature selection.** Open CDK [1] descriptors were calculated using the CDK Descriptor Calculator GUI (v 1.3.2) [2]. All descriptors were calculated except: Charged Partial Surface Area (under electronic), Ionization Potential (under electronic), Amino Acid Count (under protein), WHIM (under hybrid), and all geometrical descriptors. Descriptors with constant entries (khs.sLi, khs.ssBe, khs.ssssBe, khs.ssBH, khs.ssssB, khs.sNH3, khs.ssNH2, khs.sssNH, khs.sSiH3, khs.sPH2, khs.ssPH, khs.sssssP, khs.sGeH3, khs.ssGeH2, khs.sssGeH, khs.ssssGe, khs.sAsH2, khs.ssAsH, khs.sssAs, khs.sssdAs, khs.sssssAs, khs.sSeH, khs.dSe, khs.ssSe, khs.aaSe, khs.dssSe, khs.ddssSe, khs.sSnH3, khs.ssSnH2, khs.sssSnH, khs.ssssSn, khs.sPbH3, khs.ssPbH2, khs.sssPbH, khs.ssssPb), descriptors with less than 10 non-zero entries: khs.sssB (5), khs.dNH (2), khs.ssssN (1), khs.ssSiH2 (1), khs.sssSiH (1), khs.sssP (8), khs.dssS (4); and Kier3 and HybRatio (several entries with 'NA') were also removed. The following code was used to find collinear descriptors (using the caret package):
 * 1) load in data
 * 1) correlation matrix
 * 1) find correlation r > 0.95

The following compound was removed from the dataset because it failed to generate any descriptors:
 * name || SMILES || mpC || CSID || NA ||
 * [|octaphenylcyclotetrasiloxane] || c1ccc(cc1)[Si]2(O[Si](O[Si](O[Si](O2)(c3ccccc3)c4ccccc4)(c5ccccc5)c6ccccc6)(c7ccccc7)c8ccccc8)c9ccccc9 || 200.75 || 61642 || CDK Fail ||

The resulting spreadsheet of [|2705 melting points with 132 CDK] descriptors was then randomly split into two sets: a [|test set] of 500 compounds and a [|training set] of 2205 compounds.

Building the Models
code mydata = read.csv(file="20110728DVTrainingSet.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE) print(mydata.rf)
 * Random Forest.** The randomForest package (v 4.5-34) in R was used to build the random forest models using the following code:
 * 1) do random forest [randomForest 4.5-34]

[output] Call: randomForest(formula = mpC ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 44

Mean of squared residuals: 1529.134 % Var explained: 82.18 [output]

varImpPlot(mydata.rf,main="Random Forest Variable Importance") code The RF reports an OOB R2 of 0.82 and an OOB RMSE of 39.10 °C with resulting image below showing the importance of the descriptors. The image points to both the number of hydrogen bond donors (nHBDon) and the topological polar surface area (TopoPSA) as the most important physiochemical properties for melting point prediction. This confirms results from previous analyses. The model was then saved so that it could be deployed as a web service using the following code: code .saveRDS(mydata.rf, file = "ONSMPModel004")
 * 1) get plot of importance - %IncMSE and IncNodePurity

code The model is available for [|download] to use for batch melting point prediction with a CC0 license.

code testdata = read.csv(file="20110728DVTestSet.csv",head=TRUE,row.names="molID") test.predict <- predict(mydata.rf,testdata) write.csv(test.predict, file = "RFTestSetPredict.csv") code The random forest model had an AAE of 11.88 °C, a RMSE of 15.98 (39.10) °C, a Bias of -0.05 °C and an R2 value of 0.97 (0.82) when used to predict the melting points of 2205 training-set compounds (with the OOB values given in parentheses), using the following code: code testdata = read.csv(file="20110728DVTrainingSet.csv",head=TRUE,row.names="molID") test.predict <- predict(mydata.rf,testdata) write.csv(test.predict, file = "RFTrainingSetPredict.csv") code ====The figures below show: a) Predicted melting point vs. Experimental melting point for the training set, b) Predicted melting point vs. Experimental melting point for the test set, and c) Residuals vs. Experimental melting point for the test set. The labels are ChemSpider IDs.====
 * Random Forest Model Results.** The random forest model had an AAE of 29.51 °C, a RMSE of 40.91 °C, a Bias of 0.97 °C and an R2 value of 0.82 when used to predict the melting points of 500 test-set compounds, using the following code:





code mydata = read.csv(file="20110728DVTrainingSet.csv",head=TRUE,row.names="molID") library(leaps) attach(mydata) leaps<-regsubsets(mpC ~ nHBDon + TopoPSA + WTPT.3 + ATSc1 + XLogP + SC.3 + nHBAcc + WTPT.5 + BCUTp.1l + nAtomLC + nAtomP + ALogP + MDEC.12 + BCUTp.1h + AMR + ATSc4 + WTPT.4 + SP.7 + bpol + ALogp2 + ATSm3 + BCUTw.1h + VP.2 + MDEC.22 + VP.6 + VC.3 + nRotB + SPC.6 + WTPT.2 + SPC.4 + fragC + SCH.7 + MDEC.33 + ATSm4 + VCH.7 + ATSp5 + MW + BCUTp.1h + SPC.4 + ATSm5 + VCH.6 + MDEC.23, nvmax=12,data=mydata,nbest=1,really.big=TRUE)
 * Linear model.** The descriptors indicated as important by the random forest were used to create a linear model on the same training set of data using the following code - using the leaps package to select the best seven descriptors:
 * 1) load in data

plot(leaps,scale="adjr2")
 * 1) models are ordered by the selection statistic - create plot.

fit <- lm(mpC ~ nHBDon + TopoPSA + nAtomP + ALogp2 + nRotB + WTPT.2 + MW,data=mydata) summary(fit)
 * 1) summary of fit

[output] Residuals: Min      1Q   Median       3Q      Max -238.659 -37.292   -5.167   33.530  228.906

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -544.07449  31.20233 -17.437  < 2e-16 *** nHBDon       34.20127    1.78728  19.136  < 2e-16 *** TopoPSA       0.99515    0.05823  17.089  < 2e-16 *** nAtomP        3.30729    0.39990   8.270 2.29e-16 *** ALogp2        1.97586    0.22617   8.736  < 2e-16 *** nRotB        -6.72046    0.47912 -14.027  < 2e-16 *** WTPT.2      251.61827   17.20702  14.623  < 2e-16 *** MW            0.29366    0.02266  12.959  < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 52.77 on 2197 degrees of freedom Multiple R-squared: 0.6768,    Adjusted R-squared: 0.6757 F-statistic: 657.1 on 7 and 2197 DF, p-value: < 2.2e-16 [output] code The image below shows the best n descriptors together with the adjusted r-squared.

The 7-descriptor linear model was used to predict the melting point values of the 2205 training set with an AAE of 41.47 °C, a RMSE of 52.67 °C, a bias of -0.00 °C, and an R2 of 0.68. The linear model was used to predict the melting points of the 500 compound test-set with an AAE of 44.09 °C, a RMSE of 58.19 °C, a bias of -0.50 °C, and an R2 of 0.65. Figures were again created showing a) Predicted melting point vs. Experimental melting point for the training set, b) Predicted melting point vs. Experimental melting point for the test set, and c) Residuals vs. Experimental melting point for the test set. The labels are ChemSpider IDs.







Discussion
With a highly curated set of diverse melting points a predictive RF model using only 2D CDK descriptors with a Test-set R2 of 0.82 was built. This out performs all other models in the literature, even those using more sophisticated modeling techniques and propriety descriptors. This suggests that the data used to build previous models was of a lower quality. Also, examining the outliers, we saw a suspected pattern of errors in the CDK descriptor ALogP and thus ALogp2 causing certain compounds to be outliers. These errors had matched our experience and thus we decided to repeat the above analysis, see MPModel005, without these descriptors. Results from MPModel005 suggest that most compounds that are outliers are true outliers and not outliers resulting from the CDK's poor ALogP prediction.

Acknowledgements
We'd like to thank Rajarshi Guha and Noel O'Boyle for many helpful discussions.