Random Forest Based Melting Point Prediction on a Highly Curated Double Validated Dataset

Researchers: Jean-Claude Bradley, Andrew Lang, Antony Williams, et. al.


Melting points have been collected from several Open Data sources. After several rounds of curation, we have found errors in all sources - both structural and incorrect mp values. This has led us to believe that the difficulty in modeling melting points has been due, at least in part, to inconsistencies in the underlying data. To this end we have taken a subset of 2706, at least doubly validated, melting points from our already highly curated set of 20,155 melting points, to use as the data for modeling. Using this data we constructed ....


The data. We took a subset of (ONSMP028) where we have at least two (non-identical) values that differ by at most 5C. On this set we performed another round of manual curation, removing all compounds that had at least one chiral center, possessed cis/trans isomerism, were inorganic or a salt. This left us with a dataset of 2706 compounds (ONSMP029).

Calculating descriptors and feature selection. Open CDK [1] descriptors were calculated using the CDK Descriptor Calculator GUI (v 1.3.2) [2]. All descriptors were calculated except: Charged Partial Surface Area (under electronic), Ionization Potential (under electronic), Amino Acid Count (under protein), WHIM (under hybrid), and all geometrical descriptors. Descriptors with constant entries (khs.sLi, khs.ssBe, khs.ssssBe, khs.ssBH, khs.ssssB, khs.sNH3, khs.ssNH2, khs.sssNH, khs.sSiH3, khs.sPH2, khs.ssPH, khs.sssssP, khs.sGeH3, khs.ssGeH2, khs.sssGeH, khs.ssssGe, khs.sAsH2, khs.ssAsH, khs.sssAs, khs.sssdAs, khs.sssssAs, khs.sSeH, khs.dSe, khs.ssSe, khs.aaSe, khs.dssSe, khs.ddssSe, khs.sSnH3, khs.ssSnH2, khs.sssSnH, khs.ssssSn, khs.sPbH3, khs.ssPbH2, khs.sssPbH, khs.ssssPb), descriptors with less than 10 non-zero entries: khs.sssB (5), khs.dNH (2), khs.ssssN (1), khs.ssSiH2 (1), khs.sssSiH (1), khs.sssP (8), khs.dssS (4); and Kier3 and HybRatio (several entries with 'NA') were also removed. The following code was used to find collinear descriptors (using the caret package):
## load in data
mydata = read.csv(file="20110727DVwithMolIDFailedRemoved.csv",head=TRUE,row.names="molID")
## correlation matrix
cor.mat = cor(mydata)
## find correlation r > 0.95
findCorrelation(cor.mat, cutoff = .95, verbose = TRUE)
 [1]  63 164  32 156  64  61 162  26  27  65  28  69  22  29  66  67 124  70 154  12  78  74  73  13 125  81  59   7  49  43  17  18
The recommended columns (SP-2, Zagreb, nB, WTPT-1, SP-3, SP-0, WPOL, ATSp1, ATSp2, SP-4, ATSp3, VP-0, ATSm2, ATSp4, SP-5, SP-6, Kier1, VP-1, VABC, apol, SPC-5, VP-5, VP-4, naAromAtom, Kier2, VPC-5, VC-5, BCUTc-1h, VCH-4, SCH-3, ATSc2, ATSc3) were deleted, leaving a total of 132 descriptors.

The following compound was removed from the dataset because it failed to generate any descriptors:
CDK Fail

The resulting spreadsheet of 2705 melting points with 132 CDK descriptors was then randomly split into two sets: a test set of 500 compounds and a training set of 2205 compounds.

Building the Models

Random Forest. The randomForest package (v 4.5-34) in R was used to build the random forest models using the following code:
mydata = read.csv(file="20110728DVTrainingSet.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.5-34]
mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE)
 randomForest(formula = mpC ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 44
          Mean of squared residuals: 1529.134
                    % Var explained: 82.18
## get plot of importance - %IncMSE and IncNodePurity
varImpPlot(mydata.rf,main="Random Forest Variable Importance")
The RF reports an OOB R2 of 0.82 and an OOB RMSE of 39.10 °C with resulting image below showing the importance of the descriptors. The image points to both the number of hydrogen bond donors (nHBDon) and the topological polar surface area (TopoPSA) as the most important physiochemical properties for melting point prediction. This confirms results from previous analyses.
The model was then saved so that it could be deployed as a web service using the following code:
.saveRDS(mydata.rf, file = "ONSMPModel004")
The model is available for download to use for batch melting point prediction with a CC0 license.

Random Forest Model Results. The random forest model had an AAE of 29.51 °C, a RMSE of 40.91 °C, a Bias of 0.97 °C and an R2 value of 0.82 when used to predict the melting points of 500 test-set compounds, using the following code:
testdata = read.csv(file="20110728DVTestSet.csv",head=TRUE,row.names="molID")
test.predict <- predict(mydata.rf,testdata)
write.csv(test.predict, file = "RFTestSetPredict.csv")
The random forest model had an AAE of 11.88 °C, a RMSE of 15.98 (39.10) °C, a Bias of -0.05 °C and an R2 value of 0.97 (0.82) when used to predict the melting points of 2205 training-set compounds (with the OOB values given in parentheses), using the following code:
testdata = read.csv(file="20110728DVTrainingSet.csv",head=TRUE,row.names="molID")
test.predict <- predict(mydata.rf,testdata)
write.csv(test.predict, file = "RFTrainingSetPredict.csv")

The figures below show: a) Predicted melting point vs. Experimental melting point for the training set, b) Predicted melting point vs. Experimental melting point for the test set, and c) Residuals vs. Experimental melting point for the test set. The labels are ChemSpider IDs.




Linear model. The descriptors indicated as important by the random forest were used to create a linear model on the same training set of data using the following code - using the leaps package to select the best seven descriptors:
## load in data
mydata = read.csv(file="20110728DVTrainingSet.csv",head=TRUE,row.names="molID")
leaps<-regsubsets(mpC ~ nHBDon + TopoPSA + WTPT.3 + ATSc1 + XLogP + SC.3 + nHBAcc + WTPT.5 + BCUTp.1l + nAtomLC + nAtomP + ALogP
+ MDEC.12 + BCUTp.1h + AMR + ATSc4 + WTPT.4 + SP.7 + bpol + ALogp2 + ATSm3 + BCUTw.1h + VP.2 + MDEC.22 + VP.6 + VC.3 + nRotB
+ SPC.6 + WTPT.2 + SPC.4 + fragC + SCH.7 + MDEC.33 + ATSm4 + VCH.7 + ATSp5 + MW + BCUTp.1h + SPC.4 + ATSm5 + VCH.6 + MDEC.23,
## models are ordered by the selection statistic - create plot.
fit <- lm(mpC ~ nHBDon + TopoPSA + nAtomP + ALogp2 + nRotB + WTPT.2 + MW,data=mydata)
## summary of fit
     Min       1Q   Median       3Q      Max
-238.659  -37.292   -5.167   33.530  228.906
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -544.07449   31.20233 -17.437  < 2e-16 ***
nHBDon        34.20127    1.78728  19.136  < 2e-16 ***
TopoPSA        0.99515    0.05823  17.089  < 2e-16 ***
nAtomP         3.30729    0.39990   8.270 2.29e-16 ***
ALogp2         1.97586    0.22617   8.736  < 2e-16 ***
nRotB         -6.72046    0.47912 -14.027  < 2e-16 ***
WTPT.2       251.61827   17.20702  14.623  < 2e-16 ***
MW             0.29366    0.02266  12.959  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 52.77 on 2197 degrees of freedom
Multiple R-squared: 0.6768,     Adjusted R-squared: 0.6757
F-statistic: 657.1 on 7 and 2197 DF,  p-value: < 2.2e-16
The image below shows the best n descriptors together with the adjusted r-squared.

The 7-descriptor linear model was used to predict the melting point values of the 2205 training set with an AAE of 41.47 °C, a RMSE of 52.67 °C, a bias of -0.00 °C, and an R2 of 0.68. The linear model was used to predict the melting points of the 500 compound test-set with an AAE of 44.09 °C, a RMSE of 58.19 °C, a bias of -0.50 °C, and an R2 of 0.65. Figures were again created showing a) Predicted melting point vs. Experimental melting point for the training set, b) Predicted melting point vs. Experimental melting point for the test set, and c) Residuals vs. Experimental melting point for the test set. The labels are ChemSpider IDs.





With a highly curated set of diverse melting points a predictive RF model using only 2D CDK descriptors with a Test-set R2 of 0.82 was built. This out performs all other models in the literature, even those using more sophisticated modeling techniques and propriety descriptors. This suggests that the data used to build previous models was of a lower quality. Also, examining the outliers, we saw a suspected pattern of errors in the CDK descriptor ALogP and thus ALogp2 causing certain compounds to be outliers. These errors had matched our experience and thus we decided to repeat the above analysis, see MPModel005, without these descriptors. Results from MPModel005 suggest that most compounds that are outliers are true outliers and not outliers resulting from the CDK's poor ALogP prediction.


[1] Steinbeck C.;Han Y; Kuhn S.; Horlacher O.; Luttmann E. and Willighagen E. The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo- and Bioinformatics. J. Chem. Inf. Comput. Sci., 2003, 43 (2), pp 493–500 DOI: 10.1021/ci025584y
[2] Guha R. CDK Descriptor Calculator GUI (v 1.3.2) [http://www.rguha.net/code/java/cdkdesc.html]


We'd like to thank Rajarshi Guha and Noel O'Boyle for many helpful discussions.