Modeling Melting Points Using OpenBabel

Researchers: Jean-Claude Bradley, Andrew Lang, Tony Williams, et. al.

Introduction

Fragment-based models are, in general, less accurate than more sophisticated nonlinear models, such as random forests. However, fragment-based models, because of their speed, are useful for approximating properties for large compound libraries. Fragment-based models are also smaller, easier to deploy, and are open to better chemical interpretability than nonlinear models. Inspired by a blog post by OpenBabel developer Noel O'Boyle on how to deploy SMARTS fragment-based models using OpenBabel, we developed an open (CC0) fragment-based melting point model that can be incorporated into OpenBabel which had an R2: 0.61, AAE: 43.6, and RMSE: 58.1 when used to predict the melting points of a 3500 compound test-set. While our results are comparable to a previously published non-reproducible fragment-based model using the problematic PHYPROP database (659 compound test-set R2: 0.62, AAE: 48.9, RMSE: 48.8) [1], they pale in comparison to our current nonlinear models, which have an OOB R2: 0.83 and RMSE: 38.1.

Procedure

With the lack of a dedicated open fragment calculator for QSPR modeling, we decided to use OpenBabel itself as the modeling platform. We began by collecting a set of fragments by taking the union of the fragments from the current fragment-based models deployed in Open Babel (psa, logp, mr). We then used OpenBabel to calculate the fragment counts for the same 16015 compound training set used in MPModel007, by first assigning a value of 1 to all descriptors to get a total count, and then for each descriptor - one at a time - setting a value of zero (with all the rest one). By subtracting, we get a count for each descriptor. Doing this 167 times gives the needed fragment counts. I needed to do it this way because in OpenBabel: "Where several SMARTS strings assign values to the same atom, only the final assignment is retained." [2] (training set with fragment counts)

We removed the surprisingly large number of fragments with counts less than 5 ([#1], [#50], [O][#14], [O][#33], [O][#50], [O]C=[#7], [O]C=S, [*], [#9-*], [#17-*], [#35-*], [#53-*], [#53+*], [Fe,Cu,Zn,Tc,Cd,Pt,Au,Hg], [CH4], [CH2](C)C, [CH](C)(C)C, [CH3][O,N,F,Cl,Br,#15,#16,#53;!a], [CH2X4][O,N,F,Cl,Br,#15,#16,#53;!a], [CHX4][O,N,F,Cl,Br,#15,#16,#53;!a], [CH3]c, [CH3][a!#6], [CH2X4]a, [CHX4]a, [CX4][!#6;!#7;!#8;!#9;!#15;!#16;!#17;!#35;!#53;!#1], [N]#*, [N](=*)#*, [NH]1-*-*1, [N+](-*)(-*)(-*)-*, [N+](-*)(-*)=*, [N+](-*)#*, [NH+](-*)(-*)-*, [NH+](-*)=*, [NH2+](-*)-*, [NH2+]=*, [NH3+]-*, [n](:*):*, [n](:*)(:*):*, [n](-*)(:*):*, [n](=*)(:*):*, [nH](:*):*, [n+](:*)(:*):*, [n+](-*)(:*):*, [nH+](:*):*, [N+0](=a)a, [N+0](a)(a)a, [NH3+*], [NH2+*], [NH+*], [n+*], [NH0+*](=A)(A)a, [NH0+*](=[#6])=[#7], [N+*]#A, [N-*], [N+*](=[N-*])=N, [#8], [O](-*)-*, [O]1-*-*1, [OH]-*, [O-]-*, [o](:*):*, [OH2], [O]=[#8], [OX1-*][#7], [OX1-*][#16], [OX1-*][#15;#33;#43;#53], [O]=[CH2], [O]=[CX2]=O, [O]=C([a!#6])[a!#6], [O-1]C(=O), [#15], [P](-*)=*, [S](-*)-*, [S]=*, [S](-*)(-*)=*, [S](-*)(-*)(=*)=*, [SH]-*, [s](:*):*, [s](=*)(:*):*, [S-*], [S+*]). Note: a lot of these fragments may have individual counts greater than five but overall they end up with counts less than five due to the way OpenBabel assigns counts to multiple fragments on the same atom.

We removed the four correlated descriptors: [C]=c, [O]=[#7], [O]=c, [PH](-*)(-*)=* by using the caret package and the following code in R
## load in data
mydata = read.csv(file="TrainingWithDescriptorsReadyForFeatureSelection.csv",head=TRUE,row.names="molID")
## correlation matrix
cor.mat = cor(mydata)
## find correlation r > 0.95
findCorrelation(cor.mat, cutoff = .95, verbose = TRUE)
[output]
[1] 74 75 50 84
[output]

Again using R, we developed a simple linear model and performed further feature selection to attain the fragment coefficients.
## load in data
mydata = read.csv(file="TrainingWithDescriptorsReadyForLinearRegression.csv",head=TRUE,row.names="molID")
fit <- lm(mpC ~ 0 + tf1 + tf2 + tf4 + tf5 + tf6 + tf7 + tf9 + tf10 + tf11 + tf12 + tf14 + tf17 + tf18 + tf20 + tf22 + tf23 + tf25 +
tf26 + tf27 + tf28 + tf34 + tf36 + tf40 + tf44 + tf45 + tf46 + tf47 + tf48 + tf49 + tf50 + tf55 + tf56 + tf57 + tf58 + tf59 + tf60 +
tf61 + tf62 + tf63 + tf64 + tf65 + tf66 + tf67 + tf68 + tf69 + tf70 + tf71 + tf72 + tf73 + tf76 + tf77 + tf78 + tf79 + tf81 + tf83 +
tf84 + tf86 + tf87 + tf104 + tf105 + tf106 + tf109 + tf113 + tf115 + tf124 + tf128 + tf129 + tf131 + tf132 + tf133 + tf134 + tf135 +
tf142 + tf143 + tf144 + tf147 + tf148 + tf149 + tf153 + tf155 + tf164 + tf167, data=mydata)
 
## step-wise feature selection to remove non-significant features
bestfit <- stepAIC(fit, upper=~., lower=~1, direction="both")
 
## output results
summary(bestfit)
 
[output]
Call:
lm(formula = mpC ~ tf2 + tf4 + tf5 + tf9 + tf10 + tf11 + tf12 +
    tf14 + tf17 + tf20 + tf22 + tf23 + tf25 + tf26 + tf28 + tf34 +
    tf36 + tf40 + tf44 + tf45 + tf46 + tf47 + tf48 + tf50 + tf55 +
    tf56 + tf57 + tf58 + tf59 + tf60 + tf61 + tf62 + tf63 + tf64 +
    tf65 + tf66 + tf67 + tf68 + tf69 + tf70 + tf71 + tf72 + tf73 +
    tf76 + tf78 + tf79 + tf81 + tf84 + tf86 + tf87 + tf104 +
    tf106 + tf109 + tf113 + tf115 + tf124 + tf131 + tf132 + tf144 +
    tf147 + tf148 + tf155 + tf164 - 1, data = mydata)
 
Residuals:
    Min      1Q  Median      3Q     Max
-784.85  -32.22    0.41   35.55  351.06
 
Coefficients:
      Estimate Std. Error t value Pr(>|t|)
tf2   -11.2891     0.3411 -33.094  < 2e-16 ***
tf4    -8.0352     5.1829  -1.550 0.121079
tf5   -36.0534     5.0458  -7.145 9.37e-13 ***
tf9    32.1543     2.2267  14.440  < 2e-16 ***
tf10    7.3289     0.8070   9.082  < 2e-16 ***
tf11   22.9132     1.3487  16.989  < 2e-16 ***
tf12   48.0277     2.6471  18.143  < 2e-16 ***
tf14   26.5074     4.4937   5.899 3.74e-09 ***
tf17   42.6169     4.3065   9.896  < 2e-16 ***
tf20   44.4787     1.3334  33.356  < 2e-16 ***
tf22   24.8303    13.6538   1.819 0.068998 .
tf23   40.0567     7.7142   5.193 2.10e-07 ***
tf25  -15.2212     0.5951 -25.577  < 2e-16 ***
tf26   -8.2654     0.9197  -8.987  < 2e-16 ***
tf28   14.7222     4.9566   2.970 0.002980 **
tf34   17.7551     2.5995   6.830 8.79e-12 ***
tf36   34.2188     8.6157   3.972 7.17e-05 ***
tf40   40.6006     1.1303  35.920  < 2e-16 ***
tf44   32.9369     1.2416  26.527  < 2e-16 ***
tf45   20.7267     0.7192  28.818  < 2e-16 ***
tf46  -14.8687     2.5568  -5.815 6.17e-09 ***
tf47   15.5201     0.8502  18.255  < 2e-16 ***
tf48   18.9400     0.9590  19.750  < 2e-16 ***
tf50    8.0103     1.6218   4.939 7.93e-07 ***
tf55   28.8598     1.7874  16.147  < 2e-16 ***
tf56   27.9572     2.0982  13.325  < 2e-16 ***
tf57   19.6289     1.1059  17.750  < 2e-16 ***
tf58   25.3437     1.1923  21.257  < 2e-16 ***
tf59   26.4790     1.1996  22.073  < 2e-16 ***
tf60   10.0405     5.5401   1.812 0.069954 .
tf61   11.6743     0.3470  33.641  < 2e-16 ***
tf62   22.2839     0.6271  35.536  < 2e-16 ***
tf63   18.3672     0.8946  20.530  < 2e-16 ***
tf64   17.1777     0.5595  30.702  < 2e-16 ***
tf65   20.7165     1.1950  17.335  < 2e-16 ***
tf66   18.4476     0.7396  24.942  < 2e-16 ***
tf67   17.1867     2.1395   8.033 1.02e-15 ***
tf68   31.1712     6.9949   4.456 8.40e-06 ***
tf69   43.4894     9.0487   4.806 1.55e-06 ***
tf70   35.4438     2.0479  17.307  < 2e-16 ***
tf71   18.5955     3.2410   5.738 9.78e-09 ***
tf72   16.8983     8.4073   2.010 0.044454 *
tf73   26.0819     1.6811  15.515  < 2e-16 ***
tf76   21.9141     0.7073  30.982  < 2e-16 ***
tf78   13.9251     1.2672  10.989  < 2e-16 ***
tf79    7.7621     1.4669   5.291 1.23e-07 ***
tf81   19.4670     1.5937  12.215  < 2e-16 ***
tf84   -4.9750     2.4064  -2.067 0.038711 *
tf86  -22.2462     7.8444  -2.836 0.004575 **
tf87  -21.3490     4.7744  -4.472 7.82e-06 ***
tf104 -34.4811     4.7166  -7.311 2.79e-13 ***
tf106 -32.8638    16.7665  -1.960 0.050003 .
tf109  13.9377     2.2851   6.099 1.09e-09 ***
tf113  15.8908     0.6518  24.380  < 2e-16 ***
tf115  96.6851    11.5045   8.404  < 2e-16 ***
tf124  10.3334     0.8373  12.341  < 2e-16 ***
tf131  -8.9495     0.7989 -11.202  < 2e-16 ***
tf132  -6.6031     1.9248  -3.431 0.000604 ***
tf144 -27.6702    14.3434  -1.929 0.053732 .
tf147   8.5440     3.4518   2.475 0.013324 *
tf148   4.8169     2.8596   1.684 0.092105 .
tf155  11.6943     4.6059   2.539 0.011127 *
tf164   8.9702     1.5402   5.824 5.86e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 57.25 on 15952 degrees of freedom
Multiple R-squared: 0.836,      Adjusted R-squared: 0.8353
F-statistic:  1290 on 63 and 15952 DF,  p-value: < 2.2e-16
[output]

Following the instructions on the OpenBabel plugin development page, we deployed the melting point model plugin by creating the appropriate text file (mpC.txt) and by updating the plugindefines.txt file in the data folder. We then used OpenBabel to calculate the melting points of the 3500 compound test set with an R2: 0.67, AAE: 43.6, and RMSE: 58.1, illustrated below (test set with descriptive statistics).
20120706predictedvsmeasured.png

Results

We successfully created an open fragment-based model that can easily be added to OpenBabel. We used R to create the linear model and OpenBabel as the fragment calculator for an open data set of melting points. The test-set statistics R2: 0.61, AAE: 43.6, and RMSE: 58.1, can likely be improved upon by further curation of the open melting point data (or by using our less diverse but higher quality doubleplusgood dataset), by introducing new fragments, by optimization of the fragment order, and by using more sophisticated feature selection and linear modeling techniques than used here.

References

1. Clark M. Generalized Fragment-Substructure Based Property Prediction Method. J. Chem. Inf. Model. 2005, 45, 30-38
2. OpenBabel plugin development document: http://open-babel.readthedocs.org/en/latest/WritePlugins/AddNewDescriptor.html