General Solute Model: Solubility in Methanol (Last Updated: June 8, 2010)

Researcher: Andrew Lang
Data is from the Open Notebook Science Challenge: SolubilitiesSum2010-05-17.xls

Since some molecules have multiple measured solubilities, repeated measurements were aggregated by taking the mean value. Included all solutes with measured solubility values in methanol that are solid at room temperature. Excluded inorganics and entries marked with DONOTUSE. Excluded 2,6-dichlorobenzaldehyde, 2-chloro-5-nitrobenzaldehyde, and 4-nitrobenzaldehyde as they react with methanol to form a hemiacetal. Excluded meloxicam, piroxicam, riboflavin, luteolin, imidacloprid, temazepam as they had a VCClogs prediction fail.

This left 131 solutes (14 aldehydes, 3 amines, 47 carboxylic acids, 50 non-Ugi related, 17 Ugi Products): AllData.xlsx

The descriptors were calculated using Bioclipse Version: 2.4.0.RC1. Included CDK REST descriptors only. The CDK failed to calculate descriptors for urea and it was removed from the data set. A random forests was created using R to determine the top 30 most important descriptors.
importance.png


Models were built using linear regression with forward stepwise selection of descriptors: data with descriptors

The summary of the 11 descriptor model is below:
Residuals:
     Min       1Q   Median       3Q      Max
-1.74056 -0.25084  0.02844  0.31104  1.14220
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.2098705  0.2244443   5.391 3.65e-07 ***
VCCalogp     0.2945028  0.0684732   4.301 3.52e-05 ***
VCClogs      0.2377867  0.0850861   2.795 0.006066 **
nAtomP      -0.0565363  0.0152603  -3.705 0.000323 ***
SC.3         1.1757485  0.2494116   4.714 6.70e-06 ***
SP.7        -0.7449714  0.1545837  -4.819 4.33e-06 ***
VC.3        -1.2766976  0.3222959  -3.961 0.000128 ***
ATSm3       -0.0714896  0.0167511  -4.268 4.01e-05 ***
ATSc2       -1.9143793  0.6141099  -3.117 0.002293 **
ATSp4        0.0009114  0.0002112   4.315 3.34e-05 ***
nHBAcc      -0.1388953  0.0589180  -2.357 0.020048 *
nHBDon      -0.1411440  0.0734206  -1.922 0.056966 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.5349 on 118 degrees of freedom
Multiple R-squared: 0.6899,     Adjusted R-squared: 0.661
F-statistic: 23.86 on 11 and 118 DF,  p-value: < 2.2e-16
The results of actual vs. predicted can be found in this summary spreadsheet, see figure below.

model.png

Conclusion

The above model can be used to make predictions for compounds with unknown methanol solubility. It is better than both the original model and the previous model where the descriptors were calculated using a version of bioclipse that had a bug that converted all SMILES to upper case before descriptor calculation.