Solubility of Carboxylic Acids in Methanol (Last Updated: June 11, 2010)

Researcher: Andrew Lang
Data is from the Open Notebook Science Challenge: SolubilitiesSum2010-05-17.xls

Since some molecules have multiple measured solubilities, repeated measurements were aggregated by taking the mean value. Included all solutes with measured solubility values in methanol that are solid at room temperature. Excluded inorganics and entries marked with DONOTUSE.

This left 47 solutes: AllDataWithDescriptors.xlsx

The descriptors were calculated using Bioclipse Version: 2.4.0.RC1. Included CDK REST descriptors only. A random forests was created using R to determine the top 30 most important descriptors.

importance.png

Models were built using linear regression with forward stepwise selection of descriptors and a 10-fold crss validation was run using R.

10-foldcv.png

The summary of the 12 descriptor model is below:
Residuals:
    Min      1Q  Median      3Q     Max
-0.5428 -0.1653 -0.0197  0.2331  0.4971
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.42744    0.30432    4.69  4.3e-05 ***
VCCalogp     0.42593    0.06449    6.60  1.4e-07 ***
VCClogs     -0.30819    0.14562   -2.12  0.04171 *
nAtomP      -0.17012    0.02573   -6.61  1.4e-07 ***
ATSm2        0.24124    0.10687    2.26  0.03053 *
ATSm3       -0.14630    0.05001   -2.93  0.00609 **
ATSc2       -3.57948    0.99193   -3.61  0.00098 ***
ATSp1       -0.04060    0.00758   -5.36  5.9e-06 ***
ATSp2        0.03552    0.00669    5.31  6.8e-06 ***
AMR          0.03140    0.01047    3.00  0.00504 **
C2SP3        0.14643    0.05700    2.57  0.01476 *
SP.4        -0.63930    0.26188   -2.44  0.02000 *
VP.6        -4.95101    0.88413   -5.60  2.9e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.322 on 34 degrees of freedom
Multiple R-squared: 0.875,      Adjusted R-squared: 0.831
F-statistic: 19.9 on 12 and 34 DF,  p-value: 6.14e-12
 

The results of actual vs. predicted can be found in this summary spreadsheet, see figure below.

carboxmodel.png

Conclusion

The above model can be used to make predictions for carboxylic with unknown methanol solubility.