SolubilityModel004

General Solute Model: Solubility in Methanol (Last Updated: June 8, 2010)
Researcher: Andrew Lang Data is from the Open Notebook Science Challenge: [|SolubilitiesSum2010-05-17.xls]

Since some molecules have multiple measured solubilities, repeated measurements were aggregated by taking the mean value. Included all solutes with measured solubility values in methanol that are solid at room temperature. Excluded inorganics and entries marked with DONOTUSE. Excluded 2,6-dichlorobenzaldehyde, 2-chloro-5-nitrobenzaldehyde, and 4-nitrobenzaldehyde as they react with methanol to form a hemiacetal. Excluded meloxicam, piroxicam, riboflavin, luteolin, imidacloprid, temazepam as they had a VCClogs prediction fail.

This left 131 solutes (14 aldehydes, 3 amines, 47 carboxylic acids, 50 non-Ugi related, 17 Ugi Products): [|AllData.xlsx]

The descriptors were calculated using Bioclipse Version: 2.4.0.RC1. Included CDK REST descriptors only. The CDK failed to calculate descriptors for urea and it was removed from the data set. A random forests was created using R to determine the top 30 most important descriptors.

Models were built using linear regression with forward stepwise selection of descriptors: [|data with descriptors]

The summary of the 11 descriptor model is below: code Residuals: Min      1Q   Median       3Q      Max -1.74056 -0.25084 0.02844  0.31104  1.14220

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.2098705  0.2244443   5.391 3.65e-07 *** VCCalogp    0.2945028  0.0684732   4.301 3.52e-05 *** VCClogs     0.2377867  0.0850861   2.795 0.006066 ** nAtomP     -0.0565363  0.0152603  -3.705 0.000323 *** SC.3        1.1757485  0.2494116   4.714 6.70e-06 *** SP.7       -0.7449714  0.1545837  -4.819 4.33e-06 *** VC.3       -1.2766976  0.3222959  -3.961 0.000128 *** ATSm3      -0.0714896  0.0167511  -4.268 4.01e-05 *** ATSc2      -1.9143793  0.6141099  -3.117 0.002293 ** ATSp4       0.0009114  0.0002112   4.315 3.34e-05 *** nHBAcc     -0.1388953  0.0589180  -2.357 0.020048 * nHBDon     -0.1411440  0.0734206  -1.922 0.056966. --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5349 on 118 degrees of freedom Multiple R-squared: 0.6899,    Adjusted R-squared: 0.661 F-statistic: 23.86 on 11 and 118 DF, p-value: < 2.2e-16 code The results of actual vs. predicted can be found in this [|summary spreadsheet], see figure below.



Conclusion
The above model can be used to make predictions for compounds with unknown methanol solubility. It is better than both the [|original model] and the previous model where the descriptors were calculated using a version of bioclipse that had a bug that converted all SMILES to upper case before descriptor calculation.