Linear regression model for solubility in methanol [01-14-2009].

Since some molecules had multiple measured solubilities, repeated measurements were aggregated by taking the mean value.

Excluded values for O=CO and O=C(O)C since their mean solubility put them pretty much at the high end of the solubility histograms. They were also consistently predicted as outliers (via Cooks distance) in models that did consider them. Also excluded inorganics and entries marked with DONOTUSE. The training set contained 57 molecules.

Calculated 195 constitutional and topological descriptors using the CDK 1.2.x and reduced it to a pool of 44 descriptors using r^2 and constant tests. The 44-descriptor pool is available as methanol2.csv (includes the averaged solubilities). Performed brute force descriptor selection on the 44 member pool, looking at 2-, 3-, 4- and 5-descriptor subsets and identifying models with lowest RMSE. Settled on a 4-descriptor model, since it had the most significant coefficients and the F-value for the model started decreasing when I moved to the 5-descriptor models.

The summary of the 4-descriptor model is below:
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  12.3804     0.7817  15.837  < 2e-16 ***
nAromBond    -0.4998     0.1014  -4.929 8.82e-06 ***
C2SP3        -0.6229     0.1195  -5.213 3.26e-06 ***
nHBAcc       -1.3857     0.2462  -5.628 7.39e-07 ***
khs.ddsN     -5.1486     1.4927  -3.449  0.00112 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 
Residual standard error: 2.459 on 52 degrees of freedom
Multiple R-squared: 0.6593,    Adjusted R-squared: 0.6331
F-statistic: 25.16 on 4 and 52 DF,  p-value: 1.258e-11
Since this utilized the entire dataset, I performed LOO cross-validation, giving a q^2 = 0.58. LOO results also indicated that going from 4 to 5 descriptors resulted in increasing differences between the model r^2 and q^2, justifying stopping at 4 descriptors.
solmodel002.png

TSET/PSET Split


In the interests of completeness, I split the 57 molecules into a training (45 molecules) and a prediction (12 molecules) and rebuilt the 4-descriptor model with the training set and predicted the prediction set. The summary for the model is
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  12.5514     0.8176  15.351  < 2e-16 ***
nAromBond    -0.4889     0.1068  -4.576 4.52e-05 ***
C2SP3        -0.6640     0.1262  -5.261 5.14e-06 ***
nHBAcc       -1.2661     0.2529  -5.007 1.16e-05 ***
khs.ddsN     -5.3677     1.4721  -3.646 0.000759 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 
Residual standard error: 2.396 on 40 degrees of freedom
Multiple R-squared: 0.7089,    Adjusted R-squared: 0.6798
F-statistic: 24.35 on 4 and 40 DF,  p-value: 2.902e-10
solmodel002-split.png

Summary


Overall not a very great model. Using these descriptors and robust regression (rlm in the MASS package) confirmed the low quality of the model. The robust regression also justified the use of the 4 descriptor mdoel. This analysis should be followed by a PLS based interpretation. In this case the descriptors are a little more informative, but since they're all count based descriptors, the applicability domain for this model is probably limited.