Linear regression model for solubility in methanol [01-13-2009].

Since some molecules had multiple measured solubilities, repeated measurements were aggregated by taking the mean value.

Excluded values for O=CO and O=C(O)C since their mean solubility put them pretty much at the high end of the solubility histograms. They were also consistently predicted as outliers (via Cooks distance) in models that did consider them. The training set contained 63 molecules.

Calculated 195 constitutional and topological descriptors using the CDK 1.2.x and reduced it to a pool of 44 descriptors using r^2 and constant tests. The 44-descriptor pool is available as methanol.csv (includes the averaged solubilities). Performed brute force descriptor selection on the 44 member pool, looking at 2-, 3-, 4- and 5-descriptor subsets and identifying models with lowest RMSE. Settled on a 3-descriptor model, since it had the most significant coefficients and the F-value for the model started decreasing when I moved to 4- and 5-descriptor models.

The summary of the 3-descriptor model is below:
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   9.6137     0.8360  11.500  < 2e-16 ***
WTPT.4       -0.3482     0.1213  -2.871  0.00567 **
C2SP3        -0.5882     0.1299  -4.529 2.93e-05 ***
WTPT.5       -0.7652     0.2469  -3.100  0.00297 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 
Residual standard error: 3.089 on 59 degrees of freedom
Multiple R-squared: 0.4536,    Adjusted R-squared: 0.4258
F-statistic: 16.32 on 3 and 59 DF,  p-value: 7.712e-08
Since this utilized the entire dataset, I performed LOO cross-validation, giving a q^2 = 0.38. LOO results also indicated that going from 3 to 4 or 5 descriptors resulted in increasing differences between the model r^2 and q^2, justifying stopping at 3 descriptors.

solmodel001.png


TSET/PSET Split


In the interests of completeness, I split the 63 molecules into a training (50 molecules) and a prediction (13 molecules) and rebuilt the 3 descriptor model with the training set and predicted the prediction set. The summary for the model is
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   9.9929     0.8442  11.838 1.46e-15 ***
WTPT.4       -0.3860     0.1160  -3.329  0.00172 **
C2SP3        -0.6322     0.1202  -5.260 3.66e-06 ***
WTPT.5       -0.6153     0.2166  -2.841  0.00668 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 
Residual standard error: 2.578 on 46 degrees of freedom
Multiple R-squared: 0.5576,    Adjusted R-squared: 0.5288
F-statistic: 19.33 on 3 and 46 DF,  p-value: 2.973e-08
Though the model is better than the original model (using all the data), it's a lucky choice. Different TSET/PSET splits lead to different (poorer) models. A plot of the predicted versus observed for the training and prediction sets using this model is given below.

solmodel001-split.png

Summary


Overall not a very great model. Using these descriptors and robust regression (rlm in the MASS package) confirmed the low quality of the model. Furthermore, using robust regression also suggested that a 3 descriptor model would be better than a 4 or 5 descriptor model.

This analysis should be followed by a PLS based interpretation, though given the nature of the descriptors this may not be very informative. Of the top of my head, the descriptors mainly focus on molecular size and branching.