Nonlinear regression model for solubility of Ugi reactants in any solvent [Updated: 09-30-2009].

Since some molecules had multiple measured solubilities, repeated measurements were aggregated by taking the mean value.

Included all compounds marked amine, isonitrile, aldehyde, or carboxylic acid that are solid at room temperature in solvents: acetonitrile, chloroform, dmso, ethanol, methanol, thf, and toluene. Excluded inorganics and entries marked with DONOTUSE.

Calculated descriptors using the CDK Descriptor Calculator (v0.94) for both the compounds and the solvents. Added dipole moment and dielectric constant as descriptors for the solvents. On the suggestion of Rajarshi Guha and Egon Willighagen part of the data was reserved for testing and only 200 data points were used for modeling. Also, an improved method of descriptor selection was used but this is still the area in which a more sophisticated method for descriptor selection will improve the model.

The summary of the 10-descriptor model is below (using 200 data poins):
Coefficients:
                           Estimate Std.Error TStat     PValue
(Intercept)              -133.454   27.0555  -4.9326  1.7724*10^-6 ***
1/AMR                     -31.4476   4.4002  -7.1468  < 10^-9      ***
1/Kier1                    36.6713   8.6481   4.2404  0.000035     ***
XLogP                       1.1049   0.1676   6.5922  < 10^-9      ***
BCUTc1                     -7.4035   3.3478  -2.2115  0.0282       ***
ATSc1                       6.0309   1.4934   4.0383  0.000078     ***
apol                       -0.1731   0.0476  -3.6359  0.000357     ***
apol * Solvent DC          -0.0053   0.0012  -4.3173  0.000025     ***
bpol * Solvent DC           0.0065   0.0017   3.6788  0.000305     ***
Solvent BCUTw1             11.6778   2.2714   5.1412  6.7852*10^-7 ***
Solvent TopoPSA             0.1071   0.0265   4.0423  0.000077     ***
---
Multiple R-squared: 0.5303,    Adjusted R-squared: 0.5054
p-value: 0 (less than 10^-9)

tsetpset.png

Using all the data, the 10-descriptor model is as follows:
Coefficients:
                           Estimate Std.Error TStat     PValue
(Intercept)              -118.344   24.9484  -4.7435  3.6537*10^-6 ***
1/AMR                     -32.0374   4.1611  -7.6993  < 10^-9      ***
1/Kier1                    32.8133   7.7248   4.2478  0.000031     ***
XLogP                       1.1597   0.1511   7.6732  < 10^-9      ***
BCUTc1                     -5.8371   2.9327  -1.9903  0.0477       ***
ATSc1                       6.2259   1.3878   4.4862  0.000011     ***
apol                       -0.2016   0.0423  -4.7632  3.3435*10^-6 ***
apol * Solvent DC          -0.0055   0.0011  -4.9320  1.5436*10^-6 ***
bpol * Solvent DC           0.0068   0.0017   4.1215  0.000052     ***
Solvent BCUTw1             10.3717   2.0891   4.9646  1.3267*10^-6 ***
Solvent TopoPSA             0.1138   0.0238   4.7751  3.1681*10^-6 ***
---
Multiple R-squared: 0.5153,    Adjusted R-squared: 0.4946
p-value: 0 (less than 10^-9)
pic2.png

The above model, though still not brilliant is an improvement over the original model (below), at least when comparing the R-squared values. It is interesting to note that even though a different method of selecting descriptors was used (building up to 10 descriptors using combinations of subsets of descriptors - again, the weakest area in the development of this model is in the method of descriptor selection) several descriptors re-appear with similar coefficients, namely: 1/AMR, apol, ATSc1, and XLogP. It is also interesting to note that all the descriptors involving the solvents changed. Finally, we note that with the data color-coded, it is easy to see that the model is based mainly on aldehydes and carboxylic acids.



Original Model: Nonlinear regression model for solubility of Ugi reactants in any solvent [09-17-2009].

Since some molecules had multiple measured solubilities, repeated measurements were aggregated by taking the mean value.

Included all compounds marked amine, isonitrile, aldehyde, or carboxylic acid that are solid at room temperature in solvents: acetonitrile, chloroform, dmso, ethanol, methanol, thf, and toluene. Excluded inorganics and entries marked with DONOTUSE.

Calculated descriptors using the CDK Descriptor Calculator (v0.94) for both the compounds and the solvents. Added dipole moment and dielectric constant as descriptors for the solvents. Performed brute force descriptor selection until all descriptors in the models had a p-value less than 0.02. This is an area in which a more sophisticated method for descriptor selection should result in a better model.

The summary of the 11-descriptor model is below (using entire dataset):
Coefficients:
                           Estimate Std.Error TStat     PValue
(Intercept)                 7.1795   0.8660   8.2901  < 10^-9      ***
Kier2                      -0.4307   0.2155  -1.9987  0.0468       ***
1/AMR                     -28.2691   4.4520  -6.3497  1.1104*10^-9 ***
apol                       -0.3336   0.0414  -8.0572  < 10^-9      ***
ATSc1                       6.4693   1.4212   4.5519  8.5471*10^-6 ***
nRotB                       0.3754   0.0803   4.6701  5.0757*10^-6 ***
TopoPSA                    -0.0347   0.0076  -4.5367  9.1357*10^-6 ***
XLogP                       0.9463   0.1644   5.7563  2.6830*10^-8 ***
apol*solvent dipole moment -0.0176   0.0071  -2.4881  0.0135       ***
solvent dielectric constant 0.0466   0.0132   3.5333  0.0005       ***
solvent ALogP              -0.5740   0.1931  -2.9723  0.0033       ***
solvent AMR                 0.1219   0.0266   4.5875  7.3148*10^-6 ***
---
Multiple R-squared: 0.4660,    Adjusted R-squared: 0.4432
p-value: 0 (less than 10^-9)

model3.png

Summary

The model is limited but does predict the solubility of any compound in any solvent , though since it is based on Ugi reaction components, it will work best for aldehydes, amines, and carboxylic acids (there are not many isonitriles in the dataset - 5 out of 244 values).