Abraham descriptors S is the solutedipolarity/polarizability. A is the solute overall (summation) hydrogen bond acidity. B is the soluteoverall (summation) hydrogen bond basicity.

Method

Duplicates (4-fluorobenzoic acid, 4-hydroxybenzaldehyde, benzoic acid, chloroacetic acid, ibuprofen, phenanthrene, phenylacetic acid, and salicylic acid) from the combined data of 252 molecules were removed, keeping the literature values over the calculated ones, leaving a dataset of 244 molecules. 2D CDK descriptors were calculated using the CDK Descriptor Calculator GUI (v 1.0.5): S, A, B.

Random forests models were created using R 2.11.0 (RCode.txt) and the v4.5-26 Random Forest Package to determine descriptor importance and obvious outliers (mandelic acid, cinnamic acid, and 4-phenylbutyric acid) were identified by hand and removed from the datasets for S and B, leaving 241 molecules. In addition to rows with zero hydrogen bond donors, A = 0 by default for compounds without any hydrogen bond donors[2], obvious outliers (mandelic acid, cinnamic acid, 4-phenylbutyric acid, 3,4,5-trihydroxybenzoic acid, and dimethylbenzolsulfonamide) were identified by hand and removed from the dataset for A leaving 135 molecules.

These refined datasets were used to create random forest models in R one more time to verify descriptor importance, see figures below (in the order S, A and B).

Figure 1: Importance of CDK descriptors for modeling Abraham descriptor S.

Figure 2: Importance of CDK descriptors for modeling Abraham descriptor A.

Figure 3: Importance of CDK descriptors for modeling Abraham descriptor B.

Results

Models were built using linear regression with forward stepwise selection of descriptors.
The summary of the five descriptor model for S is presented below:

Residuals:
Min 1Q Median 3Q Max
-0.86957 -0.11209 -0.02619 0.07700 1.34817
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.251345 0.028807 8.725 4.97e-16 ***
nAromBond 0.077760 0.005485 14.178 < 2e-16 ***
WTPT.5 0.031074 0.009780 3.177 0.00169 **
WTPT.3 0.029599 0.009041 3.274 0.00122 **
khs.dO 0.323095 0.042312 7.636 5.62e-13 ***
khs.aasC -0.056589 0.026289 -2.153 0.03237 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2428 on 235 degrees of freedom
Multiple R-squared: 0.8384, Adjusted R-squared: 0.835
F-statistic: 243.9 on 5 and 235 DF, p-value: < 2.2e-16

The summary of the six descriptor model for A is presented below:

Residuals:
Min 1Q Median 3Q Max
-0.42759 -0.07995 0.02353 0.09046 0.32265
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.13404 0.02700 4.965 2.15e-06 ***
ATSc1 0.68977 0.24984 2.761 0.00661 **
khs.sOH 0.16872 0.02367 7.127 6.67e-11 ***
khs.dO 0.17197 0.03979 4.321 3.09e-05 ***
khs.ssO -0.25111 0.05803 -4.327 3.01e-05 ***
VPC.6 0.13847 0.03232 4.284 3.58e-05 ***
SPC.5 -0.12334 0.02194 -5.621 1.14e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.155 on 128 degrees of freedom
Multiple R-squared: 0.6429, Adjusted R-squared: 0.6262
F-statistic: 38.41 on 6 and 128 DF, p-value: < 2.2e-16

The summary of the five descriptor model for B is presented below:

Residuals:
Min 1Q Median 3Q Max
-0.560315 -0.109938 0.003439 0.091864 0.706056
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.25350 2.121e-02 11.952 < 2e-16 ***
nHBAcc 0.09006 1.981e-02 4.547 9.28e-06 ***
WTPT.5 0.06196 9.272e-03 6.683 2.15e-10 ***
WTPT.3 -0.02439 8.074e-03 -3.021 0.00284 **
ATSc1 0.93550 2.010e-01 4.654 5.81e-06 ***
ATSp5 0.0001995 1.489e-05 13.399 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1694 on 206 degrees of freedom
Multiple R-squared: 0.8367, Adjusted R-squared: 0.8327
F-statistic: 211.1 on 5 and 206 DF, p-value: < 2.2e-16

Three-fold cross validation was performed on the models to check the model's predictive abilities.

Figure 5: Model for descriptor S. Data sized by nAromBond and coloured by khs.dO

Figure 6: Model for descriptor A. Data sized by ATSc1 and coloured by khs.sOH.

Figure 7: Model for descriptor B. Data sized by nHBAcc and coloured by ATSp5.

Using Model001 for ab initio solubility prediction

Using model001 and the Abraham technique for predicting solubilities in 70+ organic solvents[1], it is possible to predict solubility values from chemical structure (SMILES). Using webservices (CDK - provided by Egon Willighagen and ChemSpider) to calculate cdk descriptors, molar refractivity, and molar volume; it is possible to estimate all five Abraham descriptors E, S, A, B, and V. Using these values together with a water solubility prediction (from VCC labs) we can predict the solubility in over 70 solvents. To test the utility of this method we compared the predicted solubilities to the measured solubilities of over 1,000 compounds from the ONS Challenge. The results showed an r-squared value of 0.32396 and a RMSD of 2.0759M.

Conclusion

Three models have been determined that can be used to predict Abraham descriptors S, A and B using the CDK descriptors - webservice, whose description can be found at the CDK Descriptors Names page.

The models for S (adj. r-squared: 0.853, RMSD: 0.2398) and B (adj. r-squared: 0.8327, RMSD: 0.1670) are clearly better than that for A (adj. r-squared: 0.6262, RMSD: 0.1509). This can be seen from the 3-fold cross-validation images, obs. vs. pred. values, and adjusted r-squared values. These results can be compared with the results of Jover[3], who did a similar 5-descriptor model regression using non-open descriptors, obtaining r-squared values for S, A and B of 0.868, 0.873 and 0.758 respectively. His higher r-squared value for his model for Abraham descriptor A is likely due to him leaving in all compounds in his regression analysis, even those compounds without hydrogen bond donors. Even though Jover used different descriptors to build his model, it is interesting to note that his model for S contains the nAromBond descriptor and his model for B contains nHBAcc (though this is as expected).

References

[1] M.H. Abraham; et al. (2009). Prediction of Solubility of Drugs and Other Compounds in Organic Solvents. Journal of Pharmaceutical Sciences. DOI: 10.1002/jps.21922
[2] M.H. Abraham; P.L. Grellier; D.V. Prior; P.P. Duce; J.J. Morris and P.J. Taylor. J. Chem. Soc. Perkin Trans 2. (1989) 699. All compounds with unactive hydrogen taken as zero.
[3] J. Jover; R. Bosque and J. Sales. Determination of Abraham Solute Parameters from Molecular Structure. J. Chem. Inf. Comput. Sci. (2004). 44. 1098-1106

## Predicting Abraham Descriptors S, A and B

Researcher: Andrew LangData used in modeling is from the spreadsheet of compounds with known Abraham descriptors (live) gathered from the literature KnownAbrahamDescriptorsJul19-2010.xls and descriptors calculated from solubility values from the Open Notebook Science Challenge using the method of Abraham

et. al.[1] ONSCJul19-2010.xlsx.Abraham descriptorsS is the solutedipolarity/polarizability.A is the solute overall (summation) hydrogen bond acidity.B is the soluteoverall (summation) hydrogen bond basicity.## Method

Duplicates (4-fluorobenzoic acid, 4-hydroxybenzaldehyde, benzoic acid, chloroacetic acid, ibuprofen, phenanthrene, phenylacetic acid, and salicylic acid) from the combined data of 252 molecules were removed, keeping the literature values over the calculated ones, leaving a dataset of 244 molecules. 2D CDK descriptors were calculated using the CDK Descriptor Calculator GUI (v 1.0.5): S, A, B.Random forests models were created using R 2.11.0 (RCode.txt) and the v4.5-26 Random Forest Package to determine descriptor importance and obvious outliers (mandelic acid, cinnamic acid, and 4-phenylbutyric acid) were identified by hand and removed from the datasets for S and B, leaving 241 molecules. In addition to rows with zero hydrogen bond donors, A = 0 by default for compounds without any hydrogen bond donors[2], obvious outliers (mandelic acid, cinnamic acid, 4-phenylbutyric acid, 3,4,5-trihydroxybenzoic acid, and dimethylbenzolsulfonamide) were identified by hand and removed from the dataset for A leaving 135 molecules.

These refined datasets were used to create random forest models in R one more time to verify descriptor importance, see figures below (in the order S, A and B).

## Results

Models were built using linear regression with forward stepwise selection of descriptors.The summary of the five descriptor model for S is presented below:

The summary of the six descriptor model for A is presented below:

The summary of the five descriptor model for B is presented below:

Three-fold cross validation was performed on the models to check the model's predictive abilities.

The summary of results, compounds used in the analysis together with descriptors, observed values, and predicted values, are available for download (SCombinedDataWithModel.xlsx, ACombinedDataWithModel.xlsx and BCombinedDataWithModel.xlsx) and are illustrated graphically below:

Using model001 and the Abraham technique for predicting solubilities in 70+ organic solvents[1], it is possible to predict solubility values from chemical structure (SMILES). Using webservices (CDK - provided by Egon Willighagen and ChemSpider) to calculate cdk descriptors, molar refractivity, and molar volume; it is possible to estimate all five Abraham descriptors E, S, A, B, and V. Using these values together with a water solubility prediction (from VCC labs) we can predict the solubility in over 70 solvents. To test the utility of this method we compared the predicted solubilities to the measured solubilities of over 1,000 compounds from the ONS Challenge. The results showed an r-squared value of 0.32396 and a RMSD of 2.0759M.Using Model001 for ab initio solubility prediction## Conclusion

Three models have been determined that can be used to predict Abraham descriptors S, A and B using the CDK descriptors - webservice, whose description can be found at the CDK Descriptors Names page.The models for S (adj. r-squared: 0.853, RMSD: 0.2398) and B (adj. r-squared: 0.8327, RMSD: 0.1670) are clearly better than that for A (adj. r-squared: 0.6262, RMSD: 0.1509). This can be seen from the 3-fold cross-validation images, obs. vs. pred. values, and adjusted r-squared values. These results can be compared with the results of Jover[3], who did a similar 5-descriptor model regression using non-open descriptors, obtaining r-squared values for S, A and B of 0.868, 0.873 and 0.758 respectively. His higher r-squared value for his model for Abraham descriptor A is likely due to him leaving in all compounds in his regression analysis, even those compounds without hydrogen bond donors. Even though Jover used different descriptors to build his model, it is interesting to note that his model for S contains the nAromBond descriptor and his model for B contains nHBAcc (though this is as expected).

## References

[1] M.H. Abraham;et al.(2009). Prediction of Solubility of Drugs and Other Compounds in Organic Solvents. Journal of Pharmaceutical Sciences. DOI: 10.1002/jps.21922[2] M.H. Abraham; P.L. Grellier; D.V. Prior; P.P. Duce; J.J. Morris and P.J. Taylor. J. Chem. Soc. Perkin Trans 2. (1989) 699. All compounds with unactive hydrogen taken as zero.

[3] J. Jover; R. Bosque and J. Sales. Determination of Abraham Solute Parameters from Molecular Structure. J. Chem. Inf. Comput. Sci. (2004). 44. 1098-1106