Predicting Abraham Descriptors S, A and B

Researcher: Andrew Lang
Data used in modeling is from the spreadsheet of compounds with known Abraham descriptors (live) gathered from the literature KnownAbrahamDescriptorsJul19-2010.xls and descriptors calculated from solubility values from the Open Notebook Science Challenge using the method of Abraham et. al.[1] ONSCJul19-2010.xlsx.

Abraham descriptors
S is the solute dipolarity/polarizability.
A is the solute overall (summation) hydrogen bond acidity.
B is the solute overall (summation) hydrogen bond basicity.

Method

Duplicates (4-fluorobenzoic acid, 4-hydroxybenzaldehyde, benzoic acid, chloroacetic acid, ibuprofen, phenanthrene, phenylacetic acid, and salicylic acid) from the combined data of 252 molecules were removed, keeping the literature values over the calculated ones, leaving a dataset of 244 molecules. 2D CDK descriptors were calculated using the CDK Descriptor Calculator GUI (v 1.0.5): S, A, B.

Random forests models were created using R 2.11.0 (RCode.txt) and the v4.5-26 Random Forest Package to determine descriptor importance and obvious outliers (mandelic acid, cinnamic acid, and 4-phenylbutyric acid) were identified by hand and removed from the datasets for S and B, leaving 241 molecules. In addition to rows with zero hydrogen bond donors, A = 0 by default for compounds without any hydrogen bond donors[2], obvious outliers (mandelic acid, cinnamic acid, 4-phenylbutyric acid, 3,4,5-trihydroxybenzoic acid, and dimethylbenzolsulfonamide) were identified by hand and removed from the dataset for A leaving 135 molecules.

These refined datasets were used to create random forest models in R one more time to verify descriptor importance, see figures below (in the order S, A and B).
SRFDescriptorImportance.png
Figure 1: Importance of CDK descriptors for modeling Abraham descriptor S.


ARFDescriptorImportance.png
Figure 2: Importance of CDK descriptors for modeling Abraham descriptor A.


BRFDescriptorImportance.png
Figure 3: Importance of CDK descriptors for modeling Abraham descriptor B.


Results

Models were built using linear regression with forward stepwise selection of descriptors.
The summary of the five descriptor model for S is presented below:
Residuals:
     Min       1Q   Median       3Q      Max
-0.86957 -0.11209 -0.02619  0.07700  1.34817
 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.251345   0.028807   8.725 4.97e-16 ***
nAromBond    0.077760   0.005485  14.178  < 2e-16 ***
WTPT.5       0.031074   0.009780   3.177  0.00169 **
WTPT.3       0.029599   0.009041   3.274  0.00122 **
khs.dO       0.323095   0.042312   7.636 5.62e-13 ***
khs.aasC    -0.056589   0.026289  -2.153  0.03237 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.2428 on 235 degrees of freedom
Multiple R-squared: 0.8384,     Adjusted R-squared: 0.835
F-statistic: 243.9 on 5 and 235 DF,  p-value: < 2.2e-16
The summary of the six descriptor model for A is presented below:
Residuals:
     Min       1Q   Median       3Q      Max
-0.42759 -0.07995  0.02353  0.09046  0.32265
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.13404    0.02700   4.965 2.15e-06 ***
ATSc1        0.68977    0.24984   2.761  0.00661 **
khs.sOH      0.16872    0.02367   7.127 6.67e-11 ***
khs.dO       0.17197    0.03979   4.321 3.09e-05 ***
khs.ssO     -0.25111    0.05803  -4.327 3.01e-05 ***
VPC.6        0.13847    0.03232   4.284 3.58e-05 ***
SPC.5       -0.12334    0.02194  -5.621 1.14e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.155 on 128 degrees of freedom
Multiple R-squared: 0.6429,     Adjusted R-squared: 0.6262
F-statistic: 38.41 on 6 and 128 DF,  p-value: < 2.2e-16
 
The summary of the five descriptor model for B is presented below:
Residuals:
      Min        1Q    Median        3Q       Max
-0.560315 -0.109938  0.003439  0.091864  0.706056
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.25350    2.121e-02  11.952  < 2e-16 ***
nHBAcc       0.09006    1.981e-02   4.547 9.28e-06 ***
WTPT.5       0.06196    9.272e-03   6.683 2.15e-10 ***
WTPT.3      -0.02439    8.074e-03  -3.021  0.00284 **
ATSc1        0.93550    2.010e-01   4.654 5.81e-06 ***
ATSp5        0.0001995  1.489e-05  13.399  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 0.1694 on 206 degrees of freedom
Multiple R-squared: 0.8367,     Adjusted R-squared: 0.8327
F-statistic: 211.1 on 5 and 206 DF,  p-value: < 2.2e-16
Three-fold cross validation was performed on the models to check the model's predictive abilities.
S3-foldCVAll.png
Figure 4: Three-fold cross-validation.


The summary of results, compounds used in the analysis together with descriptors, observed values, and predicted values, are available for download (SCombinedDataWithModel.xlsx, ACombinedDataWithModel.xlsx and BCombinedDataWithModel.xlsx) and are illustrated graphically below:
Spredictedvsobserved.png
Figure 5: Model for descriptor S. Data sized by nAromBond and coloured by khs.dO


Aobsvspred.png
Figure 6: Model for descriptor A. Data sized by ATSc1 and coloured by khs.sOH.


Bobsvspred.png
Figure 7: Model for descriptor B. Data sized by nHBAcc and coloured by ATSp5.



Using Model001 for ab initio solubility prediction

Using model001 and the Abraham technique for predicting solubilities in 70+ organic solvents[1], it is possible to predict solubility values from chemical structure (SMILES). Using webservices (CDK - provided by Egon Willighagen and ChemSpider) to calculate cdk descriptors, molar refractivity, and molar volume; it is possible to estimate all five Abraham descriptors E, S, A, B, and V. Using these values together with a water solubility prediction (from VCC labs) we can predict the solubility in over 70 solvents. To test the utility of this method we compared the predicted solubilities to the measured solubilities of over 1,000 compounds from the ONS Challenge. The results showed an r-squared value of 0.32396 and a RMSD of 2.0759M.

Conclusion

Three models have been determined that can be used to predict Abraham descriptors S, A and B using the CDK descriptors - webservice, whose description can be found at the CDK Descriptors Names page.
S = 0.251345 + 0.077760 nAromBond + 0.031074 WTPT.5 + 0.029599 WTPT.3 + 0.323095 khs.dO - 0.056589 khs.aasC
(Adjusted R-squared: 0.835)
 
A = 0.13404 + 0.68977 ATSc1 + 0.16872 khs.sOH + 0.17197 khs.dO - 0.25111 khs.ssO + 0.13847 VPC.6 - 0.12334 SPC.5
(Adjusted R-squared: 0.6262)
 
B = 0.25350 + 0.09006 nHBAcc + 0.06196 WTPT.5 - 0.02439 WTPT.3 + 0.93550 ATSc1 + 0.0001995 ATSp5
(Adjusted R-squared: 0.8327)
The models for S (adj. r-squared: 0.853, RMSD: 0.2398) and B (adj. r-squared: 0.8327, RMSD: 0.1670) are clearly better than that for A (adj. r-squared: 0.6262, RMSD: 0.1509). This can be seen from the 3-fold cross-validation images, obs. vs. pred. values, and adjusted r-squared values. These results can be compared with the results of Jover[3], who did a similar 5-descriptor model regression using non-open descriptors, obtaining r-squared values for S, A and B of 0.868, 0.873 and 0.758 respectively. His higher r-squared value for his model for Abraham descriptor A is likely due to him leaving in all compounds in his regression analysis, even those compounds without hydrogen bond donors. Even though Jover used different descriptors to build his model, it is interesting to note that his model for S contains the nAromBond descriptor and his model for B contains nHBAcc (though this is as expected).

References

[1] M.H. Abraham; et al. (2009). Prediction of Solubility of Drugs and Other Compounds in Organic Solvents. Journal of Pharmaceutical Sciences. DOI: 10.1002/jps.21922
[2] M.H. Abraham; P.L. Grellier; D.V. Prior; P.P. Duce; J.J. Morris and P.J. Taylor. J. Chem. Soc. Perkin Trans 2. (1989) 699. All compounds with unactive hydrogen taken as zero.
[3] J. Jover; R. Bosque and J. Sales. Determination of Abraham Solute Parameters from Molecular Structure. J. Chem. Inf. Comput. Sci. (2004). 44. 1098-1106