AbrahamDescriptorsModel001

Predicting Abraham Descriptors S, A and B
Researcher: Andrew Lang Data used in modeling is from the spreadsheet of compounds with known Abraham descriptors (live) gathered from the literature [|KnownAbrahamDescriptorsJul19-2010.xls] and descriptors calculated from solubility values from the Open Notebook Science Challenge using the method of Abraham //et. al.//[1] [|ONSCJul19-2010.xlsx].

// **Abraham descriptors** // // S is the solute **dipolarity/polarizability.**// //** A is the solute overall (summation) hydrogen bond acidity**.// // B is the solute **overall (summation) hydrogen bond basicity**.//

Method
Duplicates (4-fluorobenzoic acid, 4-hydroxybenzaldehyde, benzoic acid, chloroacetic acid, ibuprofen, phenanthrene, phenylacetic acid, and salicylic acid) from the combined data of 252 molecules were removed, keeping the literature values over the calculated ones, leaving a dataset of 244 molecules. 2D CDK descriptors were calculated using the CDK Descriptor Calculator GUI (v 1.0.5): [|S], [|A], [|B].

Random forests models were created using R 2.11.0 ([|RCode.txt]) and the v4.5-26 Random Forest Package to determine descriptor importance and obvious outliers (mandelic acid, cinnamic acid, and 4-phenylbutyric acid) were identified by hand and removed from the datasets for S and B, leaving 241 molecules. In addition to rows with zero hydrogen bond donors, A = 0 by default for compounds without any hydrogen bond donors[2], obvious outliers (mandelic acid, cinnamic acid, 4-phenylbutyric acid, 3,4,5-trihydroxybenzoic acid, and dimethylbenzolsulfonamide) were identified by hand and removed from the dataset for A leaving 135 molecules.

These refined datasets were used to create random forest models in R one more time to verify descriptor importance, see figures below (in the order S, A and B).





Results
Models were built using linear regression with forward stepwise selection of descriptors. The summary of the five descriptor model for S is presented below: code Residuals: Min      1Q   Median       3Q      Max -0.86957 -0.11209 -0.02619 0.07700  1.34817

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.251345   0.028807   8.725 4.97e-16 *** nAromBond   0.077760   0.005485  14.178  < 2e-16 *** WTPT.5      0.031074   0.009780   3.177  0.00169 ** WTPT.3      0.029599   0.009041   3.274  0.00122 ** khs.dO      0.323095   0.042312   7.636 5.62e-13 *** khs.aasC   -0.056589   0.026289  -2.153  0.03237 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2428 on 235 degrees of freedom Multiple R-squared: 0.8384,    Adjusted R-squared: 0.835 F-statistic: 243.9 on 5 and 235 DF, p-value: < 2.2e-16 code The summary of the six descriptor model for A is presented below: code Residuals: Min      1Q   Median       3Q      Max -0.42759 -0.07995 0.02353  0.09046  0.32265

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.13404    0.02700   4.965 2.15e-06 *** ATSc1       0.68977    0.24984   2.761  0.00661 ** khs.sOH     0.16872    0.02367   7.127 6.67e-11 *** khs.dO      0.17197    0.03979   4.321 3.09e-05 *** khs.ssO    -0.25111    0.05803  -4.327 3.01e-05 *** VPC.6       0.13847    0.03232   4.284 3.58e-05 *** SPC.5      -0.12334    0.02194  -5.621 1.14e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.155 on 128 degrees of freedom Multiple R-squared: 0.6429,    Adjusted R-squared: 0.6262 F-statistic: 38.41 on 6 and 128 DF, p-value: < 2.2e-16

code The summary of the five descriptor model for B is presented below: code Residuals: Min       1Q    Median        3Q       Max -0.560315 -0.109938 0.003439  0.091864  0.706056

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.25350    2.121e-02  11.952  < 2e-16 *** nHBAcc      0.09006    1.981e-02   4.547 9.28e-06 *** WTPT.5      0.06196    9.272e-03   6.683 2.15e-10 *** WTPT.3     -0.02439    8.074e-03  -3.021  0.00284 ** ATSc1       0.93550    2.010e-01   4.654 5.81e-06 *** ATSp5       0.0001995  1.489e-05  13.399  < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1694 on 206 degrees of freedom Multiple R-squared: 0.8367,    Adjusted R-squared: 0.8327 F-statistic: 211.1 on 5 and 206 DF, p-value: < 2.2e-16 code Three-fold cross validation was performed on the models to check the model's predictive abilities.

The summary of results, compounds used in the analysis together with descriptors, observed values, and predicted values, are available for download ([|SCombinedDataWithModel.xlsx], [|ACombinedDataWithModel.xlsx] and [|BCombinedDataWithModel.xlsx]) and are illustrated graphically below:





**Using Model001 for ab initio solubility prediction**
Using model001 and the Abraham technique for predicting solubilities in 70+ organic solvents[1], it is possible to predict solubility values from chemical structure (SMILES). Using webservices (CDK - provided by Egon Willighagen and ChemSpider) to calculate cdk descriptors, molar refractivity, and molar volume; it is possible to estimate all five Abraham descriptors E, S, A, B, and V. Using these values together with a water solubility prediction (from VCC labs) we can predict the solubility in over 70 solvents. To test the utility of this method we compared the predicted solubilities to the measured solubilities of [|over 1,000 compounds from the ONS Challenge]. The results showed an r-squared value of 0.32396 and a RMSD of 2.0759M.

Conclusion
Three models have been determined that can be used to predict Abraham descriptors S, A and B using the CDK descriptors - webservice, whose description can be found at the CDK Descriptors Names page. code S = 0.251345 + 0.077760 nAromBond + 0.031074 WTPT.5 + 0.029599 WTPT.3 + 0.323095 khs.dO - 0.056589 khs.aasC (Adjusted R-squared: 0.835)

A = 0.13404 + 0.68977 ATSc1 + 0.16872 khs.sOH + 0.17197 khs.dO - 0.25111 khs.ssO + 0.13847 VPC.6 - 0.12334 SPC.5 (Adjusted R-squared: 0.6262)

B = 0.25350 + 0.09006 nHBAcc + 0.06196 WTPT.5 - 0.02439 WTPT.3 + 0.93550 ATSc1 + 0.0001995 ATSp5 (Adjusted R-squared: 0.8327) code The models for S (adj. r-squared: 0.853, RMSD: 0.2398) and B (adj. r-squared: 0.8327, RMSD: 0.1670) are clearly better than that for A (adj. r-squared: 0.6262, RMSD: 0.1509). This can be seen from the 3-fold cross-validation images, obs. vs. pred. values, and adjusted r-squared values. These results can be compared with the results of Jover[3], who did a similar 5-descriptor model regression using non-open descriptors, obtaining r-squared values for S, A and B of 0.868, 0.873 and 0.758 respectively. His higher r-squared value for his model for Abraham descriptor A is likely due to him leaving in all compounds in his regression analysis, even those compounds without hydrogen bond donors. Even though Jover used different descriptors to build his model, it is interesting to note that his model for S contains the nAromBond descriptor and his model for B contains nHBAcc (though this is as expected).