[Please remove all span tags and be consistent with header levels by using the Wikitext editor JCB]

Modeling Solvent Abraham Coefficients - Model001a

Researcher
Jesse Patsolic

Objective

To model the Abraham solvent coefficients c, e, s, a, b, v from chemical structure using CDK descriptors. Model001a and Model001b are performed by different researchers but use the same data, procedure and techniques.

Introduction

The Abraham solvent coefficients c, s, a, b, v are determined by fitting experimental partition coefficients and solubility measurements to the Abraham general solvation model equations. Recent work has shown that the solvent Abraham coefficient c, the regression intercept, has physical–chemical meaning, being related to the van der Waals volume. [1] This means that all the coefficients have physical–chemical meaning and thus it may be possible to model them directly from structure. In fact, this has already been done for a limited chemical space of alkane solvents containing hydroxyl, ether, ester and/or ketone functional groups using fragments. [2] Here we present a general model for predicting solvent Abraham coefficients from structure using open CDK descriptors.

Procedure

The up-to-date (as of January 21, 2012) solvent coefficients for 78 organic solvents was take from the ONSChallange solvent database.

A CSV file of descriptors was generated using the CDK Descriptor Calculator GUI (v1.3.2) for all 2D descriptors except the following:
Ionization Potential, Charged Partial Surface Areas, Protein, Geometrical, or WHIM. The option 'Add Explicit H' was selected.
The descriptors Kier3 and HybRatio were deleted from the CSV file due to multiple "NA" entries.

Building the Models

Random Forest. The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest models in the following sections of code.

Abraham Solvent Coefficient c

The predictive model of the Abraham solvent coefficient c was obtained with the following code:
## Load data
> mydata = read.csv(file="cwithdescriptorsreadyformodeling.csv",head=TRUE,row.name="molID")
 
## Load randomForest package (v. 4.6-6)
> require(randomForest)
 
## Do randomForest
> mydata.rf <- randomForest(c ~ ., data = mydata,importance = TRUE)
 
## Print results to screen
> print(mydata.rf)
 
[output]
Call:
 randomForest(formula = c ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.0239344
                    % Var explained: 26.3
[output]
 
## get plot of descriptor importance
> varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 
[output] ## see following image titled "Random Forest Variable Importance"
 
## Write predicted descriptor values out as a .csv file.
 
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file = "RFTestPredict_c.csv")

The varImpPlot() command generates the following image which shows the importance of the Topological Polar Surface Area (TopoPSA) and the XLogP as the most important physiochemical properties for the prediction of the Abraham solvent coefficient c.

jlp_varImpPlot_c.png


The figure below was generated with Tableau Public (v. 7.0) with the file c_predictedwithdescriptorsforTP_jlp.csv, and it shows the predicted Abraham solvent coefficient c vs.
the experimental value for c. The data is colored from red to green by Topological Polar Surface Area.

cVSc-predictedTopoPSA.png

Abraham Solvent Coefficient s

The predictive model of the Abraham solvent coefficient s was obtained with the following code:
## Load randomForest package
> require(randomForest)
 
## Load data
> mydata = read.csv(file="swithdescriptorsreadyformodeling.csv",head=TRUE,row.names="molID")
 
## Do randomForest (v. 4.6-6)
> mydata.rf <- randomForest(s ~ ., data = mydata,importance = TRUE)
 
## Write predicted descriptor values to a .csv file
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file = "RFTestPredict_s.csv")
 
## Print results to screen
> print(mydata.rf)
 
[output]
Call:
 randomForest(formula = s ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.1079902
                    % Var explained: 76.2
[output]
 
 
## Generate plot of descriptor importance
> varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 
[output]  ## see following image

The varImpPlot command generates the following image. The Moreau-Broto autocorrelation (ATSc1) descriptor is shown to be the most important physiochemical property for the prediction of the Abraham solvent coefficient s.

jlp_varImpPlot_s.png


The figure below was generated with Tableau Public (v. 7.0) with the file s_predictedwithdescriptorsforTP_jlp.csv, and it shows the predicted Abraham solvent coefficient s vs. the experimental value for s. The data is colored from red to green by the value of the ATSc1 descriptor.

sVSs-predictedATSc1.png


Abraham Solvent Coefficient a

Again using the randomForest package to predict the Abraham solvent coefficient a with the following R code:
## Load randomForest package
> require(randomForest)
 
## Load data
> mydata = read.csv(file = "awithdescriptorsreadyformodeling.csv",head=TRUE,row.names="molID")
 
## Do randomForest (v. 4.6-6)
> mydata.rf <- randomForest(a ~ ., data = mydata,importance=TRUE)
 
## Write predicted descriptor values to a .csv file
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file="RFTestPredict_a.csv")
 
## Print results to screen
> print(mydata.rf)
 
[output]
Call:
 randomForest(formula = a ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.2825547
                    % Var explained: 89.98
[output]
 
 
## Generate plot of descriptor importance
> varImpPlot(mydata.rf, main="Random Forest Variable Importance")
 
[output] ## see following image

The varImpPlot command generated the following image. The HBondAcceptorCount (nHBAcc) is shown as the most important descriptor for the prediction of the Abraham solvent coefficient a.


jlp_varImpPlot_a.png

Using Tableau Public (v. 7.0), with the file a_predictedwithdescriptorsforTP_jlp.csv, the experimental vs. the predicted values of the Abraham solvent coefficient a were plotted and colored based on the nHBAcc descriptor.
aVsa-predicted_nHBAcc.png


Abraham Solvent Coefficient b


The predictive model for the Abraham solvent coefficient b was run in R as follows:

## Load randomForest
> require(randomForest)
 
## Load data
> mydata = read.csv(file="bwithdescriptorsreadyformodeling.csv",head=TRUE,row.names="molID")
 
## Do randomForest (v. 4.6-6)
> mydata.rf <- randomForest(b ~ ., data = mydata,importance=TRUE)
 
## Write predicted descriptor values to a .csv file
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file="RFTestPredict_b.csv")
 
## Print results to screen
> print(mydata.rf)
 
[output]
Call:
 randomForest(formula = b ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.2331113
                    % Var explained: 45.49
[output]
 
## Generate plot of descriptor importance
> varImpPlot(mydata.rf, main="Random Forest Variable Importance")
 
[output]  ## see following image
 

Output from the randomForest varImpPlot() command shows the khs.sOH descriptor as the most important descriptor related to the Abraham solvent coefficient b.


jlp_varImpPlot_b.png

Using Tableau Public (v. 7.0), with the file b_predictedwithdescriptorsforTP_jlp.csv, the experimental vs. the predicted values of the Abraham solvent coefficient b were plotted and colored based on the khs.sOH descriptor.

bVSb-predicted_khs.sOH.png



Abraham Solvent Coefficient e


Moving on to the Abraham solvent descriptor e. The R code is as follows:

## Load randomForest
> require(randomForest)
 
## Load data
> mydata = read.csv(file="ewithdescriptorsreadyformodeling.csv",head=TRUE,row.names="molID")
 
## Do randomForest (v. 4.6-6)
> mydata.rf <- randomForest(e ~ ., data = mydata,importance = TRUE)
 
## Write predicted descriptor values to a .csv file
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file = "RFTestPredict_e.csv")
 
## Print results to screen
> print(mydata.rf)
 
[output]
 
Call:
 randomForest(formula = e ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.03533318
                    % Var explained: 22.63
[output]
 
## Generate plot of descriptor importance
> varImpPlot(mydata.rf, main="Random Forest Variable Importance")
 
[output] ## see following image
Using the varImpPlot() command generates the following plots. This shows that the descriptors ALogp2 as the most important physiochemical property for the prediction
of the prediction of the Abraham solvent descriptor e.

jlp_varImpPlot_e.png


The following plot gives the experimental versus the predicted values for the Abraham solvent descriptor e from the data file e_predictedwithdescriptorsforTP_jlp.csv. The data points are colored according to the value of the ALogp2 descriptor.

eVSe-predicted_ALogp2.png


Abraham Solvent Coefficient v


The R code for the predictive model for the Abraham solvent coefficient v is as follows:

## Load randomForest package (v. 4.6-6)
> require(randomForest)
 
## Load data
> mydata = read.csv(file="vwithdescriptorsreadyformodeling.csv",head=TRUE,row.name="molID")
 
## Do randomForest
> mydata.rf <- randomForest(v ~ ., data = mydata,importance =TRUE)
 
## Write predicted descriptor values to a .csv file
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict,file="RFTestPredict_v.csv")
 
## Print results to screen
> print(mydata.rf)
 
[output]
Call:
 randomForest(formula = v ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.08940691
                    % Var explained: 57.01
[output]
 
## Generate plot of descriptor importance
> varImpPlot(mydata.rf, main="Random Forest Variable Importance")
 
[output] ## see following image

The following image was generated by the randomForest varImpPlot() command and shows the BCUTp.1l as the most importance descriptor according to %IncMSE

jlp_varImpPlot_v.png

The following plot gives the experimental versus the predicted values for the Abraham solvent descriptor v from the data file v_withdescriptorsforTP.csv. The data points are colored according to the value of the BCUTp.1l descriptor.

vVSv-predicted__BCUTp-1l.png

Discussion


The following table is a summary of the predictive models of each of the Abraham solvent coefficients.

_Solvent Coefficient_
_% Var Explained_
_varImpPlot() Descriptor_
c
26.3
TopoPSA
s
76.2
ATSc1
a
89.98
nHBAcc
b
45.49
khs.sOH
e
22.63
ALogp2
v
57.01
BCUTp.1l

Thus, the predictive model for the Abraham solvent descriptor a yielded the best "precent Var Explained" of 89.98 out of the six coefficients modeled. The coefficient b is an evaluation of the acidity of the hydrogen bonds of the phase, its corresponding descriptor from the varImpPlot() command is khs.sOH which describes the number of E-state fragment occurrences. The predictive model for a yielded a "percent Var Explained" of 76.2. The coefficient a is an evaluation of the basicity of the hydrogen bonds of the phase. The function of the coefficient a corresponds well to the descriptor nHBAcc which describes the total of hydrogen bond acceptors. The "percent Var Explained" model result for the coefficient v is 57.01 and the function of the v coefficient is to give an amount of the hydrophobicity of the phase which in turn explains the forces of cavitation and dispersion interactions.
[3], [4]

Conclusion

[Good work, Give a summary of the results '% var explained' - maybe in a table and any other conclusions you'd like to draw. Then read [2], especially about what the coefficients c,s,e,a,b,v are meant to mean, and discuss if the CDK descriptors picked up make sense: __**http://pele.farmbio.uu.se/nightly-1.4.x/dnames.html**__ -AL]
[What are your thoughts about whether this model could be used by chemists to plan reactions? Solute Model001 has not proven to be practically usable for applications like the Solvent Selector . Do you think this model is good enough - and if not what strategy will you use to improve this model? JCB]
[Since this model is about estimating the values for solvent Abraham descriptors, the most important aspect of performance is how it does predicting solubilities for solutes with well-defined Abraham descriptors. A straightforward way to assess this would be to generate a fourth column in the Solvent Selector for cinnamic acid (which has lots of experimental measurements and good predictions for almost all solvents listed under the AD measured column). I could then give you feedback from a synthetic chemist's perspective about how useful your model is likely to be. JCB We can do this for solvent model002 and compare to Acree's analysis in [2] -AL]

References

[1] Paul C.M. van Noort. Solvation thermodynamics and the physical–chemical meaning of the constant in Abraham solvation equations. Chemosphere (2011), doi:10.1016/j.chemosphere.2011.11.073
[2] Laura M. Grubbs, Mariam Saifullah, Nohelli E. De La Rosa, Shulin Ye, Sai S. Achi, William E. Acree Jr., Michael H. Abraham. Mathematical correlations for describing solute transfer into functionalized alkane solvents containing hydroxyl, ether, ester or ketone solvents. Fluid Phase Equilibria 298 (2010) 48–53, doi:10.1016/j.fluid.2010.07.007
[3] Jesús Jover, Ramón Bosque, and Joaquim Sales. Determination of Abraham Solute Parameters from Molecular Structure. J. Chem. Inf. Comput. Sci. 2004, 44, 1098-1106.
[4] http://pele.farmbio.uu.se/nightly-1.2.x/dnames.html