Modeling Solvent Abraham Coefficients - Model001b

Researcher
Lori Fielding
[Good work, please remove all spans from the wikitext -AL]

Objective

To model the Abraham solvent coefficients c, e, s, a, b, v from chemical structure using CDK descriptors. Model001a and Model001b are performed by different researchers but use the same data, procedure and techniques.

Introduction

The Abraham solvent coefficients c, s, a, b, v are determined by fitting experimental partition coefficients and solubility measurements to the Abraham general solvation model equations. Recent work has shown that the solvent Abraham coefficient c, the regression intercept, has physical–chemical meaning, being related to the van der Waals volume. [1] This means that all the coefficients have physical–chemical meaning and thus it may be possible to model them directly from structure. In fact, this has already been done for a limited chemical space of alkane solvents containing hydroxyl, ether, ester and/or ketone functional groups using fragments. [2] Here we present a general model for predicting solvent Abraham coefficients from structure using open CDK descriptors.

Procedure

The up-to-date (as of January 21, 2012) solvent coefficients for 78 organic solvents was take from the ONSChallange solvent database.

A CSV file of descriptors was generated using the CDK Descriptor Calculator GUI (v1.3.2) for all 2D descriptors with the following options deselected:
Ionization Potential, Charged Partial Surface Areas, Protein, Geometrical, and WHIM, additionally selecting the option 'Add Explicit H'.
The descriptors Kier3 and HybRatio were deleted from the CSV file due to multiple "NA" entries.

Random Forest.

Abraham Solvent Coefficient c

The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest model for the prediction of the Abraham solvent coefficient c using the following code:

## Load data
> mydata = read.csv(file="cwithdescriptorsreadyformodeling.csv",head=TRUE,row.name="molID")
 
## Load randomForest package (v. 4.6-6)
> require(randomForest)
 
## Do randomForest
> mydata.rf <- randomForest(c ~ ., data = mydata,importance = TRUE)
 
## Print results
> print(mydata.rf)
 
[output]
Call:
 randomForest(formula = c ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.02395482
                    % Var explained: 26.23
[output]
 
## Write predicted descriptor values out as a .csv file.
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file ="RFTestPredict_c.csv")
 
## Get plot of descriptor importance
> varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 
[output] ## see following image: "Random Forest Variable Importance"
 
 
 
The varImpPlot() command generates the following image which shows the importance of the Topological Polar Surface Area (TopoPSA) and the XLogP as the most important physiochemical properties for the prediction of the Abraham solvent coefficient c.


varImpPlot_c.png

The following chart was generated with Tableau Public (v. 7.0) using the file cwithdescriptorsforTP.csv, and depicts the relationship between the predicted Abraham solvent coefficient c vs. the experimental value for c. The data is colored from red to green by Topological Polar Surface Area.

c_tp.png

Abraham Solvent Coefficient e

The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest model for the prediction of the Abraham solvent coefficient e using the following code:

## Load data
> mydata = read.csv(file="ewithdescriptorsreadyformodeling.csv",head=TRUE,row.name="molID")
 
## Load randomForest package (v. 4.6-6)
> require(randomForest)
 
## Do randomForest
> mydata.rf <- randomForest(e ~ ., data = mydata,importance = TRUE)
 
## Print results
> print(mydata.rf)
 
[output]
Call:
 randomForest(formula = e ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.03508388
                    % Var explained: 23.17
[output]
 
## Write predicted descriptor values out as a .csv file.
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file ="RFTestPredict_e.csv")
 
## Get plot of descriptor importance
> varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 
[output] ## see following image: "Random Forest Variable Importance"
 
 
 
The varImpPlot() command generates the following image which shows the importance of the (MDEC.12) and the XLogP as the most important physiochemical properties for the prediction of the Abraham solvent coefficient e.

varImpPlot_e.png

The following chart was generated with Tableau Public (v. 7.0) using the file ewithdescriptorsforTP.csv, and depicts the relationship between the predicted Abraham solvent coefficient e vs. the experimental value for e. The data is colored from red to green by MDEC.12.

e_tp.png

Abraham Solvent Coefficient s

The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest model for the prediction of the Abraham solvent coefficient s using the following code:

## Load data
> mydata = read.csv(file="swithdescriptorsreadyformodeling.csv",head=TRUE,row.name="molID")
 
## Load randomForest package (v. 4.6-6)
> require(randomForest)
 
## Do randomForest
> mydata.rf <- randomForest(s ~ ., data = mydata,importance = TRUE)
 
## Print results
> print(mydata.rf)
 
[output]
Call:
 randomForest(formula = s ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.110901
                    % Var explained: 75.56
[output]
 
## Write predicted descriptor values out as a .csv file.
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file ="RFTestPredict_s.csv")
 
## Get plot of descriptor importance
> varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 
[output] ## see following image: "Random Forest Variable Importance"
 
 
 
The varImpPlot() command generates the following image which shows the importance of the (nAtomP) and the ATSc1 as the most important physiochemical properties for the prediction of the Abraham solvent coefficient s.

varImpPlot_s.png

The following chart was generated with Tableau Public (v. 7.0) using the file swithdescriptorsfortp.csv, and depicts the relationship between the predicted Abraham solvent coefficient s vs. the experimental value for s. The data is colored from red to green by nAtomP.

s_tp.png


Abraham Solvent Coefficient a

The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest model for the prediction of the Abraham solvent coefficient a using the following code:

## Load data
> mydata = read.csv(file="awithdescriptorsreadyformodeling.csv",head=TRUE,row.name="molID")
 
## Load randomForest package (v. 4.6-6)
> require(randomForest)
 
## Do randomForest
> mydata.rf <- randomForest(a ~ ., data = mydata,importance = TRUE)
 
## Print results
> print(mydata.rf)
 
[output]
Call:
 randomForest(formula = a ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.2675513
                    % Var explained: 90.51
[output]
 
## Write predicted descriptor values out as a .csv file.
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file ="RFTestPredict_a.csv")
 
## Get plot of descriptor importance
> varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 
[output] ## see following image: "Random Forest Variable Importance"
 
 
 
The varImpPlot() command generates the following image which shows the importance of the (nHBAcc) as the most important physiochemical property for the prediction of the Abraham solvent coefficient a.

varImpPlot_a.png

The following chart was generated with Tableau Public (v. 7.0) using the file awithdescriptorsforTP.csv, and depicts the relationship between the predicted Abraham solvent coefficient a vs. the experimental value for a. The data is colored from red to green by nHBAcc.

a_tp.png


Abraham Solvent Coefficient b

The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest model for the prediction of the Abraham solvent coefficient b using the following code:

## Load data
> mydata = read.csv(file="bwithdescriptorsreadyformodeling.csv",head=TRUE,row.name="molID")
 
## Load randomForest package (v. 4.6-6)
> require(randomForest)
 
## Do randomForest
> mydata.rf <- randomForest(b ~ ., data = mydata,importance = TRUE)
 
## Print results
> print(mydata.rf)
 
[output]
Call:
 randomForest(formula = s ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.2219764
                    % Var explained: 48.1
[output]
 
## Write predicted descriptor values out as a .csv file.
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file ="RFTestPredict_b.csv")
 
## Get plot of descriptor importance
> varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 
[output] ## see following image: "Random Forest Variable Importance"
 
 
 
The varImpPlot() command generates the following image which shows the importance of the (khs-sOH) as the most important physiochemical property for the prediction of the Abraham solvent coefficient b.

varImpPlot_b.png

The following chart was generated with Tableau Public (v. 7.0) using the file bwithdescriptorsforTP.csv, and depicts the relationship between the predicted Abraham solvent coefficient b vs. the experimental value for b. The data is colored from red to green by khs-sOH.

b_tp.png



Abraham Solvent Coefficient v






The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest model for the prediction of the Abraham solvent coefficient v using the following code:




## Load data
> mydata = read.csv(file="vwithdescriptorsreadyformodeling.csv",head=TRUE,row.name="molID")
 
## Load randomForest package (v. 4.6-6)
> require(randomForest)
 
## Do randomForest
> mydata.rf <- randomForest(v ~ ., data = mydata,importance = TRUE)
 
## Print results
> print(mydata.rf)
 
[output]
Call:
 randomForest(formula = s ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.08849263
                    % Var explained: 57.45
[output]
 
## Write predicted descriptor values out as a .csv file.
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file ="RFTestPredict_v.csv")
 
## Get plot of descriptor importance
> varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 
[output] ## see following image: "Random Forest Variable Importance"
 
 
 
The varImpPlot() command generates the following image which shows the importance of the xLogP as the most important physiochemical property for the prediction of the Abraham solvent coefficient v.

varImpPlot_v.png



The following chart was generated with Tableau Public (v. 7.0) using the file , and depicts the relationship between the predicted Abraham solvent coefficient v vs. the experimental value for v. The data is colored from red to green by xLogP.

v_tp.png

Discussion

Log


References

[1] Paul C.M. van Noort. Solvation thermodynamics and the physical–chemical meaning of the constant in Abraham solvation equations. Chemosphere (2011), doi:10.1016/j.chemosphere.2011.11.073
[2] Laura M. Grubbs, Mariam Saifullah, Nohelli E. De La Rosa, Shulin Ye, Sai S. Achi, William E. Acree Jr., Michael H. Abraham. Mathematical correlations for describing solute transfer into functionalized alkane solvents containing hydroxyl, ether, ester or ketone solvents. Fluid Phase Equilibria 298 (2010) 48–53, doi:10.1016/j.fluid.2010.07.007