To model the Abraham solute coefficients E, S, A, B, V, L from chemical structure using CDK descriptors. Model002a and Model002b are performed by different researchers but use the same data, procedure and techniques.

CDK Descriptors were calculated for 2,481 out of 2,486 solutes from the 20120226AcreeSMILESreduced.csv using Rajarshi Guha's CDK Descriptor Calculator. All 2D descriptors were calculated except the following: Ionization Potential, Charged Partial Surface Areas, Protein, Geometrical, or WHIM. The option 'Add Explicit H' was selected.

The CDK Descriptor Calculator came across errors when generating the descriptors for the complete SMILES csv file. SMILES numbers 1651 -1656 were the cause of the errors.

1651

tetramethyltin

C[Sn](C)(C)C

1652

tetraethyltin

CC[Sn](CC)(CC)CC

1653

tetraethyllead

CC[Pb](CC)(CC)CC

1654

ferrocene

[cH-]1cccc1.[cH-]1cccc1.[Fe+2]

1655

germanium tetrachloride

Cl[Ge](Cl)(Cl)Cl

1656

methyl mercuric (II) chloride

[Cl-].[Hg+]C

These were removed to create the reduced csv file linked to above, and the rest of the descriptors were generated.

The descriptors Kier3, and HybRatio were removed due to "NA" entries.
The generated descriptors were merged with the Abraham solute coefficients, SMILES, and names in the following file: 20120228AcreeSMILESreducedCDKoutREADYforModeling.csv
This file is then split up for each Abraham descriptor to be modeled.

[Lori model with this exact same file - you do not have to go through the curation independently -AL]

Building the Models

Random Forest. The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest models in the following sections of code.

Abraham Solute Coefficient E

All the descriptors but E will removed from the 20120228AcreeSMILESreducedCDKoutREADYforModeling.csv file. The unavailable values for the descriptor E were labeled as "-123", these had to be parsed out of the file with the following TCL scripts: CSVreplaceCommas.tcl, to replace the commas in the name field with "|". Remove123.tcl, removes the rows where the value for E is "-123" and stores it in the outfile, and writes the deleted rows into the delfile. CSVreplaceBars.tcl, replaces the "|" in the name field with ",". The output of the run through the TCL scripts is 20120301EwithDescriptors_123removed_withNames.csv.

The SMILES, and name fields will be added back after modeling.

>require(randomForest)
Loading required package: randomForest
randomForest 4.6-6
Type rfNews() to see new features/changes/bug fixes.
# Read input file> mydata=read.csv(file="20120301EwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")# Do randomForest> mydata.rf<- randomForest(E ~., data= mydata,importance=TRUE)> test.predict<-predict(mydata.rf,mydata)# Write out results>write.csv(test.predict, file="RFTestPredict_E.csv")> varImpPlot(mydata.rf,main ="Random Forest Variable Importance: E")[output]#see following graph>print(mydata.rf)
Call:
randomForest(formula= E ~ ., data= mydata, importance = TRUE)
Type of random forest: regression
Number of trees:500
No. of variables tried at each split:68
Mean of squared residuals:0.03384625% Var explained:94.59

The output from the varImpPlot() command in R yields the following graph.

Plotting the known values against the predicted values with Tableau Public gives the following plot. The points are colored according to the khs.sF descriptor.

The data sheet was used as input to R and the randomForest package with the following code:

>require(randomForest)
Loading required package: randomForest
randomForest 4.6-6
Type rfNews() to see new features/changes/bug fixes.
> mydata=read.csv(file="20120302SwithDescriptorsREadyforModeling.csv",head=TRUE,row.names="molID")> mydata.rf<- randomForest(S ~., data= mydata,importance = TRUE)> varImpPlot(mydata.rf, main="Random Forest Variable Importance: S")[output]## see following graph> test.predict<-predict(mydata.rf, mydata)>write.csv(test.predict, file="RFTestPredict_S.csv")>print(mydata.rf)
Call:
randomForest(formula= S ~ ., data= mydata, importance = TRUE)
Type of random forest: regression
Number of trees:500
No. of variables tried at each split:68
Mean of squared residuals:0.1196888% Var explained:77.04

The varImpPlot() command gives the following graph of variable importance:

Plotting the known values versus the predicted values for the Abraham solute descriptor S with Tableau Public yields the following scatter plot. The points are colored according to the nAtomP descriptor.

The data sheet was input and run through the randomForest package with the following R code:

>require(randomForest)> mydata =read.csv(file="20120302AwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")> mydata.rf<- randomForest(A ~., data= mydata, importance = TRUE)> test.predict<-predict(mydata.rf,mydata)>write.csv(test.predict, file="20120303RFTestPredict_A.csv")> varImpPlot(mydata.rf,main="Random Forest Variable Importance: A")[output]# see following graph>print(mydata.rf)
Call:
randomForest(formula= A ~ ., data= mydata, importance = TRUE)
Type of random forest: regression
Number of trees:500
No. of variables tried at each split:68
Mean of squared residuals:0.03444421% Var explained:64.9

The varImpPlot() command gave the following graph of variable importance.

Plotting the known values versus the predicted values for the Abraham solute descriptor S with Tableau Public yields the following plot. The points are colored according to the khs.aasN descriptor.

The data sheet was input and run through the randomForest package with the following R code:

>require(randomForest)
Loading required package: randomForest
randomForest 4.6-6
Type rfNews() to see new features/changes/bug fixes.
## read in data> mydata=read.csv(file="20120302BwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")## Do randomForest> mydata.rf<- randomForest(B ~., data= mydata,importance=TRUE)> test.predict<-predict(mydata.rf,mydata)## Write output data>write.csv(test.predict, file="RFTestPredict_B.csv")> varImpPlot(mydata.rf,main ="Random Forest Variable Importance: B")[output]## see following graph>print(mydata.rf)
Call:
randomForest(formula= B ~ ., data= mydata, importance = TRUE)
Type of random forest: regression
Number of trees:500
No. of variables tried at each split:68
Mean of squared residuals:0.02822268% Var explained:88.62

The varImpPlot() command gave the following graph of variable importance.

Plotting the known values versus the predicted values for the Abraham solute descriptor B with Tableau Public yields the following plot. The points are colored according to the nHBAcc descriptor.

The data sheet was used as input to R and the randomForest package with the following code:

>require(randomForest)
Loading required package: randomForest
randomForest 4.6-6
Type rfNews() to see new features/changes/bug fixes.
## Read in data> mydata=read.csv(file="20120303VwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")## Do randomForest> mydata.rf<- randomForest(V ~., data= mydata,importance=TRUE)> test.predict<-predict(mydata.rf,mydata)## Write output data>write.csv(test.predict, file="RFTestPredict_V.csv")> varImpPlot(mydata.rf,main ="Random Forest Variable Importance: V")[output]## see following graph>print(mydata.rf)
Call:
randomForest(formula= V ~ ., data= mydata, importance = TRUE)
Type of random forest: regression
Number of trees:500
No. of variables tried at each split:68
Mean of squared residuals:0.02013338% Var explained:95.36

The varImpPlot() command gives the following graph of variable importance:

Plotting the known values versus the predicted values for the Abraham solute descriptor V with Tableau Public yields the following plot. The points are colored according to the VABC descriptor.

The data sheet was used as input to R and the randomForest package with the following code:

# Load randomForest>require(randomForest)# Read in data> mydata=read.csv(file="20120304LwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")# Do randomForest> mydata.rf<- randomForest(L ~., data= mydata,importance=TRUE)> test.predict<-predict(mydata.rf,mydata)# Write output>write.csv(test.predict, file="RFTestPredict_L.csv")> varImpPlot(mydata.rf,main ="Random Forest Variable Importance: L")[output]# see following graph>print(mydata.rf)
Call:
randomForest(formula= L ~ ., data= mydata, importance = TRUE)
Type of random forest: regression
Number of trees:500
No. of variables tried at each split:68
Mean of squared residuals:0.5755685% Var explained:93.96

The varImpPlot() command gives the following graph of variable importance:

Plotting the known values versus the predicted values for the Abraham solute descriptor L with Tableau Public yields the following plot. The points are colored according to the ATSp1 descriptor.

Results

In summary of the results that R produced is the following table listing the Abraham solute descriptor along with its % Var explained, and most important descriptor:

Solute Coefficient

% Var Explained

Meaning [1],[2]

varImpPlot() Descriptor

E

94.59%

excess molar refraction

khs.sF

S

77.04%

dipolarity/polarizability

nAtomP

A

64.9%

hydrogen-bond acidity

khs.aasN

B

88.62%

hydrogen-bond basicity

nHBAcc

V

95.36%

McGowan characteristic volume

VABC

L

93.96%

logarithm of the solute gas phase into n-hexadecane at 298K

ATSp1

To compare the results for S, A, and B from Model001 to these results for S, A, and B the Out Of Box value for R2 (% Variance Explained) must be converted.

Using the formula for R2,

where y_i are the observed values and f_i are the predicted values of the datasets for the Abraham Solute Descriptors. The R2 values were calculated with R2_SAB.zip and are compared to the multiple R2 values obtained in Model001 in the following table.

Solute Coefficient

R2 Model002a

R2 Model001

S

0.9574

0.8384

A

0.9329

0.6429

B

0.9809

0.8367

References

[1] Paul C.M. van Noort. Solvation thermodynamics and the physicalâ€“chemical meaning of the constant in Abraham solvation equations. Chemosphere (2011), doi:10.1016/j.chemosphere.2011.11.073
[2] Laura M. Grubbs, Mariam Saifullah, Nohelli E. De La Rosa, Shulin Ye, Sai S. Achi, William E. Acree Jr., Michael H. Abraham. Mathematical correlations for describing solute transfer into functionalized alkane solvents containing hydroxyl, ether, ester or ketone solvents. Fluid Phase Equilibria 298 (2010) 48â€“53, doi:10.1016/j.fluid.2010.07.00

## Predicting Abraham Descriptors - Model002a

ResearchersJesse Patsolic and Matthew Wilson

## Objective

To model the Abraham solute coefficients E, S, A, B, V, L from chemical structure using CDK descriptors. Model002a and Model002b are performed by different researchers but use the same data, procedure and techniques.## Procedure

The up-to-date (as of February 23, 2012) 20120223AcreeCompilationDescriptorsAddedToWebService.csv was used as the initial data set.Using a TCL script, CSVduplicates.tcl, with help from Matthew Wilson the duplicate entries were deleted while merging any missing descriptor entries 20120223AcreeCompilationDescriptorsAddedToWebService_DuplicateFree.csv.

CDK Descriptors were calculated for 2,481 out of 2,486 solutes from the 20120226AcreeSMILESreduced.csv using Rajarshi Guha's CDK Descriptor Calculator. All 2D descriptors were calculated except the following: Ionization Potential, Charged Partial Surface Areas, Protein, Geometrical, or WHIM. The option 'Add Explicit H' was selected.

The CDK Descriptor Calculator came across errors when generating the descriptors for the complete SMILES csv file. SMILES numbers 1651 -1656 were the cause of the errors.

These were removed to create the reduced csv file linked to above, and the rest of the descriptors were generated.

The descriptors Kier3, and HybRatio were removed due to "NA" entries.

The generated descriptors were merged with the Abraham solute coefficients, SMILES, and names in the following file: 20120228AcreeSMILESreducedCDKoutREADYforModeling.csv

This file is then split up for each Abraham descriptor to be modeled.

[Lori model with this exact same file - you do not have to go through the curation independently -AL]## Building the Models

Random Forest.The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest models in the following sections of code.## Abraham Solute Coefficient E

All the descriptors but E will removed from the 20120228AcreeSMILESreducedCDKoutREADYforModeling.csv file. The unavailable values for the descriptor E were labeled as "-123", these had to be parsed out of the file with the following TCL scripts: CSVreplaceCommas.tcl, to replace the commas in the name field with "|". Remove123.tcl, removes the rows where the value for E is "-123" and stores it in the outfile, and writes the deleted rows into the delfile. CSVreplaceBars.tcl, replaces the "|" in the name field with ",". The output of the run through the TCL scripts is 20120301EwithDescriptors_123removed_withNames.csv.After using these scripts the SMILES, name and csid fields are removed and the csv file,20120301EwithDescriptorsReadyforModeling.csv, is ready to be modeled with R.

The SMILES, and name fields will be added back after modeling.

The output from the varImpPlot() command in R yields the following graph.

Plotting the known values against the predicted values with Tableau Public gives the following plot. The points are colored according to the khs.sF descriptor.

## Abraham Solute Coefficient S

The same procedure for E was used to generate 20120302SwithDescriptors_123removed_withNames.csv. The names, and SMILES, fields were removed and the resulting file was used as input for R.The data sheet was used as input to R and the randomForest package with the following code:

The varImpPlot() command gives the following graph of variable importance:

Plotting the known values versus the predicted values for the Abraham solute descriptor S with Tableau Public yields the following scatter plot. The points are colored according to the nAtomP descriptor.

## Abraham Solute Coefficient A

The same procedure was used generate 20120311AwithDescriptors_123removed_withNames.csv. The names, and SMILES fields were removed to use as input for R.The data sheet was input and run through the randomForest package with the following R code:

The varImpPlot() command gave the following graph of variable importance.

Plotting the known values versus the predicted values for the Abraham solute descriptor S with Tableau Public yields the following plot. The points are colored according to the khs.aasN descriptor.

## Abraham Solute Coefficient B

The same procedure was used to generate 20120302BwithDescriptors_123removed_withNames.csv. The names, and SMILES, fields were removed and the resulting file was used as input for R.The data sheet was input and run through the randomForest package with the following R code:

The varImpPlot() command gave the following graph of variable importance.

Plotting the known values versus the predicted values for the Abraham solute descriptor B with Tableau Public yields the following plot. The points are colored according to the nHBAcc descriptor.

## Abraham Solute Coefficient V

The same procedure was used to generate 20120303VwithDescriptors_123removed_withNames.csv. The names, and SMILES, fields were removed and the resulting file was used as input for R.

The data sheet was used as input to R and the randomForest package with the following code:

The varImpPlot() command gives the following graph of variable importance:

Plotting the known values versus the predicted values for the Abraham solute descriptor V with Tableau Public yields the following plot. The points are colored according to the VABC descriptor.

## Abraham Solute Coefficient L

The same procedure was used to generate 20120304LwithDescriptors_123removed_withNames.csv. The names, and SMILES, fields were removed and the resulting file was used as input for R.

The data sheet was used as input to R and the randomForest package with the following code:

The varImpPlot() command gives the following graph of variable importance:

Plotting the known values versus the predicted values for the Abraham solute descriptor L with Tableau Public yields the following plot. The points are colored according to the ATSp1 descriptor.

## Results

In summary of the results that R produced is the following table listing the Abraham solute descriptor along with its % Var explained, and most important descriptor:

To compare the results for S, A, and B from Model001 to these results for S, A, and B the Out Of Box value for R2 (% Variance Explained) must be converted.

Using the formula for R2,

where y_i are the observed values and f_i are the predicted values of the datasets for the Abraham Solute Descriptors. The R2 values were calculated with R2_SAB.zip and are compared to the multiple R2 values obtained in Model001 in the following table.

## References

[1] Paul C.M. van Noort. Solvation thermodynamics and the physicalâ€“chemical meaning of the constant in Abraham solvation equations. Chemosphere (2011), doi:10.1016/j.chemosphere.2011.11.073[2] Laura M. Grubbs, Mariam Saifullah, Nohelli E. De La Rosa, Shulin Ye, Sai S. Achi, William E. Acree Jr., Michael H. Abraham. Mathematical correlations for describing solute transfer into functionalized alkane solvents containing hydroxyl, ether, ester or ketone solvents. Fluid Phase Equilibria 298 (2010) 48â€“53, doi:10.1016/j.fluid.2010.07.00