Predicting Abraham Descriptors - Model002a


Researchers
Jesse Patsolic and Matthew Wilson

Objective

To model the Abraham solute coefficients E, S, A, B, V, L from chemical structure using CDK descriptors. Model002a and Model002b are performed by different researchers but use the same data, procedure and techniques.

Procedure

The up-to-date (as of February 23, 2012) 20120223AcreeCompilationDescriptorsAddedToWebService.csv was used as the initial data set.
Using a TCL script, CSVduplicates.tcl, with help from Matthew Wilson the duplicate entries were deleted while merging any missing descriptor entries 20120223AcreeCompilationDescriptorsAddedToWebService_DuplicateFree.csv.

CDK Descriptors were calculated for 2,481 out of 2,486 solutes from the 20120226AcreeSMILESreduced.csv using Rajarshi Guha's CDK Descriptor Calculator. All 2D descriptors were calculated except the following: Ionization Potential, Charged Partial Surface Areas, Protein, Geometrical, or WHIM. The option 'Add Explicit H' was selected.

The CDK Descriptor Calculator came across errors when generating the descriptors for the complete SMILES csv file. SMILES numbers 1651 -1656 were the cause of the errors.
1651
tetramethyltin
C[Sn](C)(C)C
1652
tetraethyltin
CC[Sn](CC)(CC)CC
1653
tetraethyllead
CC[Pb](CC)(CC)CC
1654
ferrocene
[cH-]1cccc1.[cH-]1cccc1.[Fe+2]
1655
germanium tetrachloride
Cl[Ge](Cl)(Cl)Cl
1656
methyl mercuric (II) chloride
[Cl-].[Hg+]C

These were removed to create the reduced csv file linked to above, and the rest of the descriptors were generated.

The descriptors Kier3, and HybRatio were removed due to "NA" entries.
The generated descriptors were merged with the Abraham solute coefficients, SMILES, and names in the following file: 20120228AcreeSMILESreducedCDKoutREADYforModeling.csv
This file is then split up for each Abraham descriptor to be modeled.

[Lori model with this exact same file - you do not have to go through the curation independently -AL]

Building the Models


Random Forest. The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest models in the following sections of code.

Abraham Solute Coefficient E

All the descriptors but E will removed from the 20120228AcreeSMILESreducedCDKoutREADYforModeling.csv file. The unavailable values for the descriptor E were labeled as "-123", these had to be parsed out of the file with the following TCL scripts: CSVreplaceCommas.tcl, to replace the commas in the name field with "|". Remove123.tcl, removes the rows where the value for E is "-123" and stores it in the outfile, and writes the deleted rows into the delfile. CSVreplaceBars.tcl, replaces the "|" in the name field with ",". The output of the run through the TCL scripts is 20120301EwithDescriptors_123removed_withNames.csv.

After using these scripts the SMILES, name and csid fields are removed and the csv file,20120301EwithDescriptorsReadyforModeling.csv, is ready to be modeled with R.

The SMILES, and name fields will be added back after modeling.

> require(randomForest)
Loading required package: randomForest
randomForest 4.6-6
Type rfNews() to see new features/changes/bug fixes.
 
# Read input file
> mydata=read.csv(file="20120301EwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")
 
# Do randomForest
> mydata.rf <- randomForest(E ~., data = mydata,importance=TRUE)
> test.predict <- predict(mydata.rf,mydata)
 
# Write out results
> write.csv(test.predict, file="RFTestPredict_E.csv")
> varImpPlot(mydata.rf,main = "Random Forest Variable Importance: E")
 
[output]  #see following graph
 
> print(mydata.rf)
 
Call:
 randomForest(formula = E ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.03384625
                    % Var explained: 94.59
 

The output from the varImpPlot() command in R yields the following graph.

varImpPlot_Ejlp.png

Plotting the known values against the predicted values with Tableau Public gives the following plot. The points are colored according to the khs.sF descriptor.

EvE-predicted_khs_sF.png

Abraham Solute Coefficient S

The same procedure for E was used to generate 20120302SwithDescriptors_123removed_withNames.csv. The names, and SMILES, fields were removed and the resulting file was used as input for R.

The data sheet was used as input to R and the randomForest package with the following code:
> require(randomForest)
Loading required package: randomForest
randomForest 4.6-6
Type rfNews() to see new features/changes/bug fixes.
> mydata=read.csv(file="20120302SwithDescriptorsREadyforModeling.csv",head=TRUE,row.names = "molID")
> mydata.rf <- randomForest(S ~., data = mydata,importance = TRUE)
> varImpPlot(mydata.rf, main= "Random Forest Variable Importance: S")
 
[output] ## see following graph
 
> test.predict <- predict(mydata.rf, mydata)
> write.csv(test.predict, file="RFTestPredict_S.csv")
> print(mydata.rf)
 
Call:
 randomForest(formula = S ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.1196888
                    % Var explained: 77.04
 

The varImpPlot() command gives the following graph of variable importance:
20120302varImpPlot_S.png

Plotting the known values versus the predicted values for the Abraham solute descriptor S with Tableau Public yields the following scatter plot. The points are colored according to the nAtomP descriptor.

20120302SvsS-predicted.png

Abraham Solute Coefficient A

The same procedure was used generate 20120311AwithDescriptors_123removed_withNames.csv. The names, and SMILES fields were removed to use as input for R.

The data sheet was input and run through the randomForest package with the following R code:

> require(randomForest)
> mydata = read.csv(file="20120302AwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")
> mydata.rf <- randomForest(A ~., data = mydata, importance = TRUE)
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file = "20120303RFTestPredict_A.csv")
> varImpPlot(mydata.rf,main="Random Forest Variable Importance: A")
 
[output] # see following graph
 
> print(mydata.rf)
 
Call:
 randomForest(formula = A ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.03444421
                    % Var explained: 64.9

The varImpPlot() command gave the following graph of variable importance.

varImpPlot_a.png


Plotting the known values versus the predicted values for the Abraham solute descriptor S with Tableau Public yields the following plot. The points are colored according to the khs.aasN descriptor.

20120303AvsA-predicted.png



Abraham Solute Coefficient B

The same procedure was used to generate 20120302BwithDescriptors_123removed_withNames.csv. The names, and SMILES, fields were removed and the resulting file was used as input for R.

The data sheet was input and run through the randomForest package with the following R code:

> require(randomForest)
Loading required package: randomForest
randomForest 4.6-6
Type rfNews() to see new features/changes/bug fixes.
 
## read in data
> mydata=read.csv(file="20120302BwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")
 
## Do randomForest
> mydata.rf <- randomForest(B ~., data = mydata,importance=TRUE)
> test.predict <- predict(mydata.rf,mydata)
 
## Write output data
> write.csv(test.predict, file="RFTestPredict_B.csv")
> varImpPlot(mydata.rf,main = "Random Forest Variable Importance: B")
 
[output] ## see following graph
 
> print(mydata.rf)
 
Call:
 randomForest(formula = B ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.02822268
                    % Var explained: 88.62

The varImpPlot() command gave the following graph of variable importance.

20120303varImpPlot_B.png

Plotting the known values versus the predicted values for the Abraham solute descriptor B with Tableau Public yields the following plot. The points are colored according to the nHBAcc descriptor.

20120303BvsB-predicted.png



Abraham Solute Coefficient V


The same procedure was used to generate 20120303VwithDescriptors_123removed_withNames.csv. The names, and SMILES, fields were removed and the resulting file was used as input for R.

The data sheet was used as input to R and the randomForest package with the following code:

> require(randomForest)
Loading required package: randomForest
randomForest 4.6-6
Type rfNews() to see new features/changes/bug fixes.
 
## Read in data
> mydata=read.csv(file="20120303VwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")
 
## Do randomForest
> mydata.rf <- randomForest(V ~., data = mydata,importance=TRUE)
> test.predict <- predict(mydata.rf,mydata)
 
## Write output data
> write.csv(test.predict, file="RFTestPredict_V.csv")
> varImpPlot(mydata.rf,main = "Random Forest Variable Importance: V")
 
[output]  ## see following graph
 
> print(mydata.rf)
 
Call:
 randomForest(formula = V ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.02013338
                    % Var explained: 95.36
 

The varImpPlot() command gives the following graph of variable importance:

varImpPlot_v.png

Plotting the known values versus the predicted values for the Abraham solute descriptor V with Tableau Public yields the following plot. The points are colored according to the VABC descriptor.

20120304VvsV-predicted.png


Abraham Solute Coefficient L


The same procedure was used to generate 20120304LwithDescriptors_123removed_withNames.csv. The names, and SMILES, fields were removed and the resulting file was used as input for R.

The data sheet was used as input to R and the randomForest package with the following code:

# Load randomForest
> require(randomForest)
 
# Read in data
> mydata=read.csv(file="20120304LwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")
 
# Do randomForest
> mydata.rf <- randomForest(L ~., data = mydata,importance=TRUE)
> test.predict <- predict(mydata.rf,mydata)
 
# Write output
> write.csv(test.predict, file="RFTestPredict_L.csv")
> varImpPlot(mydata.rf,main = "Random Forest Variable Importance: L")
 
 [output]  # see following graph
 
> print(mydata.rf)
 
Call:
 randomForest(formula = L ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.5755685
                    % Var explained: 93.96
 
 

The varImpPlot() command gives the following graph of variable importance:

varImpPlot_L.png


Plotting the known values versus the predicted values for the Abraham solute descriptor L with Tableau Public yields the following plot. The points are colored according to the ATSp1 descriptor.

20120304LvsL-predicted.png

Results


In summary of the results that R produced is the following table listing the Abraham solute descriptor along with its % Var explained, and most important descriptor:

Solute Coefficient
% Var Explained
Meaning [1],[2]
varImpPlot() Descriptor
E
94.59%
excess molar refraction
khs.sF
S
77.04%
dipolarity/polarizability
nAtomP
A
64.9%
hydrogen-bond acidity
khs.aasN
B
88.62%
hydrogen-bond basicity
nHBAcc
V
95.36%
McGowan characteristic volume
VABC
L
93.96%
logarithm of the solute gas phase into n-hexadecane at 298K
ATSp1


To compare the results for S, A, and B from Model001 to these results for S, A, and B the Out Of Box value for R2 (% Variance Explained) must be converted.

Using the formula for R2,


R2eqn.pngexternal image c.gif

where y_i are the observed values and f_i are the predicted values of the datasets for the Abraham Solute Descriptors. The R2 values were calculated with R2_SAB.zip and are compared to the multiple R2 values obtained in Model001 in the following table.

Solute Coefficient
R2 Model002a
R2 Model001
S
0.9574
0.8384
A
0.9329
0.6429
B
0.9809
0.8367

References

[1] Paul C.M. van Noort. Solvation thermodynamics and the physical–chemical meaning of the constant in Abraham solvation equations. Chemosphere (2011), doi:10.1016/j.chemosphere.2011.11.073
[2] Laura M. Grubbs, Mariam Saifullah, Nohelli E. De La Rosa, Shulin Ye, Sai S. Achi, William E. Acree Jr., Michael H. Abraham. Mathematical correlations for describing solute transfer into functionalized alkane solvents containing hydroxyl, ether, ester or ketone solvents. Fluid Phase Equilibria 298 (2010) 48–53, doi:10.1016/j.fluid.2010.07.00