Predicting Abraham Descriptors - Model002b

Researcher
Lori Fielding

Objective

To model the Abraham solute coefficients E, S, A, B, V, L from chemical structure using CDK descriptors. Model002a and Model002b are performed by different researchers but use the same data, procedure and techniques.


Procedure

The up-to-date (as of February 23, 2012) 20120223AcreeCompilationDescriptorsAddedToWebService.csv was used as the initial data set.
Using a TCL script, CSVduplicates.tcl, with help from Matthew Wilson the duplicate entries were deleted while merging any missing descriptor entries 20120223AcreeCompilationDescriptorsAddedToWebService_DuplicateFree.csv.

CDK Descriptors were calculated for 2,481 out of 2,486 solutes from the 20120226AcreeSMILESreduced.csv using Rajarshi Guha's CDK Descriptor Calculator. All 2D descriptors were calculated except the following: Ionization Potential, Charged Partial Surface Areas, Protein, Geometrical, or WHIM. The option 'Add Explicit H' was selected.

The CDK Descriptor Calculator came across errors when generating the descriptors for the complete SMILES csv file. SMILES numbers 1651 -1656 were the cause of the errors.
1651
tetramethyltin
C[Sn](C)(C)C
1652
tetraethyltin
CC[Sn](CC)(CC)CC
1653
tetraethyllead
CC[Pb](CC)(CC)CC
1654
ferrocene
[cH-]1cccc1.[cH-]1cccc1.[Fe+2]
1655
germanium tetrachloride
Cl[Ge](Cl)(Cl)Cl
1656
methyl mercuric (II) chloride
[Cl-].[Hg+]C

These were removed to create the reduced csv file linked to above, and the rest of the descriptors were generated.

The descriptors Kier3, and HybRatio were removed due to "NA" entries.
The generated descriptors were merged with the Abraham solute coefficients, SMILES, and names in the following file: 20120228AcreeSMILESreducedCDKoutREADYforModeling.csv
This file is then split up for each Abraham descriptor to be modeled.

Building the Models


Random Forest. The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest models in the following sections of code.

Abraham Solute Coefficient E

All the descriptors but E will removed from the 20120228AcreeSMILESreducedCDKoutREADYforModeling.csv file. The unavailable values for the descriptor E were labeled as "-123", these had to be parsed out of the file with the following TCL scripts from Model002a: CSVreplaceCommas.tcl, to replace the commas in the name field with "|". Remove123.tcl, removes the rows where the value for E is "-123" and stores it in the outfile, and writes the deleted rows into the delfile. CSVreplaceBars.tcl, replaces the "|" in the name field with ",". The output of the run through the TCL scripts is 20120311EwithDescriptors_123removed_withNames.csv.

After using these scripts the SMILES, name and csid fields are removed and the csv file 20120311EwithDescriptorsReadyforModeling.csv, is ready to be modeled with R.
The SMILES, name, and csid fields will be added back after modeling.


> require(randomForest)
Loading required package: randomForest
randomForest 4.6-6
Type rfNews() to see new features/changes/bug fixes.
 
# Read input file
>mydata=read.csv(file="20120311EwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")
 
# Do randomForest> mydata.rf <- randomForest(E ~., data = mydata,importance=TRUE)
> test.predict <- predict(mydata.rf,mydata)
 
# Write out results
> write.csv(test.predict, file="RFTestPredict_E.csv")
> varImpPlot(mydata.rf,main = "Random Forest Variable Importance: E")
 
[output]  #see following graph
 
> print(mydata.rf)
 
Call:
 randomForest(formula = E ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.03477053
                    % Var explained: 94.44
 

The output from the varImpPlot() command in R yields the following graph.

20120311varImpPlot_E.png

Plotting the known values against the predicted values with Tableau Public gives the following plot. The points are colored according to the TopoPSA descriptor.

002b_E.png

Abraham Solute Coefficient S
The same procedure for E was used to generate 20120311SwithDescriptors_123removed_withNames.csv. The names, and SMILES, fields were removed and the resulting file was used as input for R.

The data sheet was used as input to R and the randomForest package with the following code:

> require(randomForest)
Loading required package: randomForest
randomForest 4.6-6
Type rfNews() to see new features/changes/bug fixes.
> mydata=read.csv(file="20120311SwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")
> mydata.rf <- randomForest(S ~., data = mydata,importance=TRUE)
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file="RFTestPredict_S.csv")
> varImpPlot(mydata.rf,main = "Random Forest Variable Importance: S")
 
[output] ## see following graph
 
> print(mydata.rf)
 
Call:
 randomForest(formula = S ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.1216817
                    % Var explained: 76.66

The varImpPlot() command gives the following graph of variable importance:

20120311varImpPlot_S.png

Plotting the known values versus the predicted values for the Abraham solute descriptor S with Tableau Public yields the following scatter plot. The points are colored according to the nAtomP descriptor.

002b_S.png

Abraham Solute Coefficient A
The same procedure was used generate 20120311AwithDescriptors_123removed_withNames.csv. The names, and SMILES fields were removed to use as input for R.

The data sheet was input and run through the randomForest package with the following R code:

> require(randomForest)
Loading required package: randomForest
randomForest 4.6-6
Type rfNews() to see new features/changes/bug fixes.
> mydata=read.csv(file="20120311AwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")
> mydata.rf <- randomForest(A ~., data = mydata,importance=TRUE)
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file="RFTestPredict_A.csv")
> varImpPlot(mydata.rf,main = "Random Forest Variable Importance: A")
 
[output] ## see following graph
 
> print(mydata.rf)
 
Call:
 randomForest(formula = A ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.03489871
                    % Var explained: 64.44

The varImpPlot() command gives the following graph of variable importance:

20120311varImpPlot_A.png

Plotting the known values versus the predicted values for the Abraham solute descriptor A with Tableau Public yields the following plot. The points are colored according to the khs.aasN descriptor.

002b_A.png



Abraham Solute Coefficient B
The same procedure was used to generate 20120311BwithDescriptors_123removed_withNames.csv. The names, and SMILES, fields were removed and the resulting file was used as input for R.

The data sheet was input and run through the randomForest package with the following R code:

> require(randomForest)
Loading required package: randomForest
randomForest 4.6-6
Type rfNews() to see new features/changes/bug fixes.
> mydata=read.csv(file="20120311BwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")
> mydata.rf <- randomForest(B ~., data = mydata,importance=TRUE)
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file="RFTestPredict_B.csv")
> varImpPlot(mydata.rf,main = "Random Forest Variable Importance: B")
 
[output] ## see following graph
 
> print(mydata.rf)
 
Call:
 randomForest(formula = B ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.02826724
                    % Var explained: 88.6

The varImpPlot() command gives the following graph of variable importance:

20120311varImpPlot_B.png

Plotting the known values versus the predicted values for the Abraham solute descriptor B with Tableau Public yields the following plot. The points are colored according to the nHBAcc descriptor.

002b_B.png

Abraham Solute Coefficient V

The same procedure was used to generate 20120311VwithDescriptors_123removed_withNames.csv. The names, and SMILES, fields were removed and the resulting file was used as input for R.

The data sheet was used as input to R and the randomForest package with the following code:

> require(randomForest)
Loading required package: randomForest
randomForest 4.6-6
Type rfNews() to see new features/changes/bug fixes.
> mydata=read.csv(file="20120311VwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")
> mydata.rf <- randomForest(V ~., data = mydata,importance=TRUE)
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file="RFTestPredict_V.csv")
> varImpPlot(mydata.rf,main = "Random Forest Variable Importance: V")
 
[output] ## see following graph
 
> print(mydata.rf)
 
Call:
 randomForest(formula = V ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.02026566
                    % Var explained: 95.33

The varImpPlot() command gives the following graph of variable importance:

20120311varImpPlot_V.png

Plotting the known values versus the predicted values for the Abraham solute descriptor V with Tableau Public yields the following plot. The points are colored according to the VABC descriptor.

002b_V.png

Abraham Solute Coefficient L

The same procedure was used to generate 20120311LwithDescriptors_123removed_withNames.csv. The names, and SMILES, fields were removed and the resulting file was used as input for R.

The data sheet was used as input to R and the randomForest package with the following code:

> require(randomForest)
Loading required package: randomForest
randomForest 4.6-6
Type rfNews() to see new features/changes/bug fixes.
> mydata.rf <- randomForest(L ~., data = mydata,importance=TRUE)
> test.predict <- predict(mydata.rf,mydata)
> write.csv(test.predict, file="RFTestPredict_L.csv")
> varImpPlot(mydata.rf,main = "Random Forest Variable Importance: L")
 
[output] ## see following graph
 
> print(mydata.rf)
 
Call:
 randomForest(formula = L ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 68
 
          Mean of squared residuals: 0.5894518
                    % Var explained: 93.81

The varImpPlot() command gives the following graph of variable importance:

20120311varImpPlot_L.png

Plotting the known values versus the predicted values for the Abraham solute descriptor L with Tableau Public yields the following plot. The points are colored according to the ATSp1 descriptor.

002b_L.png

[Write a conclusion and summarize all the numbers in a table -AL]