Predicting+Abraham+Descriptors+-+Model002b

Predicting Abraham Descriptors - Model002b
Lori Fielding
 * Researcher**

Objective
To model the Abraham solute coefficients E, S, A, B, V, L from chemical structure using CDK descriptors. Model002a and Model002b are performed by different researchers but use the same data, procedure and techniques.

Procedure
The up-to-date (as of February 23, 2012) [|20120223AcreeCompilationDescriptorsAddedToWebService.csv] was used as the initial data set. Using a TCL script, [|CSVduplicates.tcl], with help from Matthew Wilson the duplicate entries were deleted while merging any missing descriptor entries [|20120223AcreeCompilationDescriptorsAddedToWebService_DuplicateFree.csv].

CDK Descriptors were calculated for 2,481 out of 2,486 solutes from the [|20120226AcreeSMILESreduced.csv] using Rajarshi Guha's CDK Descriptor Calculator. All 2D descriptors were calculated except the following: Ionization Potential, Charged Partial Surface Areas, Protein, Geometrical, or WHIM. The option 'Add Explicit H' was selected.

The CDK Descriptor Calculator came across errors when generating the descriptors for the complete SMILES csv file. SMILES numbers 1651 -1656 were the cause of the errors.
 * 1651 || tetramethyltin || C[Sn](C)(C)C ||
 * 1652 || tetraethyltin || CC[Sn](CC)(CC)CC ||
 * 1653 || tetraethyllead |||| CC[Pb](CC)(CC)CC ||
 * 1654 || ferrocene || [cH-]1cccc1.[cH-]1cccc1.[Fe+2] ||
 * 1655 || germanium tetrachloride || Cl[Ge](Cl)(Cl)Cl ||
 * 1656 || methyl mercuric (II) chloride || [Cl-].[Hg+]C ||

These were removed to create the reduced csv file linked to above, and the rest of the descriptors were generated.

The descriptors Kier3, and HybRatio were removed due to "NA" entries. The generated descriptors were merged with the Abraham solute coefficients, SMILES, and names in the following file: [|20120228AcreeSMILESreducedCDKoutREADYforModeling.csv] This file is then split up for each Abraham descriptor to be modeled.

Building the Models

 * Random Forest.** The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest models in the following sections of code.

Abraham Solute Coefficient E
All the descriptors but E will removed from the 20120228AcreeSMILESreducedCDKoutREADYforModeling.csv file. The unavailable values for the descriptor E were labeled as "-123", these had to be parsed out of the file with the following TCL scripts from Model002a: [|CSVreplaceCommas.tcl], to replace the commas in the name field with "|". [|Remove123.tcl], removes the rows where the value for E is "-123" and stores it in the outfile, and writes the deleted rows into the delfile. [|CSVreplaceBars.tcl], replaces the "|" in the name field with ",". The output of the run through the TCL scripts is [|20120311EwithDescriptors_123removed_withNames.csv].

After using these scripts the SMILES, name and csid fields are removed and the csv file [|20120311EwithDescriptorsReadyforModeling.csv], is ready to be modeled with R. The SMILES, name, and csid fields will be added back after modeling.

code format="rsplus" > require(randomForest) Loading required package: randomForest randomForest 4.6-6 Type rfNews to see new features/changes/bug fixes.

>mydata=read.csv(file="20120311EwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")
 * 1) Read input file

> test.predict <- predict(mydata.rf,mydata)
 * 1) Do randomForest> mydata.rf <- randomForest(E ~., data = mydata,importance=TRUE)

> write.csv(test.predict, file="RFTestPredict_E.csv") > varImpPlot(mydata.rf,main = "Random Forest Variable Importance: E")
 * 1) Write out results

[output] #see following graph

> print(mydata.rf)

Call: randomForest(formula = E ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.03477053 % Var explained: 94.44

code

The output from the varImpPlot command in R yields the following graph.



Plotting the known values against the predicted values with Tableau Public gives the following plot. The points are colored according to the TopoPSA descriptor.



The same procedure for E was used to generate [|20120311SwithDescriptors_123removed_withNames.csv]. The names, and SMILES, fields were removed and the resulting file was used as input for R.
 * Abraham Solute Coefficient S**

The data sheet was used as input to R and the randomForest package with the following code:

code format="rsplus" > require(randomForest) Loading required package: randomForest randomForest 4.6-6 Type rfNews to see new features/changes/bug fixes. > mydata=read.csv(file="20120311SwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID") > mydata.rf <- randomForest(S ~., data = mydata,importance=TRUE) > test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file="RFTestPredict_S.csv") > varImpPlot(mydata.rf,main = "Random Forest Variable Importance: S")

[output] ## see following graph

> print(mydata.rf)

Call: randomForest(formula = S ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.1216817 % Var explained: 76.66 code

The varImpPlot command gives the following graph of variable importance:



Plotting the known values versus the predicted values for the Abraham solute descriptor S with Tableau Public yields the following scatter plot. The points are colored according to the nAtomP descriptor.



The same procedure was used generate [|20120311AwithDescriptors_123removed_withNames.csv]. The names, and SMILES fields were removed to use as input for R.
 * Abraham Solute Coefficient A**

The data sheet was input and run through the randomForest package with the following R code:

code format="rsplus" > require(randomForest) Loading required package: randomForest randomForest 4.6-6 Type rfNews to see new features/changes/bug fixes. > mydata=read.csv(file="20120311AwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID") > mydata.rf <- randomForest(A ~., data = mydata,importance=TRUE) > test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file="RFTestPredict_A.csv") > varImpPlot(mydata.rf,main = "Random Forest Variable Importance: A")

[output] ## see following graph

> print(mydata.rf)

Call: randomForest(formula = A ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.03489871 % Var explained: 64.44 code

The varImpPlot command gives the following graph of variable importance:



Plotting the known values versus the predicted values for the Abraham solute descriptor A with Tableau Public yields the following plot. The points are colored according to the khs.aasN descriptor.



The same procedure was used to generate [|20120311BwithDescriptors_123removed_withNames.csv]. The names, and SMILES, fields were removed and the resulting file was used as input for R.
 * Abraham Solute Coefficient B**

The data sheet was input and run through the randomForest package with the following R code:

code format="rsplus" > require(randomForest) Loading required package: randomForest randomForest 4.6-6 Type rfNews to see new features/changes/bug fixes. > mydata=read.csv(file="20120311BwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID") > mydata.rf <- randomForest(B ~., data = mydata,importance=TRUE) > test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file="RFTestPredict_B.csv") > varImpPlot(mydata.rf,main = "Random Forest Variable Importance: B")

[output] ## see following graph

> print(mydata.rf)

Call: randomForest(formula = B ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.02826724 % Var explained: 88.6 code

The varImpPlot command gives the following graph of variable importance:



Plotting the known values versus the predicted values for the Abraham solute descriptor B with Tableau Public yields the following plot. The points are colored according to the nHBAcc descriptor.




 * Abraham Solute Coefficient V**

The same procedure was used to generate [|20120311VwithDescriptors_123removed_withNames.csv]. The names, and SMILES, fields were removed and the resulting file was used as input for R.

The data sheet was used as input to R and the randomForest package with the following code:

code format="rsplus" > require(randomForest) Loading required package: randomForest randomForest 4.6-6 Type rfNews to see new features/changes/bug fixes. > mydata=read.csv(file="20120311VwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID") > mydata.rf <- randomForest(V ~., data = mydata,importance=TRUE) > test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file="RFTestPredict_V.csv") > varImpPlot(mydata.rf,main = "Random Forest Variable Importance: V")

[output] ## see following graph

> print(mydata.rf)

Call: randomForest(formula = V ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.02026566 % Var explained: 95.33 code

The varImpPlot command gives the following graph of variable importance:



Plotting the known values versus the predicted values for the Abraham solute descriptor V with Tableau Public yields the following plot. The points are colored according to the VABC descriptor.




 * Abraham Solute Coefficient L**

The same procedure was used to generate [|20120311LwithDescriptors_123removed_withNames.csv]. The names, and SMILES, fields were removed and the resulting file was used as input for R.

The data sheet was used as input to R and the randomForest package with the following code:

code format="rsplus" > require(randomForest) Loading required package: randomForest randomForest 4.6-6 Type rfNews to see new features/changes/bug fixes. > mydata.rf <- randomForest(L ~., data = mydata,importance=TRUE) > test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file="RFTestPredict_L.csv") > varImpPlot(mydata.rf,main = "Random Forest Variable Importance: L")

[output] ## see following graph

> print(mydata.rf)

Call: randomForest(formula = L ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.5894518 % Var explained: 93.81 code

The varImpPlot command gives the following graph of variable importance:



Plotting the known values versus the predicted values for the Abraham solute descriptor L with Tableau Public yields the following plot. The points are colored according to the ATSp1 descriptor.




 * [Write a conclusion and summarize all the numbers in a table -AL]**