Predicting+Abraham+Descriptors+-+Model002a

Predicting Abraham Descriptors - Model002a
Jesse Patsolic and Matthew Wilson
 * Researchers**

Objective
To model the Abraham solute coefficients E, S, A, B, V, L from chemical structure using CDK descriptors. Model002a and Model002b are performed by different researchers but use the same data, procedure and techniques.

Procedure
The up-to-date (as of February 23, 2012) [|20120223AcreeCompilationDescriptorsAddedToWebService.csv] was used as the initial data set. Using a TCL script, [|CSVduplicates.tcl], with help from Matthew Wilson the duplicate entries were deleted while merging any missing descriptor entries [|20120223AcreeCompilationDescriptorsAddedToWebService_DuplicateFree.csv].

CDK Descriptors were calculated for 2,481 out of 2,486 solutes from the [|20120226AcreeSMILESreduced.csv] using Rajarshi Guha's CDK Descriptor Calculator. All 2D descriptors were calculated except the following: Ionization Potential, Charged Partial Surface Areas, Protein, Geometrical, or WHIM. The option 'Add Explicit H' was selected.

The CDK Descriptor Calculator came across errors when generating the descriptors for the complete SMILES csv file. SMILES numbers 1651 -1656 were the cause of the errors.
 * 1651 || tetramethyltin || C[Sn](C)(C)C ||
 * 1652 || tetraethyltin || CC[Sn](CC)(CC)CC ||
 * 1653 || tetraethyllead |||| CC[Pb](CC)(CC)CC ||
 * 1654 || ferrocene || [cH-]1cccc1.[cH-]1cccc1.[Fe+2] ||
 * 1655 || germanium tetrachloride || Cl[Ge](Cl)(Cl)Cl ||
 * 1656 || methyl mercuric (II) chloride || [Cl-].[Hg+]C ||

These were removed to create the reduced csv file linked to above, and the rest of the descriptors were generated.

The descriptors Kier3, and HybRatio were removed due to "NA" entries. The generated descriptors were merged with the Abraham solute coefficients, SMILES, and names in the following file: [|20120228AcreeSMILESreducedCDKoutREADYforModeling.csv] This file is then split up for each Abraham descriptor to be modeled.


 * [****Lori model with this exact same file - you do not have to go through the curation independently -AL]**

Building the Models

 * Random Forest.** The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest models in the following sections of code.

Abraham Solute Coefficient E
All the descriptors but E will removed from the 20120228AcreeSMILESreducedCDKoutREADYforModeling.csv file. The unavailable values for the descriptor E were labeled as "-123", these had to be parsed out of the file with the following TCL scripts: [|CSVreplaceCommas.tcl], to replace the commas in the name field with "|". [|Remove123.tcl], removes the rows where the value for E is "-123" and stores it in the outfile, and writes the deleted rows into the delfile. [|CSVreplaceBars.tcl], replaces the "|" in the name field with ",". The output of the run through the TCL scripts is [|20120301EwithDescriptors_123removed_withNames.csv].

After using these scripts the SMILES, name and csid fields are removed and the csv file,[|20120301EwithDescriptorsReadyforModeling.csv], is ready to be modeled with R.

The SMILES, and name fields will be added back after modeling.

code format="rsplus" > require(randomForest) Loading required package: randomForest randomForest 4.6-6 Type rfNews to see new features/changes/bug fixes.

> mydata=read.csv(file="20120301EwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")
 * 1) Read input file

> mydata.rf <- randomForest(E ~., data = mydata,importance=TRUE) > test.predict <- predict(mydata.rf,mydata)
 * 1) Do randomForest

> write.csv(test.predict, file="RFTestPredict_E.csv") > varImpPlot(mydata.rf,main = "Random Forest Variable Importance: E")
 * 1) Write out results

[output] #see following graph

> print(mydata.rf)

Call: randomForest(formula = E ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.03384625 % Var explained: 94.59

code

The output from the varImpPlot command in R yields the following graph.



Plotting the known values against the predicted values with Tableau Public gives the following plot. The points are colored according to the khs.sF descriptor.



Abraham Solute Coefficient S
The same procedure for E was used to generate [|20120302SwithDescriptors_123removed_withNames.csv]. The names, and SMILES, fields were removed and the resulting file was used as input for R.

The data sheet was used as input to R and the randomForest package with the following code: code format="rsplus" > require(randomForest) Loading required package: randomForest randomForest 4.6-6 Type rfNews to see new features/changes/bug fixes. > mydata=read.csv(file="20120302SwithDescriptorsREadyforModeling.csv",head=TRUE,row.names = "molID") > mydata.rf <- randomForest(S ~., data = mydata,importance = TRUE) > varImpPlot(mydata.rf, main= "Random Forest Variable Importance: S")

[output] ## see following graph

> test.predict <- predict(mydata.rf, mydata) > write.csv(test.predict, file="RFTestPredict_S.csv") > print(mydata.rf)

Call: randomForest(formula = S ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.1196888 % Var explained: 77.04

code

The varImpPlot command gives the following graph of variable importance:

Plotting the known values versus the predicted values for the Abraham solute descriptor S with Tableau Public yields the following scatter plot. The points are colored according to the nAtomP descriptor.



Abraham Solute Coefficient A
The same procedure was used generate [|20120311AwithDescriptors_123removed_withNames.csv]. The names, and SMILES fields were removed to use as input for R.

The data sheet was input and run through the randomForest package with the following R code:

code format="rsplus" > require(randomForest) > mydata = read.csv(file="20120302AwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID") > mydata.rf <- randomForest(A ~., data = mydata, importance = TRUE) > test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file = "20120303RFTestPredict_A.csv") > varImpPlot(mydata.rf,main="Random Forest Variable Importance: A")

[output] # see following graph

> print(mydata.rf)

Call: randomForest(formula = A ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.03444421 % Var explained: 64.9 code

The varImpPlot command gave the following graph of variable importance.



Plotting the known values versus the predicted values for the Abraham solute descriptor S with Tableau Public yields the following plot. The points are colored according to the khs.aasN descriptor.



Abraham Solute Coefficient B
The same procedure was used to generate [|20120302BwithDescriptors_123removed_withNames.csv]. The names, and SMILES, fields were removed and the resulting file was used as input for R.

The data sheet was input and run through the randomForest package with the following R code:

code format="rsplus" > require(randomForest) Loading required package: randomForest randomForest 4.6-6 Type rfNews to see new features/changes/bug fixes.

> mydata=read.csv(file="20120302BwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")
 * 1) read in data

> mydata.rf <- randomForest(B ~., data = mydata,importance=TRUE) > test.predict <- predict(mydata.rf,mydata)
 * 1) Do randomForest

> write.csv(test.predict, file="RFTestPredict_B.csv") > varImpPlot(mydata.rf,main = "Random Forest Variable Importance: B")
 * 1) Write output data

[output] ## see following graph

> print(mydata.rf)

Call: randomForest(formula = B ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.02822268 % Var explained: 88.62 code

The varImpPlot command gave the following graph of variable importance.



Plotting the known values versus the predicted values for the Abraham solute descriptor B with Tableau Public yields the following plot. The points are colored according to the nHBAcc descriptor.



Abraham Solute Coefficient V
The same procedure was used to generate [|20120303VwithDescriptors_123removed_withNames.csv]. The names, and SMILES, fields were removed and the resulting file was used as input for R.

The data sheet was used as input to R and the randomForest package with the following code:

code format="rsplus" > require(randomForest) Loading required package: randomForest randomForest 4.6-6 Type rfNews to see new features/changes/bug fixes.

> mydata=read.csv(file="20120303VwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")
 * 1) Read in data

> mydata.rf <- randomForest(V ~., data = mydata,importance=TRUE) > test.predict <- predict(mydata.rf,mydata)
 * 1) Do randomForest

> write.csv(test.predict, file="RFTestPredict_V.csv") > varImpPlot(mydata.rf,main = "Random Forest Variable Importance: V")
 * 1) Write output data

[output] ## see following graph

> print(mydata.rf)

Call: randomForest(formula = V ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.02013338 % Var explained: 95.36

code

The varImpPlot command gives the following graph of variable importance:



Plotting the known values versus the predicted values for the Abraham solute descriptor V with Tableau Public yields the following plot. The points are colored according to the VABC descriptor.



Abraham Solute Coefficient L
The same procedure was used to generate [|20120304LwithDescriptors_123removed_withNames.csv]. The names, and SMILES, fields were removed and the resulting file was used as input for R.

The data sheet was used as input to R and the randomForest package with the following code:

code format="rsplus" > require(randomForest)
 * 1) Load randomForest

> mydata=read.csv(file="20120304LwithDescriptorsReadyforModeling.csv",head=TRUE,row.names="molID")
 * 1) Read in data

> mydata.rf <- randomForest(L ~., data = mydata,importance=TRUE) > test.predict <- predict(mydata.rf,mydata)
 * 1) Do randomForest

> write.csv(test.predict, file="RFTestPredict_L.csv") > varImpPlot(mydata.rf,main = "Random Forest Variable Importance: L")
 * 1) Write output

[output] # see following graph

> print(mydata.rf)

Call: randomForest(formula = L ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.5755685 % Var explained: 93.96

code

The varImpPlot command gives the following graph of variable importance:



Plotting the known values versus the predicted values for the Abraham solute descriptor L with Tableau Public yields the following plot. The points are colored according to the ATSp1 descriptor.



Results
In summary of the results that R produced is the following table listing the Abraham solute descriptor along with its % Var explained, and most important descriptor:


 * Solute Coefficient || % Var Explained || Meaning [1],[2] || varImpPlot Descriptor ||
 * E || 94.59% || excess molar refraction || khs.sF ||
 * S || 77.04% || dipolarity/polarizability || nAtomP ||
 * A || 64.9% || hydrogen-bond acidity || khs.aasN ||
 * B || 88.62% || hydrogen-bond basicity || nHBAcc ||
 * V || 95.36% || McGowan characteristic volume || VABC ||
 * L || 93.96% || logarithm of the solute gas phase into n-hexadecane at 298K || ATSp1 ||

To compare the results for S, A, and B from Model001 to these results for S, A, and B the Out Of Box value for R2 (% Variance Explained) must be converted.

Using the formula for R2,



where y_i are the observed values and f_i are the predicted values of the datasets for the Abraham Solute Descriptors. The R2 values were calculated with [|R2_SAB.zip] and are compared to the multiple R2 values obtained in Model001 in the following table.


 * Solute Coefficient || R2 Model002a || R2 Model001 ||
 * S || 0.9574 || 0.8384 ||
 * A || 0.9329 || 0.6429 ||
 * B || 0.9809 || 0.8367 ||