ASM002f

First, Download the and open the file in a [|spreadsheet editor]. Next, copy the Smiles column and save to a new text document. Open the[| CDK gui] and run the descriptors with the 'add explicit H' option and all 2-D descriptors except CPSA and IP saving the output file as Comma Delimited.

Then open the original Dataset and copy the logS column (be sure to remove the extraneous words from this header) and copy it to the new CSV file from the CDK gui.


 * Note: I actually did not do this in a spreadsheet editor because it rounded some numbers, but this should not cause a problem.

Have [|python], [|rpy], and [|GNU R]installed and added to your default PATH then run the attached Python script.
 * Note: the python script assumes that the input file is called "smile_description.csv" without quotes

The script removes any column that has 3 or more entries of 'NA' or '0' then precedes to remove any row that has an 'NA'. It then passes the cleaned data through R via [|rpy] for the [|caret library] to delete the Duplicated columns.

leaves 77 descriptors, 2273 molecules

Afterwards it drops to a [|shell] and executes R via the command line and passes the file without Duplicates through the [|randomForest] model. Created value:
 * Note the reason I didn't pass it through [|rpy] on this one is because when i did, rather than output the r-squared value, it output the entire Tree process of creating the value. not sure why.

> library("randomForest") > mydata = read.csv(file="cleaned_removeDuplicates.csv",head=TRUE,row.names="Title") > mydata.rf <- randomForest(logS ~ ., data = mydata,importance = TRUE) > print(mydata.rf)

Call: randomForest(formula = logS ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 25 Mean of squared residuals: 0.7168829 % Var explained: 83.63 >

Full Output or the script(is also attached for further research): delete column 11 titled Wlambda1.unity delete column 12 titled Wlambda2.unity delete column 13 titled Wlambda3.unity delete column 14 titled Wnu1.unity delete column 15 titled Wnu2.unity delete column 16 titled Wgamma1.unity delete column 17 titled Wgamma2.unity delete column 18 titled Wgamma3.unity delete column 19 titled Weta1.unity delete column 20 titled Weta2.unity delete column 21 titled Weta3.unity delete column 22 titled WT.unity delete column 23 titled WA.unity delete column 24 titled WV.unity delete column 25 titled WK.unity delete column 26 titled WG.unity delete column 27 titled WD.unity delete column 28 titled nAcid delete column 30 titled naAromAtom delete column 31 titled nAromBond delete column 48 titled nBase delete column 51 titled C1SP1 delete column 52 titled C2SP1 delete column 53 titled C1SP2 delete column 54 titled C2SP2 delete column 55 titled C3SP2 delete column 56 titled C1SP3 delete column 57 titled C2SP3 delete column 58 titled C3SP3 delete column 59 titled C4SP3 delete column 102 titled nHBDon delete column 103 titled nHBAcc delete column 105 titled khs.sLi delete column 106 titled khs.ssBe delete column 107 titled khs.ssssBe delete column 108 titled khs.ssBH delete column 109 titled khs.sssB delete column 110 titled khs.ssssB delete column 111 titled khs.sCH3 delete column 112 titled khs.dCH2 delete column 113 titled khs.ssCH2 delete column 114 titled khs.tCH delete column 115 titled khs.dsCH delete column 116 titled khs.aaCH delete column 117 titled khs.sssCH delete column 118 titled khs.ddC delete column 119 titled khs.tsC delete column 120 titled khs.dssC delete column 121 titled khs.aasC delete column 122 titled khs.aaaC delete column 123 titled khs.ssssC delete column 124 titled khs.sNH3 delete column 125 titled khs.sNH2 delete column 126 titled khs.ssNH2 delete column 127 titled khs.dNH delete column 128 titled khs.ssNH delete column 129 titled khs.aaNH delete column 130 titled khs.tN delete column 131 titled khs.sssNH delete column 132 titled khs.dsN delete column 133 titled khs.aaN delete column 134 titled khs.sssN delete column 135 titled khs.ddsN delete column 136 titled khs.aasN delete column 137 titled khs.ssssN delete column 138 titled khs.sOH delete column 139 titled khs.dO delete column 140 titled khs.ssO delete column 141 titled khs.aaO delete column 142 titled khs.sF delete column 143 titled khs.sSiH3 delete column 144 titled khs.ssSiH2 delete column 145 titled khs.sssSiH delete column 146 titled khs.ssssSi delete column 147 titled khs.sPH2 delete column 148 titled khs.ssPH delete column 149 titled khs.sssP delete column 150 titled khs.dsssP delete column 151 titled khs.sssssP delete column 152 titled khs.sSH delete column 153 titled khs.dS delete column 154 titled khs.ssS delete column 155 titled khs.aaS delete column 156 titled khs.dssS delete column 157 titled khs.ddssS delete column 158 titled khs.sCl delete column 159 titled khs.sGeH3 delete column 160 titled khs.ssGeH2 delete column 161 titled khs.sssGeH delete column 162 titled khs.ssssGe delete column 163 titled khs.sAsH2 delete column 164 titled khs.ssAsH delete column 165 titled khs.sssAs delete column 166 titled khs.sssdAs delete column 167 titled khs.sssssAs delete column 168 titled khs.sSeH delete column 169 titled khs.dSe delete column 170 titled khs.ssSe delete column 171 titled khs.aaSe delete column 172 titled khs.dssSe delete column 173 titled khs.ddssSe delete column 174 titled khs.sBr delete column 175 titled khs.sSnH3 delete column 176 titled khs.ssSnH2 delete column 177 titled khs.sssSnH delete column 178 titled khs.ssssSn delete column 179 titled khs.sI delete column 180 titled khs.sPbH3 delete column 181 titled khs.ssPbH2 delete column 182 titled khs.sssPbH delete column 183 titled khs.ssssPb delete column 186 titled Kier3 delete column 187 titled nAtomLC delete column 188 titled nAtomP delete column 189 titled LipinskiFailures delete column 190 titled nAtomLAC delete column 212 titled nRotB

R's caret output = [11, 23, 24, 25, 26, 28, 30, 31, 38, 48, 49, 50, 51, 52, 53, 54, 56, 57, 58, 59, 60, 61, 62, 64, 65, 67, 68, 98, 100, 106, 108]

delete column 11 titled apol as duplicate delete column 23 titled ATSp1 as duplicate delete column 24 titled ATSp2 as duplicate delete column 25 titled ATSp3 as duplicate delete column 26 titled ATSp4 as duplicate delete column 28 titled nB as duplicate delete column 30 titled SCH-3 as duplicate delete column 31 titled SCH-4 as duplicate delete column 38 titled VCH-6 as duplicate delete column 48 titled SP-0 as duplicate delete column 49 titled SP-1 as duplicate delete column 50 titled SP-2 as duplicate delete column 51 titled SP-3 as duplicate delete column 52 titled SP-4 as duplicate delete column 53 titled SP-5 as duplicate delete column 54 titled SP-6 as duplicate delete column 56 titled VP-0 as duplicate delete column 57 titled VP-1 as duplicate delete column 58 titled VP-2 as duplicate delete column 59 titled VP-3 as duplicate delete column 60 titled VP-4 as duplicate delete column 61 titled VP-5 as duplicate delete column 62 titled VP-6 as duplicate delete column 64 titled SPC-4 as duplicate delete column 65 titled SPC-5 as duplicate delete column 67 titled VPC-4 as duplicate delete column 68 titled VPC-5 as duplicate delete column 98 titled VABC as duplicate delete column 100 titled WTPT-1 as duplicate delete column 106 titled WPOL as duplicate delete column 108 titled Zagreb as duplicate

leaves 77 descriptors, 2273 molecules

RANDOM FOREST

R version 2.15.2 (2012-10-26) -- "Trick or Treat" Copyright (C) 2012 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license' or 'licence' for distribution details.

R is a collaborative project with many contributors. Type 'contributors' for more information and 'citation' on how to cite R or R packages in publications.

Type 'demo' for some demos, 'help' for on-line help, or 'help.start' for an HTML browser interface to help. Type 'q' to quit R.

> library("randomForest") > mydata = read.csv(file="cleaned_removeDuplicates.csv",head=TRUE,row.names="Title") > mydata.rf <- randomForest(logS ~ ., data = mydata,importance = TRUE) > print(mydata.rf)

Call: randomForest(formula = logS ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 25

Mean of squared residuals: 0.7168829 % Var explained: 83.63 >