ASM002e

Researcher: Katie Crosby
 * Aqueous Solubility Model 002e**

To create a general open model predicting the aqueous solubility of organic compounds using open data and open descriptors.
 * Objective**

Procedure
Using Rajarshi Guha's CDK Descriptor Calculator GUI (v 1.3.7) I calculated all 2D CDK descriptors for the training set using the 'add explicit H' option and comma delineated.
 * Calculating Descriptors**

The resulting file was:

Removed descriptors with less than 3 non-zero entries: khs.sLi, khs.ssBe, khs.ssssBe, khs.ssBH, khs.sssB, khs.ssssB, khs.ssNH2, khs.sssNH, khs.ssssN, khs.sSiH3, khs.ssSiH2, khs.sssSiH, khs.ssssSi, khs.sPH2, khs.ssPH, khs.sssP, khs.sssssP, khs.sGeH3, khs.ssGeH2, khs.sssGeH, khs.ssssGe, khs.sAsH2, khs.ssAsH, khs.sssAs, khs.sssdAs, khs.sssssAs, khs.sSeH, khs.dSe, khs.ssSe, khs.aaSe, khs.dssSe, khs.ddssSe, khs.sSnH3, khs.ssSnH2, khs.sssSnH, khs.ssssSn, khs.sPbH3, khs.ssPbH2, khs.sssPbH, khs.ssssPb
 * Feature Selection**

Keir3 was also removed because of the NA's in its column.

The resulting file is:

Further feature selection was performed on the remaining set of 2273 molecules with 140 descriptors by using the caret package in R:

__Code:__

library("caret") setwd("C:/Users/Katie/Documents/fall2012") mydata = read.csv(file="smiles1.csv",head=TRUE,row.names="molID") cor.mat = cor(mydata) findCorrelation(cor.mat, cutoff = .95, verbose = TRUE)

__output:__ [1] 139 47 140 14 13 15 46 60 54 16 61 55 48 56 53 133 49 50 52 [20] 44 132 57 45 58 63 64 34 27 26

The caret-recommended descriptor were removed: ATSp2, ATSp3, ATSp4, ATSp5, SCH-3, SCH-4, VCH-6, SP-0, SP-1, SP-2, SP-3, SP-4, SP-5, SP-6, VP-0, VP-1, VP-2, VP-3, VP-4, VP-5, VP-6, SPC-4, SPC-5, VPC-4, VPC-5, VABC, WTPT-1, WPOL, Zagreb

This left 111 descriptors. 'logS' was added from the original data set file for the next step in the calculation. The resulting file is:

An initial set of 1000 randomly selected molecules was used to build a random forest model in R with the following code:
 * Building the Model **

__Code:__

library("randomForest") setwd("C:/Users/Katie/Documents/fall2012") mydata = read.csv(file="smiles2.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(logS ~ ., data = mydata,importance = TRUE) print(mydata.rf)
 * 1) do random forest [randomForest 4.6-6]

__output:__ Call: randomForest(formula = logS ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 37 Mean of squared residuals: 0.7435326 % Var explained: 83.02