Aqueoues Model 002c

Rsearcher: Jessica Fitzgerald

Procedure

Calculating Descriptors
Using Rajarshi Guha's CDK Descriptor Calculator,all 2D CDK desriptors were calculated (i.e. protein and geometrical descriptors were excluded). CPSA, IP, bpol, and WHIM descriptors were also excluded. These were calculated using the 'add explicit H' option

Feature Selection
Removed descriptors with less than or equal to 3 non-zero entries:
khs.sLi, khs.ssBe, khs.ssssBe, khs.ssBH, khs.sssB, khs.ssssB, khs.ssNH2, khs.sssNH, khs.ssssN, khs.sSiH3, khs.ssSiH2, khs.sssSiH, khs.ssssSi, khs.sPH2, khs.ssPH, khs.sssP, khs.sssssP, khs.sGeH3, khs.ssGeH2, khs.sssGeH, khs.ssssGe, khs.sAsH2, khs.ssAsH, khs.sssAs, khs.sssdAs, khs.sssssAs, khs.sSeH, khs.dSe, khs.ssSe, khs.aaSe, khs.aaSe, khs.dssSe, khs.ddssSe, khs.sSnH3, khs.ssSnH2, khs.sssSnH, khs.ssssSn, khs.sPbH3, khs.ssPbH2, khs.sssPbH, khs.ssssPb

Also removed molecule 166 and descriptors Kier3 and Hybratio due to multiple N/As

This resulted in 167 descriptors with 2841 molecules.

Next, using the library 'caret' in R enabled the user to view columns with similar data, so that those that are duplicated can be deleted. The code is as follows:

> mydata = read.csv(file = "/Users/jessicafitzgerald/Desktop/DataSet002WithDeletedCells.csv", head=TRUE, row.names = "molID")
> ##correlation matrix
> cor.mat=cor(mydata)
> ## find correlation r > 0.95
> library("caret")
> ## load in data
> mydata = read.csv(file = "/Users/jessicafitzgerald/Desktop/DataSet002WithDeletedCells.csv", head=TRUE, row.names = "molID")
> ## correlation matrix
> cor.mat=cor(mydata)
> ## find correlation r > .95
> findCorrelation(cor.mat, cutoff = .95, verbose = TRUE)

[output]

[1] 166 164 63 62 27 26 64 32 28 158 60 29 61 65 68 70 69 156 76 66 77 71 12 157 72 73 74 79 80 50 13 43 47

[output]

Descriptors removed from Caret recommendations: nAcid , Apol , ATSm5, ATSp1, ATSp2, ATSp3, nBase, SCH-3, SCH-7, VCH-5, VC-6, SP-0, SP-1, SP-2, SP-3, SP-4, VP-7, SP-7, VP-0, VP-1, VP-2, VP-3, VP-4, VP-5, SPC-4, SP-5, SPC-6, VPC-4, VAdjMat, VABC, MW, WPATH, XLogP

The final set was of 134 descriptors and 2841 molecules.

Data--Building the Model

Using a randomForest, model, an initial set of 1000 molecules was used to build the model in R with the following code:

> library("randomForest")
> mydata = read.csv(file="/Users/jessicafitzgerald/Documents/Summer Online '12/12-13 Semester V/Math Modelling/DataSet002Complete.csv", head=TRUE, row.names="molID")
> ## do random Forest
> mydata.rf <- randomForest(logS ~ ., data = mydata, importance = TRUE)
> print(mydata.rf)

[output]

Call:
randomForest(formula = logS ~ ., data = mydata, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 44
Mean of squared residuals: 0.615971
% Var explained: 85.39

[output]