ASM002c

**Aqueoues Model 002c**

 * Rsearcher:** Jessica Fitzgerald

Procedure
Using [|Rajarshi Guha's CDK Descriptor Calculator],all 2D CDK desriptors were calculated (i.e. protein and geometrical descriptors were excluded). CPSA, IP, bpol, and WHIM descriptors were also excluded. These were calculated using the 'add explicit H' option
 * Calculating Descriptors**

Removed descriptors with less than or equal to 3 non-zero entries: khs.sLi, khs.ssBe, khs.ssssBe, khs.ssBH, khs.sssB, khs.ssssB, khs.ssNH2, khs.sssNH, khs.ssssN, khs.sSiH3, khs.ssSiH2, khs.sssSiH, khs.ssssSi, khs.sPH2, khs.ssPH, khs.sssP, khs.sssssP, khs.sGeH3, khs.ssGeH2, khs.sssGeH, khs.ssssGe, khs.sAsH2, khs.ssAsH, khs.sssAs, khs.sssdAs, khs.sssssAs, khs.sSeH, khs.dSe, khs.ssSe, khs.aaSe, khs.aaSe, khs.dssSe, khs.ddssSe, khs.sSnH3, khs.ssSnH2, khs.sssSnH, khs.ssssSn, khs.sPbH3, khs.ssPbH2, khs.sssPbH, khs.ssssPb
 * Feature Selection**

Also removed molecule 166 and descriptors Kier3 and Hybratio due to multiple N/As

This resulted in 167 descriptors with 2841 molecules.

Next, using the library 'caret' in R enabled the user to view columns with similar data, so that those that are duplicated can be deleted. The code is as follows:

> mydata = read.csv(file = "/Users/jessicafitzgerald/Desktop/DataSet002WithDeletedCells.csv", head=TRUE, row.names = "molID") > ##correlation matrix > cor.mat=cor(mydata) > ## find correlation r > 0.95 > library("caret") > ## load in data > mydata = read.csv(file = "/Users/jessicafitzgerald/Desktop/DataSet002WithDeletedCells.csv", head=TRUE, row.names = "molID") > ## correlation matrix > cor.mat=cor(mydata) > ## find correlation r > .95 > findCorrelation(cor.mat, cutoff = .95, verbose = TRUE)

[output]

 [1] 166 164 63 62 27 26 64 32 28 158 60 29 61 65 68 70 69 156 76 66 77 71 12 157 72 73 74 79 80 50 13 43 47

[output]

=
Descriptors removed from Caret recommendations: nAcid, Apol , ATSm5, ATSp1, ATSp2, ATSp3, nBase, SCH-3, SCH-7, VCH-5, VC-6, SP-0, SP-1, SP-2, SP-3, SP-4, VP-7, SP-7, VP-0, VP-1, VP-2, VP-3, VP-4, VP-5, SPC-4, SP-5, SPC-6, VPC-4, VAdjMat, VABC, MW, WPATH, XLogP =====

The final set was of 134 descriptors and 2841 molecules.

Data--Building the Model
Using a randomForest, model, an initial set of 1000 molecules was used to build the model in R with the following code:

> library("randomForest") > mydata = read.csv(file="/Users/jessicafitzgerald/Documents/Summer Online '12/12-13 Semester V/Math Modelling/DataSet002Complete.csv", head=TRUE, row.names="molID") <span style="font-family: 'Courier New',Courier,monospace;">> ## do random Forest <span style="font-family: 'Courier New',Courier,monospace;">> mydata.rf <- randomForest(logS ~ ., data = mydata, importance = TRUE) <span style="font-family: 'Courier New',Courier,monospace;">> print(mydata.rf)

<span style="font-family: 'Courier New',Courier,monospace;">[output]

<span style="font-family: 'Courier New',Courier,monospace;">Call: <span style="font-family: 'Courier New',Courier,monospace;">randomForest(formula = logS ~ ., data = mydata, importance = TRUE) <span style="font-family: 'Courier New',Courier,monospace;">Type of random forest: regression <span style="font-family: 'Courier New',Courier,monospace;">Number of trees: 500 <span style="font-family: 'Courier New',Courier,monospace;">No. of variables tried at each split: 44 <span style="font-family: 'Courier New',Courier,monospace;">Mean of squared residuals: 0.615971 <span style="font-family: 'Courier New',Courier,monospace;">% Var explained: 85.39

<span style="font-family: 'Courier New',Courier,monospace;">[output]