ASM002a

Aqueous Solubility Model 002a Researchers: Daryl Charron

Procedure
Using Rajarshi Guha's CDK Descriptor Calculator GUI (v 1.3.7) we calculated all 2D CDK descriptors (i.e. no protein or geometrical descriptors) except CPSA, IP, and WHIM for the training set using the 'add explicit H' option.
 * Calculating Descriptors **

Removed descriptors with less than or equal to 5 non-zero entries: khs.sLi, khs.ssBe, khs.ssssBe, khs.ssBH, khs.sssB, khs.ssssB, khs.ddC, khs.sNH3, khs.ssNH2, khs.sssNH, khs.ssssN, khs.sSiH3, khs.ssSiH2, khs.sssSiH, khs.ssssSi, khs.sPH2, khs.ssPH, khs.sssP, khs.sssssP, khs.sGeH3, khs.ssGeH2, khs.sssGeH, khs.ssssGe, khs.sAsH2, khs.ssAsH, khs.sssAs, khs.sssdAs, khs.sssssAs, khs.sSeH, khs.dSe, khs.ssSe, khs.aaSe, khs.dssSe, khs.ddssSe, khs.sSnH3, khs.ssSnH2, khs.sssSnH, khs.ssssSn, khs.sPbH3, khs.ssPbH2, khs.sssPbH, khs.ssssPb, Removed descriptor Kier3 due to excessive NAs
 * Feature Selection **

This resulted in 2273 molecules with 165 descriptors. code > library("caret") > ## load in data > mydata = read.csv(file="cdkout.csv",head=TRUE,row.names="molID") > ## correlation matrix > cor.mat = cor(mydata) > ## find correlation r > 0.95 > findCorrelation(cor.mat, cutoff = .95, verbose = TRUE)

[Output]

[1] 165 163 64 27 63 26 28 71 29 65 32 70 69 157 61 72 155 62 77

[20] 12 66 78 67 73 74 75 80 81 51 13 44 43

[Output] code The caret-recommended descriptor were removed: nAcid, apol, ATSm5, ATSp1, ATSp2, ATSp3, nBase, C4SP3, SCH-3, VCH-5, VC-6, SP-0, SP-1, SP-2, SP-3, SP-4, SP-5, SP-7, VP-0, VP-1, VP-2, VP-3, VP-4, VP-5, VP-7, SPC-4, SPC-6, VPC-4, VAdjMat, MW, WPATH, XLogP

This left us with a final set of 2273 molecules and 133 descriptors. code library("randomForest") mydata = read.csv(file="cdkout.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(logS ~ ., data = mydata,importance = TRUE) print(mydata.rf) [output] Call: randomForest(formula = logS ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 44
 * 1) do random forest [randomForest 4.6-6]

Mean of squared residuals: 0.706776 % Var explained: 83.86 [output] code