Aqueous Solubility Model 002a
Researchers: Daryl Charron

Procedure

Calculating Descriptors
Using Rajarshi Guha's CDK Descriptor Calculator GUI (v 1.3.7) we calculated all 2D CDK descriptors (i.e. no protein or geometrical descriptors) except CPSA, IP, and WHIM for the training set using the 'add explicit H' option.

Feature Selection
Removed descriptors with less than or equal to 5 non-zero entries: khs.sLi, khs.ssBe, khs.ssssBe, khs.ssBH, khs.sssB, khs.ssssB, khs.ddC, khs.sNH3, khs.ssNH2, khs.sssNH, khs.ssssN, khs.sSiH3, khs.ssSiH2, khs.sssSiH, khs.ssssSi, khs.sPH2, khs.ssPH, khs.sssP, khs.sssssP, khs.sGeH3, khs.ssGeH2, khs.sssGeH, khs.ssssGe, khs.sAsH2, khs.ssAsH, khs.sssAs, khs.sssdAs, khs.sssssAs, khs.sSeH, khs.dSe, khs.ssSe, khs.aaSe, khs.dssSe, khs.ddssSe, khs.sSnH3, khs.ssSnH2, khs.sssSnH, khs.ssssSn, khs.sPbH3, khs.ssPbH2, khs.sssPbH, khs.ssssPb,
Removed descriptor Kier3 due to excessive NAs

This resulted in 2273 molecules with 165 descriptors.
> library("caret")
> ## load in data
> mydata = read.csv(file="cdkout.csv",head=TRUE,row.names="molID")
> ## correlation matrix
> cor.mat = cor(mydata)
> ## find correlation r > 0.95
> findCorrelation(cor.mat, cutoff = .95, verbose = TRUE)
 
[Output]
 
[1] 165 163 64 27 63 26 28 71 29 65 32 70 69 157 61 72 155 62 77
 
[20] 12 66 78 67 73 74 75 80 81 51 13 44 43
 
[Output]
The caret-recommended descriptor were removed: nAcid, apol, ATSm5, ATSp1, ATSp2, ATSp3, nBase, C4SP3, SCH-3, VCH-5, VC-6, SP-0, SP-1, SP-2, SP-3, SP-4, SP-5, SP-7, VP-0, VP-1, VP-2, VP-3, VP-4, VP-5, VP-7, SPC-4, SPC-6, VPC-4, VAdjMat, MW, WPATH, XLogP


This left us with a final set of 2273 molecules and 133 descriptors.
library("randomForest")
mydata = read.csv(file="cdkout.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.6-6]
mydata.rf <- randomForest(logS ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
[output]
Call:
randomForest(formula = logS ~ ., data = mydata, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 44
 
Mean of squared residuals: 0.706776
% Var explained: 83.86
[output]