Aqueous Solubility Model ASM002d
Researchers: Rebecca Giese

To create a general open model to predict aqueous solubility of organic compounds by using open data and open descriptors.

By using Rajarshi Guha's CDK, all the 2D CDK descriptors were calculated except for CPSA, IP for the training is set the training set with the explicit 'H' option

Feature Selection
Removed the descriptors with less than 7 non- zero entries, thus the following entries were removed: khs.sLi, khs.ssBe, khs.ssssBe, khs.ssBH, khs.sssB,khs.ssssB, khs.sCH3, khs.ddC, khs.sNH3, khs.ssNH2, khs.sssNH, khs.ssss, khs.sSiH3, khs.sssSiH, khs.ssSiH2, khs.sPH2, khs.ssPH,khs.sssP, khs.sssssP, khs.sSH, khs.sGeH3, khs.ssGeH2, khs.sssGeH, khs.ssssGe, khs.sAsH2, khs.ssAsH, khs.sssAs, khs.sssdAs, khs.sssssAs, khs.sSeH, khs.dSe, khs.ssSe, khs.aaSe, khs.dssSe, khs.ddssSe, khs.sSnH3, khs.ssSnH2, khs.sssSnH, khs.sssSnH, khs.ssssSn, khs.sPbH3, khs.ssPbH2, khs.sssPbH, khs.ssssPb

The following are removed due having excess of NA: Kier3

At this point there are 2273 molecules and 164 descriptors

More feature selection was conducted using caret, which gives an output of:
[1] 164 162 64 27 63 26 28 71 29 65 32 70 156 69 61 72 154 62 77 12 66 78 67 73 74 75 80 81 51 13 44 43

Thus by recommendation the following entries were deleted: XLog P, WPath, WTPT2, MW, VPC-5, VCP-4, SPC-5,SPC-6, VP-7, VP-6, VP-5, VP-4,
SPC-4, VP-3, VP-2, VP-1, SP-7, SP-6, SP-5, SP-4, SP-3, SP-2, SP-1, SP-0,VCH-7, SCH-4, SCH-5, nB, ATSp4, ATSp3, ATSp2, ATSp1, nAromBond, naAromAtom

Which left us with 2273 molecules and 132 descriptors

Finally by using Random Forest our results were as follows:
Number of trees: 500
No. of variables tried at each split: 43
Mean of squared residuals: 0.6882564
% Var explained: 84.28