Aqueous Solubility Model 002e
Researcher: Katie Crosby

To create a general open model predicting the aqueous solubility of organic compounds using open data and open descriptors.


Calculating Descriptors
Using Rajarshi Guha's CDK Descriptor Calculator GUI (v 1.3.7) I calculated all 2D CDK descriptors for the training set using the 'add explicit H' option and comma delineated.

The resulting file was:

Feature Selection
Removed descriptors with less than 3 non-zero entries: khs.sLi, khs.ssBe, khs.ssssBe, khs.ssBH, khs.sssB, khs.ssssB, khs.ssNH2, khs.sssNH, khs.ssssN, khs.sSiH3, khs.ssSiH2, khs.sssSiH, khs.ssssSi, khs.sPH2, khs.ssPH, khs.sssP, khs.sssssP, khs.sGeH3, khs.ssGeH2, khs.sssGeH, khs.ssssGe, khs.sAsH2, khs.ssAsH, khs.sssAs, khs.sssdAs, khs.sssssAs, khs.sSeH, khs.dSe, khs.ssSe, khs.aaSe, khs.dssSe, khs.ddssSe, khs.sSnH3, khs.ssSnH2, khs.sssSnH, khs.ssssSn, khs.sPbH3, khs.ssPbH2, khs.sssPbH, khs.ssssPb

Keir3 was also removed because of the NA's in its column.

The resulting file is:

Further feature selection was performed on the remaining set of 2273 molecules with 140 descriptors by using the caret package in R:


mydata = read.csv(file="smiles1.csv",head=TRUE,row.names="molID")
cor.mat = cor(mydata)
findCorrelation(cor.mat, cutoff = .95, verbose = TRUE)

[1] 139 47 140 14 13 15 46 60 54 16 61 55 48 56 53 133 49 50 52
[20] 44 132 57 45 58 63 64 34 27 26

The caret-recommended descriptor were removed: ATSp2, ATSp3, ATSp4, ATSp5, SCH-3, SCH-4, VCH-6, SP-0, SP-1, SP-2, SP-3, SP-4, SP-5, SP-6, VP-0, VP-1, VP-2, VP-3, VP-4, VP-5, VP-6, SPC-4, SPC-5, VPC-4, VPC-5, VABC, WTPT-1, WPOL, Zagreb

This left 111 descriptors. 'logS' was added from the original data set file for the next step in the calculation. The resulting file is:

Building the Model
An initial set of 1000 randomly selected molecules was used to build a random forest model in R with the following code:


mydata = read.csv(file="smiles2.csv",head=TRUE,row.names="molID")
    1. do random forest [randomForest 4.6-6]
mydata.rf <- randomForest(logS ~ ., data = mydata,importance = TRUE)

randomForest(formula = logS ~ ., data = mydata, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 37
Mean of squared residuals: 0.7435326
% Var explained: 83.02