ASM001

Aqueous Solubility Model 001

 * Researchers: Jean-Claude Bradley, Andrew Lang, and Antony Williams**

Objective
To create a general open model predicting the aqueous solubility of organic compounds using open data and open descriptors.

Data Collection and Filtering
Our starting place is the large (57857 compounds) set of open aqueous solubility data provided by Rajarshi Guha //et al.// [1] We then filtered the dataset to improve its use for QSPR modeling.
 * Training Set**
 * Filter 1: Converting to canonical SMILES.** The SMILES in the original dataset were 'washed' in MOE (v2008.10) and upon inspection we noticed a significant number were charge imbalanced, so we downloaded original SMILES from PubChem - losing 4 compounds (not returned by PubChem), leaving 57853 compounds. These SMILES were then canonicalised using OpenBabel 2.3.0.
 * Filter 2: Removing salts.** All compounds with a period in their SMILES (1884 compounds) were removed as salts, leaving 55969 compounds. These salts were not apparent in the original dataset.
 * Filter 3: Assigning ChemSpider IDs (CSID).** Antony Williams screened the compounds against the ChemSpider database, assigning ChemSpider IDs to those compounds that existed in ChemSpider, leaving 55094 compounds.
 * Filter 4: Removing Estimates.** We removed all entries marked "Measured solubility is greater than 75% dose concentration, actual solubility may be higher," and all entries marked "Below LOQ," this left a //**training/test dataset**// of 35551 compounds - [[file:AqueousDataset001.xlsx|Aqueous Solubility Dataset 001]].

Taking the aqueous solubility collected from the literature by Wang //et. al.// (dataset - 3636 compounds) [2], combining it with the aqueous solubility data collected as part of the open notebook science challenge (dataset - 252 compounds), removing compounds without ChemSpider IDs and filtering for duplicates (and averaging multiple values for the same compound) gives us a //**test set**// of 2841 compounds -. The following compounds were removed because of inconsistent measurements (antipyrene, acrolein, tetrahydrofuran).
 * Test Set 1**

A set of 28 compound, provided by Hewitt //et. al.// [3] in their "solubility challenge" comparing the predictive ability of several commercial aqueous solubility models, was used as a second //**test set**// on order to compare our model's performance directly with the performance of the models tested by Hewitt //et. al.// -.
 * Test Set 2**

Procedure
Using Rajarshi Guha's CDK Descriptor Calculator GUI (v 1.3.4) we calculated all 2D CDK descriptors (i.e. no protein or geometrical descriptors) except CPSA, IP, WHIM, and bpol (not available in 1.3.4) for the training set using the 'add explicit H' option.
 * Calculating Descriptors**

Removed descriptors with less than 36 non-zero entries: khs.sLi, khs.ssBe, khs.ssssBe, khs.ssBH, khs.sssB, khs.ssssB, khs.ddC, khs.sNH3, khs.ssNH2, khs.sssNH, khs.ssssN, khs.sSiH3, khs.ssSiH2, khs.sssSiH, khs.ssssSi, khs.sPH2, khs.ssPH, khs.sssP, khs.sssssP, khs.sSH, khs.dssS, khs.sGeH3, khs.ssGeH2, khs.sssGeH, khs.ssssGe, khs.sAsH2, khs.ssAsH, khs.sssAs, khs.sssdAs, khs.sssssAs, khs.sSeH, khs.dSe, khs.ssSe, khs.aaSe, khs.dssSe, khs.ddssSe, khs.sSnH3, khs.ssSnH2, khs.sssSnH, khs.ssssSn, khs.sI, khs.sPbH3, khs.ssPbH2, khs.sssPbH, khs.ssssPb
 * Feature Selection**

Removed compounds with multiple NA - descriptor failure - CSIDs: 4797610, 2104719, 2104724, 4807115, 4750678, 4674166, 4806481, 4809281, 4783859, 4716159, 4713927, 520280, 17322037, 604145

Further feature selection was performed on the remaining set of 35537 molecules with 162 descriptors by using the caret package in R: code library("caret") mydata = read.csv(file="AqueousDataset001WithDescriptorsReadyForR.csv",head=TRUE,row.names="molID") cor.mat = cor(mydata) findCorrelation(cor.mat, cutoff = .95, verbose = TRUE) [output] [1] 162 62 160  63  68  32  60 154 152  27  26  69  12  28  61  64  29  15  71  65  77 159  66  72  73  80  74  13  16  48  94 33  42 [output] code The caret-recommended descriptor were removed: apol, naAromAtom, nAtom, ATSc1, ATSp1, ATSp2, ATSp3, ATSp4, nB, C1SP1, SCH-3, VCH-4, SP-0, SP-1, SP-2, SP-3, SP-4, SP-5, SP-6, VP-0, VP-1, VP-3, VP-4, VP-5, VP-6, SPC-5, VPC-5, khs.tsC, VABC, WTPT-1, WPATH, WPOL, Zagreb
 * 1) load in data
 * 1) correlation matrix
 * 1) find correlation r > 0.95

This left us with a [|final set of 35537 molecules and 129 descriptors].

An initial set of 1000 randomly selected molecules was used to build a random forest model in R with the following code: code mydata = read.csv(file="AqueousDataset001TrainingSetReadyForRsmall.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(logS ~ ., data = mydata,importance = TRUE) print(mydata.rf) [output] Call: randomForest(formula = logS ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 43
 * Building the Model**
 * 1) do random forest [randomForest 4.6-6]

Mean of squared residuals: 0.3071348 % Var explained: 29.47 [output] code Such a low OOB R2 value (0.29) was surprising, so a second randomly selected set of 1000 molecules from the original dataset using the provided descriptors was run through R. This run had an OOB R2 value of 0.31. Looking at the data in its entirity, see histogram below, we see a sharp cutoff of values at logS = -0.8. This together with the low R2 value suggest that dataset is not useful for modeling experimental aqueous solubility. Instead, it is useful for classifying models, especially binary, i.e. two bins: "low solubility" logS < -1.4 and "high solubility" log S > -1.4 as has been done successfully previously. [1, 4]

Conclusion
The aqueous solubility dataset is useful for binary classification models but is not a good set of data for general modeling of aqueous solubility.