Aqueous Solubility Model 001

Researchers: Jean-Claude Bradley, Andrew Lang, and Antony Williams


To create a general open model predicting the aqueous solubility of organic compounds using open data and open descriptors.

Data Collection and Filtering

Training Set
Our starting place is the large (57857 compounds) set of open aqueous solubility data provided by Rajarshi Guha et al. [1] We then filtered the dataset to improve its use for QSPR modeling.
Filter 1: Converting to canonical SMILES. The SMILES in the original dataset were 'washed' in MOE (v2008.10) and upon inspection we noticed a significant number were charge imbalanced, so we downloaded original SMILES from PubChem - losing 4 compounds (not returned by PubChem), leaving 57853 compounds. These SMILES were then canonicalised using OpenBabel 2.3.0.
Filter 2: Removing salts. All compounds with a period in their SMILES (1884 compounds) were removed as salts, leaving 55969 compounds. These salts were not apparent in the original dataset.
Filter 3: Assigning ChemSpider IDs (CSID). Antony Williams screened the compounds against the ChemSpider database, assigning ChemSpider IDs to those compounds that existed in ChemSpider, leaving 55094 compounds.
Filter 4: Removing Estimates. We removed all entries marked "Measured solubility is greater than 75% dose concentration, actual solubility may be higher," and all entries marked "Below LOQ," this left a training/test dataset of 35551 compounds - Aqueous Solubility Dataset 001.

Test Set 1
Taking the aqueous solubility collected from the literature by Wang et. al. (dataset - 3636 compounds) [2], combining it with the aqueous solubility data collected as part of the open notebook science challenge (dataset - 252 compounds), removing compounds without ChemSpider IDs and filtering for duplicates (and averaging multiple values for the same compound) gives us a test set of 2841 compounds - Aqueous Solubility Dataset 002. The following compounds were removed because of inconsistent measurements (antipyrene, acrolein, tetrahydrofuran).

Test Set 2
A set of 28 compound, provided by Hewitt et. al. [3] in their "solubility challenge" comparing the predictive ability of several commercial aqueous solubility models, was used as a second test set on order to compare our model's performance directly with the performance of the models tested by Hewitt et. al. - Aqueous Solubility Dataset 003.


Calculating Descriptors
Using Rajarshi Guha's CDK Descriptor Calculator GUI (v 1.3.4) we calculated all 2D CDK descriptors (i.e. no protein or geometrical descriptors) except CPSA, IP, WHIM, and bpol (not available in 1.3.4) for the training set using the 'add explicit H' option.

Feature Selection
Removed descriptors with less than 36 non-zero entries: khs.sLi, khs.ssBe, khs.ssssBe, khs.ssBH, khs.sssB, khs.ssssB, khs.ddC, khs.sNH3, khs.ssNH2, khs.sssNH, khs.ssssN, khs.sSiH3, khs.ssSiH2, khs.sssSiH, khs.ssssSi, khs.sPH2, khs.ssPH, khs.sssP, khs.sssssP, khs.sSH, khs.dssS, khs.sGeH3, khs.ssGeH2, khs.sssGeH, khs.ssssGe, khs.sAsH2, khs.ssAsH, khs.sssAs, khs.sssdAs, khs.sssssAs, khs.sSeH, khs.dSe, khs.ssSe, khs.aaSe, khs.dssSe, khs.ddssSe, khs.sSnH3, khs.ssSnH2, khs.sssSnH, khs.ssssSn, khs.sI, khs.sPbH3, khs.ssPbH2, khs.sssPbH, khs.ssssPb

Removed compounds with multiple NA - descriptor failure - CSIDs: 4797610, 2104719, 2104724, 4807115, 4750678, 4674166, 4806481, 4809281, 4783859, 4716159, 4713927, 520280, 17322037, 604145

Further feature selection was performed on the remaining set of 35537 molecules with 162 descriptors by using the caret package in R:
## load in data
mydata = read.csv(file="AqueousDataset001WithDescriptorsReadyForR.csv",head=TRUE,row.names="molID")
## correlation matrix
cor.mat = cor(mydata)
## find correlation r > 0.95
findCorrelation(cor.mat, cutoff = .95, verbose = TRUE)
 [1] 162  62 160  63  68  32  60 154 152  27  26  69  12  28  61  64  29  15  71  65  77 159  66  72  73  80  74  13  16  48  94 33  42
The caret-recommended descriptor were removed: apol, naAromAtom, nAtom, ATSc1, ATSp1, ATSp2, ATSp3, ATSp4, nB, C1SP1, SCH-3, VCH-4, SP-0, SP-1, SP-2, SP-3, SP-4, SP-5, SP-6, VP-0, VP-1, VP-3, VP-4, VP-5, VP-6, SPC-5, VPC-5, khs.tsC, VABC, WTPT-1, WPATH, WPOL, Zagreb

This left us with a final set of 35537 molecules and 129 descriptors.

Building the Model
An initial set of 1000 randomly selected molecules was used to build a random forest model in R with the following code:
mydata = read.csv(file="AqueousDataset001TrainingSetReadyForRsmall.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.6-6]
mydata.rf <- randomForest(logS ~ ., data = mydata,importance = TRUE)
randomForest(formula = logS ~ ., data = mydata, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 43
Mean of squared residuals: 0.3071348
% Var explained: 29.47
Such a low OOB R2 value (0.29) was surprising, so a second randomly selected set of 1000 molecules from the original dataset using the provided descriptors was run through R. This run had an OOB R2 value of 0.31. Looking at the data in its entirity, see histogram below, we see a sharp cutoff of values at logS = -0.8. This together with the low R2 value suggest that dataset is not useful for modeling experimental aqueous solubility. Instead, it is useful for classifying models, especially binary, i.e. two bins: "low solubility" logS < -1.4 and "high solubility" log S > -1.4 as has been done successfully previously. [1, 4]


The aqueous solubility dataset is useful for binary classification models but is not a good set of data for general modeling of aqueous solubility.


1. Guha R et al. Exploratory analysis of kinetic solubility measurements of a small molecule library. Bioorganic & Medicinal Chemistry. Volume 19, Issue 13, 1 July 2011, Pages 4127–4134 original data
2. Wang J, Hou T, and Xu X. 2009. Aqueous Solubility Prediction Based on Weighted Atom Type Counts and Solvent Accessible Surface Areas. J. Chem. Inf. Model. 2009, 49, 571–581 doi:10.1021/ci800406y
3. Hewitt M et al. 2009. In Silico Prediction of Aqueous Solubility: The Solubility Challenge. J. Chem. Inf. Model. 2009, 49, 2572–2587 doi:10.1021/ci900286s
4. Cheng T, Li Q, Wang Y, and Bryant SH. Binary Classification of Aqueous Solubility Using Support Vector Machines with Reduction and Recombination Feature Selection. J. Chem. Inf. Model. 2011, 51, 229–236 10.1021/ci100364a