General Solubility Model

Lori Fielding and Jesse Patsolic


To model solubilities based on calculated CDK descriptors.


The data from 20120410ONSCSolubilityData.xls was used as the initial dataset.

Preparing the Data

The file had some HTML tags that were accidentally copied over from web based data.
"<wbr />" were found and removed.

A problem arose with escape characters ("\") in the SMILES. The data was being manipulated on a machine operating under OSX, using TCL scripts, which might have contributed to the problem. e.g. meloxicam: O=C\2c1c(cccc1)S(=O)(=O)N(C/2=C(\O)Nc3ncc(s3)C)C would appear as O=C inside the spreadsheet. Copying and pasting seemed to be the problem.

It was required that each solvent have at least seven measurements.
Summary of all measurements (Andrew Lang) gives a list of each solvent followed by its number of measurements.

All of the rows with water or mixed solvents were removed.
A list of the solvents with at least seven measurements was created AtLeastSeven.csv and used to filter out the solvents with less using CSVremove.tcl.

20120410ONSCSolubilities.xml contains data about the solvents in 20120410ONSCSolubilityData.xls. From the XML file the list of liquid solvents was parsed out with the following R script:

# R version 2.14.1 (2011-12-22)
# Copyright (C) 2011 The R Foundation for Statistical Computing
# ISBN 3-900051-07-0
# Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
# Jesse L. Patsolic
# MAT 429-97
# 04/19/12
# Under advisment of Dr. Andrew Lang
#Package: XML
#Version: 3.9-4
f = (file ="20120410ONSCSolubilities.xml")
doc <- xmlParse(f)
r <- xmlRoot(doc)
all <- getNodeSet(doc, "//m:properties")
sol <- list(Solute="solute")
liq <- list(Liquid="liquid")
for (i in 1:length(all)) {
    b <- xmlToList(all[[i]])

The following contains the filtered solubility data: 20120425solids.csv

From this file, the solute and solvent SMILES were pulled out and saved as text in separate .smi files: solutes.smi and solvent.smi.

CDK Descriptors

These files were used as input to Rajarshi Guha's CDK Descriptor Calculator.

The option "add explicit H" was selected.
The following descriptors were removed from the calculation:
Charged Partial Surface Area
Ionization Potential
Protein (all)
Geometrical (all)
CHI Chain indices
CHI Path indices
CHI Cluster indices
CHI Path-Cluster indices
Moreau-Broto Auto-correlation (polarizability)
VABC Volume

The output yielded two CSV files which were then looked at to find any "NA" values. The Kier3 column was removed, along with rows 2024, 2403, and 2788.

Table Generation

The two CSV files generated were combined along with their product in the following format using ColumnProduct.tcl:

The output file was prepended with the molar concentration yielding the resulting file: 20120428solute_solvent_productsREADYforMODELING.csv.


The molar concentration was then modeled using R 2.14.1 using package randomForest 4.6-6.

# R version 2.14.1 (2011-12-22)
# Copyright (C) 2011 The R Foundation for Statistical Computing
# ISBN 3-900051-07-0
# Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
# Jesse L. Patsolic
# MAT 429-97
# Under advisment of Dr. Andrew Lang
>Loading required package: randomForest
>randomForest 4.6-6
>Type rfNews() to see new features/changes/bug fixes.
+ ,head=TRUE,row.names="molID",sep="\t")
mydata.rf <- randomForest(molarconcentration ~., data = mydata,importance=TRUE)
test.predict <- predict(mydata.rf,mydata)
write.csv(test.predict, file="RFTestPredict_Solubility.csv")
varImpPlot(mydata.rf,main = "Random Forest Variable Importance: Molar Concentration")
# see following graph
# output
 randomForest(formula = molarconcentration ~ ., data = mydata,      importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 162
          Mean of squared residuals: 0.5387622
                    % Var explained: 71.51
user  system elapsed
466.367   3.833 555.596

The following is the plot generated by the varImpPlot command.

Plotting the Results

The predicted values for the molar concentration were combined with the experimental values for molar concentration along with some of the top descriptors from the output from varImpPlot() into 20120428solute_solvent_productsFORtp.csv.

The molar concentration was plotted against the predicted values with Tableau Public (v. 7.0). The sizes correlate to the absolute error and the points are colored according to the TopoPSAsolute descriptor.