General+Solubility+Model

General Solubility Model

Lori Fielding and Jesse Patsolic
 * Researchers**

Objective
To model solubilities based on calculated CDK descriptors.

Procedure
The data from [|20120410ONSCSolubilityData.xls] was used as the initial dataset.

Preparing the Data
The file had some HTML tags that were accidentally copied over from web based data. " " were found and removed.

A problem arose with escape characters ("\") in the SMILES. The data was being manipulated on a machine operating under OSX, using TCL scripts, which might have contributed to the problem. e.g. meloxicam: O=C\2c1c(cccc1)S(=O)(=O)N(C/2=C(\O)Nc3ncc(s3)C)C would appear as O=C inside the spreadsheet. Copying and pasting seemed to be the problem.

It was required that each solvent have at least seven measurements. [|Summary of all measurements] (Andrew Lang) gives a list of each solvent followed by its number of measurements.

All of the rows with water or mixed solvents were removed. A list of the solvents with at least seven measurements was created [|AtLeastSeven.csv] and used to filter out the solvents with less using [|CSVremove.tcl].

[|20120410ONSCSolubilities.xml] contains data about the solvents in 20120410ONSCSolubilityData.xls. From the XML file the list of liquid solvents was parsed out with the following R script:

code format="rsplus"
 * 1) R version 2.14.1 (2011-12-22)
 * 2) Copyright (C) 2011 The R Foundation for Statistical Computing
 * 3) ISBN 3-900051-07-0
 * 4) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
 * 5) Jesse L. Patsolic studiojlp@oru.edu
 * 6) MAT 429-97
 * 7) 04/19/12
 * 8) Under advisment of Dr. Andrew Lang
 * 1) Under advisment of Dr. Andrew Lang

library(XML)
 * 1) Package: XML
 * 2) Version: 3.9-4

f = (file ="20120410ONSCSolubilities.xml")

doc <- xmlParse(f)

r <- xmlRoot(doc)

all <- getNodeSet(doc, "//m:properties")

sol <- list(Solute="solute") liq <- list(Liquid="liquid")

for (i in 1:length(all)) { b <- xmlToList(alli) write(b$Solute,file="sol.csv",sep="\n",append=TRUE) write(b$Liquid,file="liq.csv",sep="\n",append=TRUE) } code

The following contains the filtered solubility data: [|20120425solids.csv]

From this file, the solute and solvent SMILES were pulled out and saved as text in separate .smi files: [|solutes.smi] and [|solvent.smi].

CDK Descriptors
These files were used as input to Rajarshi Guha's CDK Descriptor Calculator.

The option "add explicit H" was selected. The following descriptors were removed from the calculation:
 * Charged Partial Surface Area ||
 * Ionization Potential ||
 * Protein (all) ||
 * Geometrical (all) ||
 * WHIM ||
 * CHI Chain indices ||
 * CHI Path indices ||
 * CHI Cluster indices ||
 * CHI Path-Cluster indices ||
 * Moreau-Broto Auto-correlation (polarizability) ||
 * VABC Volume ||

The output yielded two CSV files which were then looked at to find any "NA" values. The Kier3 column was removed, along with rows 2024, 2403, and 2788.

Table Generation
The two CSV files generated were combined along with their product in the following format using [|ColumnProduct.tcl]:
 * ALogPsolute || ... || ALogPsolvent || ... || ALogPsolute_ALogPcolvent ||
 * x || ... || y || ... || (x*y) ||

The output file was prepended with the molar concentration yielding the resulting file: [|20120428solute_solvent_productsREADYforMODELING.csv].

Modeling
The molar concentration was then modeled using R 2.14.1 using package randomForest 4.6-6.

code format="rsplus"
 * 1) R version 2.14.1 (2011-12-22)
 * 2) Copyright (C) 2011 The R Foundation for Statistical Computing
 * 3) ISBN 3-900051-07-0
 * 4) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
 * 5) Jesse L. Patsolic studiojlp@oru.edu
 * 6) MAT 429-97
 * 7) Under advisment of Dr. Andrew Lang
 * 1) Under advisment of Dr. Andrew Lang

require(randomForest) >Loading required package: randomForest >randomForest 4.6-6 >Type rfNews to see new features/changes/bug fixes.

setwd("/Users/StudioJLP/Documents/ORU/SP12/MAT429_97Chem/General+Solubility+Model")

mydata=read.csv(file="20120428solute_solvent_productsREADYforMODELING.csv" + ,head=TRUE,row.names="molID",sep="\t")

mydata.rf <- randomForest(molarconcentration ~., data = mydata,importance=TRUE) test.predict <- predict(mydata.rf,mydata) write.csv(test.predict, file="RFTestPredict_Solubility.csv") varImpPlot(mydata.rf,main = "Random Forest Variable Importance: Molar Concentration")
 * 1) see following graph

print(mydata.rf)

Call: randomForest(formula = molarconcentration ~ ., data = mydata,     importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 162
 * 1) output

Mean of squared residuals: 0.5387622 % Var explained: 71.51

proc.time user system elapsed 466.367  3.833 555.596 code

The following is the plot generated by the varImpPlot command.

Plotting the Results
The predicted values for the molar concentration were combined with the experimental values for molar concentration along with some of the top descriptors from the output from varImpPlot into [|20120428solute_solvent_productsFORtp.csv].

The molar concentration was plotted against the predicted values with Tableau Public (v. 7.0). The sizes correlate to the absolute error and the points are colored according to the TopoPSAsolute descriptor.