AbrahamSolventsModel002

CDK-Based Solvent Abraham Coefficient Models

 * Researcher:** Andrew Lang

Objective
To create two separate random forest models for each of the Abraham model solvent coefficients c, e, s, a, b, and v. One model for a general diverse set of solvents (after outliers have been removed) and one model for a limited set of solvents. All content, models and data are released as CC0 - the default license for all ONS work.

Procedure
code mydata = read.csv(file="20120126SolventsWithDescriptorsForPCA.csv",head=TRUE,row.names="molID")
 * Model002a:** Our previous analysis reduced our set of [|78 solvents with CDK-descriptors], calculated using Rajarshi Guha's CDK Descriptor Calculator, to one of [|63 solvents] by eliminating four solvents that lay outside of the main chemical space (tributyl phosphate, isopropyl myristate, hexadecane, and octadecanol), 7 solvents with c-coefficient less than -0.058 (DMF, dimethylacetamide, ethylene glycol, nitrobenzene, DMSO, iodobenzene, and formamide) under the assumption that solvents should not have negative values for c, [1] and 4 additional solvents identified as consensus (over all coefficients) model outliers (trifluoroethanol, carbon disulfide, nitromethane, and N-formylmorpholine).
 * Model002a-Chemical Space:** The chemical space shown below was created using Tableau Public v7.0 with principal components calculated using the following code in R 2.13.0:
 * 1) load in data

data <- mydata[, 1:129] ## Don't include molID in pca pc1 <- prcomp(data, scale. = T) x <- pc1$x summary(pc1)

[output] Importance of components: PC1   PC2     PC3 Standard deviation    6.2811 4.3032 3.47059 Proportion of Variance 0.3058 0.1436 0.09337 Cumulative Proportion 0.3058 0.4494 0.54275 [output] code The colours in the above chemical space correspond to the third principal component. code library(randomForest) mydata = read.csv(file="20120126SolventsWithDescriptorsc.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(c ~ ., data = mydata,importance = TRUE) varImpPlot(mydata.rf) print(mydata.rf) [output] Call: randomForest(formula = c ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 43
 * Model002a-RFc:** A random forest model for the c-coefficient was created using the following R code:
 * 1) load in data
 * 1) do random forest [randomForest 4.5-34]
 * 1) Show important descriptors
 * 1) Summary of random forest model

Mean of squared residuals: 0.006481098 % Var explained: 44.65 [output] test.predict <- predict(mydata.rf,mydata) write.csv(test.predict, file = "RFTestPredict.csv") code The variable importance can be seen in the following plot (nHBDon, WTPT.2, etc.) The model was exported using the following code and is available for download: [|crfmodel] code .saveRDS(mydata.rf, file = "crfmodel") code A plot of predicted values vs. measured values is shown below coloured from green to red by AE. A QSARDB model was created (annotated example) through the command line ([|clog.txt]) as is also available for download: [|AbrahamSolventc.qdb.zip] The training set R2 for the random forest model 0.9083 (OOB R2 0.4465 - approximate test-set R2).
 * 1) predict using the random forest model
 * 1) write the predictions to the working directory
 * 1) serialize(mydata.rf, model)

code library(randomForest) mydata = read.csv(file="20120126SolventsWithDescriptorse.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(e ~ ., data = mydata,importance = TRUE) varImpPlot(mydata.rf) print(mydata.rf) [output] Call: randomForest(formula = e ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 43
 * Model002a-RFe:** A random forest model for the e-coefficient was created using the following R code:
 * 1) load in data
 * 1) do random forest [randomForest 4.6-6]
 * 1) Show important descriptors
 * 1) Summary of random forest model

Mean of squared residuals: 0.01895079 % Var explained: 40.21 [output] test.predict <- predict(mydata.rf,mydata) write.csv(test.predict, file = "RFTestPredict.csv") code The training-set R2 = 0.9019, OOB R2 = 0.4021. The variable importance can be seen in the following plot (XLogP, ATSc1, etc.) The model was exported using the following code and is available for download: [|erfmodel] code .saveRDS(mydata.rf, file = "erfmodel") code A plot of predicted values vs. measured values is shown below coloured from green to red by AE. A QSARDB model was created through the command line and is also available for download: [|AbrahamSolvente.qdb.zip] The training set R2 value for the random forest model 0.9019 (OOB R2 0.4021 - approximate test-set R2).
 * 1) predict using the random forest model
 * 1) write the predictions to the working directory
 * 1) serialize(mydata.rf, model)

code library(randomForest) mydata = read.csv(file="20120126SolventsWithDescriptorss.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(s ~ ., data = mydata,importance = TRUE) varImpPlot(mydata.rf) print(mydata.rf) [output] Call: randomForest(formula = s ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 43
 * Model002a-RFs:** A random forest model for the s-coefficient was created using the following R code:
 * 1) load in data
 * 1) do random forest [randomForest 4.6-6]
 * 1) Show important descriptors
 * 1) Summary of random forest model

Mean of squared residuals: 0.07565735 % Var explained: 81.47 [output] test.predict <- predict(mydata.rf,mydata) write.csv(test.predict, file = "RFTestPredict.csv") code The training-set R2 = 0.9700, OOB R2 = 0.8147. The variable importance can be seen in the following plot (ATSc1, nAtomP, etc.) The model was exported using the following code and is available for download: [|srfmodel] code .saveRDS(mydata.rf, file = "srfmodel") code A plot of predicted values vs. measured values is shown below coloured from green to red by AE. A QSARDB model was created through the command line and is also available for download: [|AbrahamSolvents.qdb.zip]
 * 1) predict using the random forest model
 * 1) write the predictions to the working directory
 * 1) serialize(mydata.rf, model)

code library(randomForest) mydata = read.csv(file="20120126SolventsWithDescriptorsa.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(a ~ ., data = mydata,importance = TRUE) varImpPlot(mydata.rf) print(mydata.rf) [output] Call: randomForest(formula = a ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 43
 * Model002a-RFa:** A random forest model for the a-coefficient was created using the following R code:
 * 1) load in data
 * 1) do random forest [randomForest 4.6-6]
 * 1) Show important descriptors
 * 1) Summary of random forest model

Mean of squared residuals: 0.1258657 % Var explained: 95.6 [output] test.predict <- predict(mydata.rf,mydata) write.csv(test.predict, file = "RFTestPredict.csv") code The training-set R2 = 0.9924, OOB R2 = 0.9560. The variable importance can be seen in the following plot (nHBAcc, TopoPSA, WTPT.4, etc.) The model was exported using the following code and is available for download: [|arfmodel] code .saveRDS(mydata.rf, file = "arfmodel") code A plot of predicted values vs. measured values is shown below coloured from green to red by AE. A QSARDB model was created through the command line and is also available for download: [|AbrahamSolventa.qdb.zip]
 * 1) predict using the random forest model
 * 1) write the predictions to the working directory
 * 1) serialize(mydata.rf, model)

code library(randomForest) mydata = read.csv(file="20120126SolventsWithDescriptorsb.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(b ~ ., data = mydata,importance = TRUE) varImpPlot(mydata.rf) print(mydata.rf) [output] Call: randomForest(formula = b ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 43
 * Model002a-RFb:** A random forest model for the b-coefficient was created using the following R code:
 * 1) load in data
 * 1) do random forest [randomForest 4.6-6]
 * 1) Show important descriptors
 * 1) Summary of random forest model

Mean of squared residuals: 0.05920697 % Var explained: 74.64 [output] test.predict <- predict(mydata.rf,mydata) write.csv(test.predict, file = "RFTestPredict.csv") code The training-set R2 = 0.9517, OOB R2 = 0.7464. The variable importance can be seen in the following plot (khs.sOH, nHBDon, etc.) The model was exported using the following code and is available for download: [|brfmodel] code .saveRDS(mydata.rf, file = "brfmodel") code A plot of predicted values vs. measured values is shown below coloured from green to red by AE. A QSARDB model was created through the command line and is also available for download: [|AbrahamSolventb.qdb.zip]
 * 1) predict using the random forest model
 * 1) write the predictions to the working directory
 * 1) serialize(mydata.rf, model)

code library(randomForest) mydata = read.csv(file="20120126SolventsWithDescriptorsv.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(v ~ ., data = mydata,importance = TRUE) varImpPlot(mydata.rf) print(mydata.rf) [output] Call: randomForest(formula = v ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 43
 * Model002a-RFv:** A random forest model for the v-coefficient was created using the following R code:
 * 1) load in data
 * 1) do random forest [randomForest 4.6-6]
 * 1) Show important descriptors
 * 1) Summary of random forest model

Mean of squared residuals: 0.03566637 % Var explained: 68.16 [output] test.predict <- predict(mydata.rf,mydata) write.csv(test.predict, file = "RFTestPredict.csv") code The training-set R2 = 0.9483, OOB R2 = 0.6816. The variable importance can be seen in the following plot (TopoPSA, XLogP, etc.) The model was exported using the following code and is available for download: [|vrfmodel] code .saveRDS(mydata.rf, file = "vrfmodel") code A plot of predicted values vs. measured values is shown below coloured from green to red by AE. A QSARDB model was created through the command line and is also available for download: [|AbrahamSolventv.qdb.zip]
 * 1) predict using the random forest model
 * 1) write the predictions to the working directory
 * 1) serialize(mydata.rf, model)

Results - Model002a (General)
The following table summarizes the results, including the Training Set R2 (R2) and the Out of Bag R2 (OOB R2) which is a good estimate of utility. The models were used to predict the parameters of all the solvents in the test-set and a consensus measure of accuracy was attained by summing the squares of the standardized absolute errors across all parameters. Using this measure we identify the model outliers to be acetonitrile and sulfolane. The chart below, showing the consensus error for each solvent, also gives information as to how the models perform on the different classes on solvents. The models perform better for compounds that only contain carbon (non-aliphatic rings), oxygen and hydrogen atoms.
 * parameter || R2 || OOB R2 || meaning[1,2] || ImpVar ||
 * c || 0.91 || 0.45 || log of the solvent–water ratio of the volume packing densities || nHBDon,WTPT.2 ||
 * e || 0.90 || 0.40 || general solute-solvent dispersion interactions || XLogP, ATSc1 ||
 * s || 0.97 || 0.81 || dipole-dipole and dipole-induced interactions || ATSc1, nAtomP ||
 * a || 0.99 || 0.96 || hydrogen-bond basicity || nHBAcc, TopoPSA, WTPT.4 ||
 * b || 0.95 || 0.75 || hydrogen-bond acidity || khs.sOH, nHBDon ||
 * v || 0.95 || 0.68 || associated with cavity formation || TopoPSA, XLogP ||

Predicting the Abraham parameters for new solvents
1863 compounds with a melting point between -100 C and 0 C were taken from the most recent Open melting point dataset (ONSMP031). All CDK descriptors (except CPSA, Ionization Potential, WHIM, all Protein descriptors, and all Geometrical descriptors) were calculated using Rajarshi Guha's CDK Descriptor Calculator v1.3.2. The descriptors Kier3 and HybRatio were removed post-calculation due to a significant number of NA values. The [|solvent parameters for 1863 compounds] were then calculated using the random forest models above, see the following R code: code library(randomForest) mytestdata = read.csv(file="cdk.csv",head=TRUE,row.names="molID")
 * 1) load in the data

mydata.rf <- .readRDS(file="crfmodel") test.predict <- predict(mydata.rf,mytestdata) write.csv(test.predict, file = "RFTestPredictc.csv")
 * 1) load in the models and output the results to csv files

mydata.rf <- .readRDS(file="erfmodel") test.predict <- predict(mydata.rf,mytestdata) write.csv(test.predict, file = "RFTestPredicte.csv")

mydata.rf <- .readRDS(file="srfmodel") test.predict <- predict(mydata.rf,mytestdata) write.csv(test.predict, file = "RFTestPredicts.csv")

mydata.rf <- .readRDS(file="arfmodel") test.predict <- predict(mydata.rf,mytestdata) write.csv(test.predict, file = "RFTestPredicta.csv")

mydata.rf <- .readRDS(file="brfmodel") test.predict <- predict(mydata.rf,mytestdata) write.csv(test.predict, file = "RFTestPredictb.csv")

mydata.rf <- .readRDS(file="vrfmodel") test.predict <- predict(mydata.rf,mytestdata) write.csv(test.predict, file = "RFTestPredictv.csv") code The Abraham parameters were then used to predict the 25C-solubility of trans-cinnamic acid in all 1863 solvents using the latest Abraham descriptors for trans-cinnamic acid.

Model002b
Here we limit our models to solvents that only contain carbon, oxygen, and hydrogen atoms. We started with the original list of 78 solvents and then removed solvents that were not comprised of only carbon, oxygen, and hydrogen atoms (1,2-dichloroethane, 1-chlorobutane, acetonitrile, benzonitrile, bromobenzene, carbon disulfide, carbon tetrachloride, chlorobenzene, chloroform, dibutylformamide, dichloromethane, diethylacetamide, dimethylacetamide, DMF, DMSO, fluorobenzene, formamide, iodobenzene, N-ethylacetamide, N-ethylformamide, N-formylmorpholine, nitrobenzene, nitromethane, N-methyl-2-piperidone, N-methylacetamide, N-methylformamide, N-methylpyrrolidinone, sulfolane, tributyl phosphate, trifluoroethanol), This left us with 48 solvents.


 * Removing Outliers:** Some of the 48 solvents were removed as outliers previously but are considered here, including solvents with negative c-parameter, thus before modeling we proceed by removing a new set of outliers exclusively identified step-wise using our consensus measure outlined above listed with the most suspect parameter: ethylene glycol (c = 2.729), isopropyl myristate (c = -0.605), propylene carbonate (v = 3.412), octadecanol (e = 0.148), methylcyclohexane (e = 0.782), cyclohexane (v = 4.577), cyclohexanone (e = 0.225), methanol (v = 3.549). This left us with 40 solvents.

The chart below, showing the consensus error for each solvent, also gives information as to how the models perform on the different classes on solvents within the set. The models seems to perform worse, in general, for compounds with aliphatic rings. The predicted Abraham parameters were compared to those from model002a and it was [|found] that no statistical improvement was gained when restricting to a smaller subset. Thus since model002a is built using a more diverse set of solvents, it is recommended as the model of choice to predict the Abraham parameters for new solvents.
 * Model002b-RandomForest:** Random forest models were created for the Abraham solvent parameters as above. The following table summarizes the results, including the Training Set R2 (R2) and the Out of Bag R2 (OOB R2) which is a good estimate of utility.
 * parameter || R2 || OOB R2 || meaning[1,2] || ImpVar ||
 * c || 0.93 || 0.61 || log of the solvent–water ratio of the volume packing densities || khs.sOH, nHBDon ||
 * e || 0.94 || 0.65 || general solute-solvent dispersion interactions || MDEC.12, nAtomLAC ||
 * s || 0.97 || 0.81 || dipole-dipole and dipole-induced interactions || nAtomLAC, ATSc1 ||
 * a || 0.99 || 0.98 || hydrogen-bond basicity || nHBDon, ATSc2 ||
 * b || 0.98 || 0.90 || hydrogen-bond acidity || khs.sOH, nHBDon ||
 * v || 0.96 || 0.77 || associated with cavity formation || TopoPSA, BCUTp.1l ||

Classifying Solvents
We have derived models for predicting the Abraham solvent parameters that have OOB R2-values ranging from 0.40 to 0.96. Using these models we provide a combined list of [|1939 solvents with Abraham parameters] (88 measured-20120218 and 1851 predicted) partitioned into 12 groups using K-means clustering: code mydata = read.csv(file="AbrahamParametersForAllSolventsForR.csv",head=TRUE,row.names="solvent") mydata <- scale(mydata) # standardize variables fit <- kmeans(mydata, 12) # 12 cluster solution aggregate(mydata,by=list(fit$cluster),FUN=mean) [output] Group.1         c          e          s          a           b            v 1        1 -3.4447087 -2.3802005  1.0588094  1.2687462  8.35950545 -8.264504875 2       2 -0.2213855  1.6799541 -1.7406426 -1.4034535 -0.52068584  0.959811199 3        3  1.1138946  0.5469179 -1.2799698 -1.0889780 -0.51144540  0.751917778 4        4  0.9063351 -0.6136814  0.1136884 -1.0827809  0.58785776  0.492690575 5        5 -0.4784631  0.5757637  0.2347302  0.5634954 -0.62430692 -0.125921445 6        6 -0.8509431 -0.5903600  1.0320664  0.7625482 -0.14621129 -0.499160233 7        7  0.5982105 -0.9070429  0.3900724  1.0393694  1.49439191 -1.001666347 8        8 -0.6185940  0.3117337 -0.6860866  0.9540825  1.08859538 -0.009583815 9        9  1.1098023 -1.1295906  0.8859178  0.7341573  0.02253894 -0.907135912 10      10 -1.0171923 -0.3653057  0.8084089 -0.6995847 -0.09943499  0.545896977 11      11 -0.8630286  0.5534365 -0.2131106 -1.0294510 -0.22548842  0.789781060 12      12  0.6925955 -0.5640462  0.4181038  0.6016715 -0.56390282 -0.150832117 [output] mydata <- data.frame(mydata, fit$cluster) write.csv(mydata, file = "clusters.csv") code The 12 groups have the following statistics: Looking at the above table, we see two interesting results: 1. Group one contains no predicted solvents, where all the other groups have over 100. This means the solvents of group one are atypical/special. 2. Group five has 245 predicted solvents but no solvents with measured Abraham parameters. This suggests that a whole entire region of the chemical space has yet to be explored. The predicted solvents of group five a typified by compounds like: ethyl undec-10-enoate, trielaidoylglycerol, octyl acetate, aramite, beta-ionone, beta-butoxyethyl phthalate, 6-methylheptyl 2-(2,4-dichlorophenoxy)propanoate, 1,1,5-Trimethyl-3,3-bis[(2-methyl-2-propanyl)peroxy]cyclohexane. The compounds of group five are most similar to those of group group six, except in that they have a much larger values for 'e'.
 * 1) K-Means Cluster Analysis
 * 1) get cluster means
 * 1) append cluster assignment
 * Group || Measured Solvents || Number Of Predicted Solvents || Durand Classification[3] ||
 * 1 || water, ethanol/water(10:90)vol, ethanol/water(20:80)vol, ethanol/water(30:70)vol, ethanol/water(40:60)vol, ethanol/water(50:50)vol, ethanol/water(60:40)vol, ethylene glycol, formamide, trifluoroethanol || 0 || polar structured (75%), organic acidic compounds (25%) TFE ||
 * 2 || carbon disulfide, cyclohexane, decane, dodecane, hexadecane, methylcyclohexane, nonane, octane, undecane || 262 || apolar (100%) ||
 * 3 || 2,2,4-trimethylpentane, carbon tetrachloride, dibutyl ether, heptane, hexane, pentane, p-xylene || 106 || apolar (86%), aprotic dipolar (14%) dibutyl ether ||
 * 4 || 1,2-dichloroethane, 1-chlorobutane, chloroform, dichloromethane, fluorobenzene || 146 || assymetric halogenated hydrocarbon (73%), apolar (27%) 1-chlorobutane, 1,1,1-trichloroethane, tetrachloroethylene ||
 * 5 || - || 245 || weak electron pair donor bases (80%), aprotic dipolar (20%) quinoline ||
 * 6 || 1,4-dioxane, benzonitrile, cyclohexanone, dimethyl acetamide, DMF, DMSO, nitromethane, N-methyl-2-piperidone, N-methylacetamide, N, methylpyrrolidinone, propylene carbonate, sulfolane || 216 || aprotic dipolar (44%), aprotic highly diploar (33%), other (23%) ||
 * 7 || 2-butanol, 2-methyl-1-propanol, 2-methyl-2-propanol, ethanol, ethanol/water(70:30)vol, ethanol/water(80:20)vol, ethanol/water(90:10)vol, methanol, N-methylformamide, tributyl phosphate || 154 || polar protic (53%), amphiprotic (27%), other (20%) ||
 * 8 || 1-butanol, 1-decanol, 1-heptanol, 1-hexanol, 1-octanol, 1-pentanol, 1-propanol, 2-pentanol, 2-propanol, 3-methyl-1-butanol, N-formylmorpholine, octadecanol || 110 || amphiprotic (80%), other (20%) PEG, phenol, 3-methylphenol ||
 * 9 || acetone, acetonitrile, butanone, dibutylformamide, diethyl acetamide, methyl acetate, N-ethylacetamide, N-ethylformamide || 158 || aprotic highly dipolar (63%), aprotic dipolar (37%) ||
 * 10 || bromobenzene, iodobenzene, nitrobenzene || 117 || aprotic dipolar (43%), apolar (43%), other (14%) ||
 * 11 || benzene, chlorobenzene, ethylbenzene, isopropyl myristate, m-xylene, o-xylene, toluene || 111 || apolar (88%), strong electron pair donor bases (12%) tributylamine ||
 * 12 || butyl acetate, diethyl ether, ethyl acetate, methyl tert-butyl ether, THF || 226 || aprotic dipolar (82%), weak electron pair donor bases (18%) ||

Classifying the groups in classical terms (polar, apolar, etc) by comparing the groups to those derived by Durand //et. al.// [3], [|we see] a good correlation between the 12 groups and classical solvent groupings with group 1 being polar, groups 2, 3, and 11 being apolar, group 4 being assymetric halgonated hydrocarbons, group 5 being weak electron pair donor bases, groups 6, 9, and 12 being dipolar, group 7 being polar protic, group 8 being amphiprotic; and group 10 being a mixture of aprotic dipolar and apolar compounds but on closer examination, we see that group 10 consists entirely of substituted benzene rings. Perfect correlation cannot be expected because our partition consists of 12 groups whereas Durand's consists of 10. Also, our method of partitioning may focus on slightly different structure properties, group 10 for example which consists entirely of substituted benzene rings, yet Durand splits the group into two.

Using principal component analysis on the full set of solvents we can see the groupings in chemical space. This graph is interactive - hover over the data point to see info - select multiple classes using ctrl-click. We see that distinct lines are hard to draw and compounds situated on the edge of two or more groups will share similarities to surrounding compounds, even if they are from different groups. media type="custom" key="12565024"