CDK-Based Solvent Abraham Coefficient Models

Researcher: Andrew Lang

Objective

To create two separate random forest models for each of the Abraham model solvent coefficients c, e, s, a, b, and v. One model for a general diverse set of solvents (after outliers have been removed) and one model for a limited set of solvents. All content, models and data are released as CC0 - the default license for all ONS work.

Procedure

Model002a: Our previous analysis reduced our set of 78 solvents with CDK-descriptors, calculated using Rajarshi Guha's CDK Descriptor Calculator, to one of 63 solvents by eliminating four solvents that lay outside of the main chemical space (tributyl phosphate, isopropyl myristate, hexadecane, and octadecanol), 7 solvents with c-coefficient less than -0.058 (DMF, dimethylacetamide, ethylene glycol, nitrobenzene, DMSO, iodobenzene, and formamide) under the assumption that solvents should not have negative values for c, [1] and 4 additional solvents identified as consensus (over all coefficients) model outliers (trifluoroethanol, carbon disulfide, nitromethane, and N-formylmorpholine).
Model002a-Chemical Space: The chemical space shown below was created using Tableau Public v7.0 with principal components calculated using the following code in R 2.13.0:
## load in data
mydata = read.csv(file="20120126SolventsWithDescriptorsForPCA.csv",head=TRUE,row.names="molID")
 
data <- mydata[, 1:129] ## Don't include molID in pca
pc1 <- prcomp(data, scale. = T)
x <- pc1$x
summary(pc1)
 
[output]
Importance of components:
                          PC1    PC2     PC3
Standard deviation     6.2811 4.3032 3.47059
Proportion of Variance 0.3058 0.1436 0.09337
Cumulative Proportion  0.3058 0.4494 0.54275
[output]
63PCA.png
The colours in the above chemical space correspond to the third principal component.
Model002a-RFc: A random forest model for the c-coefficient was created using the following R code:
library(randomForest)
## load in data
mydata = read.csv(file="20120126SolventsWithDescriptorsc.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.5-34]
mydata.rf <- randomForest(c ~ ., data = mydata,importance = TRUE)
## Show important descriptors
varImpPlot(mydata.rf)
## Summary of random forest model
print(mydata.rf)
[output]
Call:
 randomForest(formula = c ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 43
 
          Mean of squared residuals: 0.006481098
                    % Var explained: 44.65
[output]
## predict using the random forest model
test.predict <- predict(mydata.rf,mydata)
## write the predictions to the working directory
write.csv(test.predict, file = "RFTestPredict.csv")
The variable importance can be seen in the following plot (nHBDon, WTPT.2, etc.)
c-varimp.png
The model was exported using the following code and is available for download: crfmodel
## serialize(mydata.rf, model)
.saveRDS(mydata.rf, file = "crfmodel")
A plot of predicted values vs. measured values is shown below coloured from green to red by AE.
c-results.png
A QSARDB model was created (annotated example) through the command line (clog.txt) as is also available for download: AbrahamSolventc.qdb.zip
The training set R2 for the random forest model 0.9083 (OOB R2 0.4465 - approximate test-set R2).

Model002a-RFe: A random forest model for the e-coefficient was created using the following R code:
library(randomForest)
## load in data
mydata = read.csv(file="20120126SolventsWithDescriptorse.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.6-6]
mydata.rf <- randomForest(e ~ ., data = mydata,importance = TRUE)
## Show important descriptors
varImpPlot(mydata.rf)
## Summary of random forest model
print(mydata.rf)
[output]
Call:
 randomForest(formula = e ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 43
 
          Mean of squared residuals: 0.01895079
                    % Var explained: 40.21
[output]
## predict using the random forest model
test.predict <- predict(mydata.rf,mydata)
## write the predictions to the working directory
write.csv(test.predict, file = "RFTestPredict.csv")
The training-set R2 = 0.9019, OOB R2 = 0.4021. The variable importance can be seen in the following plot (XLogP, ATSc1, etc.)
e-varimp.png
The model was exported using the following code and is available for download: erfmodel
## serialize(mydata.rf, model)
.saveRDS(mydata.rf, file = "erfmodel")
A plot of predicted values vs. measured values is shown below coloured from green to red by AE.
e-results.png
A QSARDB model was created through the command line and is also available for download: AbrahamSolvente.qdb.zip
The training set R2 value for the random forest model 0.9019 (OOB R2 0.4021 - approximate test-set R2).

Model002a-RFs: A random forest model for the s-coefficient was created using the following R code:
library(randomForest)
## load in data
mydata = read.csv(file="20120126SolventsWithDescriptorss.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.6-6]
mydata.rf <- randomForest(s ~ ., data = mydata,importance = TRUE)
## Show important descriptors
varImpPlot(mydata.rf)
## Summary of random forest model
print(mydata.rf)
[output]
Call:
 randomForest(formula = s ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 43
 
          Mean of squared residuals: 0.07565735
                    % Var explained: 81.47
[output]
## predict using the random forest model
test.predict <- predict(mydata.rf,mydata)
## write the predictions to the working directory
write.csv(test.predict, file = "RFTestPredict.csv")
The training-set R2 = 0.9700, OOB R2 = 0.8147. The variable importance can be seen in the following plot (ATSc1, nAtomP, etc.)
s-varimp.png
The model was exported using the following code and is available for download: srfmodel
## serialize(mydata.rf, model)
.saveRDS(mydata.rf, file = "srfmodel")
A plot of predicted values vs. measured values is shown below coloured from green to red by AE.
s-results.png
A QSARDB model was created through the command line and is also available for download: AbrahamSolvents.qdb.zip

Model002a-RFa: A random forest model for the a-coefficient was created using the following R code:
library(randomForest)
## load in data
mydata = read.csv(file="20120126SolventsWithDescriptorsa.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.6-6]
mydata.rf <- randomForest(a ~ ., data = mydata,importance = TRUE)
## Show important descriptors
varImpPlot(mydata.rf)
## Summary of random forest model
print(mydata.rf)
[output]
Call:
 randomForest(formula = a ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 43
 
          Mean of squared residuals: 0.1258657
                    % Var explained: 95.6
[output]
## predict using the random forest model
test.predict <- predict(mydata.rf,mydata)
## write the predictions to the working directory
write.csv(test.predict, file = "RFTestPredict.csv")
The training-set R2 = 0.9924, OOB R2 = 0.9560. The variable importance can be seen in the following plot (nHBAcc, TopoPSA, WTPT.4, etc.)
a-varimp.png
The model was exported using the following code and is available for download: arfmodel
## serialize(mydata.rf, model)
.saveRDS(mydata.rf, file = "arfmodel")
A plot of predicted values vs. measured values is shown below coloured from green to red by AE.
a-results.png
A QSARDB model was created through the command line and is also available for download: AbrahamSolventa.qdb.zip

Model002a-RFb: A random forest model for the b-coefficient was created using the following R code:
library(randomForest)
## load in data
mydata = read.csv(file="20120126SolventsWithDescriptorsb.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.6-6]
mydata.rf <- randomForest(b ~ ., data = mydata,importance = TRUE)
## Show important descriptors
varImpPlot(mydata.rf)
## Summary of random forest model
print(mydata.rf)
[output]
Call:
 randomForest(formula = b ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 43
 
          Mean of squared residuals: 0.05920697
                    % Var explained: 74.64
[output]
## predict using the random forest model
test.predict <- predict(mydata.rf,mydata)
## write the predictions to the working directory
write.csv(test.predict, file = "RFTestPredict.csv")
The training-set R2 = 0.9517, OOB R2 = 0.7464. The variable importance can be seen in the following plot (khs.sOH, nHBDon, etc.)
b-varimp.png
The model was exported using the following code and is available for download: brfmodel
## serialize(mydata.rf, model)
.saveRDS(mydata.rf, file = "brfmodel")
A plot of predicted values vs. measured values is shown below coloured from green to red by AE.
b-results.png
A QSARDB model was created through the command line and is also available for download: AbrahamSolventb.qdb.zip

Model002a-RFv: A random forest model for the v-coefficient was created using the following R code:
library(randomForest)
## load in data
mydata = read.csv(file="20120126SolventsWithDescriptorsv.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.6-6]
mydata.rf <- randomForest(v ~ ., data = mydata,importance = TRUE)
## Show important descriptors
varImpPlot(mydata.rf)
## Summary of random forest model
print(mydata.rf)
[output]
Call:
 randomForest(formula = v ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 43
 
          Mean of squared residuals: 0.03566637
                    % Var explained: 68.16
[output]
## predict using the random forest model
test.predict <- predict(mydata.rf,mydata)
## write the predictions to the working directory
write.csv(test.predict, file = "RFTestPredict.csv")
The training-set R2 = 0.9483, OOB R2 = 0.6816. The variable importance can be seen in the following plot (TopoPSA, XLogP, etc.)
v-varimp.png
The model was exported using the following code and is available for download: vrfmodel
## serialize(mydata.rf, model)
.saveRDS(mydata.rf, file = "vrfmodel")
A plot of predicted values vs. measured values is shown below coloured from green to red by AE.
v-results.png
A QSARDB model was created through the command line and is also available for download: AbrahamSolventv.qdb.zip

Results - Model002a (General)

The following table summarizes the results, including the Training Set R2 (R2) and the Out of Bag R2 (OOB R2) which is a good estimate of utility.
parameter
R2
OOB R2
meaning[1,2]
ImpVar
c
0.91
0.45
log of the solvent–water ratio of the volume packing densities
nHBDon,WTPT.2
e
0.90
0.40
general solute-solvent dispersion interactions
XLogP, ATSc1
s
0.97
0.81
dipole-dipole and dipole-induced interactions
ATSc1, nAtomP
a
0.99
0.96
hydrogen-bond basicity
nHBAcc, TopoPSA, WTPT.4
b
0.95
0.75
hydrogen-bond acidity
khs.sOH, nHBDon
v
0.95
0.68
associated with cavity formation
TopoPSA, XLogP
The models were used to predict the parameters of all the solvents in the test-set and a consensus measure of accuracy was attained by summing the squares of the standardized absolute errors across all parameters. Using this measure we identify the model outliers to be acetonitrile and sulfolane. The chart below, showing the consensus error for each solvent, also gives information as to how the models perform on the different classes on solvents. The models perform better for compounds that only contain carbon (non-aliphatic rings), oxygen and hydrogen atoms.
FinalResults.png

Predicting the Abraham parameters for new solvents

1863 compounds with a melting point between -100 C and 0 C were taken from the most recent Open melting point dataset (ONSMP031). All CDK descriptors (except CPSA, Ionization Potential, WHIM, all Protein descriptors, and all Geometrical descriptors) were calculated using Rajarshi Guha's CDK Descriptor Calculator v1.3.2. The descriptors Kier3 and HybRatio were removed post-calculation due to a significant number of NA values. The solvent parameters for 1863 compounds were then calculated using the random forest models above, see the following R code:
library(randomForest)
## load in the data
mytestdata = read.csv(file="cdk.csv",head=TRUE,row.names="molID")
 
## load in the models and output the results to csv files
mydata.rf <- .readRDS(file="crfmodel")
test.predict <- predict(mydata.rf,mytestdata)
write.csv(test.predict, file = "RFTestPredictc.csv")
 
mydata.rf <- .readRDS(file="erfmodel")
test.predict <- predict(mydata.rf,mytestdata)
write.csv(test.predict, file = "RFTestPredicte.csv")
 
mydata.rf <- .readRDS(file="srfmodel")
test.predict <- predict(mydata.rf,mytestdata)
write.csv(test.predict, file = "RFTestPredicts.csv")
 
mydata.rf <- .readRDS(file="arfmodel")
test.predict <- predict(mydata.rf,mytestdata)
write.csv(test.predict, file = "RFTestPredicta.csv")
 
mydata.rf <- .readRDS(file="brfmodel")
test.predict <- predict(mydata.rf,mytestdata)
write.csv(test.predict, file = "RFTestPredictb.csv")
 
mydata.rf <- .readRDS(file="vrfmodel")
test.predict <- predict(mydata.rf,mytestdata)
write.csv(test.predict, file = "RFTestPredictv.csv")
The Abraham parameters were then used to predict the 25C-solubility of trans-cinnamic acid in all 1863 solvents using the latest Abraham descriptors for trans-cinnamic acid.

Model002b

Here we limit our models to solvents that only contain carbon, oxygen, and hydrogen atoms. We started with the original list of 78 solvents and then removed solvents that were not comprised of only carbon, oxygen, and hydrogen atoms (1,2-dichloroethane, 1-chlorobutane, acetonitrile, benzonitrile, bromobenzene, carbon disulfide, carbon tetrachloride, chlorobenzene, chloroform, dibutylformamide, dichloromethane, diethylacetamide, dimethylacetamide, DMF, DMSO, fluorobenzene, formamide, iodobenzene, N-ethylacetamide, N-ethylformamide, N-formylmorpholine, nitrobenzene, nitromethane, N-methyl-2-piperidone, N-methylacetamide, N-methylformamide, N-methylpyrrolidinone, sulfolane, tributyl phosphate, trifluoroethanol), This left us with 48 solvents.

Removing Outliers: Some of the 48 solvents were removed as outliers previously but are considered here, including solvents with negative c-parameter, thus before modeling we proceed by removing a new set of outliers exclusively identified step-wise using our consensus measure outlined above listed with the most suspect parameter: ethylene glycol (c = 2.729), isopropyl myristate (c = -0.605), propylene carbonate (v = 3.412), octadecanol (e = 0.148), methylcyclohexane (e = 0.782), cyclohexane (v = 4.577), cyclohexanone (e = 0.225), methanol (v = 3.549). This left us with 40 solvents.

Model002b-RandomForest: Random forest models were created for the Abraham solvent parameters as above. The following table summarizes the results, including the Training Set R2 (R2) and the Out of Bag R2 (OOB R2) which is a good estimate of utility.
parameter
R2
OOB R2
meaning[1,2]
ImpVar
c
0.93
0.61
log of the solvent–water ratio of the volume packing densities
khs.sOH, nHBDon
e
0.94
0.65
general solute-solvent dispersion interactions
MDEC.12, nAtomLAC
s
0.97
0.81
dipole-dipole and dipole-induced interactions
nAtomLAC, ATSc1
a
0.99
0.98
hydrogen-bond basicity
nHBDon, ATSc2
b
0.98
0.90
hydrogen-bond acidity
khs.sOH, nHBDon
v
0.96
0.77
associated with cavity formation
TopoPSA, BCUTp.1l
The chart below, showing the consensus error for each solvent, also gives information as to how the models perform on the different classes on solvents within the set. The models seems to perform worse, in general, for compounds with aliphatic rings.
COHRootSum2.png
The predicted Abraham parameters were compared to those from model002a and it was found that no statistical improvement was gained when restricting to a smaller subset. Thus since model002a is built using a more diverse set of solvents, it is recommended as the model of choice to predict the Abraham parameters for new solvents.

Classifying Solvents

We have derived models for predicting the Abraham solvent parameters that have OOB R2-values ranging from 0.40 to 0.96. Using these models we provide a combined list of 1939 solvents with Abraham parameters (88 measured-20120218 and 1851 predicted) partitioned into 12 groups using K-means clustering:
mydata = read.csv(file="AbrahamParametersForAllSolventsForR.csv",head=TRUE,row.names="solvent")
mydata <- scale(mydata) # standardize variables
# K-Means Cluster Analysis
fit <- kmeans(mydata, 12) # 12 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
[output]
   Group.1          c          e          s          a           b            v
1        1 -3.4447087 -2.3802005  1.0588094  1.2687462  8.35950545 -8.264504875
2        2 -0.2213855  1.6799541 -1.7406426 -1.4034535 -0.52068584  0.959811199
3        3  1.1138946  0.5469179 -1.2799698 -1.0889780 -0.51144540  0.751917778
4        4  0.9063351 -0.6136814  0.1136884 -1.0827809  0.58785776  0.492690575
5        5 -0.4784631  0.5757637  0.2347302  0.5634954 -0.62430692 -0.125921445
6        6 -0.8509431 -0.5903600  1.0320664  0.7625482 -0.14621129 -0.499160233
7        7  0.5982105 -0.9070429  0.3900724  1.0393694  1.49439191 -1.001666347
8        8 -0.6185940  0.3117337 -0.6860866  0.9540825  1.08859538 -0.009583815
9        9  1.1098023 -1.1295906  0.8859178  0.7341573  0.02253894 -0.907135912
10      10 -1.0171923 -0.3653057  0.8084089 -0.6995847 -0.09943499  0.545896977
11      11 -0.8630286  0.5534365 -0.2131106 -1.0294510 -0.22548842  0.789781060
12      12  0.6925955 -0.5640462  0.4181038  0.6016715 -0.56390282 -0.150832117
[output]
# append cluster assignment
mydata <- data.frame(mydata, fit$cluster)
write.csv(mydata, file = "clusters.csv")
The 12 groups have the following statistics:
Group
Measured Solvents
Number Of Predicted Solvents
Durand Classification[3]
1
water, ethanol/water(10:90)vol, ethanol/water(20:80)vol, ethanol/water(30:70)vol, ethanol/water(40:60)vol, ethanol/water(50:50)vol, ethanol/water(60:40)vol, ethylene glycol, formamide, trifluoroethanol
0
polar structured (75%), organic acidic compounds (25%) TFE
2
carbon disulfide, cyclohexane, decane, dodecane, hexadecane, methylcyclohexane, nonane, octane, undecane
262
apolar (100%)
3
2,2,4-trimethylpentane, carbon tetrachloride, dibutyl ether, heptane, hexane, pentane, p-xylene
106
apolar (86%), aprotic dipolar (14%) dibutyl ether
4
1,2-dichloroethane, 1-chlorobutane, chloroform, dichloromethane, fluorobenzene
146
assymetric halogenated hydrocarbon (73%), apolar (27%) 1-chlorobutane, 1,1,1-trichloroethane, tetrachloroethylene
5
-
245
weak electron pair donor bases (80%), aprotic dipolar (20%) quinoline
6
1,4-dioxane, benzonitrile, cyclohexanone, dimethyl acetamide, DMF, DMSO, nitromethane, N-methyl-2-piperidone, N-methylacetamide, N, methylpyrrolidinone, propylene carbonate, sulfolane
216
aprotic dipolar (44%), aprotic highly diploar (33%), other (23%)
7
2-butanol, 2-methyl-1-propanol, 2-methyl-2-propanol, ethanol, ethanol/water(70:30)vol, ethanol/water(80:20)vol, ethanol/water(90:10)vol, methanol, N-methylformamide, tributyl phosphate
154
polar protic (53%), amphiprotic (27%), other (20%)
8
1-butanol, 1-decanol, 1-heptanol, 1-hexanol, 1-octanol, 1-pentanol, 1-propanol, 2-pentanol, 2-propanol, 3-methyl-1-butanol, N-formylmorpholine, octadecanol
110
amphiprotic (80%), other (20%) PEG, phenol, 3-methylphenol
9
acetone, acetonitrile, butanone, dibutylformamide, diethyl acetamide, methyl acetate, N-ethylacetamide, N-ethylformamide
158
aprotic highly dipolar (63%), aprotic dipolar (37%)
10
bromobenzene, iodobenzene, nitrobenzene
117
aprotic dipolar (43%), apolar (43%), other (14%)
11
benzene, chlorobenzene, ethylbenzene, isopropyl myristate, m-xylene, o-xylene, toluene
111
apolar (88%), strong electron pair donor bases (12%) tributylamine
12
butyl acetate, diethyl ether, ethyl acetate, methyl tert-butyl ether, THF
226
aprotic dipolar (82%), weak electron pair donor bases (18%)
Looking at the above table, we see two interesting results: 1. Group one contains no predicted solvents, where all the other groups have over 100. This means the solvents of group one are atypical/special. 2. Group five has 245 predicted solvents but no solvents with measured Abraham parameters. This suggests that a whole entire region of the chemical space has yet to be explored. The predicted solvents of group five a typified by compounds like: ethyl undec-10-enoate, trielaidoylglycerol, octyl acetate, aramite, beta-ionone, beta-butoxyethyl phthalate, 6-methylheptyl 2-(2,4-dichlorophenoxy)propanoate, 1,1,5-Trimethyl-3,3-bis[(2-methyl-2-propanyl)peroxy]cyclohexane. The compounds of group five are most similar to those of group group six, except in that they have a much larger values for 'e'.

Classifying the groups in classical terms (polar, apolar, etc) by comparing the groups to those derived by Durand et. al. [3], we see a good correlation between the 12 groups and classical solvent groupings with group 1 being polar, groups 2, 3, and 11 being apolar, group 4 being assymetric halgonated hydrocarbons, group 5 being weak electron pair donor bases, groups 6, 9, and 12 being dipolar, group 7 being polar protic, group 8 being amphiprotic; and group 10 being a mixture of aprotic dipolar and apolar compounds but on closer examination, we see that group 10 consists entirely of substituted benzene rings. Perfect correlation cannot be expected because our partition consists of 12 groups whereas Durand's consists of 10. Also, our method of partitioning may focus on slightly different structure properties, group 10 for example which consists entirely of substituted benzene rings, yet Durand splits the group into two.

Using principal component analysis on the full set of solvents we can see the groupings in chemical space. This graph is interactive - hover over the data point to see info - select multiple classes using ctrl-click. We see that distinct lines are hard to draw and compounds situated on the edge of two or more groups will share similarities to surrounding compounds, even if they are from different groups.



References

[1] Paul C.M. van Noort. Solvation thermodynamics and the physical–chemical meaning of the constant in Abraham solvation equations. Chemosphere (2011), doi:10.1016/j.chemosphere.2011.11.073
[2] Laura M. Grubbs, Mariam Saifullah, Nohelli E. De La Rosa, Shulin Ye, Sai S. Achi, William E. Acree Jr., Michael H. Abraham. Mathematical correlations for describing solute transfer into functionalized alkane solvents containing hydroxyl, ether, ester or ketone solvents. Fluid Phase Equilibria 298 (2010) 48–53, doi:10.1016/j.fluid.2010.07.00
[3] Morgan Durand,Valrie Molinier, Werner Kunz, and Jean-Marie Aubry. Classification of Organic Solvents Revisited by Using the COSMO-RS
Approach. Chem. Eur. J. (2011), doi:10.1002/chem.201001743