To model the Abraham general solvation equations solvent coefficients c, e, s, a, b, v in order to identify outliers. The goal is to Identify outliers that correspond to solvents whose coefficients need updating. This will aid in prioritizing solubility/logP measurement experiments. Also, once these solvents have been identified we can justifiably remove them from both solvent and solute modeling efforts until their coefficients are re-calculated. The models presented here are complimentary to both model001a and model001b which were performed on the full dataset of 78 up-to-date (as of January 21, 2012) solvent coefficients for organic solvents taken from the ONSChallange solvent database (Acree, Bradley, Lang).

Procedure

CDK Descriptors were calculated for all 78 solvents using Rajarshi Guha's CDK Descriptor Calculator. The option to add explicit hydrogens was selected and all descriptors were calculated except the following: Charged Partial Surface Area, Ionization Potential, WHIM, and all protein and geometrical descriptors. The following descriptors were removed because thay had 'NA' entries for some solvents: Kier3, HybRatio. This left a dataset of 78 solvents with 206 descriptors. By outlier, we mean those solvents that are model outliers yet are chemically similar to most of the other solvents. This is because solvents that are chemically dissimilar to most of the other solvents will likely be identified as model outliers just because they are chemically dissimilar and not necessarily because their coefficients need updating. To this end we begin by removing solvents that are outside the main chemical space.

Chemical Space: Solvents which are outliers chemically were identified and removed from the database because when modeled they would likely be flagged as model outliers primarily due to them being chemically different from the majority of solvents and not necessarily the type of outlier we are trying to identify here - being a model outlier while chemically similar to the majority of solvents in the database. Descriptors that had zero variation (all zeros) were removed from the database leaving 78 solvents with 135 descriptors. The following code was then used to analyse the chemical space.

## load in data
mydata = read.csv(file="20120121SolventsWithDescriptorsForPCA.csv",head=TRUE,row.names="molID")
data <- mydata[, 2:137] ## Don't include molID or DV (DV flag) in pca
pc1 <- prcomp(data, scale. = T)
x <- pc1$x
summary(pc1)
[output]
Importance of components:
PC1 PC2 PC3
Standard deviation 6.6771 3.9981 3.44361
Proportion of Variance 0.3278 0.1175 0.08719
Cumulative Proportion 0.3278 0.4454 0.53255
[output]
colvec <- mydata$DV
plot(x[, 1], x[, 2], pch = 20, col = colvec+2,xlim=c(-15,30),ylim=c(-10,15))

The chemical outliers were identified to be tributyl phosphate, isopropyl myristate (c = -0.605), hexadecane, and octadecanol (c = -0.096) illustrated below with the points coloured by PC3 (red: -ve, blue: +ve).

Modeling: Random forest models were created for all solvent coefficients using R 2.13.0, using code similar to that below.

library(randomForest)
## load in data
mydata = read.csv(file="20120121SolventsWithDescriptorsc.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.5-34]
mydata.rf <- randomForest(c ~ ., data = mydata,importance = TRUE)
## Show important descriptors
varImpPlot(mydata.rf)
## Summary of random forest model
print(mydata.rf)
## predict using the random forest model
test.predict <- predict(mydata.rf,mydata)
## write the predictions to the working directory
write.csv(test.predict, file = "RFTestPredict.csv")

Outliers Method 1: The absolute error for each solvent coefficient was standardized by dividing by the standard deviation. The resulting standardized deviations were squared and summed across all coefficients and the solvent with the highest sum was taken to be the top outlier. This outlier was then removed from the dataset and the procedure was performed again to identify the next outlier, and so on. The top four outliers were found to be, in this order: DMF (c = -0.305), trifluoroethanol, carbon disulfide, and ethylene glycol (c = -0.270).

A balance has to be made with regards to stepwise removal of outliers and maintaining enough solvents to build general models. With the 70 remaining solvents, the top 15 outliers were determined to be: formamide (c = -0.171), dimethylacetamide (c = -0.271), N-formylmorpholine (c = -0.032), DMSO (c = -0.194), carbon tetrachloride, dibutylformamide, sulfolane, acetonitrile, diethylacetamide, nitromethane, chloroform, nitrobenzene (c = -0.196), methylcyclohexane, iodobenzene (c = -0.192), and cyclohexane.

We note here that of the 13 solvents with negative c coefficients, 10 have been identified as outliers. The only three not identified as outliers have three of the four least negative values for c (1-decanol: c = -0.058, 1-octanol: c = -0.034, bromobenzene c = -0.017). This seems to support van Noort's analysis that the c descriptor has volume related physical-chemical meaning and so should never be negative. [1]

Results Method 1: The following table shows the R2 values for each coefficient at each stage (1: original dataset - 78 solvents. 2: dataset after 4 chemical space outliers removed - 74 solvents. 3: dataset after 4 model outliers removed - 70 solvents.) of curation.

Stage

1

2

3

c-R2

26

43

46

e-R2

21

28

34

s-R2

77

80

83

a-R2

90

91

93

b-R2

48

47

65

v-R2

55

55

65

Outliers Method 2: After removing the 4 chemical space outliers (tributyl phosphate, isopropyl myristate, hexadecane, and octadecanol) as done above, we remove the 7 solvents with c-coefficient less than -0.058 (DMF, dimethylacetamide, ethylene glycol, nitrobenzene, DMSO, iodobenzene, and formamide) under the assumption that solvents should not have negative values for c. Four additional model outliers were identified using the method described above: trifluoroethanol, carbon disulfide, nitromethane, and N-formylmorpholine (c = -0.032).

After this stepwise removal of outliers we ran the regression one last time and noted the next 8 outliers: sulfolane, acetonitrile, dibutylformamide, carbon tetrachloride, methylcyclohexane, propylene carbonate, chloroform, and diethylacetamide.

Results Method 2: The following table shows the R2 values for each coefficient at each stage (1: original dataset - 78 solvents. 2: dataset after 4 chemical space outliers removed - 74 solvents. 3: dataset after 7 additional solvents with large negative c-coefficients were removed - 67 solvents. 4: dataset after 4 model outliers removed - 63 solvents.) of curation.

Stage

1

2

3

4

c-R2

26

43

32

47

e-R2

21

28

29

41

s-R2

77

80

77

81

a-R2

90

91

93

95

b-R2

48

47

46

74

v-R2

55

55

61

69

Conclusion and Recommendations

1. The models for coefficient e, even after the top outliers have been removed, have R2 values less than 0.5. This suggest that the solute descriptor E, which is sometimes derived from structure instead of by regression, may benefit from being derived via regression. This in turn should improve the accuracy of the e-coefficient.
2. Since the regression intercept c has historically not been chosen to be zero, it is assumed that it picks up all solute-solvent interactions not described by the other solvent coefficients, and if van Noort [1] is correct in assigning a primarily volume related physical-chemical meaning to c, then c should not be negative. This is interesting because even when the top outliers are removed, like e, the c-coefficient is hard to model, having R2 values less than 0.5. Thus either c must be taken as zero for all solvents (with the assumption that all solute-solvent interactions can be encoded in the other coefficients) or c should be made zero and the other coefficients recalculated for solvents who initially give a negative value for c.
3. The following four solvents have coefficients that should be treated as being suspect and they need to be prioritized for re-evaluation: trifluoroethanol, carbon disulfide, nitromethane, and N-formylmorpholine (c = -0.032)

References

[1] Paul C.M. van Noort. Solvation thermodynamics and the physicalâ€“chemical meaning of the constant in Abraham solvation equations. Chemosphere (2011), doi:10.1016/j.chemosphere.2011.11.073

## Abraham Model Solvent Coefficients - Model001c

Researcher:Andrew Lang## Objective

To model the Abraham general solvation equations solvent coefficients c, e, s, a, b, v in order to identify outliers. The goal is to Identify outliers that correspond to solvents whose coefficients need updating. This will aid in prioritizing solubility/logP measurement experiments. Also, once these solvents have been identified we can justifiably remove them from both solvent and solute modeling efforts until their coefficients are re-calculated. The models presented here are complimentary to both model001a and model001b which were performed on the full dataset of 78 up-to-date (as of January 21, 2012) solvent coefficients for organic solvents taken from the ONSChallange solvent database (Acree, Bradley, Lang).## Procedure

CDK Descriptors were calculated for all 78 solvents using Rajarshi Guha's CDK Descriptor Calculator. The option to add explicit hydrogens was selected and all descriptors were calculated except the following: Charged Partial Surface Area, Ionization Potential, WHIM, and all protein and geometrical descriptors. The following descriptors were removed because thay had 'NA' entries for some solvents: Kier3, HybRatio. This left a dataset of 78 solvents with 206 descriptors. By outlier, we mean those solvents that are model outliers yet are chemically similar to most of the other solvents. This is because solvents that are chemically dissimilar to most of the other solvents will likely be identified as model outliers just because they are chemically dissimilar and not necessarily because their coefficients need updating. To this end we begin by removing solvents that are outside the main chemical space.Chemical Space:Solvents which are outliers chemically were identified and removed from the database because when modeled they would likely be flagged as model outliers primarily due to them being chemically different from the majority of solvents and not necessarily the type of outlier we are trying to identify here - being a model outlier while chemically similar to the majority of solvents in the database. Descriptors that had zero variation (all zeros) were removed from the database leaving 78 solvents with 135 descriptors. The following code was then used to analyse the chemical space.The chemical outliers were identified to be tributyl phosphate, isopropyl myristate (c = -0.605), hexadecane, and octadecanol (c = -0.096) illustrated below with the points coloured by PC3 (red: -ve, blue: +ve).

Modeling:Random forest models were created for all solvent coefficients using R 2.13.0, using code similar to that below.Outliers Method 1:The absolute error for each solvent coefficient was standardized by dividing by the standard deviation. The resulting standardized deviations were squared and summed across all coefficients and the solvent with the highest sum was taken to be the top outlier. This outlier was then removed from the dataset and the procedure was performed again to identify the next outlier, and so on. The top four outliers were found to be, in this order: DMF (c = -0.305), trifluoroethanol, carbon disulfide, and ethylene glycol (c = -0.270).A balance has to be made with regards to stepwise removal of outliers and maintaining enough solvents to build general models. With the 70 remaining solvents, the top 15 outliers were determined to be: formamide (c = -0.171), dimethylacetamide (c = -0.271), N-formylmorpholine (c = -0.032), DMSO (c = -0.194), carbon tetrachloride, dibutylformamide, sulfolane, acetonitrile, diethylacetamide, nitromethane, chloroform, nitrobenzene (c = -0.196), methylcyclohexane, iodobenzene (c = -0.192), and cyclohexane.

We note here that of the 13 solvents with negative c coefficients, 10 have been identified as outliers. The only three not identified as outliers have three of the four least negative values for c (1-decanol: c = -0.058, 1-octanol: c = -0.034, bromobenzene c = -0.017). This seems to support van Noort's analysis that the c descriptor has volume related physical-chemical meaning and so should never be negative. [1]

Results Method 1:The following table shows the R2 values for each coefficient at each stage (1: original dataset - 78 solvents. 2: dataset after 4 chemical space outliers removed - 74 solvents. 3: dataset after 4 model outliers removed - 70 solvents.) of curation.Outliers Method 2:After removing the 4 chemical space outliers (tributyl phosphate, isopropyl myristate, hexadecane, and octadecanol) as done above, we remove the 7 solvents with c-coefficient less than -0.058 (DMF, dimethylacetamide, ethylene glycol, nitrobenzene, DMSO, iodobenzene, and formamide) under the assumption that solvents should not have negative values for c. Four additional model outliers were identified using the method described above: trifluoroethanol, carbon disulfide, nitromethane, and N-formylmorpholine (c = -0.032).After this stepwise removal of outliers we ran the regression one last time and noted the next 8 outliers: sulfolane, acetonitrile, dibutylformamide, carbon tetrachloride, methylcyclohexane, propylene carbonate, chloroform, and diethylacetamide.

Results Method 2:The following table shows the R2 values for each coefficient at each stage (1: original dataset - 78 solvents. 2: dataset after 4 chemical space outliers removed - 74 solvents. 3: dataset after 7 additional solvents with large negative c-coefficients were removed - 67 solvents. 4: dataset after 4 model outliers removed - 63 solvents.) of curation.## Conclusion and Recommendations

1. The models for coefficient e, even after the top outliers have been removed, have R2 values less than 0.5. This suggest that the solute descriptor E, which is sometimes derived from structure instead of by regression, may benefit from being derived via regression. This in turn should improve the accuracy of the e-coefficient.2. Since the regression intercept c has historically not been chosen to be zero, it is assumed that it picks up all solute-solvent interactions not described by the other solvent coefficients, and if van Noort [1] is correct in assigning a primarily volume related physical-chemical meaning to c, then c should not be negative. This is interesting because even when the top outliers are removed, like e, the c-coefficient is hard to model, having R2 values less than 0.5. Thus either c must be taken as zero for all solvents (with the assumption that all solute-solvent interactions can be encoded in the other coefficients) or c should be made zero and the other coefficients recalculated for solvents who initially give a negative value for c.

3. The following four solvents have coefficients that should be treated as being suspect and they need to be prioritized for re-evaluation: trifluoroethanol, carbon disulfide, nitromethane, and N-formylmorpholine (c = -0.032)

## References

[1] Paul C.M. van Noort. Solvation thermodynamics and the physicalâ€“chemical meaning of the constant in Abraham solvation equations. Chemosphere (2011), doi:10.1016/j.chemosphere.2011.11.073