AbrahamSolventsModel001a


 * [Please remove all span tags and be consistent with header levels by using the Wikitext editor JCB]**

Modeling Solvent Abraham Coefficients - Model001a
Jesse Patsolic
 * Researcher**

Objective
To model the Abraham solvent coefficients c, e, s, a, b, v from chemical structure using CDK descriptors. Model001a and Model001b are performed by different researchers but use the same data, procedure and techniques.

Introduction
The Abraham solvent coefficients c, s, a, b, v are determined by fitting experimental partition coefficients and solubility measurements to the Abraham general solvation model equations. Recent work has shown that the solvent Abraham coefficient c, the regression intercept, has physical–chemical meaning, being related to the van der Waals volume. [1] This means that all the coefficients have physical–chemical meaning and thus it may be possible to model them directly from structure. In fact, this has already been done for a limited chemical space of alkane solvents containing hydroxyl, ether, ester and/or ketone functional groups using fragments. [2] Here we present a general model for predicting solvent Abraham coefficients from structure using open CDK descriptors.

Procedure
The up-to-date (as of January 21, 2012) [|solvent coefficients for 78 organic solvents] was take from the ONSChallange solvent database.

A CSV file of descriptors was generated using the [|CDK Descriptor Calculator GUI] (v1.3.2) for all 2D descriptors except the following: Ionization Potential, Charged Partial Surface Areas, Protein, Geometrical, or WHIM. The option 'Add Explicit H' was selected. The descriptors Kier3 and HybRatio were deleted from the CSV file due to multiple "NA" entries.

Building the Models

 * Random Forest.** The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest models in the following sections of code.

Abraham Solvent Coefficient c
The predictive model of the Abraham solvent coefficient **c** was obtained with the following code: code format="rsplus" > mydata = read.csv(file="cwithdescriptorsreadyformodeling.csv",head=TRUE,row.name="molID")
 * 1) Load data

> require(randomForest)
 * 1) Load randomForest package (v. 4.6-6)

> mydata.rf <- randomForest(c ~ ., data = mydata,importance = TRUE)
 * 1) Do randomForest

> print(mydata.rf)
 * 1) Print results to screen

[output] Call: randomForest(formula = c ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.0239344 % Var explained: 26.3 [output]

> varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 * 1) get plot of descriptor importance

[output] ## see following image titled "Random Forest Variable Importance"


 * 1) Write predicted descriptor values out as a .csv file.

> test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file = "RFTestPredict_c.csv") code

The varImpPlot command generates the following image which shows the importance of the Topological Polar Surface Area (TopoPSA) and the XLogP as the most important physiochemical properties for the prediction of the Abraham solvent coefficient **c**.



The figure below was generated with Tableau Public (v. 7.0) with the file [|c_predictedwithdescriptorsforTP_jlp.csv], and it shows the predicted Abraham solvent coefficient **c** vs. the experimental value for **c**. The data is colored from red to green by Topological Polar Surface Area.



Abraham Solvent Coefficient s
The predictive model of the Abraham solvent coefficient **s** was obtained with the following code: code format="rsplus" > require(randomForest)
 * 1) Load randomForest package

> mydata = read.csv(file="swithdescriptorsreadyformodeling.csv",head=TRUE,row.names="molID")
 * 1) Load data

> mydata.rf <- randomForest(s ~ ., data = mydata,importance = TRUE)
 * 1) Do randomForest (v. 4.6-6)

> test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file = "RFTestPredict_s.csv")
 * 1) Write predicted descriptor values to a .csv file

> print(mydata.rf)
 * 1) Print results to screen

[output] Call: randomForest(formula = s ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.1079902 % Var explained: 76.2 [output]

> varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 * 1) Generate plot of descriptor importance

[output] ## see following image code

The varImpPlot command generates the following image. The Moreau-Broto autocorrelation (ATSc1) descriptor is shown to be the most important physiochemical property for the prediction of the Abraham solvent coefficient s.



The figure below was generated with Tableau Public (v. 7.0) with the file [|s_predictedwithdescriptorsforTP_jlp.csv], and it shows the predicted Abraham solvent coefficient s vs. the experimental value for s. The data is colored from red to green by the value of the ATSc1 descriptor.



Abraham Solvent Coefficient a
Again using the randomForest package to predict the Abraham solvent coefficient a with the following R code: code format="rsplus" > require(randomForest)
 * 1) Load randomForest package

> mydata = read.csv(file = "awithdescriptorsreadyformodeling.csv",head=TRUE,row.names="molID")
 * 1) Load data

> mydata.rf <- randomForest(a ~ ., data = mydata,importance=TRUE)
 * 1) Do randomForest (v. 4.6-6)

> test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file="RFTestPredict_a.csv")
 * 1) Write predicted descriptor values to a .csv file

> print(mydata.rf)
 * 1) Print results to screen

[output] Call: randomForest(formula = a ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.2825547 % Var explained: 89.98 [output]

> varImpPlot(mydata.rf, main="Random Forest Variable Importance")
 * 1) Generate plot of descriptor importance

[output] ## see following image code

The varImpPlot command generated the following image. The HBondAcceptorCount (nHBAcc) is shown as the most important descriptor for the prediction of the Abraham solvent coefficient a.



Using Tableau Public (v. 7.0), with the file [|a_predictedwithdescriptorsforTP_jlp.csv], the experimental vs. the predicted values of the Abraham solvent coefficient a were plotted and colored based on the nHBAcc descriptor.

Abraham Solvent Coefficient b
The predictive model for the Abraham solvent coefficient b was run in R as follows:

code format="rsplus" > require(randomForest)
 * 1) Load randomForest

> mydata = read.csv(file="bwithdescriptorsreadyformodeling.csv",head=TRUE,row.names="molID")
 * 1) Load data

> mydata.rf <- randomForest(b ~ ., data = mydata,importance=TRUE)
 * 1) Do randomForest (v. 4.6-6)

> test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file="RFTestPredict_b.csv")
 * 1) Write predicted descriptor values to a .csv file

> print(mydata.rf)
 * 1) Print results to screen

[output] Call: randomForest(formula = b ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.2331113 % Var explained: 45.49 [output]

> varImpPlot(mydata.rf, main="Random Forest Variable Importance")
 * 1) Generate plot of descriptor importance

[output] ## see following image

code

Output from the randomForest varImpPlot command shows the khs.sOH descriptor as the most important descriptor related to the Abraham solvent coefficient b.



Using Tableau Public (v. 7.0), with the file [|b_predictedwithdescriptorsforTP_jlp.csv], the experimental vs. the predicted values of the Abraham solvent coefficient b were plotted and colored based on the khs.sOH descriptor.



Abraham Solvent Coefficient e
Moving on to the Abraham solvent descriptor e. The R code is as follows:

code format="rsplus" > require(randomForest)
 * 1) Load randomForest

> mydata = read.csv(file="ewithdescriptorsreadyformodeling.csv",head=TRUE,row.names="molID")
 * 1) Load data

> mydata.rf <- randomForest(e ~ ., data = mydata,importance = TRUE)
 * 1) Do randomForest (v. 4.6-6)

> test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file = "RFTestPredict_e.csv")
 * 1) Write predicted descriptor values to a .csv file

> print(mydata.rf)
 * 1) Print results to screen

[output]

Call: randomForest(formula = e ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.03533318 % Var explained: 22.63 [output]

> varImpPlot(mydata.rf, main="Random Forest Variable Importance")
 * 1) Generate plot of descriptor importance

[output] ## see following image code Using the varImpPlot command generates the following plots. This shows that the descriptors ALogp2 as the most important physiochemical property for the prediction of the prediction of the Abraham solvent descriptor e.



The following plot gives the experimental versus the predicted values for the Abraham solvent descriptor e from the data file [|e_predictedwithdescriptorsforTP_jlp.csv]. The data points are colored according to the value of the ALogp2 descriptor.



Abraham Solvent Coefficient v
The R code for the predictive model for the Abraham solvent coefficient v is as follows:

code format="rsplus" > require(randomForest)
 * 1) Load randomForest package (v. 4.6-6)

> mydata = read.csv(file="vwithdescriptorsreadyformodeling.csv",head=TRUE,row.name="molID")
 * 1) Load data

> mydata.rf <- randomForest(v ~ ., data = mydata,importance =TRUE)
 * 1) Do randomForest

> test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict,file="RFTestPredict_v.csv")
 * 1) Write predicted descriptor values to a .csv file

> print(mydata.rf)
 * 1) Print results to screen

[output] Call: randomForest(formula = v ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.08940691 % Var explained: 57.01 [output]

> varImpPlot(mydata.rf, main="Random Forest Variable Importance")
 * 1) Generate plot of descriptor importance

[output] ## see following image code

The following image was generated by the randomForest varImpPlot command and shows the BCUTp.1l as the most importance descriptor according to %IncMSE



The following plot gives the experimental versus the predicted values for the Abraham solvent descriptor v from the data file v_withdescriptorsforTP.csv. The data points are colored according to the value of the BCUTp.1l descriptor.



Discussion
The following table is a summary of the predictive models of each of the Abraham solvent coefficients.


 * ~ ___Solvent Coefficient___ ||~ ___% Var Explained___ ||~ ___varImpPlot Descriptor___ ||
 * c || 26.3 || TopoPSA ||
 * s || 76.2 || ATSc1 ||
 * a || 89.98 || nHBAcc ||
 * b || 45.49 || khs.sOH ||
 * e || 22.63 || ALogp2 ||
 * v || 57.01 || BCUTp.1l ||

Thus, the predictive model for the Abraham solvent descriptor **a** yielded the best "precent Var Explained" of 89.98 out of the six coefficients modeled. The coefficient **b** is an evaluation of the acidity of the hydrogen bonds of the phase, its corresponding descriptor from the varImpPlot command is khs.sOH which describes the number of E-state fragment occurrences. The predictive model for **a** yielded a "percent Var Explained" of 76.2. The coefficient **a** is an evaluation of the basicity of the hydrogen bonds of the phase. The function of the coefficient **a** corresponds well to the descriptor nHBAcc which describes the total of hydrogen bond acceptors. The "percent Var Explained" model result for the coefficient **v** is 57.01 and the function of the **v** coefficient is to give an amount of the hydrophobicity of the phase which in turn explains the forces of cavitation and dispersion interactions. [3], [4]

Conclusion

 * [Good work, Give a summary of the results '% var explained' - maybe in a table and any other conclusions you'd like to draw. Then read [2], especially about what the coefficients c,s,e,a,b,v are meant to mean, and discuss if the CDK descriptors picked up make sense:** [|__**http://pele.farmbio.uu.se/nightly-1.4.x/dnames.html**__] **-AL]**
 * [What are your thoughts about whether this model could be used by chemists to plan reactions? Solute Model001 has not proven to be practically usable for applications like the Solvent Selector . Do you think this model is good enough - and if not what strategy will you use to improve this model? JCB]**
 * [Since this model is about estimating the values for solvent Abraham descriptors, the most important aspect of performance is how it does predicting solubilities for solutes with well-defined Abraham descriptors. A straightforward way to assess this would be to generate a fourth column in the Solvent Selector for cinnamic acid (which has lots of experimental measurements and good predictions for almost all solvents listed under the AD measured column). I could then give you feedback from a synthetic chemist's perspective about how useful your model is likely to be. JCB We can do this for solvent model002 and compare to Acree's analysis in [2] -AL]**