AbrahamSolventsModel001b

Modeling Solvent Abraham Coefficients - Model001b
Lori Fielding
 * Researcher**
 * [Good work, please remove all spans from the wikitext -AL]**

Objective
To model the Abraham solvent coefficients c, e, s, a, b, v from chemical structure using CDK descriptors. Model001a and Model001b are performed by different researchers but use the same data, procedure and techniques.

Introduction
The Abraham solvent coefficients c, s, a, b, v are determined by fitting experimental partition coefficients and solubility measurements to the Abraham general solvation model equations. Recent work has shown that the solvent Abraham coefficient c, the regression intercept, has physical–chemical meaning, being related to the van der Waals volume. [1] This means that all the coefficients have physical–chemical meaning and thus it may be possible to model them directly from structure. In fact, this has already been done for a limited chemical space of alkane solvents containing hydroxyl, ether, ester and/or ketone functional groups using fragments. [2] Here we present a general model for predicting solvent Abraham coefficients from structure using open CDK descriptors.

Procedure
The up-to-date (as of January 21, 2012) [|solvent coefficients for 78 organic solvents] was take from the ONSChallange solvent database.

A CSV file of descriptors was generated using the [|CDK Descriptor Calculator GUI] (v1.3.2) for all 2D descriptors with the following options deselected: Ionization Potential, Charged Partial Surface Areas, Protein, Geometrical, and WHIM, additionally selecting the option 'Add Explicit H'. The descriptors Kier3 and HybRatio were deleted from the CSV file due to multiple "NA" entries.

Abraham Solvent Coefficient c
The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest model for the prediction of the Abraham solvent coefficient c using the following code:

code format="rsplus" > mydata = read.csv(file="cwithdescriptorsreadyformodeling.csv",head=TRUE,row.name="molID")
 * 1) Load data

> require(randomForest)
 * 1) Load randomForest package (v. 4.6-6)

> mydata.rf <- randomForest(c ~ ., data = mydata,importance = TRUE)
 * 1) Do randomForest

> print(mydata.rf)
 * 1) Print results

[output] Call: randomForest(formula = c ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.02395482 % Var explained: 26.23 [output]

> test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file ="RFTestPredict_c.csv")
 * 1) Write predicted descriptor values out as a .csv file.

> varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 * 1) Get plot of descriptor importance

[output] ## see following image: "Random Forest Variable Importance"

code The varImpPlot command generates the following image which shows the importance of the Topological Polar Surface Area (TopoPSA) and the XLogP as the most important physiochemical properties for the prediction of the Abraham solvent coefficient c.



The following chart was generated with Tableau Public (v. 7.0) using the file [|cwithdescriptorsforTP.csv], and depicts the relationship between the predicted Abraham solvent coefficient c vs. the experimental value for c. The data is colored from red to green by Topological Polar Surface Area.



Abraham Solvent Coefficient e
The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest model for the prediction of the Abraham solvent coefficient e using the following code:

code format="rsplus" > mydata = read.csv(file="ewithdescriptorsreadyformodeling.csv",head=TRUE,row.name="molID")
 * 1) Load data

> require(randomForest)
 * 1) Load randomForest package (v. 4.6-6)

> mydata.rf <- randomForest(e ~ ., data = mydata,importance = TRUE)
 * 1) Do randomForest

> print(mydata.rf)
 * 1) Print results

[output] Call: randomForest(formula = e ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.03508388 % Var explained: 23.17 [output]

> test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file ="RFTestPredict_e.csv")
 * 1) Write predicted descriptor values out as a .csv file.

> varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 * 1) Get plot of descriptor importance

[output] ## see following image: "Random Forest Variable Importance"

code The varImpPlot command generates the following image which shows the importance of the (MDEC.12) and the XLogP as the most important physiochemical properties for the prediction of the Abraham solvent coefficient e.



The following chart was generated with Tableau Public (v. 7.0) using the file [|ewithdescriptorsforTP.csv], and depicts the relationship between the predicted Abraham solvent coefficient e vs. the experimental value for e. The data is colored from red to green by MDEC.12.



Abraham Solvent Coefficient s
The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest model for the prediction of the Abraham solvent coefficient s using the following code:

code format="rsplus" > mydata = read.csv(file="swithdescriptorsreadyformodeling.csv",head=TRUE,row.name="molID")
 * 1) Load data

> require(randomForest)
 * 1) Load randomForest package (v. 4.6-6)

> mydata.rf <- randomForest(s ~ ., data = mydata,importance = TRUE)
 * 1) Do randomForest

> print(mydata.rf)
 * 1) Print results

[output] Call: randomForest(formula = s ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.110901 % Var explained: 75.56 [output]

> test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file ="RFTestPredict_s.csv")
 * 1) Write predicted descriptor values out as a .csv file.

> varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 * 1) Get plot of descriptor importance

[output] ## see following image: "Random Forest Variable Importance"

code The varImpPlot command generates the following image which shows the importance of the (nAtomP) and the ATSc1 as the most important physiochemical properties for the prediction of the Abraham solvent coefficient s.



The following chart was generated with Tableau Public (v. 7.0) using the file [|swithdescriptorsfortp.csv], and depicts the relationship between the predicted Abraham solvent coefficient s vs. the experimental value for s. The data is colored from red to green by nAtomP.



Abraham Solvent Coefficient a
The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest model for the prediction of the Abraham solvent coefficient a using the following code:

code format="rsplus" > mydata = read.csv(file="awithdescriptorsreadyformodeling.csv",head=TRUE,row.name="molID")
 * 1) Load data

> require(randomForest)
 * 1) Load randomForest package (v. 4.6-6)

> mydata.rf <- randomForest(a ~ ., data = mydata,importance = TRUE)
 * 1) Do randomForest

> print(mydata.rf)
 * 1) Print results

[output] Call: randomForest(formula = a ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.2675513 % Var explained: 90.51 [output]

> test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file ="RFTestPredict_a.csv")
 * 1) Write predicted descriptor values out as a .csv file.

> varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 * 1) Get plot of descriptor importance

[output] ## see following image: "Random Forest Variable Importance"

code The varImpPlot command generates the following image which shows the importance of the (nHBAcc) as the most important physiochemical property for the prediction of the Abraham solvent coefficient a.



The following chart was generated with Tableau Public (v. 7.0) using the file [|awithdescriptorsforTP.csv], and depicts the relationship between the predicted Abraham solvent coefficient a vs. the experimental value for a. The data is colored from red to green by nHBAcc.

Abraham Solvent Coefficient b
The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest model for the prediction of the Abraham solvent coefficient b using the following code:

code format="rsplus" > mydata = read.csv(file="bwithdescriptorsreadyformodeling.csv",head=TRUE,row.name="molID")
 * 1) Load data

> require(randomForest)
 * 1) Load randomForest package (v. 4.6-6)

> mydata.rf <- randomForest(b ~ ., data = mydata,importance = TRUE)
 * 1) Do randomForest

> print(mydata.rf)
 * 1) Print results

[output] Call: randomForest(formula = s ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.2219764 % Var explained: 48.1 [output]

> test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file ="RFTestPredict_b.csv")
 * 1) Write predicted descriptor values out as a .csv file.

> varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 * 1) Get plot of descriptor importance

[output] ## see following image: "Random Forest Variable Importance"

code The varImpPlot command generates the following image which shows the importance of the (khs-sOH) as the most important physiochemical property for the prediction of the Abraham solvent coefficient b.



The following chart was generated with Tableau Public (v. 7.0) using the file [|bwithdescriptorsforTP.csv], and depicts the relationship between the predicted Abraham solvent coefficient b vs. the experimental value for b. The data is colored from red to green by khs-sOH.



Abraham Solvent Coefficient v
The randomForest package (v4.6-6) in R (v2.14.0) was used to build the random forest model for the prediction of the Abraham solvent coefficient v using the following code:

code format="rsplus" > mydata = read.csv(file="vwithdescriptorsreadyformodeling.csv",head=TRUE,row.name="molID")
 * 1) Load data

> require(randomForest)
 * 1) Load randomForest package (v. 4.6-6)

> mydata.rf <- randomForest(v ~ ., data = mydata,importance = TRUE)
 * 1) Do randomForest

> print(mydata.rf)
 * 1) Print results

[output] Call: randomForest(formula = s ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 68

Mean of squared residuals: 0.08849263 % Var explained: 57.45 [output]

> test.predict <- predict(mydata.rf,mydata) > write.csv(test.predict, file ="RFTestPredict_v.csv")
 * 1) Write predicted descriptor values out as a .csv file.

> varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 * 1) Get plot of descriptor importance

[output] ## see following image: "Random Forest Variable Importance"

code The varImpPlot command generates the following image which shows the importance of the xLogP as the most important physiochemical property for the prediction of the Abraham solvent coefficient v.



The following chart was generated with Tableau Public (v. 7.0) using the file, and depicts the relationship between the predicted Abraham solvent coefficient v vs. the experimental value for v. The data is colored from red to green by xLogP.