AbrahamDescriptorsModel003

Predicting Abraham Descriptors - model003

 * Researcher: Andrew Lang**

Data
The original data for these models was collected from the literature, calculated using the method of Abraham[1] with solubility data from the ONSChallenge, and in large part donated by William Acree Jr. and is the largest Open Data collection of Abraham descriptors currently available. The data used here is a refined version of the data used in model002 with four additional compounds removed: rifapentine, 2-naphthoic acid, and paclitaxel due to unreasonable (a big gap between these three compounds and the rest when measured from the centre of the chemical space) Abraham descriptors; and COS due to an incorrect SMILES. The salt phenyl mercuric (II) chloride (molID 1615) was also removed. The number of hydrogen bond donors for several azoles was fixed manually (0 to 1) - molID: 1502, 1812, 1498, 1843, 1942, 1510, 1954. The value for the V descriptor was also corrected for five compounds with SMILES that contained a # and all descriptors with constant zero values (khs.xxx) were deleted, leaving a of Abraham descriptors for 2475 compounds with 171 2D CDK descriptors, summarized in the table below.

Not all compounds had values for all descriptors and thus the dataset was split into 6 for modeling purposes, eliminating compounds which did not have values for that descriptor - marked as '-123'. The summary statistics for each descriptor are listed in the table below. Note: In general, we take A to be zero when the number of hydrogen bond donors is zero[2]. There are two main exceptions to this general rule: N-oxides[3] and highly halogenated compounds: "...in the case of the N-oxides, the reason for the nonzero A value results from the fact that the N-O bond is dative. The bond is sometimes written as N+=O-, and in the case of the N-oxides the A value could be correlated with the R resonance parameter...For some of the highly halogenated compounds, the hydrogen atom might exhibit a slight acidic character. Chloroform has a small nonzero A value, and I would expect difluorochloromethane and trichloromethane to behave somewhat similarly." - William Acree Jr. [personal correspondence]
 * descriptor || n || min || max || average || stddev ||
 * E || 2144 || -1.137 || 5.691 || 0.884 || 0.779 ||
 * S || 2216 || -0.881 || 6.463 || 0.996 || 0.669 ||
 * A || 2216 || 0.000 || 2.190 || 0.174 || 0.290 ||
 * B || 2202 || 0.000 || 4.350 || 0.487 || 0.478 ||
 * V || 2216 || 0.167 || 6.390 || 1.299 || 0.639 ||
 * L || 1973 || -0.836 || 28.940 || 5.954 || 3.087 ||

Modeling
Random forest models were created using R 2.13.0 and the Random Forest 4.6-6 package. code require(randomForest)
 * E: The Excess Molar Refractivity.**
 * 1) load in the Random Forest package

mydata=read.csv(file="20120528CuratedDataE.csv",head=TRUE,row.names="molID")
 * 1) read in data

mydata.rf <- randomForest(E ~., data = mydata,importance=TRUE)
 * 1) perform randomForest

print(mydata.rf) [output] Call: randomForest(formula = E ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 57
 * 1) view results

Mean of squared residuals: 0.02386857 % Var explained: 96.07 [output]

varImpPlot(mydata.rf,main = "Random Forest Variable Importance: E")
 * 1) show variable importance

.saveRDS(mydata.rf, file = "AD003E")
 * 1) save model

test.predict <- predict(mydata.rf,mydata)
 * 1) predict values

write.csv(test.predict, file="RFTestPredictE.csv") code The Random Forest model for E is available for and use under a CC0 license.
 * 1) write output data

code require(randomForest)
 * S: Diploarity/Polarizability**
 * 1) load in the Random Forest package

mydata=read.csv(file="20120528CuratedDataS.csv",head=TRUE,row.names="molID")
 * 1) read in data

mydata.rf <- randomForest(S ~., data = mydata,importance=TRUE)
 * 1) perform randomForest

print(mydata.rf) [output] Call: randomForest(formula = S ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 57
 * 1) view results

Mean of squared residuals: 0.07893424 % Var explained: 82.34 [output]

varImpPlot(mydata.rf,main = "Random Forest Variable Importance: S")
 * 1) show variable importance

.saveRDS(mydata.rf, file = "AD003S")
 * 1) save model

test.predict <- predict(mydata.rf,mydata)
 * 1) predict values

write.csv(test.predict, file="RFTestPredictS.csv") code The Random Forest model for S is available for and use under a CC0 license.
 * 1) write output data

code mydata=read.csv(file="20120528CuratedDataA.csv",head=TRUE,row.names="molID")
 * A: Hydrogen Bond Acidity**

mydata.rf <- randomForest(A ~., data = mydata,importance=TRUE)
 * 1) perform randomForest

print(mydata.rf) [output] Call: randomForest(formula = A ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 57
 * 1) view results

Mean of squared residuals: 0.02133837 % Var explained: 74.65 [output]

varImpPlot(mydata.rf,main = "Random Forest Variable Importance: A")
 * 1) show variable importance

.saveRDS(mydata.rf, file = "AD003A")
 * 1) save model

test.predict <- predict(mydata.rf,mydata)
 * 1) predict values

write.csv(test.predict, file="RFTestPredictA.csv") code The Random Forest model for A is available for and use under a CC0 license.
 * 1) write output data

code mydata=read.csv(file="20120528CuratedDataB.csv",head=TRUE,row.names="molID")
 * B: Hydrogen Bond Basicity**
 * 1) read in data

mydata.rf <- randomForest(B ~., data = mydata,importance=TRUE)
 * 1) perform randomForest

print(mydata.rf) [output] Call: randomForest(formula = B ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 57
 * 1) view results

Mean of squared residuals: 0.01977166 % Var explained: 91.33 [output]

varImpPlot(mydata.rf,main = "Random Forest Variable Importance: B")
 * 1) show variable importance

.saveRDS(mydata.rf, file = "AD003B")
 * 1) save model

test.predict <- predict(mydata.rf,mydata)
 * 1) predict values

write.csv(test.predict, file="RFTestPredictB.csv") code The Random Forest model for B is available for and use under a CC0 license.
 * 1) write output data

code mydata=read.csv(file="20120528CuratedDataV.csv",head=TRUE,row.names="molID")
 * V: McGowan Volume**
 * 1) read in data

mydata.rf <- randomForest(V ~., data = mydata,importance=TRUE)
 * 1) perform randomForest

print(mydata.rf) [output] Call: randomForest(formula = V ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 57
 * 1) view results

Mean of squared residuals: 0.008249509 % Var explained: 97.98 [output]

varImpPlot(mydata.rf,main = "Random Forest Variable Importance: V")
 * 1) show variable importance

.saveRDS(mydata.rf, file = "AD003V")
 * 1) save model

test.predict <- predict(mydata.rf,mydata)
 * 1) predict values

write.csv(test.predict, file="RFTestPredictV.csv") code The Random Forest model for V is available for and use under a CC0 license.
 * 1) write output data

code mydata=read.csv(file="20120528CuratedDataL.csv",head=TRUE,row.names="molID")
 * L: The logarithm of the solute gas phase dimensionless Ostwald partition coefficient into hexadecane at 298 K**
 * 1) read in data

mydata.rf <- randomForest(L ~., data = mydata,importance=TRUE)
 * 1) perform randomForest

print(mydata.rf) [output] Call: randomForest(formula = L ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 57
 * 1) view results

Mean of squared residuals: 0.3858601 % Var explained: 95.95 [output]

varImpPlot(mydata.rf,main = "Random Forest Variable Importance: L")
 * 1) show variable importance

.saveRDS(mydata.rf, file = "AD003L")
 * 1) save model

test.predict <- predict(mydata.rf,mydata)
 * 1) predict values

write.csv(test.predict, file="RFTestPredictL.csv") code The Random Forest model for L is available for and use under a CC0 license.
 * 1) write output data

code cd C:\alang\share\MyMesh\ONSC\qsardb
 * Deploying the models using QsarDB.** Parallel QDB format archives were created using the same data and variations of the following code. In a CMD window.

java -Xms512M -Xmx1024M -cp conversion-toolkit-r541.jar org.qsardb.conversion.SpreadsheetConverter --id D --smiles B --name A --properties C --source C:/alang/share/MyMesh/ONSC/qsardb/originaldata/AbrahamSolutes/E/E.csv --target C:/alang/share/MyMesh/ONSC/qsardb/AbrahamSoluteE

[prompt] Id (default: 'column_C'): E Name (default: 'Column C'): E Value format pattern (default: as-is): 0.### [prompt]

java -cp prediction-toolkit-r541.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/AbrahamSoluteE add-cdk

java -Xms512M -Xmx1024M -cp prediction-toolkit-r541.jar org.qsardb.prediction.DescriptorCalculator --dir C:/alang/share/MyMesh/ONSC/qsardb/AbrahamSoluteE

java -cp prediction-toolkit-r541.jar org.qsardb.prediction.ModelRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/AbrahamSoluteE add --id rf --name "Random forest regression" --property-id E

java -cp prediction-toolkit-r541.jar org.qsardb.prediction.ModelRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/AbrahamSoluteE attach-rds --id rf

java -cp prediction-toolkit-r541.jar org.qsardb.prediction.PredictionRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/AbrahamSoluteE add --id rf-training --name "Random forest regression (training)" --model-id rf

java -cp prediction-toolkit-r541.jar org.qsardb.prediction.PredictionRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/AbrahamSoluteE attach-values --id rf-training code In R, we then used variations of the following code: code suppressMessages(library("randomForest"))

qdbDir = "C:/alang/share/MyMesh/ONSC/qsardb/AbrahamSoluteE"

propertyId = 'E'

descriptorIdList = c('ALogP', 'ALogp2', 'AMR', 'BCUTw-1l', 'BCUTw-1h', 'BCUTc-1l', 'BCUTc-1h', 'BCUTp-1l', 'BCUTp-1h', 'fragC', 'nAcid', 'apol', 'naAromAtom', 'nAromBond', 'nAtom', 'ATSc1', 'ATSc2', 'ATSc3', 'ATSc4', 'ATSc5', 'ATSm1', 'ATSm2', 'ATSm3', 'ATSm4', 'ATSm5', 'ATSp1', 'ATSp2', 'ATSp3', 'ATSp4', 'ATSp5', 'nBase', 'nB', 'bpol', 'C1SP1', 'C2SP1', 'C1SP2', 'C2SP2', 'C3SP2', 'C1SP3', 'C2SP3', 'C3SP3', 'C4SP3', 'SCH-3', 'SCH-4', 'SCH-5', 'SCH-6', 'SCH-7', 'VCH-3', 'VCH-4', 'VCH-5', 'VCH-6', 'VCH-7', 'SC-3', 'SC-4', 'SC-5', 'SC-6', 'VC-3', 'VC-4', 'VC-5', 'VC-6', 'SP-0', 'SP-1', 'SP-2', 'SP-3', 'SP-4', 'SP-5', 'SP-6', 'SP-7', 'VP-0', 'VP-1', 'VP-2', 'VP-3', 'VP-4', 'VP-5', 'VP-6', 'VP-7', 'SPC-4', 'SPC-5', 'SPC-6', 'VPC-4', 'VPC-5', 'VPC-6', 'ECCEN', 'FMF', 'nHBDon', 'nHBAcc', 'khs.sCH3', 'khs.dCH2', 'khs.ssCH2', 'khs.tCH', 'khs.dsCH', 'khs.aaCH', 'khs.sssCH', 'khs.ddC', 'khs.tsC', 'khs.dssC', 'khs.aasC', 'khs.aaaC', 'khs.ssssC', 'khs.sNH2', 'khs.dNH', 'khs.ssNH', 'khs.aaNH', 'khs.tN', 'khs.dsN', 'khs.aaN', 'khs.sssN', 'khs.ddsN', 'khs.aasN', 'khs.ssssN', 'khs.sOH', 'khs.dO', 'khs.ssO', 'khs.aaO', 'khs.sF', 'khs.sSiH3', 'khs.ssSiH2', 'khs.sssSiH', 'khs.ssssSi', 'khs.sssP', 'khs.dsssP', 'khs.sSH', 'khs.dS', 'khs.ssS', 'khs.aaS', 'khs.dssS', 'khs.ddssS', 'khs.sCl', 'khs.sBr', 'khs.sI', 'Kier1', 'Kier2', 'nAtomLC', 'nAtomP', 'LipinskiFailures', 'nAtomLAC', 'MLogP', 'MDEC-11', 'MDEC-12', 'MDEC-13', 'MDEC-14', 'MDEC-22', 'MDEC-23', 'MDEC-24', 'MDEC-33', 'MDEC-34', 'MDEC-44', 'MDEO-11', 'MDEO-12', 'MDEO-22', 'MDEN-11', 'MDEN-12', 'MDEN-13', 'MDEN-22', 'MDEN-23', 'MDEN-33', 'PetitjeanNumber', 'nRotB', 'TopoPSA', 'VAdjMat', 'VABC', 'MW', 'WTPT-1', 'WTPT-2', 'WTPT-3', 'WTPT-4', 'WTPT-5', 'WPATH', 'WPOL', 'XLogP', 'Zagreb')

loadValues = function(path, id){ result = read.table(path, header = TRUE, sep = "\t", na.strings = "N/A") result = na.omit(result) names(result) = c('Id', gsub("-", "_", x = id)) return (result) }

loadPropertyValues = function(id){ return (loadValues(paste(sep = "/", qdbDir, "properties", id, "values"), id)) }

loadDescriptorValues = function(id){ return (loadValues(paste(sep = "/", qdbDir, "descriptors", id, "values"), id)) }

rfdata = loadPropertyValues(propertyId)

for(descriptorId in descriptorIdList){ print (descriptorId) rfdata = merge(rfdata, loadDescriptorValues(descriptorId), by = 'Id') }

compoundIds = rfdata$Id

rfdata$Id = NULL

rfmodel = randomForest(formula = E ~ ., data = rfdata) print(rfmodel)

object = list

object$propertyId = propertyId object$getPropertyId = function(self){ return (self$propertyId) }

object$descriptorIdList = descriptorIdList object$getDescriptorIdList = function(self){ return (self$descriptorIdList) }

object$rfmodel = rfmodel object$evaluate = function(self, values){ suppressMessages(require("randomForest"))

descriptorIdList = self$getDescriptorIdList(self) descriptorIdList = sapply(descriptorIdList, function(x) gsub("-", "_", x))

newrfdata = data.frame(c = NA) for(i in 1:length(descriptorIdList)){ newrfdata[descriptorIdList[i]] = values[i] }

return (predict(self$rfmodel, newdata = newrfdata)) }

.saveRDS(file = paste(sep = "/", qdbDir, "models/rf/rds"), object)

rfvalues = predict(rfmodel, rfdata) predictedValues = data.frame(compoundIds, rfvalues) write.table(predictedValues, file = paste(sep = "/", qdbDir, "predictions/rf-training/values"), col.names = c("csid", "E"), row.names = FALSE, quote = FALSE, sep = "\t") code We then copied the datasets for the various descriptors into the QDB folder and zipped it up ready for deployment. Once zipped the archives were tested locally using, in the CMD window, variations of: code java -cp prediction-toolkit-r541.jar org.qsardb.prediction.SMILESPredictor --archive AbrahamSoluteE.qdb.zip --format "0.000" --smiles "c1ccc(cc1)O" code

Results
We modeled all Abraham descriptors: E, S, A, B, V, L with the following results: The models are superior to all previously published models both here and in the literature. The most difficult descriptor to model is A (as found in all previous analyses), the hydrogen bond acidity. This is due in part to the convention of taking A to be zero when the number of hydrogen bond donors is zero (except in certain exceptions) and requiring A to be non-negative, even when linear regression of known solubilities leads to a negative value for A. We model V, even though it can be derived exactly from structure for most compounds, for completeness sake.
 * descriptor || n || OOB-R2 || OOB-RMSE ||
 * E || 2144 || 0.96 || 0.154 ||
 * S || 2216 || 0.82 || 0.281 ||
 * A || 2216 || 0.75 || 0.146 ||
 * B || 2202 || 0.91 || 0.141 ||
 * V || 2216 || 0.98 || 0.091 ||
 * L || 1973 || 0.96 || 0.621 ||

Looking at the top 20 outliers for each model, we find the following compounds may have suspect descriptors. Compounds in the top twenty outliers for multiple descriptors: iodine, sucrose, digitoxin, 1,19-divinyldodecylmethyldecasiloxane, 1,3,5-tri(trifluoromethyl)benzene, riboflavin, 1-hydroxypyridinium, 1-H-tetrazole-1-acetic acid, 3-nitro-1,2,4-triazol-5-one, cimetidine, decafluorobiphenyl, hesperetin, isonicotinic acid, luteolin, mandelic acid, methyl p-aminobenzoate, phenobarbital, phloretin, rutin Compounds that are the top outliers for only a certain descriptor: perfluoreicosane (L).