Predicting Abraham Descriptors - model003

Researcher: Andrew Lang

Data

The original data for these models was collected from the literature, calculated using the method of Abraham[1] with solubility data from the ONSChallenge, and in large part donated by William Acree Jr. and is the largest Open Data collection of Abraham descriptors currently available. The data used here is a refined version of the data used in model002 with four additional compounds removed: rifapentine, 2-naphthoic acid, and paclitaxel due to unreasonable (a big gap between these three compounds and the rest when measured from the centre of the chemical space) Abraham descriptors; and COS due to an incorrect SMILES. The salt phenyl mercuric (II) chloride (molID 1615) was also removed. The number of hydrogen bond donors for several azoles was fixed manually (0 to 1) - molID: 1502, 1812, 1498, 1843, 1942, 1510, 1954. The value for the V descriptor was also corrected for five compounds with SMILES that contained a # and all descriptors with constant zero values (khs.xxx) were deleted, leaving a dataset of Abraham descriptors for 2475 compounds with 171 2D CDK descriptors, summarized in the table below.

Not all compounds had values for all descriptors and thus the dataset was split into 6 for modeling purposes, eliminating compounds which did not have values for that descriptor - marked as '-123'. The summary statistics for each descriptor are listed in the table below.
descriptor
n
min
max
average
stddev
E
2144
-1.137
5.691
0.884
0.779
S
2216
-0.881
6.463
0.996
0.669
A
2216
0.000
2.190
0.174
0.290
B
2202
0.000
4.350
0.487
0.478
V
2216
0.167
6.390
1.299
0.639
L
1973
-0.836
28.940
5.954
3.087
Note: In general, we take A to be zero when the number of hydrogen bond donors is zero[2]. There are two main exceptions to this general rule: N-oxides[3] and highly halogenated compounds: "...in the case of the N-oxides, the reason for the nonzero A value results from the fact that the N-O bond is dative. The bond is sometimes written as N+=O-, and in the case of the N-oxides the A value could be correlated with the R resonance parameter...For some of the highly halogenated compounds, the hydrogen atom might exhibit a slight acidic character. Chloroform has a small nonzero A value, and I would expect difluorochloromethane and trichloromethane to behave somewhat similarly." - William Acree Jr. [personal correspondence]

Modeling

Random forest models were created using R 2.13.0 and the Random Forest 4.6-6 package.
E: The Excess Molar Refractivity.
## load in the Random Forest package
require(randomForest)
 
## read in data
mydata=read.csv(file="20120528CuratedDataE.csv",head=TRUE,row.names="molID")
 
## perform randomForest
mydata.rf <- randomForest(E ~., data = mydata,importance=TRUE)
 
## view results
print(mydata.rf)
[output]
Call:
 randomForest(formula = E ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 57
 
          Mean of squared residuals: 0.02386857
                    % Var explained: 96.07
[output]
 
## show variable importance
varImpPlot(mydata.rf,main = "Random Forest Variable Importance: E")
 
## save model
.saveRDS(mydata.rf, file = "AD003E")
 
## predict values
test.predict <- predict(mydata.rf,mydata)
 
## write output data
write.csv(test.predict, file="RFTestPredictE.csv")
AD003VarImpE.png
AD003TPE.png
The Random Forest model for E is available for download and use under a CC0 license.

S: Diploarity/Polarizability
## load in the Random Forest package
require(randomForest)
 
## read in data
mydata=read.csv(file="20120528CuratedDataS.csv",head=TRUE,row.names="molID")
 
## perform randomForest
mydata.rf <- randomForest(S ~., data = mydata,importance=TRUE)
 
## view results
print(mydata.rf)
[output]
Call:
 randomForest(formula = S ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 57
 
          Mean of squared residuals: 0.07893424
                    % Var explained: 82.34
[output]
 
## show variable importance
varImpPlot(mydata.rf,main = "Random Forest Variable Importance: S")
 
## save model
.saveRDS(mydata.rf, file = "AD003S")
 
## predict values
test.predict <- predict(mydata.rf,mydata)
 
## write output data
write.csv(test.predict, file="RFTestPredictS.csv")
AD003VarImpS.png
AD003TPS.png
The Random Forest model for S is available for download and use under a CC0 license.

A: Hydrogen Bond Acidity
mydata=read.csv(file="20120528CuratedDataA.csv",head=TRUE,row.names="molID")
 
## perform randomForest
mydata.rf <- randomForest(A ~., data = mydata,importance=TRUE)
 
## view results
print(mydata.rf)
[output]
Call:
 randomForest(formula = A ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 57
 
          Mean of squared residuals: 0.02133837
                    % Var explained: 74.65
[output]
 
## show variable importance
varImpPlot(mydata.rf,main = "Random Forest Variable Importance: A")
 
## save model
.saveRDS(mydata.rf, file = "AD003A")
 
## predict values
test.predict <- predict(mydata.rf,mydata)
 
## write output data
write.csv(test.predict, file="RFTestPredictA.csv")
AD003VarImpA.png
AD003TPA.png
The Random Forest model for A is available for download and use under a CC0 license.

B: Hydrogen Bond Basicity
## read in data
mydata=read.csv(file="20120528CuratedDataB.csv",head=TRUE,row.names="molID")
 
## perform randomForest
mydata.rf <- randomForest(B ~., data = mydata,importance=TRUE)
 
## view results
print(mydata.rf)
[output]
Call:
 randomForest(formula = B ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 57
 
          Mean of squared residuals: 0.01977166
                    % Var explained: 91.33
[output]
 
## show variable importance
varImpPlot(mydata.rf,main = "Random Forest Variable Importance: B")
 
## save model
.saveRDS(mydata.rf, file = "AD003B")
 
## predict values
test.predict <- predict(mydata.rf,mydata)
 
## write output data
write.csv(test.predict, file="RFTestPredictB.csv")
AD003VarImpB.png
AD003TPB.png
The Random Forest model for B is available for download and use under a CC0 license.

V: McGowan Volume
## read in data
mydata=read.csv(file="20120528CuratedDataV.csv",head=TRUE,row.names="molID")
 
## perform randomForest
mydata.rf <- randomForest(V ~., data = mydata,importance=TRUE)
 
## view results
print(mydata.rf)
[output]
Call:
 randomForest(formula = V ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 57
 
          Mean of squared residuals: 0.008249509
                    % Var explained: 97.98
[output]
 
## show variable importance
varImpPlot(mydata.rf,main = "Random Forest Variable Importance: V")
 
## save model
.saveRDS(mydata.rf, file = "AD003V")
 
## predict values
test.predict <- predict(mydata.rf,mydata)
 
## write output data
write.csv(test.predict, file="RFTestPredictV.csv")
AD003VarImpV.png
AD003TPV.png
The Random Forest model for V is available for download and use under a CC0 license.

L: The logarithm of the solute gas phase dimensionless Ostwald partition coefficient into hexadecane at 298 K
## read in data
mydata=read.csv(file="20120528CuratedDataL.csv",head=TRUE,row.names="molID")
 
## perform randomForest
mydata.rf <- randomForest(L ~., data = mydata,importance=TRUE)
 
## view results
print(mydata.rf)
[output]
Call:
 randomForest(formula = L ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 57
 
          Mean of squared residuals: 0.3858601
                    % Var explained: 95.95
[output]
 
## show variable importance
varImpPlot(mydata.rf,main = "Random Forest Variable Importance: L")
 
## save model
.saveRDS(mydata.rf, file = "AD003L")
 
## predict values
test.predict <- predict(mydata.rf,mydata)
 
## write output data
write.csv(test.predict, file="RFTestPredictL.csv")
AD003VarImpL.png
AD003TPL.png
The Random Forest model for L is available for download and use under a CC0 license.

Deploying the models using QsarDB. Parallel QDB format archives were created using the same data and variations of the following code. In a CMD window.
cd C:\alang\share\MyMesh\ONSC\qsardb
 
java -Xms512M -Xmx1024M -cp conversion-toolkit-r541.jar org.qsardb.conversion.SpreadsheetConverter
--id D --smiles B --name A --properties C --source C:/alang/share/MyMesh/ONSC/qsardb/originaldata/AbrahamSolutes/E/E.csv
--target C:/alang/share/MyMesh/ONSC/qsardb/AbrahamSoluteE
 
[prompt]
Id (default: 'column_C'): E
Name (default: 'Column C'): E
Value format pattern (default: as-is): 0.###
[prompt]
 
java -cp prediction-toolkit-r541.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/AbrahamSoluteE add-cdk
 
java -Xms512M -Xmx1024M -cp prediction-toolkit-r541.jar org.qsardb.prediction.DescriptorCalculator
--dir C:/alang/share/MyMesh/ONSC/qsardb/AbrahamSoluteE
 
java -cp prediction-toolkit-r541.jar org.qsardb.prediction.ModelRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/AbrahamSoluteE add --id rf --name "Random forest regression" --property-id E
 
java -cp prediction-toolkit-r541.jar org.qsardb.prediction.ModelRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/AbrahamSoluteE attach-rds --id rf
 
java -cp prediction-toolkit-r541.jar org.qsardb.prediction.PredictionRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/AbrahamSoluteE add --id rf-training --name "Random forest regression (training)" --model-id rf
 
java -cp prediction-toolkit-r541.jar org.qsardb.prediction.PredictionRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/AbrahamSoluteE attach-values --id rf-training
In R, we then used variations of the following code:
suppressMessages(library("randomForest"))
 
qdbDir = "C:/alang/share/MyMesh/ONSC/qsardb/AbrahamSoluteE"
 
propertyId = 'E'
 
descriptorIdList = c('ALogP', 'ALogp2', 'AMR', 'BCUTw-1l', 'BCUTw-1h', 'BCUTc-1l', 'BCUTc-1h', 'BCUTp-1l', 'BCUTp-1h', 'fragC', 'nAcid',
 'apol', 'naAromAtom', 'nAromBond', 'nAtom', 'ATSc1', 'ATSc2', 'ATSc3', 'ATSc4', 'ATSc5', 'ATSm1', 'ATSm2', 'ATSm3', 'ATSm4', 'ATSm5',
 'ATSp1', 'ATSp2', 'ATSp3', 'ATSp4', 'ATSp5', 'nBase', 'nB', 'bpol', 'C1SP1', 'C2SP1', 'C1SP2', 'C2SP2', 'C3SP2', 'C1SP3', 'C2SP3',
 'C3SP3', 'C4SP3', 'SCH-3', 'SCH-4', 'SCH-5', 'SCH-6', 'SCH-7', 'VCH-3', 'VCH-4', 'VCH-5', 'VCH-6', 'VCH-7', 'SC-3', 'SC-4', 'SC-5',
 'SC-6', 'VC-3', 'VC-4', 'VC-5', 'VC-6', 'SP-0', 'SP-1', 'SP-2', 'SP-3', 'SP-4', 'SP-5', 'SP-6', 'SP-7', 'VP-0', 'VP-1', 'VP-2', 'VP-3',
 'VP-4', 'VP-5', 'VP-6', 'VP-7', 'SPC-4', 'SPC-5', 'SPC-6', 'VPC-4', 'VPC-5', 'VPC-6', 'ECCEN', 'FMF', 'nHBDon', 'nHBAcc', 'khs.sCH3',
 'khs.dCH2', 'khs.ssCH2', 'khs.tCH', 'khs.dsCH', 'khs.aaCH', 'khs.sssCH', 'khs.ddC', 'khs.tsC', 'khs.dssC', 'khs.aasC', 'khs.aaaC',
 'khs.ssssC', 'khs.sNH2', 'khs.dNH', 'khs.ssNH', 'khs.aaNH', 'khs.tN', 'khs.dsN', 'khs.aaN', 'khs.sssN', 'khs.ddsN', 'khs.aasN',
 'khs.ssssN', 'khs.sOH', 'khs.dO', 'khs.ssO', 'khs.aaO', 'khs.sF', 'khs.sSiH3', 'khs.ssSiH2', 'khs.sssSiH', 'khs.ssssSi', 'khs.sssP',
 'khs.dsssP', 'khs.sSH', 'khs.dS', 'khs.ssS', 'khs.aaS', 'khs.dssS', 'khs.ddssS', 'khs.sCl', 'khs.sBr', 'khs.sI', 'Kier1', 'Kier2',
 'nAtomLC', 'nAtomP', 'LipinskiFailures', 'nAtomLAC', 'MLogP', 'MDEC-11', 'MDEC-12', 'MDEC-13', 'MDEC-14', 'MDEC-22', 'MDEC-23',
 'MDEC-24', 'MDEC-33', 'MDEC-34', 'MDEC-44', 'MDEO-11', 'MDEO-12', 'MDEO-22', 'MDEN-11', 'MDEN-12', 'MDEN-13', 'MDEN-22', 'MDEN-23',
 'MDEN-33', 'PetitjeanNumber', 'nRotB', 'TopoPSA', 'VAdjMat', 'VABC', 'MW', 'WTPT-1', 'WTPT-2', 'WTPT-3', 'WTPT-4', 'WTPT-5', 'WPATH',
 'WPOL', 'XLogP', 'Zagreb')
 
loadValues = function(path, id){
        result = read.table(path, header = TRUE, sep = "\t", na.strings = "N/A")
        result = na.omit(result)
        names(result) = c('Id', gsub("-", "_", x = id))
        return (result)
}
 
loadPropertyValues = function(id){
        return (loadValues(paste(sep = "/", qdbDir, "properties", id, "values"), id))
}
 
loadDescriptorValues = function(id){
        return (loadValues(paste(sep = "/", qdbDir, "descriptors", id, "values"), id))
}
 
rfdata = loadPropertyValues(propertyId)
 
for(descriptorId in descriptorIdList){
        print (descriptorId)
        rfdata = merge(rfdata, loadDescriptorValues(descriptorId), by = 'Id')
}
 
compoundIds = rfdata$Id
 
rfdata$Id = NULL
 
rfmodel = randomForest(formula = E ~ ., data = rfdata)
print(rfmodel)
 
 
object = list()
 
object$propertyId = propertyId
object$getPropertyId = function(self){
        return (self$propertyId)
}
 
object$descriptorIdList = descriptorIdList
object$getDescriptorIdList = function(self){
        return (self$descriptorIdList)
}
 
object$rfmodel = rfmodel
object$evaluate = function(self, values){
        suppressMessages(require("randomForest"))
 
        descriptorIdList = self$getDescriptorIdList(self)
        descriptorIdList = sapply(descriptorIdList, function(x) gsub("-", "_", x))
 
        newrfdata = data.frame(c = NA)
        for(i in 1:length(descriptorIdList)){
                newrfdata[descriptorIdList[i]] = values[i]
        }
 
        return (predict(self$rfmodel, newdata = newrfdata))
}
 
.saveRDS(file = paste(sep = "/", qdbDir, "models/rf/rds"), object)
 
 
rfvalues = predict(rfmodel, rfdata)
predictedValues = data.frame(compoundIds, rfvalues)
write.table(predictedValues, file = paste(sep = "/", qdbDir, "predictions/rf-training/values"),
col.names = c("csid", "E"), row.names = FALSE, quote = FALSE, sep = "\t")
We then copied the datasets for the various descriptors into the QDB folder and zipped it up ready for deployment. Once zipped the archives were tested locally using, in the CMD window, variations of:
java -cp prediction-toolkit-r541.jar org.qsardb.prediction.SMILESPredictor
--archive AbrahamSoluteE.qdb.zip --format "0.000" --smiles "c1ccc(cc1)O"

Results

We modeled all Abraham descriptors: E, S, A, B, V, L with the following results:
descriptor
n
OOB-R2
OOB-RMSE
E
2144
0.96
0.154
S
2216
0.82
0.281
A
2216
0.75
0.146
B
2202
0.91
0.141
V
2216
0.98
0.091
L
1973
0.96
0.621
The models are superior to all previously published models both here and in the literature. The most difficult descriptor to model is A (as found in all previous analyses), the hydrogen bond acidity. This is due in part to the convention of taking A to be zero when the number of hydrogen bond donors is zero (except in certain exceptions) and requiring A to be non-negative, even when linear regression of known solubilities leads to a negative value for A. We model V, even though it can be derived exactly from structure for most compounds, for completeness sake.

Looking at the top 20 outliers for each model, we find the following compounds may have suspect descriptors.
Compounds in the top twenty outliers for multiple descriptors: iodine, sucrose, digitoxin, 1,19-divinyldodecylmethyldecasiloxane, 1,3,5-tri(trifluoromethyl)benzene, riboflavin, 1-hydroxypyridinium, 1-H-tetrazole-1-acetic acid, 3-nitro-1,2,4-triazol-5-one, cimetidine, decafluorobiphenyl, hesperetin, isonicotinic acid, luteolin, mandelic acid, methyl p-aminobenzoate, phenobarbital, phloretin, rutin
Compounds that are the top outliers for only a certain descriptor: perfluoreicosane (L).

References

[1] Abraham MH, et al. 2009. Prediction of Solubility of Drugs and Other Compounds in Organic Solvents. Journal of Pharmaceutical Sciences. DOI: 10.1002/jps.21922
[2] M.H. Abraham, et al. 1989. 699. All compounds with unactive hydrogen taken as zero.
[3] Abraham MH, et al. 2011. The lipophilicity and hydrogen bond strength of pyridine-N-oxides andprotonated pyridine-N-oxides. DOI: 10.1039/c0nj00893a