A random forest melting point model from 19,410 open data melting points

Researcher

Andrew S.I.D. Lang

Objective

To create a Random Forest based melting point model and distribute it as a web service using the QSARDB Open digital repository.

Background

Our goal is to create an Open CC0 melting point model using Open Data, Open Descriptors (CDK), under a transparent/reproducible/open procedure. One problem we've encountered with our previous models is deployment. One recent solution to deploying models (of all types) in the open is via the QsarDB open digital repository, developed by Villu Ruusmann.

Procedure

Data Collection and Curation. The dataset (ONSMP032) was created from the live Open Melting Point Data (20120516). The following criteria was used to remove compounds: range greater than 15C, mp > 400C, mp < -200C, compound contains Si, compound contains [2H] or [3H]. The SMILES of the remaining compounds in the dataset were run through Rajarshi Guha's CDK Descriptor Calculator GUI (v1.3.2) to generate 2D descriptors. The option to add explicit hydrogens was selected and all descriptors were calculated except the following: Charged Partial Surface Area, Ionization Potential, WHIM, and all protein and geometrical descriptors. We removed compounds that failed to generate some or all CDK descriptors (mostly compounds that contained B or P). The descriptors Kier3 and HybRation were removed as they contained multiple NA entries. Descriptors with all zero entries (all khs) were removed. This left a dataset of 19410 compounds and 172 descriptors (ONSMP033).

Summary Statistics. Excel and Tableau Public were used to generate the following summary statistics and visualizations for the data (ONSMP033). 91% of the 19410 compounds have a MW between 25 and 450 AND mpC between -75C and 290C.

mean
median
min
max
stddev
MW
238
214
16
1000
106
mpC
107
108
-199
389
91
MWHistogram.png
mpCHistogram.png
VisualizingDataset.png
Random Forest. The randomForest package (v 4.5-34) in R was used to build the random forest models using the following code:
mydata = read.csv(file="CuratedDataReadyForRF.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.5-34]
mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
 
[output]
Call:
 randomForest(formula = mpC ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 57
 
          Mean of squared residuals: 1552.277
                    % Var explained: 81.11
[output]
 
## get plot of importance - %IncMSE and IncNodePurity
varImpPlot(mydata.rf,main="Random Forest Variable Importance")
The RF reports an OOB R2 of 0.81 and an OOB RMSE of 39.40 °C with resulting image below showing the importance of the descriptors. The image points to both the number of hydrogen bond donors (nHBDon) and the topological polar surface area (TopoPSA) as the most important physiochemical properties for melting point prediction as found in all previous analyses.
20120516RFVARIMP.png
The model was then saved so that it could be deployed as a web service using the following code:
.saveRDS(mydata.rf, file = "ONSMPModel009")
The model is available for download to use for batch melting point prediction with a CC0 license.

The RF model was used to predict the melting point of all 19410 compounds. The AE was then calculated and added to a spreadsheet for later analysis. Those compounds with high AE values either have an incorrect measured melting point or are for some reason resistant to modeling using 2D CDK descriptors.

QDB Format Archive. A parallel QDB format archive was created using the same data and the following code in a CMD window (it took over 6 hours to complete):
cd C:\alang\share\MyMesh\ONSC\qsardb
 
java -Xms512M -Xmx1024M -cp conversion-toolkit-r541.jar org.qsardb.conversion.SpreadsheetConverter
--id D --smiles B --name A --properties C --source C:/alang/share/MyMesh/ONSC/qsardb/originaldata/meltingpoints/mpC.csv
--target C:/alang/share/MyMesh/ONSC/qsardb/meltingpoint
 
----
Prompted:
 
Id (default: 'column_C'): mpC
Name (default: 'Column C'): mpC
Value format pattern (default: as-is): 0.#
------
java -cp prediction-toolkit-r541.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/meltingpoint add-cdk
 
java -Xms512M -Xmx1024M -cp prediction-toolkit-r541.jar org.qsardb.prediction.DescriptorCalculator
--dir C:/alang/share/MyMesh/ONSC/qsardb/meltingpoint
 
java -cp prediction-toolkit-r541.jar org.qsardb.prediction.ModelRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/meltingpoint add --id rf --name "Random forest regression" --property-id mpC
 
java -cp prediction-toolkit-r541.jar org.qsardb.prediction.ModelRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/meltingpoint attach-rds --id rf
 
java -cp prediction-toolkit-r541.jar org.qsardb.prediction.PredictionRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/meltingpoint add --id rf-training --name "Random forest regression (training)" --model-id rf
 
java -cp prediction-toolkit-r541.jar org.qsardb.prediction.PredictionRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/meltingpoint attach-values --id rf-training
Then in R, the following commands created and exported the RF model to the QDB archive directories:
suppressMessages(library("randomForest"))
 
qdbDir = "C:/alang/share/MyMesh/ONSC/qsardb/meltingpoint"
 
propertyId = 'mpC'
 
descriptorIdList = c('ALogP', 'ALogp2', 'AMR', 'BCUTw-1l', 'BCUTw-1h', 'BCUTc-1l', 'BCUTc-1h', 'BCUTp-1l', 'BCUTp-1h', 'fragC', 'nAcid',
'apol', 'naAromAtom', 'nAromBond', 'nAtom', 'ATSc1', 'ATSc2', 'ATSc3', 'ATSc4', 'ATSc5', 'ATSm1', 'ATSm2', 'ATSm3', 'ATSm4', 'ATSm5',
'ATSp1', 'ATSp2', 'ATSp3', 'ATSp4', 'ATSp5', 'nBase', 'nB', 'bpol', 'C1SP1', 'C2SP1', 'C1SP2', 'C2SP2', 'C3SP2', 'C1SP3', 'C2SP3', 'C3SP3',
'C4SP3', 'SCH-3', 'SCH-4', 'SCH-5', 'SCH-6', 'SCH-7', 'VCH-3', 'VCH-4', 'VCH-5', 'VCH-6', 'VCH-7', 'SC-3', 'SC-4', 'SC-5', 'SC-6', 'VC-3',
'VC-4', 'VC-5', 'VC-6', 'SP-0', 'SP-1', 'SP-2', 'SP-3', 'SP-4', 'SP-5', 'SP-6', 'SP-7', 'VP-0', 'VP-1', 'VP-2', 'VP-3', 'VP-4', 'VP-5',
'VP-6', 'VP-7', 'SPC-4', 'SPC-5', 'SPC-6', 'VPC-4', 'VPC-5', 'VPC-6', 'ECCEN', 'FMF', 'nHBDon', 'nHBAcc', 'khs.ssBH', 'khs.sssB', 'khs.sCH3',
'khs.dCH2', 'khs.ssCH2', 'khs.tCH', 'khs.dsCH', 'khs.aaCH', 'khs.sssCH', 'khs.ddC', 'khs.tsC', 'khs.dssC', 'khs.aasC', 'khs.aaaC', 'khs.ssssC',
'khs.sNH3', 'khs.sNH2', 'khs.ssNH2', 'khs.dNH', 'khs.ssNH', 'khs.aaNH', 'khs.tN', 'khs.sssNH', 'khs.dsN', 'khs.aaN', 'khs.sssN', 'khs.ddsN',
'khs.aasN', 'khs.ssssN', 'khs.sOH', 'khs.dO', 'khs.ssO', 'khs.aaO', 'khs.sF', 'khs.sssP', 'khs.dsssP', 'khs.sSH', 'khs.dS', 'khs.ssS',
'khs.aaS', 'khs.dssS', 'khs.ddssS', 'khs.sCl', 'khs.sBr', 'khs.sI', 'Kier1', 'Kier2', 'nAtomLC', 'nAtomP', 'LipinskiFailures', 'nAtomLAC',
'MLogP', 'MDEC-11', 'MDEC-12', 'MDEC-13', 'MDEC-14', 'MDEC-22', 'MDEC-23', 'MDEC-24', 'MDEC-33', 'MDEC-34', 'MDEC-44', 'MDEO-11', 'MDEO-12',
'MDEO-22', 'MDEN-11', 'MDEN-12', 'MDEN-13', 'MDEN-22', 'MDEN-23', 'MDEN-33', 'PetitjeanNumber', 'nRotB', 'TopoPSA', 'VAdjMat', 'VABC', 'MW',
'WTPT-1', 'WTPT-2', 'WTPT-3', 'WTPT-4', 'WTPT-5', 'WPATH', 'WPOL', 'XLogP', 'Zagreb')
 
loadValues = function(path, id){
        result = read.table(path, header = TRUE, sep = "\t", na.strings = "N/A")
        result = na.omit(result)
        names(result) = c('Id', gsub("-", "_", x = id))
        return (result)
}
 
loadPropertyValues = function(id){
        return (loadValues(paste(sep = "/", qdbDir, "properties", id, "values"), id))
}
 
loadDescriptorValues = function(id){
        return (loadValues(paste(sep = "/", qdbDir, "descriptors", id, "values"), id))
}
 
rfdata = loadPropertyValues(propertyId)
 
for(descriptorId in descriptorIdList){
        print (descriptorId)
        rfdata = merge(rfdata, loadDescriptorValues(descriptorId), by = 'Id')
}
 
compoundIds = rfdata$Id
 
rfdata$Id = NULL
 
rfmodel = randomForest(formula = mpC ~ ., data = rfdata)
print(rfmodel)
 
 
object = list()
 
object$propertyId = propertyId
object$getPropertyId = function(self){
        return (self$propertyId)
}
 
object$descriptorIdList = descriptorIdList
object$getDescriptorIdList = function(self){
        return (self$descriptorIdList)
}
 
object$rfmodel = rfmodel
object$evaluate = function(self, values){
        suppressMessages(require("randomForest"))
 
        descriptorIdList = self$getDescriptorIdList(self)
        descriptorIdList = sapply(descriptorIdList, function(x) gsub("-", "_", x))
 
        newrfdata = data.frame(c = NA)
        for(i in 1:length(descriptorIdList)){
                newrfdata[descriptorIdList[i]] = values[i]
        }
 
        return (predict(self$rfmodel, newdata = newrfdata))
}
 
.saveRDS(file = paste(sep = "/", qdbDir, "models/rf/rds"), object)
 
 
rfvalues = predict(rfmodel, rfdata)
predictedValues = data.frame(compoundIds, rfvalues)
write.table(predictedValues, file = paste(sep = "/", qdbDir, "predictions/rf-training/values"), col.names = c("csid", "mpC"),
row.names = FALSE, quote = FALSE, sep = "\t")
The final archive was compressed as meltingpoint.qdb.zip. It can be used from the command line using:
java -cp prediction-toolkit-r541.jar org.qsardb.prediction.SMILESPredictor --archive meltingpoint.qdb.zip --format "0.0" --smiles "c1ccc(cc1)O"
This model at 90MB is too large to deploy as a web service, thus we will attempt to reduce its size by reducing the number of descriptors.

Feature selection. We begin by removing highly correlated descriptors using the following code and the caret 5.15-023 package in R 2.15.0:
## load in data
mydata = read.csv(file="CuratedDataReadyForRF.csv",head=TRUE,row.names="molID")
## correlation matrix
cor.mat = cor(mydata)
## find correlation r > 0.90
findCorrelation(cor.mat, cutoff = .90, verbose = TRUE)
[output]
 [1] 172  63  64 170  26  27  32 164  65  28  69  61  71  62  70  29 162  30  66  72  67  12 163 132  73  15
[27]  78  74  79  83  75  33  80  81  47 133  55  52  13  14   6  88  17  16 108  43
[output]
These descriptors (Zagreb, SP-2, SP-3, WPOL, ATSp1, ATSp2, nB, WTPT-1, SP-4, ATSp3, VP-0, SP-0, VP-2, SP-1, VP-1, ATSp4, VABC, ATSp5, SP-5, VP-3, SP-6, apol, MW, Kier1, VP-4, nAtom, SPC-5, VP-5, SPC-6, ECCEN, VP-6, bpol, VPC-4, VPC-5, SCH-7, Kier2, SC-5, VCH-7, naAromAtom, nAromBond, BCUTc-1l, khs.sssB, ATSc2, ATSc1, khs.tN, SCH-3) were removed, leaving 126 descriptors.

After the highly correlated descriptors were removed, we used a random forest model to identify the important descriptors.
mydata = read.csv(file="CuratedDataFeatureSelectedReadyForRF.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.6-6]
mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
 
[output]
Call:
 randomForest(formula = mpC ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 42
 
          Mean of squared residuals: 1542.097
                    % Var explained: 81.24
[output]
 
## get plot of importance - %IncMSE and IncNodePurity
varImpPlot(mydata.rf,main="Random Forest Variable Importance")
20120607VarImp.png
Keeping only the important descriptors left only 42 descriptors. We again created a random forest model using only these descriptors
mydata = read.csv(file="CuratedDataOnlyImpForRF.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.6-6]
mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
 
[output]
Call:
 randomForest(formula = mpC ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 14
 
          Mean of squared residuals: 1599.981
                    % Var explained: 80.53
[output]
 
## get plot of importance - %IncMSE and IncNodePurity
varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 
## save the model
saveRDS(mydata.rf, file = "rfmodel42descriptors")
20120607VarImpOnlyImp.png

Results

The model developed here (MPModel009) is far superior in utility than any previous published model both here and in the literature. Unfortunately at 90MB the qdb archive is too large to deploy as a web service (at the time of writing, the maximum size is limited to 10MB). However, both the serialized RF model and the QDB archive can be used offline. A second model using only 42 feature-selected descriptors was built with the hope of reducing the size of the random forest model. Upon serializing (saving), the model was about the same size as the full model (in fact it was slightly larger) and thus this technique cannot be used to reduce the size of the model. Interestingly, the accuracy of the model did not decrease dramatically upon reducing the number of descriptors from 172 to 42.