MeltingPointModel009

=A random forest melting point model from 19,410 open data melting points=

Researcher
Andrew S.I.D. Lang

Objective
To create a Random Forest based melting point model and distribute it as a web service using the QSARDB Open digital repository.

Background
Our goal is to create an Open CC0 melting point model using Open Data, Open Descriptors (CDK), under a transparent/reproducible/open procedure. One problem we've encountered with our previous models is deployment. One recent solution to deploying models (of all types) in the open is via the QsarDB open digital repository, developed by Villu Ruusmann.

Procedure

 * Data Collection and Curation.** The dataset (ONSMP032) was created from the live Open Melting Point Data (20120516). The following criteria was used to remove compounds: range greater than 15C, mp > 400C, mp < -200C, compound contains Si, compound contains [2H] or [3H]. The SMILES of the remaining compounds in the dataset were run through Rajarshi Guha's CDK Descriptor Calculator GUI (v1.3.2) to generate 2D descriptors. The option to add explicit hydrogens was selected and all descriptors were calculated except the following: Charged Partial Surface Area, Ionization Potential, WHIM, and all protein and geometrical descriptors. We removed compounds that failed to generate some or all CDK descriptors (mostly compounds that contained B or P). The descriptors Kier3 and HybRation were removed as they contained multiple NA entries. Descriptors with all zero entries (all khs) were removed. This left a dataset of 19410 compounds and 172 descriptors (ONSMP033).

code mydata = read.csv(file="CuratedDataReadyForRF.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE) print(mydata.rf)
 * Summary Statistics.** Excel and Tableau Public were used to generate the following summary statistics and visualizations for the data (ONSMP033). 91% of the 19410 compounds have a MW between 25 and 450 AND mpC between -75C and 290C.
 * || mean || median || min || max || stddev ||
 * MW || 238 || 214 || 16 || 1000 || 106 ||
 * mpC || 107 || 108 || -199 || 389 || 91 ||
 * Random Forest.** The randomForest package (v 4.5-34) in R was used to build the random forest models using the following code:
 * 1) do random forest [randomForest 4.5-34]

[output] Call: randomForest(formula = mpC ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 57

Mean of squared residuals: 1552.277 % Var explained: 81.11 [output]

varImpPlot(mydata.rf,main="Random Forest Variable Importance") code The RF reports an OOB R2 of 0.81 and an OOB RMSE of 39.40 °C with resulting image below showing the importance of the descriptors. The image points to both the number of hydrogen bond donors (nHBDon) and the topological polar surface area (TopoPSA) as the most important physiochemical properties for melting point prediction as found in all previous analyses. The model was then saved so that it could be deployed as a web service using the following code: code .saveRDS(mydata.rf, file = "ONSMPModel009") code The model is available for download to use for batch melting point prediction with a CC0 license.
 * 1) get plot of importance - %IncMSE and IncNodePurity

The RF model was used to predict the melting point of all 19410 compounds. The AE was then calculated and added to a for later analysis. Those compounds with high AE values either have an incorrect measured melting point or are for some reason resistant to modeling using 2D CDK descriptors.

code cd C:\alang\share\MyMesh\ONSC\qsardb
 * QDB Format Archive.** A parallel QDB format archive was created using the same data and the following code in a CMD window (it took over 6 hours to complete):

java -Xms512M -Xmx1024M -cp conversion-toolkit-r541.jar org.qsardb.conversion.SpreadsheetConverter --id D --smiles B --name A --properties C --source C:/alang/share/MyMesh/ONSC/qsardb/originaldata/meltingpoints/mpC.csv --target C:/alang/share/MyMesh/ONSC/qsardb/meltingpoint

Prompted:

Id (default: 'column_C'): mpC Name (default: 'Column C'): mpC Value format pattern (default: as-is): 0.# -- java -cp prediction-toolkit-r541.jar org.qsardb.prediction.DescriptorRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/meltingpoint add-cdk

java -Xms512M -Xmx1024M -cp prediction-toolkit-r541.jar org.qsardb.prediction.DescriptorCalculator --dir C:/alang/share/MyMesh/ONSC/qsardb/meltingpoint

java -cp prediction-toolkit-r541.jar org.qsardb.prediction.ModelRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/meltingpoint add --id rf --name "Random forest regression" --property-id mpC

java -cp prediction-toolkit-r541.jar org.qsardb.prediction.ModelRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/meltingpoint attach-rds --id rf

java -cp prediction-toolkit-r541.jar org.qsardb.prediction.PredictionRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/meltingpoint add --id rf-training --name "Random forest regression (training)" --model-id rf

java -cp prediction-toolkit-r541.jar org.qsardb.prediction.PredictionRegistryManager --dir C:/alang/share/MyMesh/ONSC/qsardb/meltingpoint attach-values --id rf-training code Then in R, the following commands created and exported the RF model to the QDB archive directories: code suppressMessages(library("randomForest"))

qdbDir = "C:/alang/share/MyMesh/ONSC/qsardb/meltingpoint"

propertyId = 'mpC'

descriptorIdList = c('ALogP', 'ALogp2', 'AMR', 'BCUTw-1l', 'BCUTw-1h', 'BCUTc-1l', 'BCUTc-1h', 'BCUTp-1l', 'BCUTp-1h', 'fragC', 'nAcid', 'apol', 'naAromAtom', 'nAromBond', 'nAtom', 'ATSc1', 'ATSc2', 'ATSc3', 'ATSc4', 'ATSc5', 'ATSm1', 'ATSm2', 'ATSm3', 'ATSm4', 'ATSm5', 'ATSp1', 'ATSp2', 'ATSp3', 'ATSp4', 'ATSp5', 'nBase', 'nB', 'bpol', 'C1SP1', 'C2SP1', 'C1SP2', 'C2SP2', 'C3SP2', 'C1SP3', 'C2SP3', 'C3SP3', 'C4SP3', 'SCH-3', 'SCH-4', 'SCH-5', 'SCH-6', 'SCH-7', 'VCH-3', 'VCH-4', 'VCH-5', 'VCH-6', 'VCH-7', 'SC-3', 'SC-4', 'SC-5', 'SC-6', 'VC-3', 'VC-4', 'VC-5', 'VC-6', 'SP-0', 'SP-1', 'SP-2', 'SP-3', 'SP-4', 'SP-5', 'SP-6', 'SP-7', 'VP-0', 'VP-1', 'VP-2', 'VP-3', 'VP-4', 'VP-5', 'VP-6', 'VP-7', 'SPC-4', 'SPC-5', 'SPC-6', 'VPC-4', 'VPC-5', 'VPC-6', 'ECCEN', 'FMF', 'nHBDon', 'nHBAcc', 'khs.ssBH', 'khs.sssB', 'khs.sCH3', 'khs.dCH2', 'khs.ssCH2', 'khs.tCH', 'khs.dsCH', 'khs.aaCH', 'khs.sssCH', 'khs.ddC', 'khs.tsC', 'khs.dssC', 'khs.aasC', 'khs.aaaC', 'khs.ssssC', 'khs.sNH3', 'khs.sNH2', 'khs.ssNH2', 'khs.dNH', 'khs.ssNH', 'khs.aaNH', 'khs.tN', 'khs.sssNH', 'khs.dsN', 'khs.aaN', 'khs.sssN', 'khs.ddsN', 'khs.aasN', 'khs.ssssN', 'khs.sOH', 'khs.dO', 'khs.ssO', 'khs.aaO', 'khs.sF', 'khs.sssP', 'khs.dsssP', 'khs.sSH', 'khs.dS', 'khs.ssS', 'khs.aaS', 'khs.dssS', 'khs.ddssS', 'khs.sCl', 'khs.sBr', 'khs.sI', 'Kier1', 'Kier2', 'nAtomLC', 'nAtomP', 'LipinskiFailures', 'nAtomLAC', 'MLogP', 'MDEC-11', 'MDEC-12', 'MDEC-13', 'MDEC-14', 'MDEC-22', 'MDEC-23', 'MDEC-24', 'MDEC-33', 'MDEC-34', 'MDEC-44', 'MDEO-11', 'MDEO-12', 'MDEO-22', 'MDEN-11', 'MDEN-12', 'MDEN-13', 'MDEN-22', 'MDEN-23', 'MDEN-33', 'PetitjeanNumber', 'nRotB', 'TopoPSA', 'VAdjMat', 'VABC', 'MW', 'WTPT-1', 'WTPT-2', 'WTPT-3', 'WTPT-4', 'WTPT-5', 'WPATH', 'WPOL', 'XLogP', 'Zagreb')

loadValues = function(path, id){ result = read.table(path, header = TRUE, sep = "\t", na.strings = "N/A") result = na.omit(result) names(result) = c('Id', gsub("-", "_", x = id)) return (result) }

loadPropertyValues = function(id){ return (loadValues(paste(sep = "/", qdbDir, "properties", id, "values"), id)) }

loadDescriptorValues = function(id){ return (loadValues(paste(sep = "/", qdbDir, "descriptors", id, "values"), id)) }

rfdata = loadPropertyValues(propertyId)

for(descriptorId in descriptorIdList){ print (descriptorId) rfdata = merge(rfdata, loadDescriptorValues(descriptorId), by = 'Id') }

compoundIds = rfdata$Id

rfdata$Id = NULL

rfmodel = randomForest(formula = mpC ~ ., data = rfdata) print(rfmodel)

object = list

object$propertyId = propertyId object$getPropertyId = function(self){ return (self$propertyId) }

object$descriptorIdList = descriptorIdList object$getDescriptorIdList = function(self){ return (self$descriptorIdList) }

object$rfmodel = rfmodel object$evaluate = function(self, values){ suppressMessages(require("randomForest"))

descriptorIdList = self$getDescriptorIdList(self) descriptorIdList = sapply(descriptorIdList, function(x) gsub("-", "_", x))

newrfdata = data.frame(c = NA) for(i in 1:length(descriptorIdList)){ newrfdata[descriptorIdList[i]] = values[i] }

return (predict(self$rfmodel, newdata = newrfdata)) }

.saveRDS(file = paste(sep = "/", qdbDir, "models/rf/rds"), object)

rfvalues = predict(rfmodel, rfdata) predictedValues = data.frame(compoundIds, rfvalues) write.table(predictedValues, file = paste(sep = "/", qdbDir, "predictions/rf-training/values"), col.names = c("csid", "mpC"), row.names = FALSE, quote = FALSE, sep = "\t") code The final archive was compressed as [|meltingpoint.qdb.zip]. It can be used from the command line using: code java -cp prediction-toolkit-r541.jar org.qsardb.prediction.SMILESPredictor --archive meltingpoint.qdb.zip --format "0.0" --smiles "c1ccc(cc1)O" code This model at 90MB is too large to deploy as a web service, thus we will attempt to reduce its size by reducing the number of descriptors.

code mydata = read.csv(file="CuratedDataReadyForRF.csv",head=TRUE,row.names="molID") cor.mat = cor(mydata) findCorrelation(cor.mat, cutoff = .90, verbose = TRUE) [output] [1] 172 63  64 170  26  27  32 164  65  28  69  61  71  62  70  29 162  30  66  72  67  12 163 132  73  15 [27]  78  74  79  83  75  33  80  81  47 133  55  52  13  14   6  88  17  16 108  43 [output] code These descriptors (Zagreb, SP-2, SP-3, WPOL, ATSp1, ATSp2, nB, WTPT-1, SP-4, ATSp3, VP-0, SP-0, VP-2, SP-1, VP-1, ATSp4, VABC, ATSp5, SP-5, VP-3, SP-6, apol, MW, Kier1, VP-4, nAtom, SPC-5, VP-5, SPC-6, ECCEN, VP-6, bpol, VPC-4, VPC-5, SCH-7, Kier2, SC-5, VCH-7, naAromAtom, nAromBond, BCUTc-1l, khs.sssB, ATSc2, ATSc1, khs.tN, SCH-3) were removed, leaving 126 descriptors.
 * Feature selection.** We begin by removing highly correlated descriptors using the following code and the caret 5.15-023 package in R 2.15.0:
 * 1) load in data
 * 1) correlation matrix
 * 1) find correlation r > 0.90

After the highly correlated descriptors were removed, we used a random forest model to identify the important descriptors. code mydata = read.csv(file="CuratedDataFeatureSelectedReadyForRF.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE) print(mydata.rf)
 * 1) do random forest [randomForest 4.6-6]

[output] Call: randomForest(formula = mpC ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 42

Mean of squared residuals: 1542.097 % Var explained: 81.24 [output]

varImpPlot(mydata.rf,main="Random Forest Variable Importance") code Keeping only the important descriptors left only 42 descriptors. We again created a random forest model using only these descriptors code mydata = read.csv(file="CuratedDataOnlyImpForRF.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE) print(mydata.rf)
 * 1) get plot of importance - %IncMSE and IncNodePurity
 * 1) do random forest [randomForest 4.6-6]

[output] Call: randomForest(formula = mpC ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 14

Mean of squared residuals: 1599.981 % Var explained: 80.53 [output]

varImpPlot(mydata.rf,main="Random Forest Variable Importance")
 * 1) get plot of importance - %IncMSE and IncNodePurity

saveRDS(mydata.rf, file = "rfmodel42descriptors") code
 * 1) save the model

Results
The model developed here (MPModel009) is far superior in utility than any previous published model both here and in the literature. Unfortunately at 90MB the qdb archive is too large to deploy as a web service (at the time of writing, the maximum size is limited to 10MB). However, both the serialized RF model and the QDB archive can be used offline. A second model using only 42 feature-selected descriptors was built with the hope of reducing the size of the random forest model. Upon serializing (saving), the model was about the same size as the full model (in fact it was slightly larger) and thus this technique cannot be used to reduce the size of the model. Interestingly, the accuracy of the model did not decrease dramatically upon reducing the number of descriptors from 172 to 42.