Using QSARDB to create a melting point model web service

Researcher

Andrew S.I.D. Lang

Objective

To create a relatively small (as compared to our other melting point models) Random Forest based melting point model and distribute it as a web service using the QSARDB Open digital repository.

Background

Our goal is to create an Open CC0 melting point model using Open Data, Open Descriptors (CDK), under a transparent/reproducible/open procedure. One recent solution to deploying models (of all types) in the open is via the QsarDB open digital repository, developed by Villu Ruusmann. We were successful in a previous analysis MPModel009, but the resulting QDB archive was too large to deploy. Here we develop a model on a smaller (but highly curated) dataset and perform feature selection to reduce the number of descriptors used with the goal of creating a good model of reasonable size - less than 10MB - the current limit of the QsarDB repository.

Procedure

Data Collection and Curation. We began with the doubleplusgood melting point dataset (ONSMP029) of 2706 highly curated double+ validated (range: 0.1-5 C) unique compounds that have no
chiral centers or possess cis/trans isomerism. From this set we removed coronene and octaphenylcyclotetrasiloxane as they are obvious outliers of the chemical space. For the remaining 2704 compounds, we generated all CDK descriptors except: CPSA, IP, WHIM, all protein, all geometrical. We then removed HybRatio and Kier3 due to multiple NA entries and all khs.xxx with less than 27 (1%) non-zero values, leaving 161 descriptors.

Feature Selection. While Random Forest models have no problems with highly correlated variables, using highly correlated variables can skew variable importance measures. We decided to use the caret package for R to remove highly correlated descriptors (BCUTc-1h, apol, naAromAtom, nAromBond, nAtom, ATSc2, ATSc3, ATSm2, ATSp1, ATSp2, ATSp3, ATSp4, ATSp5, nB, C1SP1, SCH-3, VCH-4, VC-5, SP-0, SP-1, SP-2, SP-3, SP-4, SP-5, SP-6, VP-0, VP-1, VP-2, VP-3, VP-4, VP-5, VP-7, SPC-5, SPC-6, VPC-5, ECCEN, Kier1, Kier2, VABC, WTPT-1, WPATH, WPOL, Zagreb) found using the following code:
library("caret")
## load in data
mydata = read.csv(file="20120607DoubleValidatedReadyForFeatureSelection.csv",head=TRUE,row.names="molID")
## correlation matrix
cor.mat = cor(mydata)
## find correlation r > 0.90
findCorrelation(cor.mat, cutoff = .90, verbose = TRUE)
[output]
7 12 13 14 15 17 18 22 26 27 28 29 30 32 34 43 49 59 61 62 63 64 65 66 67 69 70 71 72 73 74 76 78 79 81 83 121 122 151 153 158 159 161
[output]
This leaves a dataset of 2704 compounds with 118 descriptors ready for modeling.

Modeling. A random forest was created and serialized using the following code:
library("randomForest")
mydata = read.csv(file="20120607DoubleValidatedReadyRF.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.5-34]
mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
[output]
Call:
 randomForest(formula = mpC ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 39
 
          Mean of squared residuals: 1451.971
                    % Var explained: 83.34
[output]
## get variable importance plot
varImpPlot(mydata.rf,main="Random Forest Variable Importance")
The RF reports an OOB R2 of 0.83 and an OOB RMSE of 38.1 °C with the resulting image below showing the importance of the descriptors. The image points to both the number of hydrogen bond donors (nHBDon) and the topological polar surface area (TopoPSA) as the most important physiochemical properties for melting point prediction as found in all previous analyses.
20120609VarImp.png
The model was then saved so that it could be deployed as a web service using the following code:
saveRDS(mydata.rf, file = "ONSMPModel010")
The model is available for download to use for batch melting point prediction with a CC0 license. The model was then used to predict the melting points of the training set in order to identifier possible errors in the dataset and compounds with are difficult to model using current 2D CDK descriptors (such as coronene). The dataset is available for perusal.
training.predict <- predict(mydata.rf,mydata)
write.csv(training.predict, file = "RFTrainingSetPredict.csv")
Plotting the predicted versus measured melting point values using Tableau Public, we see that the melting point of compounds tends to increase with larger TopoPSA (colour) and nHBDon (size) with the top outliers being: cyanic iodide, 2,6-dimethoxy-p-benzoquinone, 2-methyl-4-nitro-1h-imidazole, isophthalic_acid, 2,2,3,3-tetramethylbutane, 2-(1,3-thiazol-4-yl)-1h-benzimidazole, 2-mercaptobenzimidazole, p-quaterphenyl, 4,4'-dihydroxybiphenyl, 2,4-hexadiyne.
20120609PredictedvsMeasured.png

QDB Format Archive. A parallel QDB format archive was created using the same data and the following code in a CMD window:
cd C:\alang\share\MyMesh\ONSC\qsardb
 
java -Xms512M -Xmx1024M -cp conversion-toolkit-r595.jar org.qsardb.conversion.SpreadsheetConverter
--id D --smiles B --name A --properties C --source C:/alang/share/MyMesh/ONSC/qsardb/originaldata/meltingpoints/ONSMP010.csv
--target C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010
 
[prompt]
Id (default: 'column_C'): mpC
Name (default: 'Column C'): mpC
Value format pattern (default: as-is): 0.#
[prompt]
 
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 add-cdk
 
java -Xms512M -Xmx1024M -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorCalculator
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010
 
java -Xms512M -Xmx1024M -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 purge
 
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id BCUTc-1h
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id apol
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id naAromAtom
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id nAromBond
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id nAtom
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ATSc2
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ATSc3
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ATSm2
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ATSp1
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ATSp2
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ATSp3
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ATSp4
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ATSp5
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id nB
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id C1SP1
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SCH-3
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VCH-4
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VC-5
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SP-0
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SP-1
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SP-2
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SP-3
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SP-4
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SP-5
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SP-6
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VP-0
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VP-1
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VP-2
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VP-3
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VP-4
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VP-5
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VP-7
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SPC-5
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id SPC-6
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VPC-5
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id ECCEN
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id HybRatio
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sLi
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssBe
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssssBe
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssBH
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssB
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssssB
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.tCH
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ddC
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sNH3
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssNH2
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.dNH
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssNH
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssssN
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sSiH3
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssSiH2
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssSiH
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sPH2
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssPH
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssP
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.dsssP
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssssP
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.dssS
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sGeH3
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssGeH2
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssGeH
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssssGe
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sAsH2
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssAsH
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssAs
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssdAs
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssssAs
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sSeH
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.dSe
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssSe
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.aaSe
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.dssSe
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ddssSe
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sSnH3
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssSnH2
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssSnH
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssssSn
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sPbH3
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssPbH2
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.sssPbH
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id khs.ssssPb
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id Kier1
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id Kier2
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id Kier3
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id VABC
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id WTPT-1
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id WPATH
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id WPOL
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.DescriptorRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 remove --id Zagreb
 
 
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.ModelRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 add --id rf --name "Random forest regression" --property-id mpC
 
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.ModelRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 attach-rds --id rf
 
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.PredictionRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 add --id rf-training --name "Random forest regression (training)" --model-id rf
 
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.PredictionRegistryManager
--dir C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010 attach-values --id rf-training
The Random Forest for the QDB archive was then created in R using:
suppressMessages(library("randomForest"))
 
qdbDir = "C:/alang/share/MyMesh/ONSC/qsardb/ONSMP010"
 
propertyId = 'mpC'
 
descriptorIdList = c('ALogP', 'ALogp2', 'AMR', 'BCUTw-1l', 'BCUTw-1h', 'BCUTc-1l', 'BCUTp-1l', 'BCUTp-1h', 'fragC', 'nAcid', 'ATSc1',
'ATSc4', 'ATSc5', 'ATSm1', 'ATSm3', 'ATSm4', 'ATSm5', 'nBase', 'bpol', 'C2SP1', 'C1SP2', 'C2SP2', 'C3SP2', 'C1SP3', 'C2SP3', 'C3SP3',
'C4SP3', 'SCH-4', 'SCH-5', 'SCH-6', 'SCH-7', 'VCH-3', 'VCH-5', 'VCH-6', 'VCH-7', 'SC-3', 'SC-4', 'SC-5', 'SC-6', 'VC-3', 'VC-4', 'VC-6',
'SP-7', 'VP-6', 'SPC-4', 'VPC-4', 'VPC-6', 'FMF', 'nHBDon', 'nHBAcc', 'khs.sCH3', 'khs.dCH2', 'khs.ssCH2', 'khs.dsCH', 'khs.aaCH',
'khs.sssCH', 'khs.tsC', 'khs.dssC', 'khs.aasC', 'khs.aaaC', 'khs.ssssC', 'khs.sNH2', 'khs.ssNH', 'khs.aaNH', 'khs.tN', 'khs.dsN',
'khs.aaN', 'khs.sssN', 'khs.ddsN', 'khs.aasN', 'khs.sOH', 'khs.dO', 'khs.ssO', 'khs.aaO', 'khs.sF', 'khs.ssssSi', 'khs.sSH', 'khs.dS',
'khs.ssS', 'khs.aaS', 'khs.ddssS', 'khs.sCl', 'khs.sBr', 'khs.sI', 'nAtomLC', 'nAtomP', 'LipinskiFailures', 'nAtomLAC', 'MLogP', 'MDEC-11',
'MDEC-12', 'MDEC-13', 'MDEC-14', 'MDEC-22', 'MDEC-23', 'MDEC-24', 'MDEC-33', 'MDEC-34', 'MDEC-44', 'MDEO-11', 'MDEO-12', 'MDEO-22',
'MDEN-11', 'MDEN-12', 'MDEN-13', 'MDEN-22', 'MDEN-23', 'MDEN-33', 'PetitjeanNumber', 'nRotB', 'TopoPSA', 'VAdjMat', 'MW', 'WTPT-2',
'WTPT-3', 'WTPT-4', 'WTPT-5', 'XLogP')
 
loadValues = function(path, id){
        result = read.table(path, header = TRUE, sep = "\t", na.strings = "N/A")
        result = na.omit(result)
        names(result) = c('Id', gsub("-", "_", x = id))
        return (result)
}
 
loadPropertyValues = function(id){
        return (loadValues(paste(sep = "/", qdbDir, "properties", id, "values"), id))
}
 
loadDescriptorValues = function(id){
        return (loadValues(paste(sep = "/", qdbDir, "descriptors", id, "values"), id))
}
 
rfdata = loadPropertyValues(propertyId)
 
for(descriptorId in descriptorIdList){
        print (descriptorId)
        rfdata = merge(rfdata, loadDescriptorValues(descriptorId), by = 'Id')
}
 
compoundIds = rfdata$Id
 
rfdata$Id = NULL
 
rfmodel = randomForest(formula = mpC ~ ., data = rfdata)
print(rfmodel)
 
 
object = list()
 
object$propertyId = propertyId
object$getPropertyId = function(self){
        return (self$propertyId)
}
 
object$descriptorIdList = descriptorIdList
object$getDescriptorIdList = function(self){
        return (self$descriptorIdList)
}
 
object$rfmodel = rfmodel
object$evaluate = function(self, values){
        suppressMessages(require("randomForest"))
 
        descriptorIdList = self$getDescriptorIdList(self)
        descriptorIdList = sapply(descriptorIdList, function(x) gsub("-", "_", x))
 
        newrfdata = data.frame(c = NA)
        for(i in 1:length(descriptorIdList)){
                newrfdata[descriptorIdList[i]] = values[i]
        }
 
        return (predict(self$rfmodel, newdata = newrfdata))
}
 
saveRDS(file = paste(sep = "/", qdbDir, "models/rf/rds"), object)
 
 
rfvalues = predict(rfmodel, rfdata)
predictedValues = data.frame(compoundIds, rfvalues)
write.table(predictedValues, file = paste(sep = "/", qdbDir, "predictions/rf-training/values"), col.names = c("csid", "mpC"),
row.names = FALSE, quote = FALSE, sep = "\t")
The QDB archive was then zipped and tested using:
java -cp prediction-toolkit-r595.jar org.qsardb.prediction.SMILESPredictor --archive ONSMP010.qdb.zip --format "0.0" --smiles "c1ccc(cc1)O"
The QDB archive was then deployed for use as a webservice to the QsarDB Open Digital Repository.

Results

An accurate (R2 0.83) melting point model using Open descriptors and Open data was developed and deployed on the QsarDB Open Digitial Repository where it can be used as a webservice.