MeltingPointModel002

Random Forest Based Melting Point Prediction

 * Researcher: Andrew Lang**

Introduction
A melting point web service is provided based upon a random forest model using open descriptors from the Chemistry Development Kit (CDK) and an Open Data dataset of 12634 melting points (ONSMP013). The model has an internal AAE of 11.597 °C, a RMSE of 38.786 °C, and an R2 value of 0.789. The model used the entire dataset as a previous analysis suggests the likelihood of over-fitting is minimal.

Procedure
The random forest model was created in R (v.2.11.0) by using the randomForest package (v.4.5-34) and the following code: code mydata = read.csv(file="20110303combinedPrepped.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE) varImpPlot(mydata.rf) print(mydata.rf)

[output] Call: randomForest(formula = mpC ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 50

Mean of squared residuals: 1504.319 % Var explained: 78.85 [output] code The importance of the descriptors were calculated, see figure 1, below. The model was then saved so that it could be deployed as a web service using the following code: code .saveRDS(mydata.rf, file = "rfmodel") code This model is available for [|download] to use (44.8MB - license: CC0) or available to use as a web service.

Predicted vs. Observed Melting Points
To visualize the accuracy of the model, we plotted the predicted melting point vs. the observed melting point for all 12634 compounds, see figure 2 below. The compounds are coloured according to topological polar surface area and sized according to the number of hydrogen bond donors, the two most 'important' descriptors.

Conclusion
An open, transparent, and free to use accurate melting point model has been provided, either for download, or for use via a web service - where melting points are calculated from SMILES.

Acknowledgments
Deployment of the model was achieved using PHP, R, and the CDK Descriptor Calculator GUI (v.1.1.1) with helpful advice from Rajarshi Guha.