Random Forest Based Melting Point Prediction

Researcher: Andrew Lang


A melting point web service is provided based upon a random forest model using open descriptors from the Chemistry Development Kit (CDK) and an Open Data dataset of 12634 melting points (ONSMP013). The model has an internal AAE of 11.597 °C, a RMSE of 38.786 °C, and an R2 value of 0.789. The model used the entire dataset as a previous analysis suggests the likelihood of over-fitting is minimal.


The random forest model was created in R (v.2.11.0) by using the randomForest package (v.4.5-34) and the following code:
mydata = read.csv(file="20110303combinedPrepped.csv",head=TRUE,row.names="molID")
mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE)
 randomForest(formula = mpC ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 50
          Mean of squared residuals: 1504.319
                    % Var explained: 78.85
The importance of the descriptors were calculated, see figure 1, below.
Figure 1. The importance of descriptors for modeling melting points.

The model was then saved so that it could be deployed as a web service using the following code:
.saveRDS(mydata.rf, file = "rfmodel")
This model is available for download to use (44.8MB - license: CC0) or available to use as a web service.

Predicted vs. Observed Melting Points

To visualize the accuracy of the model, we plotted the predicted melting point vs. the observed melting point for all 12634 compounds, see figure 2 below. The compounds are coloured according to topological polar surface area and sized according to the number of hydrogen bond donors, the two most 'important' descriptors.
Figure 2. Predicted vs. Observed melting points


An open, transparent, and free to use accurate melting point model has been provided, either for download, or for use via a web service - where melting points are calculated from SMILES.


Deployment of the model was achieved using PHP, R, and the CDK Descriptor Calculator GUI (v.1.1.1) with helpful advice from Rajarshi Guha.