Melting Point Model 001

Researcher: Andrew Lang

Abstract

We present two general melting point models created using open descriptors from the Chemistry Development Kit (CDK) and an Open Data dataset of 12634 melting points. The first model, a random forest model, using 150 2D descriptors and trained using 10000 compounds had an AAE of 29.701 °C, a RMSE of 31.294 °C, and an R2 value of 0.784 when used to predict the melting points of 2634 test-set compounds. The random forest was then used to select the descriptors for a linear model, with regression done using the same training set. The linear model had an AAE of 41.786 °C, a RMSE of 53.553 °C, and an R2 value of 0.600 on the same test set of 2634 compounds.

Introduction

Two previous melting point models have used the Open Data Karthikeyan dataset (ONSMP003) of 4173 compounds using propriety Molecular Operating Environment (MOE) descriptors. Karthikeyan et. al. [1] modeled the melting point for this data using an artificial neural network with a reported test-set RMSE of 50.4 °C with an R2 value of 0.65. O'Boyle et. al. [2], using the same set of data created a partial least squares model (test-set RMSE of 46.6°C and R2 value of 0.51) and a support vector machine model (test-set RMSE of 45.1°C and R2 of 0.54) with feature selection achieved with an artificial ant colony. While the Karthikeyan dataset (ONSMP003) is a great resource, it contains some erroneous data [3], and though large, only contains compounds with melting points between 14 °C and 392.5 °C.

Procedure

Collecting and Curating the Data. (Access to all datasets available here) We began by combining and the recently released [4] Alfa Aesar Open Data melting point dataset (ONSMP002), the Karthikeyan dataset (ONSMP003), and two smaller sets of open melting point data: one from Bergstrom et. al. [5] (ONSMP006) and one from Bradley et. al. [6] (ONSMP011). From this dataset 171 compounds whose SMILES contained at least one period (salts) were removed. Examining the data for duplicate compound measurements, we noted several compounds with multiple measurements whose melting point values differed by more than 10 °C. If there was no way to distinguish the more likely accurate answer (for example by finding supporting independent literature values), then both entries were removed. All compounds with multiple measurements, but with melting point differences less than or equal to 10 °C, were removed and replaced by a single averaged value. Finally, those compounds that failed to generate CDK descriptor values were also removed (10 compounds), see table 1 below. All removed entries were added to a single spreadsheet (ONSMP012), leaving a highly curated dataset of 12634 unique melting points (ONSMP013).

Calculating and Selecting Descriptors. Open CDK [7] descriptors were calculated using the CDK Descriptor Calculator GUI (v 1.1.1) [8]. All descriptors were calculated except Charged Partial Surface Area, Ionization Potential, Amino Acid Count, and all geometrical descriptors. Descriptors with constant entries and Kier3 (a lot of entries with 'NA') were also removed, leaving a total of 150 descriptors. The following compounds, as noted above, were removed because they had 'NA' for several descriptor values:
name
SMILES
mpC
CSID
diphenylphosphine oxide
c1ccc(cc1)P(=O)c2ccccc2
53.5
222625
periodic acid
OI(=O)(=O)=O
128
58684
benzeneseleninic anhydride
c1ccc(cc1)[Se](=O)O[Se](=O)c2ccccc2
164
78710
hydroxy(tosyloxy)iodobenzene
Cc1ccc(cc1)S(=O)(=O)OI(c2ccccc2)O
133.5
288160
(diphenylmethylidene)(triphenyl)-lambda~5~-phosphane
[P+](c1ccccc1)(c1ccccc1)(c1ccccc1)C(=C=1C=CC=CC=1)c1ccccc1
170
436055
acetyl(2,3,5,6-tetramethylphenyl)mercury
[Hg](c1c(C)c(C)cc(C)c1C)C(=O)
158
2035308
4-chlorobenzeneseleninic acid
c1cc(ccc1Cl)[Se](=O)O
184.5
4256138
phenylphosphinic acid
c1ccc(cc1)P(=O)O
84
10449255
(3Z)-N-hydroxy-2-(pyridin-2-yl)-3H-indol-3-imine 1-oxide
[O-][N+]=1c2ccccc2C(=NO)C=1c1[nH0]cccc1
216
10482712
(methanolato)(2,3,5,6-tetramethylphenyl)mercury
[Hg](OC)c1c(C)c(C)cc(C)c1C
158
10484121
Table 1. The ten compounds that the CDK failed to generate descriptors for.

Visualizing the Chemical Space. The entire dataset of 12634 melting point measurements contains compounds with melting points ranging from -170 °C to 485 °C (with a mean of 110.2 °C). The dataset was randomly split into a training set of 10000 compounds and a test set of 2634 compounds and a principal component analysis was performed using the program R (v 2.11.0) with the following code:
mydata = read.csv(file="20110303combined.csv",head=TRUE,row.names="molID")
data <- mydata[, 3:153] ## Don't include Karthikeyan or Test/Training in pca, but do include MP
pc1 <- prcomp(data, scale. = T)
x <- pc1$x
summary(pc1)
 
[output]
Importance of components:
                         PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8
Standard deviation     6.692 3.1436 2.7531 2.4615 2.2749 2.1594 1.9839 1.8855
Proportion of Variance 0.297 0.0654 0.0502 0.0401 0.0343 0.0309 0.0261 0.0235
Cumulative Proportion  0.297 0.3620 0.4122 0.4523 0.4866 0.5175 0.5435 0.5671
[output]
 
colvec <- mydata$Karthikeyan
plot(x[, 1], x[, 2], pch = 20, col = colvec+2,xlim=c(-15,45),ylim=c(-15,30))
legend(-15, 30, c("Alfa Aesar+","Karthikeyan"), cex=1, col = c(2, 3),pch=20)
colvec <- mydata$Test
plot(x[, 1], x[, 2], pch = 20, col = colvec+2,xlim=c(-15,45),ylim=c(-15,30))
legend(-15, 30, c("Alfa Aesar+","Karthikeyan"), cex=1, col = c(2, 3),pch=20)
From the resulting figures below, of the first two principal components, we see that: 1. The entire datasets comprises a much larger chemical space than the Karthikeyan data set by itself. 2. The test set and training set cover comparable regions of the chemical space, though the test set has no compounds with PC1 greater than 30 or PC2 greater then 20.
pcaCombinedKarthikeyan.png
Figure 1. First two principal components of the entire dataset, comparing the Karthikeyan dataset to the rest of the data.


pcaCombinedTestTraining.png
Figure 2. First two principal components showing the distribution of the training and test sets.

Building the Models. The randomForest package (v 4.5-34) in R was used to build the random forest models using the following code:
mydata = read.csv(file="20110303training.csv",head=TRUE,row.names="molID")
## do random forest [randomForest 4.5-34]
mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE)
print(mydata.rf)
 
[output]
Call:
 randomForest(formula = mpC ~ ., data = mydata, importance = TRUE)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 50
 
          Mean of squared residuals: 1569.888
                    % Var explained: 77.89
[output]
 
## get plot of importance - %IncMSE and IncNodePurity
varImpPlot(mydata.rf,main="Random Forest Variable Importance")
The resulting image, showing the importance of the descriptors, points to both the number of hydrogen bond donors (nHBDon) and the topological polar surface area (TopoPSA) as important physiochemical properties for melting point prediction. This confirms the previous analyses of Karthikeyan et. el. [1], Hughes et. al. [9], and Bergstom et. al. [10].
RFTraining.png
Figure 3. The importance of descriptors in the random forest model.

The descriptors indicated as important by the random forest were used to create a linear model on the same training set of data using the following code:
## do linear regression
fit <- lm(mpC ~ nHBDon+TopoPSA+WTPT.5+ATSc3+ATSc2+nAtomP+XLogP+WTPT.4+bpol+ATSc1+BCUTp.1l+WTPT.3+nAtomLC+ATSc4+khs.ssO
+ATSc5+nRotB+ALogP+BCUTp.1h+BCUTw.1h+VC.5+BCUTw.1l+SCH.7+VC.3+khs.ssCH2+fragC+ALogp2+AMR+C1SP3+ATSm1+SPC.6+SP.6+SPC.4
+SP.4+ATSp3+SP.7+SPC.5+SP.2+SP.3+SP.5+WTPT.2+SC.3+ATSp4+ATSp2+VCH.7,data=mydata)
## summary of fit
summary(fit)
 
[output]
Residuals:
     Min       1Q   Median       3Q      Max
-347.615  -34.588   -4.265   31.316  254.289
Residual standard error: 51.81 on 9954 degrees of freedom
Multiple R-squared: 0.6236,     Adjusted R-squared: 0.6219
F-statistic: 366.4 on 45 and 9954 DF,  p-value: < 2.2e-16
[output]
 
library(leaps)
attach(mydata)
leaps<-regsubsets(mpC ~ nHBDon+TopoPSA+WTPT.5+ATSc3+ATSc2+nAtomP+XLogP+WTPT.4+bpol+ATSc1+BCUTp.1l+WTPT.3+nAtomLC+ATSc4
+khs.ssO+ATSc5+nRotB+ALogP+BCUTp.1h+BCUTw.1h+VC.5+BCUTw.1l+SCH.7+VC.3+khs.ssCH2+fragC+ALogp2+AMR+C1SP3+ATSm1+SPC.6+SP.6
+SPC.4+SP.4+ATSp3+SP.7+SPC.5+SP.2+SP.3+SP.5+WTPT.2+SC.3+ATSp4+ATSp2+VCH.7,nvmax=12,data=mydata,nbest=1,really.big=TRUE)
# models are ordered by the selection statistic.
plot(leaps,scale="adjr2")
The leaps package was used to identify the top seven descriptors using exhaustive search subset selection, see figure 4, that were in turn used to create a final linear model presented below.
leaps.png
Figure 4. Exhaustive subset selection showing the top sets of descriptors by adjusted R2.

Results

Melting Point Prediction. As expected, the best model for melting point prediction was the random forest model. The random forest model had an AAE of 29.701 °C, a RMSE of 31.294 °C, and an R2 value of 0.784 when used to predict the melting points of 2634 test-set compounds, using the following code:
testdata = read.csv(file="20110303test.csv",head=TRUE,row.names="molID")
test.predict <- predict(mydata.rf,testdata)
write.csv(test.predict, file = "RFTestPredict.csv")

The linear model with an AAE of 41.786 °C, a RMSE of 53.553 °C, and an R2 value of 0.600 on the same test set of 2634 compounds was created in R using the following code:
fit <- lm(mpC ~ nHBDon+TopoPSA+nAtomP+nRotB+ATSm1+SPC.5+WTPT.2,data=mydata)
summary(fit)
 
[output]
Residuals:
     Min       1Q   Median       3Q      Max
-393.741  -36.037   -5.096   32.261  256.297
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -631.33825   17.94974  -35.17   <2e-16 ***
nHBDon        27.74523    0.69868   39.71   <2e-16 ***
TopoPSA        0.71073    0.02302   30.88   <2e-16 ***
nAtomP         2.98697    0.12512   23.87   <2e-16 ***
nRotB         -3.97845    0.19347  -20.56   <2e-16 ***
ATSm1          0.29843    0.02132   14.00   <2e-16 ***
SPC.5          4.09519    0.19386   21.12   <2e-16 ***
WTPT.2       327.15838    9.42256   34.72   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 54.26 on 9992 degrees of freedom
Multiple R-squared: 0.5857,     Adjusted R-squared: 0.5854
F-statistic:  2018 on 7 and 9992 DF,  p-value: < 2.2e-16
[output]
A plot of predicted melting point against measured melting point is shown below, see figures 5, 6, and 7.
TestTrainingBothCropped.png
Figure 5. Predicted melting point versus observed melting point. Training set is red, test set is green. Kathikeyan data are plotted as squares.

TestTrainingJustTestCropped.png
Figure 6. Predicted melting point versus observed melting point with the test set data highlighted. The Karthikeyan data are plotted as squares.

TestTrainingKarthikeyanCropped.png
Figure 7. Predicted melting point versus observed melting point. Training set is red, test set is green. Kathikeyan data are plotted as squares and are highlighted.

Outliers. Taking the intersection of the top twenty outliers from the test-set predictions of both the random forest model and the linear model, we highlight nine compounds whose melting point prediction failed dramatically for both models. These compounds are listed below in table 2 and with structures below that. All the compounds listed below, except mol9104, are outliers due to the melting point being under-predicted.
molID
CSID
Karthikeyan
nHBDon
ATSm1
SPC-5
nAtomP
nRotB
TopoPSA
WTPT-2
mpC
Predicted
AE
mol10675
4534702
yes
0
19.28
2.61
16
3
17.07
2.02
314.5
92.5
222.0
mol5415
144284
no
2
18.42
2.00
3
4
63.32
1.83
274.0
73.5
200.5
mol708
713442
no
0
18.00
4.34
18
0
0.00
2.13
341.0
142.9
198.1
mol9104
2034700
yes
2
37.43
7.85
11
2
87.72
2.02
20.5
215.3
194.8
mol2494
19099
no
2
20.23
1.98
11
2
102.84
1.92
357.5
165.4
192.1
mol12602
23873916
yes
0
27.82
8.41
18
6
77.32
2.02
342.0
157.7
184.3
mol1337
8883
no
0
10.00
2.82
0
0
0.00
2.14
267.0
83.9
183.1
mol4731
83963
yes
0
102.52
3.10
6
3
0.00
2.01
239.0
76.8
162.2
mol3952
68803
no
0
11.55
1.55
3
5
26.30
1.85
150.5
-8.9
159.4




mol10675
mol5415
mol708



mol9104
mol2494
mol12602



mol1337
mol4731
mol3952

Conclusion

We have shown that it is possible to build melting point models using Open Data and Open CDK descriptors that may be used to predict melting points from structure with reasonable accuracy - Random Forest model on test set data: AAE of 29.701 °C, a RMSE of 31.294 °C, and an R2 value of 0.784. The sharp drop in R2 (R2 value of 0.600) when a linear model was used, seems to give weight to the suggestion that "success of these [nonlinear] methods supports the conclusion that nonlinear methods more effectively model melting point." [9]. The linear model does however give extra weight to the suggestion that the number of hydrogen bond donors and the topological polar surface area are important physiochemical properties for modeling melting points in general.

References

[1] Karthikeyan M.; Glen R.C.; and Bender A. General melting point prediction based on a diverse compound dataset and artificial neural networks. J. Chem. Inf. Model.; 2005; 45(3); 581-590
[2] O'Boyle N. M.; Palmer D. S.; Nigsch F.; and Mitchell J. B. Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction. (2008) Chemistry Central journal 2 p. 21
[3] Bradley J-C. Validating Melting Point Data from Alfa Aesar, EPI and MDPI. (2011) [http://usefulchem.blogspot.com/2011/03/validating-melting-point-data-from-alfa.html]
[4] Bradley J-C. Alfa Aesar melting point data now openly available. (2011) [http://usefulchem.blogspot.com/2011/02/alfa-aesar-melting-point-data-now.html]
[5] Bergstrom, C. A.; Norinder, U.; Luthman, K.; Artursson, P. Molecular descriptors influencing melting point and their role inclassification of solid drugs. J. Chem. Inf. Comput. Sci. 2003, 43, 1177-1185.
[6] Bradley J-C. et. al. (2010-) Crowdsourced melting point data. [http://usefulchem.blogspot.com/2011/01/chemical-information-validation-results.html]
[7] Steinbeck C.;Han Y; Kuhn S.; Horlacher O.; Luttmann E. and Willighagen E. The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo- and Bioinformatics. J. Chem. Inf. Comput. Sci., 2003, 43 (2), pp 493–500 DOI: 10.1021/ci025584y
[8] Guha R. CDK Descriptor Calculator GUI (v 1.1.1) [http://www.rguha.net/code/java/cdkdesc.html]
[9] Hughes L.D.; Palmer D.S.; Nigsch F., and Mitchell J.B.O. Why Are Some Properties More Difficult To Predict than Others? A Study of QSPR Models of Solubility, Melting Point, and Log P. J. Chem. Inf. Model. 2008, 48, 220-232
[10] Bergström C.A.; Norinder U.; Luthman K.; Artursson P. Molecular descriptors influencing melting point and their role in classification of solid drugs. J Chem Inf Comput Sci. 2003 Jul-Aug;43(4):1177-85.