MeltingPointModel001

Melting Point Model 001
Researcher: Andrew Lang

Abstract
We present two general melting point models created using open descriptors from the Chemistry Development Kit (CDK) and an Open Data dataset of 12634 melting points. The first model, a random forest model, using 150 2D descriptors and trained using 10000 compounds had an AAE of 29.701 °C, a RMSE of 31.294 °C, and an R2 value of 0.784 when used to predict the melting points of 2634 test-set compounds. The random forest was then used to select the descriptors for a linear model, with regression done using the same training set. The linear model had an AAE of 41.786 °C, a RMSE of 53.553 °C, and an R2 value of 0.600 on the same test set of 2634 compounds.

Introduction
Two previous melting point models have used the Open Data Karthikeyan dataset (ONSMP003) of 4173 compounds using propriety Molecular Operating Environment (MOE) descriptors. Karthikeyan //et. al.// [1] modeled the melting point for this data using an artificial neural network with a reported test-set RMSE of 50.4 °C with an R2 value of 0.65. O'Boyle //et. al.// [2], using the same set of data created a partial least squares model (test-set RMSE of 46.6°C and R2 value of 0.51) and a support vector machine model (test-set RMSE of 45.1°C and R2 of 0.54) with feature selection achieved with an artificial ant colony. While the Karthikeyan dataset (ONSMP003) is a great resource, it contains some erroneous data [3], and though large, only contains compounds with melting points between 14 °C and 392.5 °C.

Procedure

 * Collecting and Curating the Data.** (Access to all datasets available here) We began by combining and the recently released [4] Alfa Aesar Open Data melting point dataset (ONSMP002), the Karthikeyan dataset (ONSMP003), and two smaller sets of open melting point data: one from Bergstrom //et. al.// [5] (ONSMP006) and one from Bradley //et. al.// [6] (ONSMP011). From this dataset 171 compounds whose SMILES contained at least one period (salts) were removed. Examining the data for duplicate compound measurements, we noted several compounds with multiple measurements whose melting point values differed by more than 10 °C. If there was no way to distinguish the more likely accurate answer (for example by finding supporting independent literature values), then both entries were removed. All compounds with multiple measurements, but with melting point differences less than or equal to 10 °C, were removed and replaced by a single averaged value. Finally, those compounds that failed to generate CDK descriptor values were also removed (10 compounds), see table 1 below. All removed entries were added to a single spreadsheet (ONSMP012), leaving a highly curated dataset of 12634 unique melting points (ONSMP013).

Table 1. The ten compounds that the CDK failed to generate descriptors for.
 * Calculating and Selecting Descriptors.** Open CDK [7] descriptors were calculated using the CDK Descriptor Calculator GUI (v 1.1.1) [8]. All descriptors were calculated except Charged Partial Surface Area, Ionization Potential, Amino Acid Count, and all geometrical descriptors. Descriptors with constant entries and Kier3 (a lot of entries with 'NA') were also removed, leaving a total of 150 descriptors. The following compounds, as noted above, were removed because they had 'NA' for several descriptor values:
 * name || SMILES || mpC || CSID ||
 * diphenylphosphine oxide || c1ccc(cc1)P(=O)c2ccccc2 || 53.5 || 222625 ||
 * periodic acid || OI(=O)(=O)=O || 128 || 58684 ||
 * benzeneseleninic anhydride || c1ccc(cc1)[Se](=O)O[Se](=O)c2ccccc2 || 164 || 78710 ||
 * hydroxy(tosyloxy)iodobenzene || Cc1ccc(cc1)S(=O)(=O)OI(c2ccccc2)O || 133.5 || 288160 ||
 * (diphenylmethylidene)(triphenyl)-lambda~5~-phosphane || [P+](c1ccccc1)(c1ccccc1)(c1ccccc1)C(=C=1C=CC=CC=1)c1ccccc1 || 170 || 436055 ||
 * acetyl(2,3,5,6-tetramethylphenyl)mercury || [Hg](c1c(C)c(C)cc(C)c1C)C(=O) || 158 || 2035308 ||
 * 4-chlorobenzeneseleninic acid || c1cc(ccc1Cl)[Se](=O)O || 184.5 || 4256138 ||
 * phenylphosphinic acid || c1ccc(cc1)P(=O)O || 84 || 10449255 ||
 * (3Z)-N-hydroxy-2-(pyridin-2-yl)-3H-indol-3-imine 1-oxide || [O-][N+]=1c2ccccc2C(=NO)C=1c1[nH0]cccc1 || 216 || 10482712 ||
 * (methanolato)(2,3,5,6-tetramethylphenyl)mercury || [Hg](OC)c1c(C)c(C)cc(C)c1C || 158 || 10484121 ||

code mydata = read.csv(file="20110303combined.csv",head=TRUE,row.names="molID") data <- mydata[, 3:153] ## Don't include Karthikeyan or Test/Training in pca, but do include MP pc1 <- prcomp(data, scale. = T) x <- pc1$x summary(pc1)
 * Visualizing the Chemical Space.** The entire dataset of 12634 melting point measurements contains compounds with melting points ranging from -170 °C to 485 °C (with a mean of 110.2 °C). The dataset was randomly split into a training set of 10000 compounds and a test set of 2634 compounds and a principal component analysis was performed using the program R (v 2.11.0) with the following code:

[output] Importance of components: PC1   PC2    PC3    PC4    PC5    PC6    PC7    PC8 Standard deviation    6.692 3.1436 2.7531 2.4615 2.2749 2.1594 1.9839 1.8855 Proportion of Variance 0.297 0.0654 0.0502 0.0401 0.0343 0.0309 0.0261 0.0235 Cumulative Proportion 0.297 0.3620 0.4122 0.4523 0.4866 0.5175 0.5435 0.5671 [output]

colvec <- mydata$Karthikeyan plot(x[, 1], x[, 2], pch = 20, col = colvec+2,xlim=c(-15,45),ylim=c(-15,30)) legend(-15, 30, c("Alfa Aesar+","Karthikeyan"), cex=1, col = c(2, 3),pch=20) colvec <- mydata$Test plot(x[, 1], x[, 2], pch = 20, col = colvec+2,xlim=c(-15,45),ylim=c(-15,30)) legend(-15, 30, c("Alfa Aesar+","Karthikeyan"), cex=1, col = c(2, 3),pch=20) code From the resulting figures below, of the first two principal components, we see that: 1. The entire datasets comprises a much larger chemical space than the Karthikeyan data set by itself. 2. The test set and training set cover comparable regions of the chemical space, though the test set has no compounds with PC1 greater than 30 or PC2 greater then 20.

code mydata = read.csv(file="20110303training.csv",head=TRUE,row.names="molID") mydata.rf <- randomForest(mpC ~ ., data = mydata,importance = TRUE) print(mydata.rf)
 * Building the Models.** The randomForest package (v 4.5-34) in R was used to build the random forest models using the following code:
 * 1) do random forest [randomForest 4.5-34]

[output] Call: randomForest(formula = mpC ~ ., data = mydata, importance = TRUE) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 50

Mean of squared residuals: 1569.888 % Var explained: 77.89 [output]

varImpPlot(mydata.rf,main="Random Forest Variable Importance") code The resulting image, showing the importance of the descriptors, points to both the number of hydrogen bond donors (nHBDon) and the topological polar surface area (TopoPSA) as important physiochemical properties for melting point prediction. This confirms the previous analyses of Karthikeyan //et. el.// [1], Hughes //et. al.// [9], and Bergstom //et. al.// [10]. The descriptors indicated as important by the random forest were used to create a linear model on the same training set of data using the following code: code fit <- lm(mpC ~ nHBDon+TopoPSA+WTPT.5+ATSc3+ATSc2+nAtomP+XLogP+WTPT.4+bpol+ATSc1+BCUTp.1l+WTPT.3+nAtomLC+ATSc4+khs.ssO +ATSc5+nRotB+ALogP+BCUTp.1h+BCUTw.1h+VC.5+BCUTw.1l+SCH.7+VC.3+khs.ssCH2+fragC+ALogp2+AMR+C1SP3+ATSm1+SPC.6+SP.6+SPC.4 +SP.4+ATSp3+SP.7+SPC.5+SP.2+SP.3+SP.5+WTPT.2+SC.3+ATSp4+ATSp2+VCH.7,data=mydata) summary(fit)
 * 1) get plot of importance - %IncMSE and IncNodePurity
 * 1) do linear regression
 * 1) summary of fit

[output] Residuals: Min      1Q   Median       3Q      Max -347.615 -34.588   -4.265   31.316  254.289 Residual standard error: 51.81 on 9954 degrees of freedom Multiple R-squared: 0.6236,    Adjusted R-squared: 0.6219 F-statistic: 366.4 on 45 and 9954 DF, p-value: < 2.2e-16 [output]

library(leaps) attach(mydata) leaps<-regsubsets(mpC ~ nHBDon+TopoPSA+WTPT.5+ATSc3+ATSc2+nAtomP+XLogP+WTPT.4+bpol+ATSc1+BCUTp.1l+WTPT.3+nAtomLC+ATSc4 +khs.ssO+ATSc5+nRotB+ALogP+BCUTp.1h+BCUTw.1h+VC.5+BCUTw.1l+SCH.7+VC.3+khs.ssCH2+fragC+ALogp2+AMR+C1SP3+ATSm1+SPC.6+SP.6 +SPC.4+SP.4+ATSp3+SP.7+SPC.5+SP.2+SP.3+SP.5+WTPT.2+SC.3+ATSp4+ATSp2+VCH.7,nvmax=12,data=mydata,nbest=1,really.big=TRUE) plot(leaps,scale="adjr2") code The leaps package was used to identify the top seven descriptors using exhaustive search subset selection, see figure 4, that were in turn used to create a final linear model presented below.
 * 1) models are ordered by the selection statistic.

Results
code testdata = read.csv(file="20110303test.csv",head=TRUE,row.names="molID") test.predict <- predict(mydata.rf,testdata) write.csv(test.predict, file = "RFTestPredict.csv") code
 * Melting Point Prediction.** As expected, the best model for melting point prediction was the random forest model. The random forest model had an AAE of 29.701 °C, a RMSE of 31.294 °C, and an R2 value of 0.784 when used to predict the melting points of 2634 test-set compounds, using the following code:

The linear model with an AAE of 41.786 °C, a RMSE of 53.553 °C, and an R2 value of 0.600 on the same test set of 2634 compounds was created in R using the following code: code fit <- lm(mpC ~ nHBDon+TopoPSA+nAtomP+nRotB+ATSm1+SPC.5+WTPT.2,data=mydata) summary(fit)

[output] Residuals: Min      1Q   Median       3Q      Max -393.741 -36.037   -5.096   32.261  256.297

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -631.33825  17.94974  -35.17   <2e-16 *** nHBDon       27.74523    0.69868   39.71   <2e-16 *** TopoPSA       0.71073    0.02302   30.88   <2e-16 *** nAtomP        2.98697    0.12512   23.87   <2e-16 *** nRotB        -3.97845    0.19347  -20.56   <2e-16 *** ATSm1         0.29843    0.02132   14.00   <2e-16 *** SPC.5         4.09519    0.19386   21.12   <2e-16 *** WTPT.2      327.15838    9.42256   34.72   <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 54.26 on 9992 degrees of freedom Multiple R-squared: 0.5857,    Adjusted R-squared: 0.5854 F-statistic: 2018 on 7 and 9992 DF,  p-value: < 2.2e-16 [output] code A plot of predicted melting point against measured melting point is shown below, see figures 5, 6, and 7.
 * Outliers.** Taking the intersection of the top twenty outliers from the test-set predictions of both the random forest model and the linear model, we highlight nine compounds whose melting point prediction failed dramatically for both models. These compounds are listed below in table 2 and with structures below that. All the compounds listed below, except mol9104, are outliers due to the melting point being under-predicted.
 * molID || CSID || Karthikeyan || nHBDon || ATSm1 || SPC-5 || nAtomP || nRotB || TopoPSA || WTPT-2 || mpC || Predicted || AE ||
 * mol10675 || 4534702 || yes || 0 || 19.28 || 2.61 || 16 || 3 || 17.07 || 2.02 || 314.5 || 92.5 || 222.0 ||
 * mol5415 || 144284 || no || 2 || 18.42 || 2.00 || 3 || 4 || 63.32 || 1.83 || 274.0 || 73.5 || 200.5 ||
 * mol708 || 713442 || no || 0 || 18.00 || 4.34 || 18 || 0 || 0.00 || 2.13 || 341.0 || 142.9 || 198.1 ||
 * mol9104 || 2034700 || yes || 2 || 37.43 || 7.85 || 11 || 2 || 87.72 || 2.02 || 20.5 || 215.3 || 194.8 ||
 * mol2494 || 19099 || no || 2 || 20.23 || 1.98 || 11 || 2 || 102.84 || 1.92 || 357.5 || 165.4 || 192.1 ||
 * mol12602 || 23873916 || yes || 0 || 27.82 || 8.41 || 18 || 6 || 77.32 || 2.02 || 342.0 || 157.7 || 184.3 ||
 * mol1337 || 8883 || no || 0 || 10.00 || 2.82 || 0 || 0 || 0.00 || 2.14 || 267.0 || 83.9 || 183.1 ||
 * mol4731 || 83963 || yes || 0 || 102.52 || 3.10 || 6 || 3 || 0.00 || 2.01 || 239.0 || 76.8 || 162.2 ||
 * mol3952 || 68803 || no || 0 || 11.55 || 1.55 || 3 || 5 || 26.30 || 1.85 || 150.5 || -8.9 || 159.4 ||


 * media type="custom" key="8606132" || media type="custom" key="8606138" || media type="custom" key="8606144" ||
 * mol10675 || mol5415 || mol708 ||
 * media type="custom" key="8606156" || media type="custom" key="8606170" || media type="custom" key="8606178" ||
 * mol9104 || mol2494 || mol12602 ||
 * media type="custom" key="8606194" || media type="custom" key="8606196" || media type="custom" key="8606202" ||
 * mol1337 || mol4731 || mol3952 ||

Conclusion
We have shown that it is possible to build melting point models using Open Data and Open CDK descriptors that may be used to predict melting points from structure with reasonable accuracy - Random Forest model on test set data: AAE of 29.701 °C, a RMSE of 31.294 °C, and an R2 value of 0.784. The sharp drop in R2 (R2 value of 0.600) when a linear model was used, seems to give weight to the suggestion that "success of these [nonlinear] methods supports the conclusion that nonlinear methods more effectively model melting point." [9]. The linear model does however give extra weight to the suggestion that the number of hydrogen bond donors and the topological polar surface area are important physiochemical properties for modeling melting points in general.