Melting Point Model Testing I

Researchers: Jean-Claude Bradley and Andrew Lang

Introduction

An early melting point model (MPModel002) could distinguish between ortho and para but not cis and trans. Here we will present the predictive ability of our current RF models (004, 006, and 007) on ortho vs para and cis vs. trans compounds.

Procedure

The molecules selected for testing were run through Rajarshi Guha's CDK Descriptor Calculator using SMILES for models 004 and 007 and SDF created from SMILES with ChemAxon's molconvert for model 006. The resulting CDK descriptors were then passed through the models themselves to generate predicted melting point values which were compared to experimental values.

Ortho vs. Para. Random Forest. To test the predictive ability of our models to distinguish between ortho and para substituted molecules, we selected 1,2-dicholrobenzene (CSID: 13837988) and 1,4-dichlorobenzene (CSID: 13866817) which are seemingly very similar compounds but have very different experimental melting points, -17.23 °C and 53.01 °C respectively.
compound
experimental
004
006
007
1,2-dichlorobenzene
-17.23
-11.13
-13.29
-13.27
1,4-dichlorobenzene
53.01
33.62
36.79
39.55
We see that all three models are able to distinguish between ortho and para, at least in this case. To see how the models distinguish between the molecules, we decided to look for the CDK descriptors that differ in value between the two molecules. The classes of descriptors that differ 2D are: BCUT (Eigenvalue based descriptor noted for its utility in chemical diversity described by Pearlman et al.), AutoCorrelations (charge, mass, and polarizability), Kier and Hall Chi cluster indices, path indices, and path cluster indices; ECCEN (A topological descriptor combining distance and adjacency information.), MDE (molecular distance edge descriptors for C, N and O), PetitjeanNumber (descriptor that calculates the Petitjean Number of a molecule), WTPT (weighted path descriptors described by Randic. They characterize molecular branching), and WPATH (the Wiener path number). It seems that 2D descriptors are enough to distinguish between ortho and para molecules, however there are additional 3D descriptors that are different between the two molecules: CPSA descriptors (descriptors combining surface area and partial charge information), WHIM descriptors (Holistic descriptors described by Todeschini et al), GRAV (descriptors characterizing the mass distribution of the molecule), LOBMIN (minimum length to breadth ratio), MOMI (descriptors that calculates the principal moments of inertia and ratios of the principal moments. Also calculates the radius of gyration), and Petitjean shape indeces. Yet the addition of 3D descriptors only slightly improves the prediction accuracy of the model (comparing 006 to 004).

Ortho vs. Para. Linear. To see if we could identify the descriptors that allow the models to distinguish between ortho and para, we calculated the predicted melting point using the linear model from MPModel004. The predicted value for 1,2-dichlorobenzene is 11.88 °C and the predicted value for 1,4-dichlorobenzene is 11.86 °C. The descriptors used in the linear model (ALogp2, nHBDon, nAtomP, nRotB, TopoPSA, MW, WTPT-2) are all identical for both molecules except WTPT-2 which differs only by 0.000072566. Thus the linear model (with the descriptors found in model004) is practically incapable of distinguishing between the ortho and para cases. This means that the descriptors needed to distinguish between ortho and para are not highly significant in and of themselves in determining the melting point in general; and that a non-linear model (or a linear model with more descriptors) is needed to pick up the difference in melting points between ortho and para. An interesting paper by Katritzky et al. that focuses on ortho, meta, and para substituted benzene gives a general linear model but then fails to examine if the linear model actually distinguishes between the three cases.
.
Cis vs. Trans. To test the predictive ability of our models to distinguish between cis and trans molecules, we selected maleic acid (CSID: 392248) and fumaric acid (CSID: 10197150) which previous models (MPModel002) have been unable to distinguish, giving the same predicted melting point, when the experimental melting points differ greatly due to hydrogen bonding, 134.47 °C as compared to 287 °C (wikipedia value - other values range between 230 dec. and 297.5) respectively.
compound
experimental
004
006
007
maleic acid
134.47
148.33
142.56
140.99
fumaric acid
287
148.33
144.89
140.99
We see that the there is absolutely no difference in the predicted values using models 004 and 007. These are both 2D descriptor models and the 2D CDK descriptors for maleic acid and fumaric acid are identical. There is a slight increase in the predicted melting of fumaric acid when 2D+3D descriptors are used (mpmodel006). There are 59 3D descriptors that differ between the two molecules, yet the model gets nowhere near the experimental melting point of fumaric acid. The reasons for this could be: 1. There is currently not a 3D descriptor in the CDK that contains the necessary information to explain the difference in hydrogen bonding between cis and trans molecules. 2. A model built exclusively with cis and trans molecules may perform very well and the poor performance seen here is due to them just being a tiny percentage of the training set and thus the information may be there but may not have been incorporated into the model to the degree necessary to give accurate predictions for cis/trans pairs. This is quite likely as all models seem to favour maleic acid where hydrogen bonding does not unusually change the experimental melting point.