EPISuite

Comparing EPI Suite MP Prediction to MPModel002
Researcher: Andrew Lang

Introduction
The Estimation Program Interface Suite or EPI Suite™ is a copyrighted but free for individuals to use program provided by the EPA [1]. The EPI Suite contains a model, MPBPWIN™, that can be used to predict melting points from SMILES. MPBPWIN™ works by reporting a weighted average of two melting point estimation methods; the Joback group contribution method [2, 3] and the Gold and Ogle equation [4] suggested by Lyman [5]. code Tm = 0.5839 Tb code The weights used in the reported average depend upon structure (details can be found in the EPI Suite help file). MPBPWIN™ has a reported R-Squared Value of 0.63 with an AAE of 48.6 °C when tested on a diverse set of 10051 compounds from the PHYSPROP database [6].

Our goal here is to compare the accuracy of MPBPWIN™ with our @MeltingPointModel002 when used to predict the melting points of compounds from ONSMP014 - a collection of 1070 compounds extracted from DrugBank.

Method
We took ONSMP014 (1070 compounds) and curated it as follows: 1. We removed all entries without SMILES (leaving 1006 compounds). 2. We removed all entries with blank melting points (leaving 995 compounds). 3. We removed all entries with melting points listed with ">" or "<" (leaving 958 compounds). 4. We removed all entries without a predicted LogS - these compounds were mainly metals and salts (leaving 937 compounds). 5. We removed all entries whose MP was listed as "boiling point", "decomposes", or "salt" - also all compounds with a "." in their SMILES were removed (leaving 880 compounds). 6. All remaining melting point ranges were averaged.

We then took this file and calculated CDK descriptors using Rajarshi Guha's CDK Descriptor Calculator (v 1.1). All descriptors were calculated except Charged Partial Surface Area, Ionization Potential, Amino Acid Count, and all geometrical descriptors. Kier3 was removed post calculation as it contained several 'NA' values. The CDK Descriptor Calculator failed to calculate descriptors for two compounds (which were removed from the dataset leaving 878 compounds):
 * DB02845 || Methylphosphinic Acid || C[P@@H](O)=O ||
 * DB00534 || Chlormerodrin || COC(CNC(N)=O)C[Hg]Cl ||

MPBPWIN™ was then used to calculate predicted melting points for the remaining 878 compounds. Experimentally measured melting points were also recorded when found [6]. Three compounds failed to generate predicted melting point values (they were removed from the dataset leaving 875 compounds):
 * DB00369 || Cidofovir || NC1=NC(=O)N(C[C@@H](CO)OCP(O)(O)=O)C=C1 ||
 * DB00733 || Pralidoxime || CN1C=CC=C\C1=C/[NH+]=O ||
 * DB02671 || 1-Methylimidazole || CN1C=C[NH+]=C1 ||

One entry, Succimer (OC(=O)C(S)C(S)C(O)=O), with an EPI Suite experimental melting point value 19619-8 was interpreted as 196-198 and was changed to 197 accordingly. The resulting spreadsheet of 875 compounds was saved as ONSMP015 and uploaded to the wiki.

In an effort to get a relatively accurate set of melting points, 313 entries where the difference between the Drug Bank measured value and the EPI Suite measured value was greater than 10 °C (as well as entries with no EPI Suite measured value), were removed and were saved and uploaded to the wiki as ONSMP016. The Succimer entry was removed for consistency. This left a final dataset of 562 compounds with measured melting points from both Drug Bank and EPI Suite which were in agreement within 10 °C. Melting point predictions were calculated and recorded for these 562 compounds using @MeltingPointModel002 and can be found in ONSMP017.

Results
The measured melting point values from ONSMP017 - taken to be the average of the Drug Bank and EPI Suite measured values - were compared to the EPI Suite predicted values with the following results, see figure 1 : code R-Squared: 0.3380 AAE: 59.198 RMSE: 75.069 code On figure 1 above, notice how the EPI Suite caps the predicted melting point at 350 °C. According to the documentation, this is by design.

The EPI Suite R-Squared value of 0.338 is significantly less than that reported (0.63) when used to predict the melting points of test set compounds from the PHYSPROP database. The AAE value of 59.198 °C is also higher than that reported (48.6 °C) when EPI Suite was used to predict the melting points of test set compounds from the PHYSPROP database.

The measured melting point values from ONSMP017 - taken to be the average of the Drug Bank and EPI Suite measured values - were compared to the MeltingPointModel002 predicted values with the following results, see figure 2: code R-Squared: 0.7548 AAE: 32.470 RMSE: 45.690 code The R-Squared value here (0.7548) is comparable to the R-Squared value for MeltingPointModel002 (0.7885) when used to predict the melting points of the 12634 compounds of ONSMP013, the set upon which it was trained.

Discussion
Ten percent (62 out of 624) of the measured melting point values returned by EPI Suite were different from the measured melting point from Drug Bank by more than 10 °C. An earlier analysis on a different dataset [7] found that 10% of the EPI Suite measured values were different from the measured values from Alfa Aesar by more than 5 °C. Upon investigation, many of these discrepancies point strongly to errors within the EPI dataset of measured values but analysis of the full dataset is needed to confirm this.

When comparing the predictive power of EPI Suite (R2: 0.338) and MeltingPointModel002 (R2: 0.755) it is clear that MeltingPointModel002 is superior is this case.