Alfa Aesar and Karthikeyan Comparison

Researcher: Andrew Lang

Purpose

During curation of the Karthikeyan[1] dataset, before calculating CDK descriptors, several hundred compounds were discarded due to errors in their SMILES. Also, Noel O'Boyle et. al.[2] used the original Karthikeyan dataset and descriptors to construct a "PLS model with 68 descriptors which has an RMSE on an external test set of 46.6°C and R2 of 0.51." Whilst melting points are difficult to model, it was a concern to us that the Karthikeyan dataset may not be of the same 'quality' of the Alfa Aesar dataset; so we decided to create preliminary models on the two sets individually and compare the results.

Procedure

The entire dataset (with salts removed) of 13,264 compounds was run through the CDK Descriptor Calculator (v1.1.1) calculating all CDK descriptors except Charged Partial Surface Area, Ionization Potential, Amino Acid Count, and all geometrical descriptors. Descriptors with all 'NA' and Kier3 (a lot of entries with 'NA') were removed by hand. The following compounds were also removed because they also had 'NA' for several descriptors:
name
SMILES
mpC
CSID
diphenylphosphine oxide
c1ccc(cc1)P(=O)c2ccccc2
53.5
222625
periodic acid
OI(=O)(=O)=O
128
58684
benzeneseleninic anhydride
c1ccc(cc1)[Se](=O)O[Se](=O)c2ccccc2
164
78710
hydroxy(tosyloxy)iodobenzene
Cc1ccc(cc1)S(=O)(=O)OI(c2ccccc2)O
133.5
288160
(diphenylmethylidene)(triphenyl)-lambda~5~-phosphane
[P+](c1ccccc1)(c1ccccc1)(c1ccccc1)C(=C=1C=CC=CC=1)c1ccccc1
170
436055
acetyl(2,3,5,6-tetramethylphenyl)mercury
[Hg](c1c(C)c(C)cc(C)c1C)C(=O)
158
2035308
4-chlorobenzeneseleninic acid
c1cc(ccc1Cl)[Se](=O)O
184.5
4256138
phenylphosphinic acid
c1ccc(cc1)P(=O)O
84
10449255
(3Z)-N-hydroxy-2-(pyridin-2-yl)-3H-indol-3-imine 1-oxide
[O-][N+]=1c2ccccc2C(=NO)C=1c1[nH0]cccc1
216
10482712
(methanolato)(2,3,5,6-tetramethylphenyl)mercury
[Hg](OC)c1c(C)c(C)cc(C)c1C
158
10484121
The dataset was then split into two. One set consisting of Karthikeyan compounds, and the other set, everything else, primarily Alfa Aesar values.

Alfa Aesar

The dataset consisting primarily of Alfa Aesar melting points was loaded into R version 2.11.0. A random forest (randomForest 4.5-34: mydata.rf <- randomForest(mpC ~ ., data = mydata, keep.forest=FALSE,importance = TRUE,mtry=30,ntree=200)) was created to identify important descriptors, see figure 1.
AARandomForest.png
Figure 1. Random Forest for Alfa Aesar dataset.

Out of all these descriptors, the top seven descriptors were identified using leaps (nbest=1,really.big=TRUE), see figure 2.
AAleaps.png
Figure 2: Using leaps to identify the model descriptors.


These descriptors were used to create a model (fit <- lm(mpC ~ nHBDon + TopoPSA + nAtomP + ATSm1 + WTPT.2 + SCH.7 + khs.ssO, data=mydata)):
Residuals:
     Min       1Q   Median       3Q      Max
-332.296  -36.372   -4.219   31.947  293.879
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -515.68308   19.98649  -25.80   <2e-16 ***
nHBDon        29.82426    0.80388   37.10   <2e-16 ***
TopoPSA        0.85743    0.02823   30.37   <2e-16 ***
nAtomP         4.84429    0.17608   27.51   <2e-16 ***
ATSm1          0.36665    0.02079   17.63   <2e-16 ***
WTPT.2       252.02838   10.66507   23.63   <2e-16 ***
SCH.7         55.49989    2.93595   18.90   <2e-16 ***
khs.ssO      -12.20814    0.78862  -15.48   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 53.82 on 9336 degrees of freedom
Multiple R-squared: 0.5715,     Adjusted R-squared: 0.5712
F-statistic:  1779 on 7 and 9336 DF,  p-value: < 2.2e-16

Karthikeyan

The curated Karthikeyan dataset of melting points was loaded into R version 2.11.0. A random forest (randomForest 4.5-34: mydata.rf <- randomForest(mpC ~ ., data = mydata, keep.forest=FALSE,importance = TRUE,mtry=30,ntree=200)) was created to identify important descriptors, see figure 3.
RandomForest.png
Figure 3. Random Forest for Karthikeyan dataset.

Out of all these descriptors, the top eight descriptors were identified using leaps(nbest=1,really.big=TRUE), see figure 4.
leaps.png
Figure 4. Top 8 descriptors found using leaps.

These descriptors were used to create a model (fit <- lm(mpC ~ nHBDon + TopoPSA + nAtomP + nRotB + ATSc2 + khs.ssO + SP.6 +khs.ssNH, data=mydata)):
Residuals:
    Min      1Q  Median      3Q     Max
-181.29  -32.54   -3.21   28.58  216.38
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  72.7366     2.2352   32.54   <2e-16 ***
nHBDon        8.1823     1.0914    7.50    8e-14 ***
TopoPSA       0.5131     0.0279   18.38   <2e-16 ***
nAtomP        1.7892     0.1241   14.42   <2e-16 ***
nRotB        -6.7954     0.3219  -21.11   <2e-16 ***
ATSc2       -81.3403     8.8544   -9.19   <2e-16 ***
khs.ssO     -15.3980     1.2521  -12.30   <2e-16 ***
SP.6         11.9262     0.4578   26.05   <2e-16 ***
khs.ssNH     13.9132     1.5327    9.08   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 47.3 on 3901 degrees of freedom
Multiple R-squared: 0.449,      Adjusted R-squared: 0.448
F-statistic:  397 on 8 and 3901 DF,  p-value: <2e-16

Principal Component Analysis

Another way to compare the sets directly is to use principal components analysis. This was done in R using the following code:
pc1 <- prcomp(data, scale. = T)
x <- pc1$x
summary(pc1)
 
Importance of components:
                         PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8   PC9   PC10   PC11   PC12   PC13
Standard deviation     6.717 3.1516 2.7335 2.4382 2.2815 2.1583 1.9765 1.8830 1.823 1.7591 1.6844 1.5272 1.4895
Proportion of Variance 0.299 0.0658 0.0495 0.0394 0.0345 0.0308 0.0259 0.0235 0.022 0.0205 0.0188 0.0154 0.0147
Cumulative Proportion  0.299 0.3646 0.4141 0.4535 0.4879 0.5188 0.5446 0.5681 0.590 0.6106 0.6294 0.6449 0.6596
 
colvec <- mydata$Karthikeyan
plot(x[, 1], x[, 2], pch = 20, col = colvec+1,xlim=c(-30,12),ylim=c(-20,10))
legend(2, 10, c("Alfa Aesar+","Karthikeyan"), cex=1, col=colvec+1,pch=20)
The PCA shows that the Karthikeyan dataset and the Alfa Aesar+ dataset do not represent the same chemical space. The Karthikeyan compounds have a larger mean melting point temperature than the Alfa Aesar+ compounds but they are also more diverse in general, see figure 5.
pcascatter.png

Discussion and Conclusion

On comparing the two models, we see the the Alfa Aesar (AA) set has a higher R-squared value:
AA 7-descriptor model Adjusted R-squared: 0.5712 (Note: this is a better R2 value than O'Boyle et. al. found for the Karthikeyan dataset using more sophisticated methods)
K 8-descriptor model Adjusted R-squared: 0.4480
We also see that many (but not all) of the same descriptors appear in both models: nHBDon, TopoPSA, nAtomP, and khs.ssO.

Together with the PCA, this suggests that the Alfa Aesar dataset is easier to model than the Karthikeyan dataset but that both sets should be used to cover a larger chemical space. Outliers should be curated to see if they are true outliers or just plain wrong experimental values.

References

[1] Karthikeyan, M.; Glen, R.C.; Bender, A. General melting point prediction based on a diverse compound dataset and artificial neural networks. J. Chem. Inf. Model.; 2005; 45(3); 581-590
[2] Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction. Noel M O'Boyle, David S Palmer, Florian Nigsch, John Bo Mitchell (2008) Chemistry Central journal 2 p. 21