Alfa+Aesar+and+Karthikeyan

Alfa Aesar and Karthikeyan Comparison
Researcher: Andrew Lang

Purpose
During curation of the Karthikeyan[1] dataset, before calculating CDK descriptors, several hundred compounds were discarded due to errors in their SMILES. Also, Noel O'Boyle //et. al.//[2] used the original Karthikeyan dataset and descriptors to construct a "PLS model with 68 descriptors which has an RMSE on an external test set of 46.6°C and R2 of 0.51." Whilst melting points are difficult to model, it was a concern to us that the Karthikeyan dataset may not be of the same 'quality' of the Alfa Aesar dataset; so we decided to create preliminary models on the two sets individually and compare the results.

Procedure
The [|entire dataset] (with salts removed) of 13,264 compounds was run through the CDK Descriptor Calculator (v1.1.1) calculating all CDK descriptors except Charged Partial Surface Area, Ionization Potential, Amino Acid Count, and all geometrical descriptors. Descriptors with all 'NA' and Kier3 (a lot of entries with 'NA') were removed by hand. The following compounds were also removed because they also had 'NA' for several descriptors: The dataset was then split into two. One set consisting of Karthikeyan compounds, and the other set, everything else, primarily Alfa Aesar values.
 * name || SMILES || mpC || CSID ||
 * diphenylphosphine oxide || c1ccc(cc1)P(=O)c2ccccc2 || 53.5 || 222625 ||
 * periodic acid || OI(=O)(=O)=O || 128 || 58684 ||
 * benzeneseleninic anhydride || c1ccc(cc1)[Se](=O)O[Se](=O)c2ccccc2 || 164 || 78710 ||
 * hydroxy(tosyloxy)iodobenzene || Cc1ccc(cc1)S(=O)(=O)OI(c2ccccc2)O || 133.5 || 288160 ||
 * (diphenylmethylidene)(triphenyl)-lambda~5~-phosphane || [P+](c1ccccc1)(c1ccccc1)(c1ccccc1)C(=C=1C=CC=CC=1)c1ccccc1 || 170 || 436055 ||
 * acetyl(2,3,5,6-tetramethylphenyl)mercury || [Hg](c1c(C)c(C)cc(C)c1C)C(=O) || 158 || 2035308 ||
 * 4-chlorobenzeneseleninic acid || c1cc(ccc1Cl)[Se](=O)O || 184.5 || 4256138 ||
 * phenylphosphinic acid || c1ccc(cc1)P(=O)O || 84 || 10449255 ||
 * (3Z)-N-hydroxy-2-(pyridin-2-yl)-3H-indol-3-imine 1-oxide || [O-][N+]=1c2ccccc2C(=NO)C=1c1[nH0]cccc1 || 216 || 10482712 ||
 * (methanolato)(2,3,5,6-tetramethylphenyl)mercury || [Hg](OC)c1c(C)c(C)cc(C)c1C || 158 || 10484121 ||

Alfa Aesar
The dataset consisting primarily of Alfa Aesar melting points was loaded into R version 2.11.0. A random forest (randomForest 4.5-34: mydata.rf <- randomForest(mpC ~ ., data = mydata, keep.forest=FALSE,importance = TRUE,mtry=30,ntree=200)) was created to identify important descriptors, see figure 1. Out of all these descriptors, the top seven descriptors were identified using [|leaps] (nbest=1,really.big=TRUE), see figure 2.

These descriptors were used to create a model (fit <- lm(mpC ~ nHBDon + TopoPSA + nAtomP + ATSm1 + WTPT.2 + SCH.7 + khs.ssO, data=mydata)): code Residuals: Min      1Q   Median       3Q      Max -332.296 -36.372   -4.219   31.947  293.879

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -515.68308  19.98649  -25.80   <2e-16 *** nHBDon       29.82426    0.80388   37.10   <2e-16 *** TopoPSA       0.85743    0.02823   30.37   <2e-16 *** nAtomP        4.84429    0.17608   27.51   <2e-16 *** ATSm1         0.36665    0.02079   17.63   <2e-16 *** WTPT.2      252.02838   10.66507   23.63   <2e-16 *** SCH.7        55.49989    2.93595   18.90   <2e-16 *** khs.ssO     -12.20814    0.78862  -15.48   <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 53.82 on 9336 degrees of freedom Multiple R-squared: 0.5715,    Adjusted R-squared: 0.5712 F-statistic: 1779 on 7 and 9336 DF,  p-value: < 2.2e-16 code

Karthikeyan
The curated Karthikeyan dataset of melting points was loaded into R version 2.11.0. A random forest (randomForest 4.5-34: mydata.rf <- randomForest(mpC ~ ., data = mydata, keep.forest=FALSE,importance = TRUE,mtry=30,ntree=200)) was created to identify important descriptors, see figure 3. Out of all these descriptors, the top eight descriptors were identified using leaps(nbest=1,really.big=TRUE), see figure 4. These descriptors were used to create a model (fit <- lm(mpC ~ nHBDon + TopoPSA + nAtomP + nRotB + ATSc2 + khs.ssO + SP.6 +khs.ssNH, data=mydata)): code Residuals: Min     1Q  Median      3Q     Max -181.29 -32.54   -3.21   28.58  216.38

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 72.7366     2.2352   32.54   <2e-16 *** nHBDon       8.1823     1.0914    7.50    8e-14 *** TopoPSA      0.5131     0.0279   18.38   <2e-16 *** nAtomP       1.7892     0.1241   14.42   <2e-16 *** nRotB       -6.7954     0.3219  -21.11   <2e-16 *** ATSc2      -81.3403     8.8544   -9.19   <2e-16 *** khs.ssO    -15.3980     1.2521  -12.30   <2e-16 *** SP.6        11.9262     0.4578   26.05   <2e-16 *** khs.ssNH    13.9132     1.5327    9.08   <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 47.3 on 3901 degrees of freedom Multiple R-squared: 0.449,     Adjusted R-squared: 0.448 F-statistic: 397 on 8 and 3901 DF,  p-value: <2e-16 code

Principal Component Analysis
Another way to compare the sets directly is to use principal components analysis. This was done in R using the following code: code pc1 <- prcomp(data, scale. = T) x <- pc1$x summary(pc1)

Importance of components: PC1   PC2    PC3    PC4    PC5    PC6    PC7    PC8   PC9   PC10   PC11   PC12   PC13 Standard deviation    6.717 3.1516 2.7335 2.4382 2.2815 2.1583 1.9765 1.8830 1.823 1.7591 1.6844 1.5272 1.4895 Proportion of Variance 0.299 0.0658 0.0495 0.0394 0.0345 0.0308 0.0259 0.0235 0.022 0.0205 0.0188 0.0154 0.0147 Cumulative Proportion 0.299 0.3646 0.4141 0.4535 0.4879 0.5188 0.5446 0.5681 0.590 0.6106 0.6294 0.6449 0.6596

colvec <- mydata$Karthikeyan plot(x[, 1], x[, 2], pch = 20, col = colvec+1,xlim=c(-30,12),ylim=c(-20,10)) legend(2, 10, c("Alfa Aesar+","Karthikeyan"), cex=1, col=colvec+1,pch=20) code The PCA shows that the Karthikeyan dataset and the Alfa Aesar+ dataset do not represent the same chemical space. The Karthikeyan compounds have a larger mean melting point temperature than the Alfa Aesar+ compounds but they are also more diverse in general, see figure 5.

Discussion and Conclusion
On comparing the two models, we see the the Alfa Aesar (AA) set has a higher R-squared value: AA 7-descriptor model Adjusted R-squared: 0.5712 (Note: this is a better R2 value than O'Boyle //et. al.// found for the Karthikeyan dataset using more sophisticated methods) K 8-descriptor model Adjusted R-squared: 0.4480 We also see that many (but not all) of the same descriptors appear in both models: nHBDon, TopoPSA, nAtomP, and khs.ssO.

Together with the PCA, this suggests that the Alfa Aesar dataset is easier to model than the Karthikeyan dataset but that both sets should be used to cover a larger chemical space. Outliers should be curated to see if they are true outliers or just plain wrong experimental values.