Using OCHEM to model melting points

Researcher

Andrew S.I.D. Lang

Objective

Using the default settings on OCHEM (ochem.eu) to model melting points.

Background

OCHEM [1] is a web-based modeling service that allows users to upload data and to build models using a variety of descriptors and methods. Some methods are still in beta but they have a dedicated team working on things. Any models created do not show up in the list of public models until they have been approved - after they have appeared in a peer reviewed publication. However, models that you create can be accessed by direct link and used to create web services.

We will use OCHEM to create and publish a model (and webservice) using the same datasets as used in MeltingPointModel007.

Procedure

The first three attempts to create a model using CDK descriptors and R-based random forests failed. This is likely due to the R-based random forest method being 'experimental'. It is our hope that this will be fixed in the future because using R-based random forests is our preferred non-linear modeling technique.

A first successful model ONSMP008a was created using all the default settings (ANN and E-State descriptors) except we chose bagging as the validation method under advice from Igor Tetko (personal correspondence) a developer of OCHEM who has used OCHEM to publish his logP and logS model ALogPS 3.0.

A second model using CDK descriptors ONSMP008b was created using default settings except for bagging and CORINA for 3D coordinate generation.

Results

Comparison of the models created, summarized in the table below, we see that ONSMP008b, created using ANN instead of RF, has a higher AAE than ONSMP007. This matches the findings of O'Boyle et al [2] whose best model also was RF-based.
model
descriptors
method
TrainingSetR2
TrainingSetRMSE
TrainingSetAAE
TestSetR2
TestSetRMSE
TestSetAAE
ONSMP008a
EState+ALogPS
ANN
0.72
51.11
38.14
0.73
50.74
37.58
ONSMP008b
CDK
ANN
0.75
45.63
34.38
0.76
45.69
33.90
ONSMP007
CDK
RF
0.97
16.73
12.08
0.81
40.18
29.36


Conclusion

OCHEM is a good tool for creating and distributing models, however It is still in the development stage and crashes often during model creation. Using ANN with CDK descriptors and bagging seems to be the best technique at this stage.
[Attempting to export model ONSMP008a as an Excel file does not allow exporting more than 100 SMILES (the limit for each month). This makes it effectively useless for sharing chemical information publicly. JCB]

References

[1] Sushko I, Novotarskyi S, K├Ârner R, Pandey AK, Rupp M, Teetz W, Brandmaier S, Abdelaziz A, Prokopenko VV, Tanchuk VY, Todeschini R, Varnek A, Marcou G, Ertl P, Potemkin V, Grishina M, Gasteiger J, Schwab C, Baskin II, Palyulin VA, Radchenko EV, Welsh WJ, Kholodovych V, Chekmarev D, Cherkasov A, Aires-de-Sousa J, Zhang QY, Bender A, Nigsch F, Patiny L, Williams A, Tkachenko V, Tetko IV. Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. J Comput Aided Mol Des. 2011; 25(6):533-54

[2] Noel M O'Boyle, David S Palmer, Florian Nigsch1 and John BO Mitchell. Simultaneous feature selection and parameter optimisation using
an artificial ant colony: case study of melting point prediction. Chemistry Central Journal 2008, 2:21 doi:10.1186/1752-153X-2-21