MeltingPointModel008

=Using OCHEM to model melting points=

Researcher
Andrew S.I.D. Lang

Objective
Using the default settings on OCHEM ([|ochem.eu]) to model melting points.

Background
OCHEM [1] is a web-based modeling service that allows users to upload data and to build models using a variety of descriptors and methods. Some methods are still in beta but they have a dedicated team working on things. Any models created do not show up in the list of public models until they have been approved - after they have appeared in a peer reviewed publication. However, models that you create can be accessed by direct link and used to create web services.

We will use OCHEM to create and publish a model (and webservice) using the same datasets as used in @MeltingPointModel007.

Procedure
The first three attempts to create a model using CDK descriptors and R-based random forests failed. This is likely due to the R-based random forest method being 'experimental'. It is our hope that this will be fixed in the future because using R-based random forests is our preferred non-linear modeling technique.

A first successful model ONSMP008a was created using all the default settings (ANN and E-State descriptors) except we chose bagging as the validation method under advice from Igor Tetko (personal correspondence) a developer of OCHEM who has used OCHEM to publish his logP and logS model ALogPS 3.0.

A second model using CDK descriptors ONSMP008b was created using default settings except for bagging and CORINA for 3D coordinate generation.

Results
Comparison of the models created, summarized in the table below, we see that ONSMP008b, created using ANN instead of RF, has a higher AAE than ONSMP007. This matches the findings of O'Boyle //et al// [2] whose best model also was RF-based.
 * model || descriptors || method || TrainingSetR2 || TrainingSetRMSE || TrainingSetAAE || TestSetR2 || TestSetRMSE || TestSetAAE ||
 * ONSMP008a || EState+ALogPS || ANN || 0.72 || 51.11 || 38.14 || 0.73 || 50.74 || 37.58 ||
 * ONSMP008b || CDK || ANN || 0.75 || 45.63 || 34.38 || 0.76 || 45.69 || 33.90 ||
 * ONSMP007 || CDK || RF || 0.97 || 16.73 || 12.08 || 0.81 || 40.18 || 29.36 ||

Conclusion
OCHEM is a good tool for creating and distributing models, however It is still in the development stage and crashes often during model creation. Using ANN with CDK descriptors and bagging seems to be the best technique at this stage.
 * [Attempting to export model ONSMP008a as an Excel file does not allow exporting more than 100 SMILES (the limit for each month). This makes it effectively useless for sharing chemical information publicly. JCB]**