COSMO-RS solubility prediction of ONS Challenge dataset

Researchers

Jean-Claude Bradley (Drexel U); Andrew Lang (ORU) and Andreas Klamt (COSMOlogic GmbH)

Objective

To compare the COSMO-RS [COSMOtherm_BP-TZVP_C21_01_10] solubility model against an Abraham descriptor based model (Model001), with a special emphasis on solutes with melting points near room temperature.

Background

Piperonal melts at 36 C and has a solubility of 0.15 M in hexane at room temperature. In this case the Abraham models are in much better agreement (0.17 M and 0.51 M). COSMO-RS is also in agreement at 0.07 M.
Cyclohexanecarboxylic acid melts at 31 C and appears essentially miscible with hexane even down to -8.5C (EXP184). Abraham Model001 performs very poorly in this case and predicts a solubility of 0.03 M . COSMO-RS also predicts a low solubility of 0.0078M.

Results

Log Solubilities

The spreadsheet of all common compounds between COSMO-RS and model001 with uncapped values was taken and an additional column was added for LogS values for model001 (last row RMSE). This allows for evaluation, plotting and analysis using log solubility predictions. The same was done for a comparison of Ugi Products (last row RMSE). The model001 training set was removed leaving 303 compounds (last row RMSE), 88 of which are carboxylic acids.

RMSE (log solubilities)

COSMO-RS: 1.80 Test Set - 1.56 All Compounds (1.55 Ugi Products Only)
model001: 1.58 Test Set - 1.26 All Compounds (1.36 Ugi Products Only)
Note: the values below 0.01M were actually reported as unmeasurably small in the notebook but were assigned arbitrary nonzero values in the sheet - if these values are removed both models do much better with a RMSE of 0.41 (model001) and 0.58 (COSMO-RS), with the COSMO-RS predictions expected to be much better (expectable is 0.3 log units), if we exclude the dimerization cases (carboxylic acids in nonpolar solvents).

Selected Graphs

The following graphs use log solubility values. The first showing the solubility of acetylsalicylic acid in various solvents. The solubility predictions in methyl tert-butyl ether and hexane are interesting. In methyl tert-butyl ether, Model001 predicts a much lower value than that measured and COSMO-RS predicts a much higher value than measured - even taking trends into account (the same is true for diethyl ether and dibutyl ether). In hexane both models predict a much lower solubility than that measured. Both these values would benefit from verification and have been added to dosol spreadsheet.
acetylsalicylicacid.png
acetylsalicylic acid

Absolute Solubilities

A COSMO-RS spreadsheet (final row RMSE values) was created by removing a few entries where the solute reacts with the solvent from the COSMO-RS predictions for the ONSC data. A new column was added listing the model001 predictions. Entries were capped at 9M and given the value "NA" where there were COSMO-RS predictions but not model001 predictions (due to lack of Abraham coefficients for several solvents). Two new spreadsheets were created by removing entries with NA for model001, allowing direct comparison between the COSMO-RS predictions and the model001 predictions, both for all common compounds and for Ugi products only.
Plots were created using Tableau Public 5.1 starting with a plot comparing Abraham Model001 with COSMO-RS showing all solute/solvent combinations.

RMSE (M)

COSMO-RS: 4.38 (3.05 Ugi Products Only). Note that the absolute solubilities will not be very accurate due to missing dG_fus information. But the relative solubilities of the same compound in different solvents should be reasonably good.[AK]
Model 001 : 2.51 (2.74 Ugi Products Only)

Selected Graphs (M on log scale)

As seen in the RMSEs, model001 performs better than COSMO-RS for absolute solubility prediction, especially for relatively small carboxylic acids (the primary type of compound used to train model001), though the relative solubilities of COSMO-RS are in general good and absolute solubility prediction can be improved with the addition of dG_fus information.
diphenylaceticacid.png
diphenylacetic acid

Interestingly, the accuracy of COSMO-RS improves for Ugi products whereas model001 does worse. COSMO-RS doing a particularly good job of predicting the absolute solubilities for Ugi product 176C as compared to model001, see below.[Wow that is quite interesting JCB]
Ugi176C.png
Ugi176C

TODO

  1. Obtain the solubility of cyclohexanecarboxylic acid in at least 3 more solvents to determine Abraham descriptors.
  2. Obtain temperature curve for the solubility of cyclohexanecarboxylic acid in hexane.
  3. Obtain enthalphy of fusion for piperonal and cyclohexane carboxylic acid.
  4. Generate predicted temperature curves for the solubility of cyclohexanecarboxylic acid and piperonal in hexane.
  5. Measure the solubility of acetylsalicylic acid in methyl tert-butyl ether and hexane.

Email Conversation as Part of the Scientific Record

From: Christoph Loschen <loschen@cosmologic.de>
To: Jean-Claude Bradley <bradlejc@drexel.edu>
Date: Tue, Sep 3, 2013 at 5:56 AM

I have just seen among the very nice work on the ONS wikipage the page comparing COSMO-RS with the Abrahams model.
Please let me add a few comments: The COSMO-RS results which are shown there are only relative, thus it is not really fair
computing an RMSE and then comparing with a method which gives absolute values, rather a correlation coefficient should be used.
Though its stated there that the free energy of fusion is missing which would turn the relative into absolute predictions (basically by shifting the results), the comparison is in my opinion rather measleading. For example recomputing the acetylic salicylic example with some reasonable DGfus value yields a RMSE=0.5, which is clearly better than the Abrahams-Model.
Perhaps you add a some comments to the page making this a bit clearer.

Best regards,
Christoph

From: Jean-Claude Bradley <jeanclaude.bradley@gmail.com>
To: Andrew Lang <asidlang@gmail.com>
Cc: "Acree, Bill" <Bill.Acree@unt.edu>
Date: Tue, Sep 3, 2013 at 12:12 PM

Andy - I don't know if you got a copy of this but does he have a point? I'm copying Bill too
Jean-Claude

From: Andrew Lang <asidlang@gmail.com>
To: Jean-Claude Bradley <jeanclaude.bradley@gmail.com>
Cc: "Acree, Bill" <Bill.Acree@unt.edu>
Date: Tue, Sep 3, 2013 at 12:55 PM
He has some good points. I'd like to add his comments to the page (http://onschallenge.wikispaces.com/COSMO-RS) if possible.

Some points I'd make are:

1. Trying to do a comparison of COSMO-RS without having to have dG_fus information was part of the point, because the goal was to predict the solubility of virtual compounds. If you do have dG_fus information, I believe COSMO-RS does very well.
2. Our model001 (244 data points: http://onschallenge.wikispaces.com/AbrahamDescriptorsModel001) is now superseded by model003 (2475 data points: http://onschallenge.wikispaces.com/AbrahamDescriptorsModel003)

Andy

From: Jean-Claude Bradley <jeanclaude.bradley@gmail.com>
To: Andrew Lang <asidlang@gmail.com>
Cc: "Acree, Bill" <Bill.Acree@unt.edu>
Date: Tue, Sep 3, 2013 at 1:13 PM
Andy - those are great points - depending on dG_fus info for virtual molecules is probably no simple model in itself.
Also awesome to update on model-003.
The reality is that you can never truly compare apples to apples with any of these models - best we can do is point out the strengths and weaknesses and why a chemist would choose one model over another in a given objective

From: Jean-Claude Bradley <jeanclaude.bradley@gmail.com>
To: Christoph Loschen <loschen@cosmologic.de>
Cc: Andrew Lang <asidlang@gmail.com>, "Acree, Bill" <Bill.Acree@unt.edu>
Date: Tue, Sep 3, 2013 at 7:53 PM
Christoph - I discussed the situation with Andrew Lang who is responsible for the construction of the models. Here is what he suggests:
He has some good points. I'd like to add his comments to the page (http://onschallenge.wikispaces.com/COSMO-RS) if possible.

Some points I'd make are:

1. Trying to do a comparison of COSMO-RS without having to have dG_fus information was part of the point, because the goal was to predict the solubility of virtual compounds. If you do have dG_fus information, I believe COSMO-RS does very well.
2. Our model001 (244 data points: http://onschallenge.wikispaces.com/AbrahamDescriptorsModel001) is now superseded by model003 (2475 data points: http://onschallenge.wikispaces.com/AbrahamDescriptorsModel003)

If that satisfactory?
Jean-Claude

From: loschen <loschen@cosmologic.de>
To: Jean-Claude Bradley <jeanclaude.bradley@gmail.com>
Cc: "Acree, Bill" <Bill.Acree@unt.edu>, Andrew Lang <asidlang@gmail.com>
Dear Jean-Claude,

many thanks for your reply. An additional, explaining comment at the COSMO-RS wiki should be fine.
Without dgfus COSMO-RS can not be used for absolute predictions, for these cases either experimental information
(melting point + Hfus or a reference solvent) is necessary. Alternatively a QSPR estimation either directly for DGfus or separately for Tm and Hfus has to be used with a somewhat reduced accuracy of course.

Best regards,
Christoph

From: Jean-Claude Bradley <jeanclaude.bradley@gmail.com>
To: loschen <loschen@cosmologic.de>
Cc: "Acree, Bill" <Bill.Acree@unt.edu>, Andrew Lang <asidlang@gmail.com>
Date: Wed, Sep 4, 2013 at 9:45 PM
Christoph I'm glad we could be of help in clarifying the info - feel free to contact us again as the software and strategies evolve

Jean-Claude

From: Christoph Loschen <loschen@cosmologic.de>
To: Jean-Claude Bradley <jeanclaude.bradley@gmail.com>
Cc: "Acree, Bill" <Bill.Acree@unt.edu>, Andrew Lang <asidlang@gmail.com>
Date: Fri, Sep 20, 2013 at 4:15 AM
Dear Jean-Claude,

ok, just to let you know, we have recently published a new approach for taking into account the missing
DGfus values using 1-3 reference solubilities that does not need QM computations and compared it with the popular NRTL-SAC method:
http://pubs.acs.org/doi/abs/10.1021/ie3023675

Best regards,
Christoph

From: Jean-Claude Bradley <jeanclaude.bradley@gmail.com>
To: Christoph Loschen <loschen@cosmologic.de>
Cc: "Acree, Bill" <Bill.Acree@unt.edu>, Andrew Lang <asidlang@gmail.com>
Date: Sun, Sep 29, 2013 at 5:14 PM
Christoph - thanks - a great addition to solubility prediction!