Prediction and Hypothesis Formulation using DMax Chemistry Assistant - DMax001 & DMax002
Researcher: Andrew Lang

We report on the use of PharmaDM's DMax Chemistry Assistant to formulate hypotheses about solubility in methanol. Data used in this analysis is from the Open Notebook Solubility Challenge. Summary data for methanol was collected 2010-08-31 (live data) and solutes that reacted with methanol and solutes that are liquid at 25C were removed. This left 146 solubility values which we sorted alphabetically.

DMax requires sdf format for molecules as input, so we converted from SMILES to 3D sdf using OpenBabel 2.2.3 via the command line and added key values d1-d146 manually (data in sdf format).

DMax001

The sdf file was run through DMax using the default settings and with only the solubility values as observed values in the creation of hypotheses.This resulted in a DMax Chemistry Assistant (MethanolPrediction001.dca) model that can be used to predict the solubility in methanol (RMSE: 2.26, Correlation: 0.34, Rank Correlation: 0.62), see figure 1 for the results.
DMAXmodel001.png
Figure 1. Predicted vs. Measured Solubility in Methanol - DMax001

In addition to building models that you can share, DMax automatically formulates hypotheses about the data. For the data above, it hypothesizes (only showing hypotheses where the p-value <= 0.25):
For "Why High M?":
The compounds contains a carboxylic acid and the compound contains an aliphatic chain. (p-value: 0.25)
 
For "Why Low M?":
The compound contains an amine. (p-value: 0.03)
The compound contains a non-aromatic ring. (p-value: 0.15)
The compound contains a non-aromatic 6-ring. (p-value: 0.17)
The compound contains a methyl group. (p-value: 0.17)
The compound contains a hetero-aromatic ring. (p-value: 0.17)

DMax002

As well as having its own set of molecular descriptors (electron flow, element, moiety, and structure relationship), DMax allows you to upload your own descriptors to include in the hypothesis creation stage. To see how this worked, we created CDK descriptors by running the sdf file through Rajarshi Guha's CDK Descriptor Calculator Gui (v 1.1.1). We deleted columns that had any "NA" entries and columns with integer entries whose sum was less than three. This left the data with 176 descriptors.

We re-ran the sdf file through DMax, again using the default settings, but this time we incorporated the CDK descriptors with the inbuilt DMax molecular descriptors. Doing so resulted in a second DMax Chemistry Assistant (MethanolPrediction002.dca) model that can be used to predict the solubility in methanol (RMSE: 1.76, Correlation: 0.64, Rank Correlation: 0.68), see figure 2 for the results.
DMAXmodel002.png
Figure 2. Predicted vs. Measured Solubility in Methanol - DMax002

The hypotheses now include CDK descriptors (only showing hypotheses where p-value <= 0.15):
For "Why High M?":
VP-5 < 0.25 and geomShape > 0.59 (p-value: 0.01)
WPTP-1 < 22.11 and XLogP > 0.52 (p-value: 0.03)
 
For "Why Low M?":
The compound contains an amine. (p-value: 0.03)
WSNA-2 < -93.32 and LOBMAX < 1.82 and PPSA-3 > 23.67 (p-value: 0.04)
PNSA-2 < -221.74 (p-value: 0.05)
The compound contains a non-aromatic ring. (p-value: 0.15)

Conclusion

DMax Chemistry Assistant can be used to create simple solubility models that can be shared and used to make predictions. However, the real value of DMax Chemistry Assistant lies in its ability to quickly and automatically formulate statistically significant scientific hypotheses that best match your data.