A Quick Rule of Thumb for the Domain of Applicability of ADModel003

Researcher: Andrew Lang

Objective

To provide a simple rule of thumbs for the domain of applicability for Abraham Descriptor Model 003 - ADModel003

Procedure

Starting with the same dataset as ADModel003, we only kept molecules that had measured values for all Abraham descriptors: E, S, A, B, and V. Then we calculated predicted values for all the descriptors using the RF models from ADModel003. Since the model predicts 5 different descriptors - each with a different standard deviation - the linear distance, between the measured descriptors and the predicted descriptors, in 5D-space, was calculated after first dividing all measured and predicted values for each descriptor by the standard deviation of the measured values for each descriptor. See this spreadsheet for calculation details. The gives a good measure of the prediction error over all descriptors for each molecule.

The DMax Chemistry Assistant was used in a similar manner to an early exploration of methanol solubility data in order to find relationships that explain high and low values for the 5D-error. DMax automatically finds "scientific hypotheses that best match measurements of activity (or any other observable property) of small molecules. It also makes a statistical estimate of the confidence you can have in each hypothesis."[1]

Results

The results of the DMax run are presented in the tables below:
Why High Error?
hypothesis
p-value
XLogP < -0.03
9.27E-7
MLogP < 1.63 AND TopoPSA > 31.8
1.26E-6
The compound contains a phenol
3.30E-7
The compound contains a hetero atom
6.97E-4
A 5-ring is connected to a general functional group by a single bond
3.54E-3

Why Low Error?
hypothesis
p-value
TopoPSA < 1.62 AND MlogP > 1.74 AND AMR > 8.25
1.93E-7
TopoPSA < 26.59
1.35E-5

With these results we see that whether you get a large or small error depends significantly of the polar surface area (TopoPSA) and the logarithm of the 1-octanol/water partition coefficient (logP); TopoPSA being more significant - confirmed by creating linear models of the 5D-error versus both TopoPSA and XLogP with corresponding R2 values of 0.5045 and 0.1257 respectively.

Plotting a chemical space using Tableau Public with XLogP and TopoPSA as the x and y coordinates and coloring by 5D-error (red = bad), we see that certain regions of the chemical space correspond to, on average, high errors whereas other regions correspond to, on average, low errors.
20121115ChemicalSpace.png
By analyzing the regions from the DMax hypotheses and from the geometry of the above figure, we see a quick rule of thumb for the domain of applicability of ADModel003. That being, molecules that have the following properties will significantly, on average, have better predictions, than those that don't have the following properties:

XlogP > 0,
TopoPSA < 45, and
TopoPSA/(10 - XLogP) < 5.

Using this rule of thumb on the original dataset we have the following results:
Location
Average 5D-error
n
Anywhere
0.179
2144
Inside Domain
0.110
1539
Outside Domain
0.356
605

Example
caffeine
SMILES: O=C2N(c1ncn(c1C(=O)N2C)C)C
CSID: 2424
TopoPSA: 58.44
XLogP: -0.625
Inside DOA? NO!
5D-error: 1.49

Conclusion

While our model is currently the best available (as of 2012-11-15), it can sometimes give large errors for certain molecules. Care should be taken when using values for compounds outside the domain of applicability discussed above. Even then, some additional considerations may be needed. For example, DMax suggests that the model does better with molecules with no heteroatoms. In particular there seems to be an issue with compounds that contain a phenol, and to a lesser, though still significant, extent compounds that contain 5-rings connected to a general functional group via a single bond.

References

1. DMax Chemistry Assistant: http://dtai.cs.kuleuven.be/dmax/