SolubilityAnalyses

See Predictive Solubility for links to individual predictive models. See http://ibmlc2.chem.uga.edu/sparc/ for the SPARC online predictor

The initial analysis is performed on a solvent-wise basis. The distribution of concentrations grouped by solvent is shown below

Clearly, the distributions are not normal, which implies that linear regression may not be the best approach. Furthermore, the size of the datasets (i.e., compounds tested n a given solvent) are still quite small. Methanol and Ethanol are the largest group with just under 30 observations (excluding duplicates). For reliable least squares models, I think we'll need something on the order of 50 unique compounds for any given solvent. At this point, models won't be built for Choloroform, Acetonitrile and Toluene.

Structure Diversity
Next we consider the diversity of structures, again grouped by solvent. For this, we evaluate CDK extended hashed fingerprints (1024 bit). The average Tanimoto similarity over the whole dataset (i.e., all solvents) is 0.27 which is pretty low. We can look at a clustering of the molecules in a given solvent, based on Tanimoto similarities. The dendrogram is shown below



The numberings on the leaves correspond to serial numbers of molecules within a given solvent. What is interesting to note is the skewed nature of some of the dendrograms. While methanol and ethanol results are not too skewed, looks like chloroform and acetonitrile results tend to be from structurally similar series.

Consider chloroform. The serial numbers in the clustering correspond to the following SMILES:

[,1] [,2] [1,] "1" "O=Cc1cc(ccc1Cl)[N+]([O-])=O" [2,] "2" "O=Cc1c(Cl)cccc1Cl" [3,] "3" "COc1cc(ccc1OC)C=O" [4,] "4" "O=Cc1ccc(Cl)cc1" [5,] "5" "O=Cc1ccc(N(C)C)cc1" [6,] "6" "O=Cc1ccc(O)cc1" [7,] "7" "O=[N+]([O-])c1ccc(C=O)cc1" [8,] "8" "O=S(=O)(c1ccc(cc1)C)C[N+]#[C-]"

So compound 8 is pretty much on it's own, most likely since it's the only sulphonyl compound. So it might be good to expand the chloroform results, by adding some sulfonyl compounds. Similarly, 5 is lying on it's own by virtue of being the only amine - but there are two chlorine substituted benzaldehydes. So testing some more amine substituted benaldehydes might be useful here.

For the case of methanol, you can see a branch composed of 7, 26, 6 and 29:

[,1] [,2] [1,] "7" "O=C(O)CCCCCCC" [2,] "26" "Oc1c(cccc1OC)C=O" [3,] "6" "O=C(O)CC(O)(C(=O)O)CC(=O)O" [4,] "29" "c1ccccc1C(=O)O"

In this case, would it be possible to test some of the non-aromatic aldehydes? Right now 20 of 29 compounds tested in methanol are aromatic. 9 are straight chain and there are no cyclic alkanes (and derivatives). So it would be useful to test out non-aromatic acyclic and cyclic compounds, if possible.

I'm surprised that compounds 1 and 16 show up together in a branch since they are BrCCCCCCCC(=O)O and O=Cc1cc(ccc1Cl)[N+]([O-])=O respectively.

Chemical Spaces
We can also look at the distribution of compounds by embedding them in a chemical space. The choice of chemical space is arbitrary, but since one of the goals is to use (some of) the data for drug discovery related projects, I considered a chemical space defined by the following descriptors: [|ALogP], number of rotatable bonds, [|TPSA], Molecular Weight. Note that this is an arbitrary choice, and results can (and will) differ based on the choice of chemical space.

Given these descriptors, we can then consider all molecules tested and perform a [|PCA] and then plot the first two components. I've annotated the plot with the structure of various compounds of interest: The red points are what I'd consider outliers. The outliers appear to be straight chain's or PAH's. Based on discussion with JCB, the experimental procedure is easier for aromatics as opposed to straight chain compounds. The point of this plot is to highlight the areas of chemical space (a.k.a, types of chemical compounds) that are explored or unexplored.