ONSCpaper1

[|3000 words 50 refs max] =Crowdsourcing Solubility Measurements and Solvent Selection for Ugi Reactions using Open Notebook Science=

Abstract
We report on the crowdsourcing of non-aqueous solubility measurements in a transparent process where the laboratory notebooks and all associated raw data are made available to the public in near real time. The technologies used to store and process the information are free hosted services such as blogs, wikis and Google spreadsheets, minimizing the entry barrier and allows easy copying of the infrastructure. Use of such vehicles has resulted in high page rankings in common search engines, which increases the probability that those looking for specific solubility data will discover it, even if previously unaware of the project. In addition, specialized browser based search tools are provided for more targeted and sophisticated searching. The organization of information in this way has been convenient for distributed automation of parts of the scientific process. Such expeditious public sharing of experimental progress is also conducive to rapid collaboration. As an example, the application of solubility data in this database to the design of optimal conditions for Ugi reactions will be demonstrated.
 * [Alternate abstract: The Open Notebook Science Challenge is a crowdsourcing project with the goal of developing an Open Data database of non-aqueous solubility measurements. Using Open Notebook Science, where laboratory notebooks and all associated raw data are made available to the public in near real time, allows users to interrogate the data at multiple levels of detail. This has been made possible by storing the experimental data and lab notebooks in open formats on free hosted services such as blogs, wikis and Google spreadsheets. Use of such platforms has resulted in high page rankings in search engines, which increases the probability that those looking for specific solubility data or collaboration opportunities will discover them, even if previously unaware of the project. In addition, community built browser based custom search tools are provided for more targeted and sophisticated searching. The organization of information in this way has also been convenient for distributed automation of parts of the scientific process, such as automatic solubility measurement using NMR spectroscopy, data curation, experiment prioritization, and real-time solubility modeling using the General Abraham Solvation Model. The database is now extensive, containing over 1,100 measurements, and the general chemistry community will find it useful for many purposes. As an example of the usefulness of the data, we report on the application of the solubility data for optimal solvent selection for Ugi reactions. -AL]**

Introduction
In academic research, chromatography has evolved as a general solution to the purification of products.[1] It is typically possible to find a solvent system and stationary phase that will allow isolation of a desired component. Despite the obvious utility of this approach, there are drawbacks. Chromatography is resource intensive, whether run manually or by using automation like HPLC. It also does not scale easily. Running exploratory experiments at a small scale may prove to be too expensive from a resource and time perspective if large quantities of product are later required.

Occasionally a desired product will crystallize from a reaction mixture **[or upon recrystallization of crude products.? -AL]** Chemists don't routinely design for this outcome, because with chromatography, it is almost always possible to obtain product separation. However, the ideal situation is for the product to crystallize, especially from the perspective of scaling up with low resource demands.

If the solubility of compounds in common organic solvents were made readily available either as experimentally determined measurements or predicted from useful models, it might provide chemists with information they could use in the selection of reaction conditions to maximize the likelihood of product crystallization. Even for reactions that do not generate crystalline products, knowledge of reactant solubility over a wide range of solvents can be helpful in selecting reaction conditions,

Chemistry and Web 2.0
Traditionally, access to most of the properties of chemical compounds has been limited to the literature, requiring paid subscriptions and expensive databases. However, it is becoming increasingly common to find useful properties freely available on the internet. Services such as ChemSpider[2] and the chemistry infoboxes on Wikipedia[3] are now providing basic compound properties such as boiling point, melting point, molecular weight and density. These databases make good use of Web 2.0 functionalities, essentially enabling everyone who has expertise and knowledge to contribute new information or curate what they come across. **[...knowledge to curate and contribute new data. -AL]**

The types of data associated with a given compound in these databases continue to expand regularly. Currently**[For example -AL]**, researchers can upload experimental spectra on ChemSpider. Free options for finding NMR, IR and UV spectra have always been rather limited. The online Sigma-Aldrich catalog is a common source for spectral information (as well as density when unavailable elsewhere). However, spectra in that catalog are in PDF format and cannot be expanded for clarity. The files submitted to ChemSpider are usually in the open JCAMP-DX format and are interactively viewable over common browsers via the Open Source JSpecView software.[4]

Curiously, solubility data for non-aqueous solvents are not readily available from the Web 2.0 chemistry world. This is somewhat surprising since virtually all organic chemistry reactions require a solvent and the measurement of solubility is not that much more difficult than a boiling point. There are scattered reports in the literature and the Beilstein CrossFire database[5] can be searched by solubility to identify these articles.Howerver, even when available there, often chemists report solubility in common solvents in non quantitative terms - such as "freely soluble" or "sparingly soluble."

Clearly it would be convenient to search for non-aqueous solubility using public interfaces that make use of Web 2.0 practices. By adopting an Open Data strategy, many such interfaces can be constructed and used by anyone. In this article, we describe how to create a collaborative platform to share and analyze data using mainly free hosted services on the web.

The Open Notebook Science Challenge
In September of 2008 the Open Notebook Science Challenge was launched.[6] The project invited anyone to contribute non-aqueous solubility measurements. In November, Submeta sponsored ten $500 awards for participating students in the US and the UK, issued once per month[7] The Nature Publishing Group sponsored the Challenge with a one year subscription to Nature magazine for the first 3 winners.[8] Sigma-Aldrich has also donated chemicals.[9] The Challenge judges originate from multiple disciplines: an organic chemist (JCB), a mathematician (AL), a biochemist (CN), a molecular biologist (BH), a computational chemist (RG) and an NMR expert (AW). Both graduate and undergraduate students from the University of Southampton, Drexel University, Syracuse University and Oral Roberts University have won.

The Challenge requires participants to record their experiments using Open Notebook Science (ONS), a term introduced in 2006 to reflect the complete public sharing of a laboratory notebook in as close to real time as possible.[10] The ONS UsefulChem project was used as a model to build the infrastructure.[11] A common public wiki (hosted on Wikispaces), shared between the participants, serves as the platform to record the laboratory notebook pages. Calculations are typically stored on public Google Spreadsheets to enable others to verify calculations and assumptions. When appropriate, images are generally stored on Flickr or directly uploaded to the wiki. Spectra (typically NMR) are posted on a server in JCAMP-DX format so that they can be queried interactively (e.g. to zoom or integrate) over a web brower using the JAVA application JSpecView.[4]
 * Figure1 : ONS Challenge Workflow Summary**

Over the course of the first 18 months of the ONS Challenge, 697 measurements were recorded from experiments performed by students as part of the project. An additional 859 measurements from the literature were added to a common solubility summary spreadsheet.[12] After flagging likely erroneous data, ignoring non room temperature measurements and averaging the remaining duplicates, a total of 1120 unique solute/solvent combination remain.[13] The solvents with the most unique compound measurements were methanol (184), ethanol (81). THF (72), toluene (49) and acetonitrile (40). Aldehydes and carboxylic acids made up the majority of solutes. Since the values are stored in a public Google Spreadsheet, a convenient API is available for custom services to query. In addition, the data with relevant links are made available on ChemSpider **[check with Tony for completeness]** and selected Wikipedia pages. Compilations of the solubilities are also available in a book format, where different editions are associated with specific snapshot archives of the entire project.[14]

After examining the results within the context of identical or similar measurements, values that deviated substantially were investigated. Having access to the detailed log of each experiment was invaluable in assessing the likely validity of a data point. For example, after verifying the calculations for errors, an experiment with short or unreported mixing times would be given less credence than a thorough report with extensive mixing to ensure saturation. Similarily, a direct technique like NMR would be expected to provide more reliable numbers when compared with inconsistent data in cases where solubility was determined by evaporation where the solute may have partially evaporated. Measurements judged to be in error were flagged as "DONOTUSE" and a note recorded to indicate the reason in the Solubility Summary Spreadsheet[12]. That way others who may wish to access problematic results to judge for themselves may do so easily.

Measuring solubility using NMR spectroscopy
The most common technique used in the ONS Challenge at the time of writing involves NMR measurement. Initially, an internal reference was added to the supernatant of a saturated solution, similar to the method reported by Lin et al.[15] However, the requirement for volume measurement of both the reference compound and the supernatant was later eliminated by considering the solvent itself as the internal reference. In order to do this, two new assumptions need to be introduced. First, we assume that the volume of solvent and solute are additive in order to convert molar ratios to molarities. Second we must estimate the density of solid solutes, a property generally not available experimentally. Fortunately, a service for predicting densities is freely available from ChemSpider.[16]

One advantage with NMR compared to the previous techniques is that calibration curves are not strictly required, assuming that the integrations values are proportional to the quantities of the materials. In order to ensure this, we have recently modified the default parameters during NMR acquisition to d1=50s to allow hydrogens more time to relax. In the past, hydrogens on groups known to relax more quickly (such as methyls) were selected for integration. However, some solutes (such as aromatic aldehydes) do not have groups with known rapid relaxation times and in these cases artificially low values may result from using default settings for routine H NMR.

An excellent example of the benefit in using NMR to measure solubility occurred during the routine measurement of the solubility of 4-nitrobenzaldehyde in methanol. Over several hours about half the solute was converted into a hemiacetal, a transformation that was missed in a prior report of this solubility measurement using GLC.[17] Price recently reported this finding via a similar H NMR investigation.[18] This conversion was also found for other aromatic aldehydes bearing electron withdrawing groups in alcoholic solvents.[19]

Of all the advantages, perhaps the greatest in using NMR to measure solubility is the elimination of otherwise nearly intractable errors such as adding the wrong solute or solvent. In our Semi-Automated Measurement of Solubility (SAMS) procedure[20], the NMR spectra of the saturated solutions are exported from the instrument in the open JCAMP-DX format then uploaded to a server where they can be queried dynamically from the public internet. Solvent and solute peaks are identified manually using JSpecView then the ppm values bounding the solvent and solute peaks are entered into a Google spreadsheet. A spreadsheet cell then queries a webservice which downloads the spectrum, parses the data, calculates the integral under each peak (accounting for baseline drift) and reports the integral values back to the spreadsheet. These values are then used to calculate the concentration of the solute in the solvent. Once calculated, the integral values for specific ppm ranges are stored in a database, which allows the webservice to recall values quickly rather than recalculate them each time the spreadsheet is opened. The use of a system such as this, heavily dependent upon**[... I wouldn't say 'heavily dependent upon' - sounds vulnerable -AL]** public spreadsheets, makes computational or transcription errors much easier to discover by anyone. This is especially important for crowdsourcing projects where participating students may have vastly different competencies and experience.
 * [Figure 2 SAMS spreadsheet and show calling webservice][Need integration workflow graphic -AL]**

Quality Control and Recommendation Systems for solubility measurements
As measurements are recorded from a heterogeneous group of participants using different methods, it is possible that some experimental mistakes will be made. By providing intuitive tools to compare results, outliers can be quickly identified and the experimental logs can be perused to identify the source of the discrepancy. As mentioned above, measurements that are clearly in error are marked "DONOTUSE" and do not appear in routine query results.

As the project has grown, manual inspection, looking for errors or deciding which new measurements to prioritize, has become more challenging. To address this, we have started to add automated and semi-automated processes to flag potential errors and to make solubility measurement requests to be performed. For example, an Outlier Bot queries the database periodically and flags measurements where the ratio of the standard deviation to the mean or the Grubbs' test statistic is beyond a threshold value. Such flagged measurements are added to the DoSol Google Spreadsheet[21], where a list of pending requests for solubility measurements is maintained, with a record of who (or what) made the request and the reason why. A priority column allows for adjusting the order in which the requests will be processed in each laboratory according to resources available and urgency. Thus even though requests may be made by anyone, ultimately the project managers control the order in which experiments are done.

Engineering the outcome of a Ugi reaction using solubility data
In the ONS Challenge, priority was given to the measurement of the solubility of solid aldehydes, carboxylic acids, primary amines and isonitriles. These are the components used in the **[4-component? -AL]** Ugi reaction, which we have used in our design of novel anti-malarial agents.[22] It was observed that occasionally the Ugi product would precipitate in pure form from the reaction mixture.[23] As described above, such an outcome is very beneficial to efficiently and cheaply produce desired products at larger scales.

In some cases, not all of the starting materials were sufficiently soluble to make stock solutions. It was sometimes possible to add the reactants in solid form and continue mixing until reaction with the other starting materials generated soluble intermediates. Fortunately, in some cases the final Ugi product only precipitated after enough of the starting materials were consumed to produce a clear solution for a brief window of time. However, this specially fortuitous sequence of events may not take place for other reactions involving sparingly soluble starting materials. For example if the Ugi product precipitates before all of the starting materials are brought into solution there will never be a point where a clear solution is obtained. Also this method does not lend itself to automation easily, where a liquid handler is generally used to dispense reagents.[23]

In these cases knowledge of the solubility of the starting materials and Ugi products could be used to select suitable solvents. For example, consider the Ugi reaction depicted in Figure 3.[24] As shown in Figure 4, the aldehyde component phenanthrene-9-carboxaldehyde **(3)** suffers from very low solubility in most common solvents, including methanol, often the first choice for a Ugi reaction.[25] -- An investigation of the solubility data in the ONS Challenge database as it existed in Feb 2010 [10] allowed for an evaluation of other solvents. Liquid reagents such as furfurylamine (2) and n-butylisonitrile (4) generally show good solubility in most solvents and phenylacetic acid(1) proved to be very soluble in most solvents as well. Thus the challenge consisted of finding a solvent that had a reasonably high solubility for phenanthrene-9-carboxaldeyde (3) and a low solubility of the Ugi product (5).

The problem of solvent selection is depicted in Figure 6. High boiling solvents (>100C) are excluded to enable easy drying of the product. The measured solubilities of (3) and (5) in the remaining solvents are represented as bars. Thresholds for the minimum solubility of (3) and the maximum solubility of (5) are set at 0.3 M and 0.03 M respectively. Solubility measurements which fall in the desired range are colored green and the others are red. From this analysis only benzene satisfies the criteria for both reactant and product.


 * Figure 3: Ugi reaction**

[fix according to text] When the reaction was carried out in benzene the product did precipitate and was isolated in x% yield [11], although much more slowly than in methanol, which is consistent with the known accelerating property of protic solvents in Ugi reactions.[5] Measuring the concentration of the Ugi product in the reaction mixture and in the washings revealed that more of it was dissolving compared to measurements with (5) by itself. By adding the other reactants alone or in combination it was found that the presence of phenylacetic acid increased the amount of (5) that would dissolve. This co-solute effect was not observed in methanol, and suggests that hydrogen bonding between (1) and (5) may be responsible for the increased solubility in benzene.[12] This is problematic because it reduces the yield of the precipitate obtained directly from the reaction mixture after washing. This also suggests more broadly for many other Ugi reactions that the presence of protic solvent is preferable not only to accelerate the reaction but also to favor higher yields in a precipitation strategy. It may be that optimal mixtures of protic and aprotic solvents can be found that will minimize the co-solute effect while providing sufficient solubility of reactants such as (3).
 * Figure 6: Measured solubility data for phenanthrene-9-carboxaldehyde and Ugi product 5.**

The use of modeling to select a solvent
In the previous example, experimental solubility measurements were used to guide solvent selection. Although successful, it would be beneficial to be able to screen solvents without having to perform solubility measurements for each one. Using a technique developed by Abraham[13] we have calculated the predicted solubilities of both (3) and (5) in x solvents.[10] This method can be applied to any solute provided that experimental values are available in several solvents. Figure 7 depicts the predicted suitability of each solvent using the same criteria as above and excluding ketones, which are likely to compete with the aldehydes in Ugi reactions. Based on the information available in the February 2010 archive, this approach flagged ethyl acetate as a suitable solvent. In practice, the use of ethyl acetate yielded (5) in x% yield as a precipitate.


 * Figure 7: Predicted solubility data for phenanthrene-9-carboxaldehyde and Ugi product 5.**

Experimental
Synthesis of Ugi product (5) in methanol, benzene and ethyl acetate.

Conclusion
We have demonstrated that the measurement of non-aqueous solubility data can be crowdsourced, stored and disseminated using free and hosted Web2.0 technologies. By using an Open Notebook Science strategy,