DC-Exp-001

Dissertation, Chemistry, Experiment 001 (DC-Exp-001) //Don Pellegrino (don@drexel.edu)// //Drexel University, Philadelphia, PA// //December 15, 2010//

All files from this experiment are included in the attached zip.

**Objective (required)** The objective of this experiment is to create a social compound view. This view will enable exploration of the relationships between researchers and the compounds they have worked with. The relationships are established from open notebook science reaction records.

**Procedure** **Materials** The Reaction Attempts page links to the reaction attempts data.
 * Live Google Spreadsheet [|ReactionAttempts].
 * Live Google Spreadsheet [|RXIDsReactionAttempts].

**RXIDsReactionAttempts** Live spreadsheet data is available in the [|RXIDsReactionAttempts] Google Spreadsheet. This spreadsheet includes a "ReactionID" key, the name of the researcher, the name of the solvent, and the reaction type. It does not include the names or identifiers of the compounds.

**ReactionAttempts** The [|ReactionAttempts] Google Spreadsheet lists one row for each compound involved in a reaction. Compounds have multiple identifiers including "CompoundName," "CSID," and "SMILES."

The ReactionID column provides a way to join RXIDsReactionAttempts with ReactionAttempts. Such a join will allow for the integration of the name of the researcher with the name of the compound.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Note that the Reaction Attempts Wikispaces page links to the live spreadsheets using names for the links that do not match the names of the spreadsheets in Google Spreadsheets. This should be discussed with Bradley.

<span style="color: #366092; direction: ltr; font-family: Calibri; font-size: 13pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Data Integration** <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Live versions of RXIDsReactionAttempts and ReactionAttempts Google Spreadsheets were downloaded as Microsoft Excel spreadsheets to the project DC-Exp-01 folder.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">The sheets from the two workbooks were combined into a single workbook. A VLOOKUP function was used to find the researcher names for each compound (CSID) in the data. A quick manual view of the results shows that some researcher names have multiple values in the same cell, separated with a forward slash character. Additional processing will be required to break these apart.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Maybe weight the edge by the number of ReactionIDs that establish the connection.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Maybe direct the edges with reactants on the left and product on the right.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Show not just the CSID or name but also the structure. Also a pop-up for the researcher with picture.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Follow-up work could be to look at labs as well as people. The lab can be extracted by looking at the prefix of the reaction ID.

<span style="color: #366092; direction: ltr; font-family: Calibri; font-size: 13pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Gephi** <span style="color: #366092; direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**//Source and Target Tuples must be unique//** <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Loads of the DC-Exp-001:"DC-Exp-001 Edges" worksheet into Gephi via CSV export did not complete fully. Although the Gephi Import CSV dialog did not report any errors a manual sampling reveled missing records. The Context tab in Gephi reports only 738 nodes and 1024 edges although the edge worksheet includes 3941 records. It seems that Gephi will not load multiple edges between the same source and target. For example Dustin Sprouse worked on molecule 7146 in at least reactions DSp35 and DSp36-1. Only the first edge was loaded into Gephi (DSp35). Due to this limitation is seems that edge weighting will be necessary to account for a researcher working with the same molecule in multiple experiments. Still clustering should be accurate in Gephi even if all edge records are not processed, since each molecule / researcher relationship will still be represented.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">The following fields need to be constructed for an import via the Gephi, Data Laboratory, Data Table, Edges, Import CSV dialog.
 * <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0px; margin-left: 0.375in; margin-top: 0px; unicode-bidi: embed; vertical-align: middle;">Source = The molecule (CSID)
 * <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0px; margin-left: 0.375in; margin-top: 0px; unicode-bidi: embed; vertical-align: middle;">Target = The researcher
 * <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0px; margin-left: 0.375in; margin-top: 0px; unicode-bidi: embed; vertical-align: middle;">Type = Undirected
 * <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0px; margin-left: 0.375in; margin-top: 0px; unicode-bidi: embed; vertical-align: middle;">Label = Name of the molecule and researcher.
 * <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0px; margin-left: 0.375in; margin-top: 0px; unicode-bidi: embed; vertical-align: middle;">Weight

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Additional columns can be added to annotate the graph:
 * <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0px; margin-left: 0.375in; margin-top: 0px; unicode-bidi: embed; vertical-align: middle;">Reaction Identifier (edge)
 * <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0px; margin-left: 0.375in; margin-top: 0px; unicode-bidi: embed; vertical-align: middle;">Reaction Type (edge)
 * <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0px; margin-left: 0.375in; margin-top: 0px; unicode-bidi: embed; vertical-align: middle;">Compound Type (node = CSID)
 * <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0px; margin-left: 0.375in; margin-top: 0px; unicode-bidi: embed; vertical-align: middle;">Compound Name (node = CSID)

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Adding the Type=Undirected changes the Context report to 743 nodes, 1029 edges, undirected graph.

<span style="color: #17365d; direction: ltr; font-family: Calibri; font-size: 16pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Results** <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Gephi was used to calculate the following statistics on the graph.

<span style="color: #366092; direction: ltr; font-family: Calibri; font-size: 13pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Degree Report** <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Average Degree: 2.7750677506775068

<span style="color: #366092; direction: ltr; font-family: Calibri; font-size: 13pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Graph Distance Report** <span style="color: #366092; direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Parameters:** <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Network Interpretation: undirected <span style="color: #366092; direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Results:** <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Diameter: 6 <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Radius: 1 <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Average Path length: 2.828575232814054 <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Number of shortest paths: 510708

<span style="color: #366092; direction: ltr; font-family: Calibri; font-size: 13pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Graph Density** <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">0.004

<span style="color: #366092; direction: ltr; font-family: Calibri; font-size: 13pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Modularity Report** <span style="color: #366092; direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Parameters:** <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Randomize: On <span style="color: #366092; direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Results:** <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Modularity: 0.4653148651123047 <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Number of Communities: 29

<span style="color: #366092; direction: ltr; font-family: Calibri; font-size: 13pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Connected Components Report** <span style="color: #366092; direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Parameters:** <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Network Interpretation: undirected <span style="color: #366092; direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Results:** <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Weakly Connected Components: 4

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;"> <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Figure 1: Overview Graph

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">The overview shows a primary cluster of Ugi reactions centered around Khalid Mirza. There are also three disconnected clusters. There are four small loosely connected cluster. These are likely artifacts of the procedure used here.

<span style="color: #366092; direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Disconnected Clusters** <span style="direction: ltr; line-height: 0px; margin-bottom: 0in; margin-top: 0in; overflow: hidden; unicode-bidi: embed;"> <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;"> <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Figure 2: A disconnected cluster Khalid Mriza - Marshal Moritz cluster.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;"> <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Figure 2: A disconnected Dustin Sprouse cluster.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;"> <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Figure 3: A Sebastian Petrik cluster.

<span style="color: #366092; direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Loosely Connected Clusters**

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Figure 4: David Bulger cluster.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;"> <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Figure 5: Khalid Mirza - Aneh cluster.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;"> <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Figure 6: Marshall Moritz cluster.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;"> <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Figure 7: James Giammarco - Jessica Colditz and David Bulger - Khalid Mirza connections group.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;"> <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Figure 8: Michael Wolfle cluster.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;"> <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Figure 9: t-butyl isocyanide (CSID 22045) connections. The connections are highlighted in black with the fuller graph shown in lighter gray.

<span style="color: #17365d; direction: ltr; font-family: Calibri; font-size: 16pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Discussion** <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Compound [|80987] is the sole compound linking the Synaptic Leap notebook with the UsefulChem notebook. That this is a valid link between the two notebooks is confirmed by checking the Reaction Attempts Explorer [[]] for aminoacetaldehyde dimethyl acetal. This link can also be viewed on the [|Reaction Attempt Advanced Search].

<span style="color: #17365d; direction: ltr; font-family: Calibri; font-size: 16pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">**Conclusion** <span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Next steps need to include weighting the edges, accounting for all edges, and splitting / disambiguating the names in the researcher field.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Another action might be to draw a model of the data that is in use. Note that some researchers (Michael Wolfle) may have used multiple notebook systems (Synaptic Leap SE or Our Experiment OE).

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">A critical question is to identify the overlap between different notebook collections or research groups.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Add the [|Open Notebook Science Solubility Challenge] data to the linkages.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Parsing out the names for collaborations will help make the analysis easier for a per-person view. However the current approach handles collaboration centric analysis well.

<span style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-top: 0in; unicode-bidi: embed;">Perform a temporal analysis. The RSS feeds on the Wiki should support this.