Dissertation, Chemistry, Experiment 001 (DC-Exp-001)
Don Pellegrino (don@drexel.edu)
Drexel University, Philadelphia, PA
December 15, 2010


All files from this experiment are included in the attached zip.

Objective (required)
The objective of this experiment is to create a social compound view. This view will enable exploration of the relationships between researchers and the compounds they have worked with. The relationships are established from open notebook science reaction records.

Procedure
Materials
The Reaction Attempts page links to the reaction attempts data.

RXIDsReactionAttempts
Live spreadsheet data is available in the RXIDsReactionAttempts Google Spreadsheet. This spreadsheet includes a "ReactionID" key, the name of the researcher, the name of the solvent, and the reaction type. It does not include the names or identifiers of the compounds.

ReactionAttempts
The ReactionAttempts Google Spreadsheet lists one row for each compound involved in a reaction. Compounds have multiple identifiers including "CompoundName," "CSID," and "SMILES."

The ReactionID column provides a way to join RXIDsReactionAttempts with ReactionAttempts. Such a join will allow for the integration of the name of the researcher with the name of the compound.

Note that the Reaction Attempts Wikispaces page links to the live spreadsheets using names for the links that do not match the names of the spreadsheets in Google Spreadsheets. This should be discussed with Bradley.

Data Integration
Live versions of RXIDsReactionAttempts and ReactionAttempts Google Spreadsheets were downloaded as Microsoft Excel spreadsheets to the project DC-Exp-01 folder.

The sheets from the two workbooks were combined into a single workbook. A VLOOKUP function was used to find the researcher names for each compound (CSID) in the data. A quick manual view of the results shows that some researcher names have multiple values in the same cell, separated with a forward slash character. Additional processing will be required to break these apart.

Maybe weight the edge by the number of ReactionIDs that establish the connection.

Maybe direct the edges with reactants on the left and product on the right.

Show not just the CSID or name but also the structure. Also a pop-up for the researcher with picture.

Follow-up work could be to look at labs as well as people. The lab can be extracted by looking at the prefix of the reaction ID.

Gephi
Source and Target Tuples must be unique
Loads of the DC-Exp-001:"DC-Exp-001 Edges" worksheet into Gephi via CSV export did not complete fully. Although the Gephi Import CSV dialog did not report any errors a manual sampling reveled missing records. The Context tab in Gephi reports only 738 nodes and 1024 edges although the edge worksheet includes 3941 records. It seems that Gephi will not load multiple edges between the same source and target. For example Dustin Sprouse worked on molecule 7146 in at least reactions DSp35 and DSp36-1. Only the first edge was loaded into Gephi (DSp35). Due to this limitation is seems that edge weighting will be necessary to account for a researcher working with the same molecule in multiple experiments. Still clustering should be accurate in Gephi even if all edge records are not processed, since each molecule / researcher relationship will still be represented.

The following fields need to be constructed for an import via the Gephi, Data Laboratory, Data Table, Edges, Import CSV dialog.
    • Source = The molecule (CSID)
    • Target = The researcher
    • Type = Undirected
    • Label = Name of the molecule and researcher.
    • Weight

Additional columns can be added to annotate the graph:
    • Reaction Identifier (edge)
    • Reaction Type (edge)
    • Compound Type (node = CSID)
    • Compound Name (node = CSID)

Adding the Type=Undirected changes the Context report to 743 nodes, 1029 edges, undirected graph.

Results
Gephi was used to calculate the following statistics on the graph.

Degree Report
Average Degree: 2.7750677506775068

Graph Distance Report
Parameters:
Network Interpretation: undirected
Results:
Diameter: 6
Radius: 1
Average Path length: 2.828575232814054
Number of shortest paths: 510708

Graph Density
0.004

Modularity Report
Parameters:
Randomize: On
Results:
Modularity: 0.4653148651123047
Number of Communities: 29

Connected Components Report
Parameters:
Network Interpretation: undirected
Results:
Weakly Connected Components: 4


Overview.PNG
Figure 1: Overview Graph

The overview shows a primary cluster of Ugi reactions centered around Khalid Mirza. There are also three disconnected clusters. There are four small loosely connected cluster. These are likely artifacts of the procedure used here.

Disconnected Clusters

Khalid_Mirza_-_Marshal_Moritz_Cluster.PNG
Figure 2: A disconnected cluster Khalid Mriza - Marshal Moritz cluster.

Dustin_Sprouse_Cluster.PNG
Figure 2: A disconnected Dustin Sprouse cluster.

Sebastian_Petrik_Cluster.PNG
Figure 3: A Sebastian Petrik cluster.

Loosely Connected Clusters

David_Bulger_Cluster.PNG
Figure 4: David Bulger cluster.

Khalid_Mirza_-_Aneh_Cluster.PNG
Figure 5: Khalid Mirza - Aneh cluster.

Marshall_Moritz_Cluster.PNG
Figure 6: Marshall Moritz cluster.

James_Giammarco_-_Jessica_Colditz_and_David_Bulger_-_Khalid_Mirza_Connections.PNG
Figure 7: James Giammarco - Jessica Colditz and David Bulger - Khalid Mirza connections group.

Michael_Wolfle_Cluster.PNG
Figure 8: Michael Wolfle cluster.

t-butyl_isocyanide_22045_Connections.PNG
Figure 9: t-butyl isocyanide (CSID 22045) connections. The connections are highlighted in black with the fuller graph shown in lighter gray.

Discussion
Compound 80987 is the sole compound linking the Synaptic Leap notebook with the UsefulChem notebook. That this is a valid link between the two notebooks is confirmed by checking the Reaction Attempts Explorer [http://showme.physics.drexel.edu/onsc/reactionattempts/] for aminoacetaldehyde dimethyl acetal. This link can also be viewed on the Reaction Attempt Advanced Search.

Conclusion
Next steps need to include weighting the edges, accounting for all edges, and splitting / disambiguating the names in the researcher field.

Another action might be to draw a model of the data that is in use. Note that some researchers (Michael Wolfle) may have used multiple notebook systems (Synaptic Leap SE or Our Experiment OE).

A critical question is to identify the overlap between different notebook collections or research groups.

Add the Open Notebook Science Solubility Challenge data to the linkages.

Parsing out the names for collaborations will help make the analysis easier for a per-person view. However the current approach handles collaboration centric analysis well.

Perform a temporal analysis. The RSS feeds on the Wiki should support this.