Using Cytoscape with ArangoDB

In this tutorial, we would like to visualize the data of a graph stored in ArangoDB for a human read-able overview.

This overview often helps to get a general understanding of non-artifically created data, or for a third party dataset that was not designed by ourselves.

The dataset

In this tutorial we have the case of a third party dataset designed by Marius Bäsler in his master thesis.[1]
His goal is to find the origins of parasitism with the help of GLoBIs interaction database.
This data dump can be downloaded here.
The dataset describes several organisms that live either in symbiotic or parasitary relation to one another.

In order to import the dataset we can just restore it into a running ArangoDB with arangorestore:

1	arangorestore --input-directory /path/to/extracted/dump

After this command succeeded you will end up with two collections:

nodes_otl_sub a document collection containing species, genera and families.
edges_otl_sub a edge collection, where each edge defines a relation between the nodes.

Now we have the dataset in ArangoDB and are ready to go.

Data Normalization

The goal is to export the data in xgmml format, which is readable by cytoscape the tool we want to use to visualize the data.
Unfortunately, this format requires that all vertices only have string datatypes.
So we need to normalize our dataset first and convert all attributes of the vertices to string.

Furthermore, each document needs to have identical attributes, which is also done by this step.

NOTE: this step requires some computation and does not scale well for larger datasets, if you have this situation and need some guidance please contact us on Slack, we can help you out there.

In order to do this normalization we are going to execute the following AQL:

LET attrs = (
  FOR node IN nodes_otl_sub
    FOR x IN ATTRIBUTES(node, true)
      RETURN DISTINCT x
)
 
FOR node IN nodes_otl_sub
  LET newNode = ZIP(attrs, (
    FOR attr IN attrs RETURN TO_STRING(node[attr])
  ))
  UPDATE node WITH 
    newNode
  IN nodes_otl_sub

In the first step this aql collects a distinct list of attributes available in the dataset.
In the second step, it iterates over all nodes.

Then it will create a new node that has each attribute replaced with a TO_STRING variant of it’s value.
Note here: If the attribute is not set, it will cause to save the empty string.
And then updates the document in the collection with the new node.

So after this query succeeded all vertices have all attributes and all of them are of type string.
Now we are ready to go for the export.

Exporting the data

To visualize the data we need it in xgmml format.
In order to transform the dataset into this format, we are using the arangoexport tool.

$> arangoexport --help
Usage: arangoexport []
 
Section 'global options' (Global configuration)
  --collection                restrict to collection name (can be specified multiple times) (default: )
  --configuration                the configuration file or 'none' (default: "")
  --fields                       comma separated list of fileds to export into a csv file (default: "")
  --graph-name                   name of a graph to export (default: "")
  --output-directory             output directory (default: "/home/mchacki/devel/export")
  --overwrite                   overwrite data in output directory (default: false)
  --progress                    show progress (default: true)
  --type                         type of export. possible values: "csv", "json", "jsonl", "xgmml", "xml"
                                         (default: "json")
  --version                     reports the version and exits (default: false)
  --xgmml-label-attribute        specify document attribute that will be the xgmml label (default:
                                         "label")
  --xgmml-label-only            export only xgmml label (default: false)

This tool natively supports xgmml format so it is rather straight forward to use it.
For this export, we need to name the collections we want to export, so in our case nodes_otl_sub and edges_otl_sub.

Obviously, we need to name the xgmml format as type.
For easier visualization we like to give the graph a name otl.

Finally xgmml allows defining one attribute as label.
We select the name for this tutorial.
So in total, our call will look like this:

$> arangoexport --collection nodes_otl_sub --collection edges_otl_sub --type xgmml --graph-name otl --xgmml-label-attribute name

And produces the following output:

Connected to ArangoDB 'http+tcp://127.0.0.1:8529', version 3.2.0, database: '_system', username: 'root'
# Export graph with collections nodes_otl_sub, edges_otl_sub as 'otl'
# Exporting collection 'nodes_otl_sub'...
# Exporting collection 'edges_otl_sub'...
Processed 2 collection(s), wrote 128432121 byte(s), 176 HTTP request(s)

After this export succeeded you will have an export containing a file named otl.xgmml.

This finally is the xgmml representation of our dataset.

Data visualisation

In order to visualize and analyze the dataset please download Cytoscape. For details of this product please refer to their website. For this tutorial we are just going to use it as a visualization tool.

Cytoscape: import xgmml file

Cytoscape: apply organic layout

Cytoscape: graph overview

Cytoscape: part of the graph zoomed in

Feel free to explore the graph yourself.

[1] The present graph is part of Marius Bäsler’s master thesis (Bäsler 2017 – https://github.com/majuss/globi-parasites). He’s trying to find the origins of parasitism with the help of the OpenTreeOfLife (Hinchliff et al. 2014 – doi: 10.1073/pnas.1423041112) and GlobalBioticInteractions (Poelen et al., 2014 – https://doi.org/10.1016/j.ecoinf.2014.08.005).