Pregel Community Detection Tutorial
Community structures are quite common in real networks. For example, social networks include community groups (the origin of the term, in fact) based on common locations, hobbies, occupation, etc.
- Finding an underlying community structure in a network, if it exists,
- is important for a number of reasons. Communities allow us to create a large scale map of a network
- since individual communities act like meta-nodes in the network which makes its study easier.[7]
At ArangoDB we recently integrated community detection algorithms into our pregel based distributed bulk graph processing subsystem. This will enable you to easily use your existing graph data in many different applications. This tutorial is designed to run on the Community Edition of ArangoDB.
Creating the ArangoDB Graph
The data we are going to use is the Pokec social network available from the Stanford Network Analysis Project. Pokec is the most popular on-line social network in Slovakia, we will use it to detect communities in the graph. We assume that as a social network it contains an underlying community structure, which we can discover through one of our algorithms.
As a first step, you should start the ArangoDB cluster, for example with the ArangoDB Starter.
The next step is to create the collections and the arangodb named graph. In the arangosh
prompt you can paste in the following commands:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
var graph_module = require("@arangodb/general-graph"); var graph = graph_module._create("pokec"); db._create("profiles", {numberOfShards: 4}); graph._addVertexCollection("profiles"); db._createEdgeCollection("relations", { numberOfShards: 4, replicationFactor: 1, shardKeys:["vertex"], distributeShardsLike:"profiles"}); var rel = graph_module._relation("relations", ["profiles"], ["profiles"]); graph._extendEdgeDefinitions(rel); |
Preparing and importing the graph data
You can run these bash commands to directly download the data
1 2 3 |
curl -OL https://snap.stanford.edu/data/soc-pokec-profiles.txt.gz curl -OL https://snap.stanford.edu/data/soc-pokec-relationships.txt.gz |
Now we can extract both files and transform them into a format which can be imported by ArangoDBs community edition. Our goal is it to get tab separated CSV files, so be careful when copying these commands. These commands assume you are on Linux system, but you can also execute it on windows with cygwin.
First, we create a csv file containing all user profiles, by running these bash commands:
1 2 3 |
echo -e '_key\tpublic\tcompletion_percentage\tgender\tregion\tlast_login\tregistration\tAGE\tbody\tI_am_working_in_field\tspoken_languages\thobbies\tI_most_enjoy_good_food\tpets\tbody_type\tmy_eyesight\teye_color\thair_color\thair_type\tcompleted_level_of_education\tfavourite_color\trelation_to_smoking\trelation_to_alcohol\tsign_in_zodiac\ton_pokec_i_am_looking_for\tlove_is_for_me\trelation_to_casual_sex\tmy_partner_should_be\tmarital_status\tchildren\trelation_to_children\tI_like_movies\tI_like_watching_movie\tI_like_music\tI_mostly_like_listening_to_music\tthe_idea_of_good_evening\tI_like_specialties_from_kitchen\tfun\tI_am_going_to_concerts\tmy_active_sports\tmy_passive_sports\tprofession\tI_like_books\tlife_style\tmusic\tcars\tpolitics\trelationships\tart_culture\thobbies_interests\tscience_technologies\tcomputers_internet\teducation\tsport\tmovies\ttravelling\thealth\tcompanies_brands\tmore' > soc-pokec-profiles-arangodb.txt gunzip < soc-pokec-profiles.txt.gz | sed -e 's/null//g' -e 's~^~P~' -e 's~ $~~' >> soc-pokec-profiles-arangodb.txt |
Next we take the relations file and do the same thing. Maybe go and get a coffee now, this might take a few minutes.
1 2 3 |
echo -e '_from\t_to\tvertex' > soc-pokec-relationships-arangodb.txt gzip -dc soc-pokec-relationships.txt.gz | awk -F"\t" '{print "profiles/P" $1 "\tprofiles/P" $2 "\tP" $1}' >> soc-pokec-relationships-arangodb.txt |
Now that you have the data ready for import you can import into your arangodb instance. Adjust the --server.endpoint
option as necessary
1 2 3 |
arangoimp -c none --server.endpoint http+tcp://[::1]:8530 --type tsv --collection profiles --file soc-pokec-profiles-arangodb.txt arangoimp -c none --server.endpoint http+tcp://[::1]:8530 --type tsv --collection relations --file soc-pokec-relationships-arangodb.txt |
Running the algorithms
Now that you have imported the data we can start working with it. Currently, we support three different community detection algorithms: Label Propagation (LP), a version of Speaker-Listener Label Propagation (SLPA) and Disassortative Degree Mixing and Information Diffusion (DMID).
These algorithms have different purposes: LP can recognize distinct communities in a graph and is very cheap in terms of memory consumption.
On the other hand SLPA and DMID are designed to detect overlapping communities. They both do not assume a fixed number of communities and can discover the communities on the fly.
Now in arangoshell execute the LP-algorithm:
1 2 3 4 5 6 7 |
var pregel = require("@arangodb/pregel"); var handle = pregel.start("labelpropagation", "pokec", {maxGSS:250, resultField:"community"}); // check the status periodically for completion pregel.status(handle); |
Similarly, you can also execute SLPA with the maxCommunities
parameter set to 1 to get a similar result to LP.
1 2 3 4 5 6 |
var pregel = require("@arangodb/pregel"); var handle = pregel.start("slpa", "pokec", {maxGSS:100, resultField:"community", maxCommunities:1}); // check the status periodically for completion pregel.status(handle); |
Once the LP algorithm is finished you can further work with this dataset to e.g leverage SmartGraphs. A tutorial on how to “smartify” your dataset can be found in the Smartifier Tutorial.