home shape
one query background

Pregel Community Detection Tutorial

Community structures are quite common in real networks. For example, social networks include community groups (the origin of the term, in fact) based on common locations, hobbies, occupation, etc.

  • Finding an underlying community structure in a network, if it exists,
  • is important for a number of reasons. Communities allow us to create a large scale map of a network
  • since individual communities act like meta-nodes in the network which makes its study easier.[7]

At ArangoDB we recently integrated community detection algorithms into our pregel based distributed bulk graph processing subsystem. This will enable you to easily use your existing graph data in many different applications. This tutorial is designed to run on the Community Edition of ArangoDB.

right blob min

Creating the ArangoDB Graph

The data we are going to use is the Pokec social network available from the Stanford Network Analysis Project. Pokec is the most popular on-line social network in Slovakia, we will use it to detect communities in the graph. We assume that as a social network it contains an underlying community structure, which we can discover through one of our algorithms.

As a first step, you should start the ArangoDB cluster, for example with the ArangoDB Starter.
The next step is to create the collections and the arangodb named graph. In the arangosh prompt you can paste in the following commands:

var graph_module = require("@arangodb/general-graph");
 
var graph = graph_module._create("pokec");
db._create("profiles", {numberOfShards: 4});
 
graph._addVertexCollection("profiles");
db._createEdgeCollection("relations", {
 
numberOfShards: 4,
replicationFactor: 1,
shardKeys:["vertex"],
distributeShardsLike:"profiles"});
 
var rel = graph_module._relation("relations", ["profiles"], ["profiles"]);
 
graph._extendEdgeDefinitions(rel);;

Preparing and importing the graph data

You can run these bash commands to directly download the data

curl -OL https://snap.stanford.edu/data/soc-pokec-profiles.txt.gz
 
curl -OL https://snap.stanford.edu/data/soc-pokec-relationships.txt.gz

Now we can extract both files and transform them into a format which can be imported by ArangoDBs community edition. Our goal is it to get tab separated CSV files, so be careful when copying these commands. These commands assume you are on Linux system, but you can also execute it on windows with cygwin.

First, we create a csv file containing all user profiles, by running these bash commands:

echo -e '_key\tpublic\tcompletion_percentage\tgender\tregion\tlast_login\tregistration\tAGE\tbody\tI_am_working_in_field\tspoken_languages\thobbies\tI_most_enjoy_good_food\tpets\tbody_type\tmy_eyesight\teye_color\thair_color\thair_type\tcompleted_level_of_education\tfavourite_color\trelation_to_smoking\trelation_to_alcohol\tsign_in_zodiac\ton_pokec_i_am_looking_for\tlove_is_for_me\trelation_to_casual_sex\tmy_partner_should_be\tmarital_status\tchildren\trelation_to_children\tI_like_movies\tI_like_watching_movie\tI_like_music\tI_mostly_like_listening_to_music\tthe_idea_of_good_evening\tI_like_specialties_from_kitchen\tfun\tI_am_going_to_concerts\tmy_active_sports\tmy_passive_sports\tprofession\tI_like_books\tlife_style\tmusic\tcars\tpolitics\trelationships\tart_culture\thobbies_interests\tscience_technologies\tcomputers_internet\teducation\tsport\tmovies\ttravelling\thealth\tcompanies_brands\tmore' > soc-pokec-profiles-arangodb.txt
 
gunzip < soc-pokec-profiles.txt.gz | sed -e 's/null//g' -e 's~^~P~' -e 's~ $~~' >> soc-pokec-profiles-arangodb.txt 

Next we take the relations file and do the same thing. Maybe go and get a coffee now, this might take a few minutes.

echo -e '_from\t_to\tvertex' > soc-pokec-relationships-arangodb.txt
 
gzip -dc soc-pokec-relationships.txt.gz | awk -F"\t" '{print "profiles/P" $1 "\tprofiles/P" $2 "\tP" $1}' >> soc-pokec-relationships-arangodb.txt

Now that you have the data ready for import you can import into your arangodb instance. Adjust the --server.endpoint option as necessary

 arangoimp -c none --server.endpoint http+tcp://[::1]:8530 --type tsv --collection profiles --file soc-pokec-profiles-arangodb.txt
 
  arangoimp -c none --server.endpoint http+tcp://[::1]:8530 --type tsv --collection relations --file soc-pokec-relationships-arangodb.txt

Running the algorithms

Now that you have imported the data we can start working with it. Currently, we support three different community detection algorithms: Label Propagation (LP), a version of Speaker-Listener Label Propagation (SLPA) and Disassortative Degree Mixing and Information Diffusion (DMID).

These algorithms have different purposes: LP can recognize distinct communities in a graph and is very cheap in terms of memory consumption.

On the other hand SLPA and DMID are designed to detect overlapping communities. They both do not assume a fixed number of communities and can discover the communities on the fly.

Now in arangoshell execute the LP-algorithm:

var pregel = require("@arangodb/pregel");
 
var handle = pregel.start("labelpropagation", "pokec", {maxGSS:250, resultField:"community"});
 
// check the status periodically for completion
 
pregel.status(handle);

Similarly, you can also execute SLPA with the maxCommunities parameter set to 1 to get a similar result to LP.

var pregel = require("@arangodb/pregel");
 
var handle = pregel.start("slpa", "pokec", {maxGSS:100, resultField:"community", maxCommunities:1});
 
// check the status periodically for completion
pregel.status(handle); 

Once the LP algorithm is finished you can further work with this dataset to e.g leverage SmartGraphs. A tutorial on how to “smartify” your dataset can be found in the Smartifier Tutorial.