InfoCamere investigation on graph databases: exploring relationships among Italian companies with ArangoDB.
The Italian Chambers of Commerce are public bodies entrusted to serve and promote Italian businesses through over 300 branch offices located throughout the country. InfoCamere helps them in pursuing their goals in the interest of the business community. On behalf of the Chambers’ System, InfoCamere plays a key-role in implementing the Italian Digital Agenda with respect to the digital transformation process of the national productive system, especially focusing on supporting the digitalization of SMEs.
Luca Sinico (Software Developer, InfoCamere)
Overview of the work
InfoCamere launched an inspection work on the field of graph databases during the second half of 2016. The goals of the work were to perform an investigation on the principal characteristics of such technology; to compare (both on conceptual and on performance terms) some of the available products on the market, together with a relational solution; and to inspect the adoption possibilities of a graph database for some of InfoCamere applications.
The work is based on a dataset obtained by an extraction of the Italian Business Register and contains data about equity participations among enterprises. The nodes of the graph may be physical persons or companies and collect data about denomination, company’s share capital, registration country, a unique fiscal identifier, etc. The edges of the graph represent the equity participations among them.
During our work we inspected the two principal types of graph data model, which are: “property graph” and “RDF”. Despite RDF (a W3C standard) represents a useful way to implement Linked Data and Semantic Web, and despite it organizes data in a graph form, we found that the property graph model (a sort of “industry standard”) better meets our requirements. In fact, it allows to define attributes on edges; while RDF does not allow it in a direct way. In addition, the standard query language proposed for RDF (SPARQL) shows some limitations with respect to the query languages typically provided by the DBMSs that support the “property graph” model. Two quick examples are the lack of a shortest-path computation function and the possibility to express a maximum depth level for the variable-length paths search.
The flow through which the data come to the graph database is the one descripted in Figure 2. Starting from the complex relational database which stores the Italian Business Register, some title searches are continuously generated by means of user’s demand or update operations. The title searches hold aggregated data obtained by combining different records in different tables, which are useful for some applications. Because of this, these data are put on a relational database to support their operations. Since this second relational database is mainly focused on aspects related to equity participations among companies, the graph database takes data from it.
The queries we developed have been thought so that they may be used by applications that work on such kind of dataset, and also for slightly stress the ability of the DBMSs; in particular, we developed some standard queries and some more specific queries.
Given a particular company, identified by its “fiscal ID”, we ask for its associates; its participations; or both at once; thus by limiting the search to only one level of depth. However, we also asked for both direct and indirect participations of a company (and similarly for its associates); which correspond to the exploration of the graph without depth limits. Furthermore, since the dataset constitutes of a graph (and not of a “simple” tree), the paths between two companies may be multiple. This brought us to ask for the complete list of directed paths that connect two companies; or, alternatively, for just the shortest one. We also asked for the common participations (or associates) between two companies. The graph nature of the dataset has also given rise to the formulation of two additional queries: the first returns, together with the participate node retrieved, also the less depth value by which it has been retrieved; the second query counts, for each depth level, the companies that are associates of the given node, but avoids counting them more than once.
The queries developed may be helpful for both investigation purposes and for a better data exploration experience.
Comparison of graph databases with a relational one
We imported the dataset on three of the most known graph databases, which are ArangoDB v3.0.10; Neo4j v3.0.6; and OrientDB v2.2.11 (all Community Editions). We also imported the dataset on a well known relational database: PostgreSQL v9.6.1. The choice of the relational is not strongly binding, because the performance are mainly influenced by the power of the SQL language. These products have been installed on a virtual server machine with modest resources, so that the results can be useful for other companies with similar kind of hardware availability. For each kind of query, we selected three nodes of the graph that represent three different loads for the DBMSs. In particular: one node represents a lightweight case, on which there may be few returned results or short depth exploration values; one node represents the intermediate case; and another the heavy case. We executed the queries more than once, so that we also study the performance differences between a well warmed cache and not.
Since there not currently exists a standard query language for graph databases, each of the graph DBMSs provides its own query language. This motivated us to assess the expressivity and ease-of-use of the various query languages.
The results we gathered, extremely summarized, are the following:
- Graph databases provide some purposely designed query languages that greatly help describing graph traversal queries, and also help facing some of the typical computational problems of the field. The same queries are hard to implement in an efficient way with SQL or with the help of stored procedures.
- While the relational database performs well for simpler queries, the three analyzed graph databases generally outperform it – typically by one or two orders of magnitude – for the heavy-weight cases of the graph exploration queries, i.e. those with big quantities of nodes to be analyzed and with high values of levels to be traversed.
- ArangoDB showed good import and query performances, in special manner for light-weight and intermediate workload situations.
- One of the attention points about ArangoDB for the version tested is that it is quite RAM greedy. However ArangoDB claims to have solved this “issue” with their new 3.2 release and the new RocksDB storage engine.
Because of the good feedback received during the study work, its good performance both at import and execution time, its good documentation, ease-of-use and licensing prices; ArangoDB showed good potentialities for its adoption on some of InfoCamere applications. In fact, we decided to use it for a demo application we are developing right now.
Some additional details about the comparison work may be found here.