Meta-Research Engine at UAB powered by ArangoDB
Chris Rocco, Systems Developer, The University of Alabama at Birmingham
About the Project:
Meta-Research Engine is a software system with the purpose of facilitating the mass encoding of scientific research studies. Define a research project, upload a set of papers under the domain of interest, and invite encoders to collaborate in extracting their information. At the end of a project, you are left with a reliable data set encompassing the contents of many research papers.
- Flexible Project Definitions: Capture all of the data you are interested in without compromise. Give precedence to more importance questions without having to worry about overwhelming the encoder.
- Accurate, Accessible Data: The power of Meta-Research Engine resides in its robust data model. Capture dynamic, hierarchical data, in a way that is easily queried using a variety of technologies and techniques only made available in recent years.
- Intuitive User Interface: Big data is confusing. The user interface abstracts many its complexities, making it simple enough for a 6th grader to encode high-level research studies.
- User Conflict Resolution: We double-code papers to ensure data integrity, but more often than not, these papers aren’t coded exactly the same. Meta-Research Engine facilitates the resolution of conflicts between encoders of the same paper.
- Data Management, Modeling, and Infrastructure
- User Management
- The Web Portal
- Conflict Detection, Analysis, and Resolution
We have moved to a new database management system – ArangoDB. ArangoDB is a multi-model DBMS, and has several essential advantages over relational DBMS like MySQL.
A multi-model database allows us to define complex hierarchical relationships that can be queried efficiently. ArangoDB’s native query language, AQL, allows for complex aggregations that often eliminate the need to use an external programming language to analyze data, saving time and complexity.
In a NoSQL datastore, everything is stored as a JSON document, making them easy to import, export, convert, and backup. What was a compromise in storing an encoding as JSON directly in a table column, has become a powerful data structure capable of answering complex questions about the data.
We have modeled all of our data as a graph, where the nodes are objects (usually nouns), and the edges connect those objects together (usually verbs). For example, a student (node) is enrolled in (edge) a class (node) that a teacher (node) teaches (edge).
ArangoDB’s native query language, AQL, allows for complex aggregations that often eliminate the need to use an external programming language to analyze data, saving time and complexity.
Using these new tools, we have refactored the database to allow for multiple paradigms of encoding studies. Here we introduce a new concept – a research project.
A research project has some number of domains linked to it, each of which with an arbitrary number of variables and subdomains. Domains are now purely organizational structure for grouping related variables.
This gives us a great deal of modularity, allowing for adding and dropping domains and fields without compromising the front-end.
Every time the front-end initializes an encoding of a paper, it queries the database for the research project’s ‘structure’, which is a hierarchical object (compiled directly from the database) that is comparable to the encoding format from the previous version of the project.
The following graph showcases this structure for the Big Data research study:
Encodings are lean. We don’t store structure metadata, eliminating redundancy and reducing required storage costs. For comparison, the dump of all tables in the old database except for submissions (encodings) takes up 637KB. The submissions table consumes 72MB.
The new model maintains scope (constant vs study branch) information on a per-variable basis. This more accurately describes papers and prevents the user from having to input a constant multiple times, simply because another variable in their domain varied in the experiment.
A Use-Case for Storing Data Graphically
Let’s say that we want to, given a teacher, get a list of every paper every one of their students has encoded. We would start at that teachers’ node, follow every teaches edge to some number of classes. From each of those classes, the query would follow their inbound
enrolled_in edges to some number of students. Thereon to assignments and then papers via
assignment_of edges to a list of papers. The following is a graph representation of the traversal just described, starting at the teacher LaRhonda Brown. The center of the graph is her class, with each of her students and their assignments and papers branching out.
Some user management functionality we previously decided to compromise on will become trivial. Some of these include: post registration class enrollment, multiple class enrollment, multiple instructors per class. In addition, once difficult statistics are now much easier to generate. For example, average completion by student, class, teacher.
The Web Portal
The re-visited web portal takes full advantage of the back-end upgrades.
A Stateless Approach
The project as a whole has been made completely stateless, meaning the backend and frontend share nothing but a public API. What we now have is a web service with a powerful infrastructure and a website that is simply a consumer.
This carries some major advantages. The web service can be developed independently of any of its consumers. We may also introduce new consumers, such as an administrative interface or even a mobile application with ease.
An intelligent conditional – dynamic rendering strategy enables the realtime manipulation of this complex data model on the frontend. Domains might render in multiple domains, as variables move independently. Branches are self-managing components, and can be edited individually.
Completion assessment is now super accurate, and even easier!
The study structure’s metadata is mapped to the lean encoding via AngularJS HTML component directives. These intelligent UI components allow for the seamless integration of even the most advanced UI tools.
The web service’s public API implements Google OAuth2 authentication protocol, which issues authentication tokens to clients and validates them server-side.
This new infrastructure gives rise to a powerful new conflict resolution system. We have overcome many of the difficulties in modeling the complex relationships required of conflict analysis. Conflict detection is run every time an assignment marked ‘done’ is saved, and proceeds as follows:
- Look at all of the users who have been assigned to this paper, and their encodings
- Identify structural conflict based on the number of reported study arms
- 3 branches → User A
- 4 branches → User C
- 5 branches → User B
- Identify scope conflicts
- Constant → User B
- Varying → Users A and C
- Identify value conflicts based on the reported value
- High Fat Diet → Users A and B
- Low Fat Diet → User C
- More than one unique response constitutes a conflict
- A conflict report is then generated containing information about the discrepancies and users to be consumed by the resolution application.
With this approach, we gain the power to perform conflict analysis on an arbitrary number of encodings. Even better, we allow co-existing conflict types. For example, instead of stopping after finding a structure conflict, we can still identify scope conflicts and value conflicts between variables of the same scope.
The ability to capture all of this information is a product of an accurate data model.
Big thanks to Chris Rocco for investing time to write this use case!
Also using ArangoDB? Write a few lines – post it to your blog or send it to us and we’ll publish it here.