Schema Validation in 3.7

Schema Validation in 3.7

Open In Colab

Being schema-less and allowing for flexible documents is one of big advantages of ArangoDB. But sometimes there are use-cases where we have a fixed schema. With the upcoming release 3.7, ArangoDB provides users with the means to validate the structure of documents using JSON Schema (draft-4).

The validation can be configured on collection level and with different strictness levels allowing users to import unclean data and improve later.

Let us consider a concrete example which you can directly try interactively below. You can click the "Open In Colab" button to try it yourself or see the static output throughout the post.

First things first, we need to install and import necessary packages. This notebook also creates a temporary database running on ArangoDB Oasis.

If you run this notebook you will be provided a fully functional ArangoDB deployment hostname, username, and password. These credentials, along with the deployment itself will be automatically deleted.

In [79]:
%%capture
!git clone -b oasisConnector --single-branch https://github.com/cw00dw0rd/ArangoNotebooks.git
!rsync -av ArangoNotebooks/ ./ --exclude=.git
!pip3 install pyarango
!pip3 install "python-arango>=5.0"
In [80]:
import json
import requests
import sys
import pprint
import oasis

from os import path
from pyArango.connection import *
from pyArango.collection import Collection, Edges, Field
from pyArango.graph import Graph, EdgeDefinition
from pyArango.collection import BulkOperation as BulkOperation

pp = pprint.PrettyPrinter()

# Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials(tutorialName='schemaValidation37', tempURL='https://tutorials.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB')

## Connect to the temp database
conn = oasis.connect(login)
db = conn[login["dbName"]] 

def cleanupCollections(db):
  try:
    db['Customers'].delete()
  except:
    pass
  db.reload()

# Cleanup (just in case the example is rerun)
cleanupCollections(db)

# Generating temporary credentials, run the notebook to generate your own
pp.pprint(login)
Requesting new temp credentials.
Temp database ready to use.
{'dbName': 'TUT74fbdhqtzjtp4j3yk293h',
 'hostname': 'tutorials.arangodb.cloud',
 'password': 'TUT868nebwahwwozqdi75z0eb',
 'port': 8529,
 'username': 'TUThf37jys3b5d8xxeaepo9hl'}

Let us consider a concrete example, assume you have a collection containing customers and leads. For now you have only names and mail addresses for most of the customers similar as shown below. Here we are adding in some customer information, taking note of the missing email for James. We follow this up with an AQL query that returns all of the customers in our newly created Customers collection.

In [81]:
# Create Customer Collection
collection = db.createCollection(name="Customers")

# insert some documents
docs = []
doc = collection.createDocument()
doc["firstName"] = "James"
doc["lastName"] = "Cole"
docs.append(doc)

doc = collection.createDocument()
doc["firstName"] = "Claudius"
doc["lastName"] = "Weinberger"
doc["email"] = "[email protected]"
docs.append(doc)
collection.bulkSave(docs)

# Check customers
print("Check Customers")
aql = """
  FOR customer in Customers
    return customer
  """
queryResult = db.AQLQuery(aql)
for customer in queryResult:
   print(customer)
Check Customers
ArangoDoc '_id: Customers/312077378, _key: 312077378, _rev: _a5RAMz----': <store: {'firstName': 'James', 'lastName': 'Cole'}>
ArangoDoc '_id: Customers/312077379, _key: 312077379, _rev: _a5RAMz---A': <store: {'email': '[email protected]', 'firstName': 'Claudius', 'lastName': 'Weinberger'}>
In [82]:
# Drop the collection for the next example.
db["Customers"].delete() # drop
db.reloadCollections() 

Next, we will add validation for this collection. This is done by providing a well formed object, such as the one below, to the validation attribute of the collection properties. You could add this using arangosh with the db.customers.properties(<object>) command.

The following example validation object sets a message that will be displayed on validation failure. The level "moderate" allows us to work with old documents that are not conforming to the schema. Such documents can be updated but no new invalid objects can be added or objects that are valid can be changed to be invalid. The last attribute which is required is the rule attribute containing a valid JSON Schema description.

In [83]:
# Example schema validation object
schema = {
    "rule" : {
      "type" : "object",
      "properties": {
        "firstName": {
            "type": "string",
        },
        "lastName": {
            "type": "string",
        },
        "email": {
            "type": "string",
        },
    },
    "required" : ["firstName", "lastName", "email"],
  },
  "level": "moderate",
  "message": "Customer Schema Validation Failed."
}


# Recreate Customers collection, now with moderate schema validation
collection = db.createCollection(
        name = "Customers",
        schema = schema
    )
In [84]:
# Confirm validation has been added by checking collection properties
collectionProperties = collection.properties()
pp.pprint(collectionProperties)
{'cacheEnabled': False,
 'code': 200,
 'distributeShardsLike': '_graphs',
 'error': False,
 'globallyUniqueId': 'c312077380/',
 'id': '312077380',
 'isDisjoint': False,
 'isSmart': False,
 'isSmartChild': False,
 'isSystem': False,
 'keyOptions': {'allowUserKeys': True, 'type': 'traditional'},
 'minReplicationFactor': 1,
 'name': 'Customers',
 'numberOfShards': 1,
 'replicationFactor': 3,
 'schema': {'level': 'moderate',
            'message': 'Customer Schema Validation Failed.',
            'rule': {'properties': {'email': {'type': 'string'},
                                    'firstName': {'type': 'string'},
                                    'lastName': {'type': 'string'}},
                     'required': ['firstName', 'lastName', 'email'],
                     'type': 'object'}},
 'shardKeys': ['_key'],
 'shardingStrategy': 'hash',
 'status': 3,
 'statusString': 'loaded',
 'type': 2,
 'waitForSync': False,
 'writeConcern': 1}

If we now try to insert a document where either the names aren't strings, the email is not a string, or the document doesn't include the required attributes, the insert will fail with the validation failed message specified in the validation object.

In [85]:
#  Try to insert the same documents, now that the James document doesn't match the schema
docs = []
doc = collection.createDocument()
doc["firstName"] = "James"
doc["lastName"] = "Cole"
docs.append(doc)

doc = collection.createDocument()
doc["firstName"] = "Claudius"
doc["lastName"] = "Weinberger"
doc["email"] = "[email protected]"
docs.append(doc)

try:
  collection.bulkSave(docs)
except Exception as exc:
  print("The following exception is due to attempting to insert a document that doesn't match the schema.")
  print(exc)

print()
print("Here are the documents that were successfully inserted: ")
queryResult = db.AQLQuery(aql)
for customer in queryResult:
  print(customer)
The following exception is due to attempting to insert a document that doesn't match the schema.
1 documents could not be created. Errors: {'error': False, 'created': 1, 'errors': 1, 'empty': 0, 'updated': 0, 'ignored': 0}

Here are the documents that were successfully inserted: 
ArangoDoc '_id: Customers/310078789, _key: 310078789, _rev: _a5RAOOu--_': <store: {'email': '[email protected]', 'firstName': 'Claudius', 'lastName': 'Weinberger'}>

To make the validation more stringent you can change the validation level to strict and improve the schema with some additional requirements. Then the validation object could look like this:

In [86]:
db["Customers"].delete() # Drop the collection
db.reloadCollections() # Driver method to reload the available collections on the database

schema = {
  "message" : "Customer Validation Failed",
  "level" : "strict",
  "rule" : {
    "type" : "object",
    "properties" : {
      "firstName" : {
        "type" : "string",
        "minLength" : 2,
        "maxLength" : 20
      },
      "lastName" : {
        "type" : "string",
        "minLength" : 2,
        "maxLength" : 20
      },
      "email" : {
        "type" : "string",
        "minLength" : 5,
        "maxLength" : 20
      },
      "type" : {
        "enum" : [
          "lead",
          "customer",
          "enterprise"
        ]
      }
    },
    "required" : [
      "firstName",
      "lastName",
      "email"
    ],
    "additionalProperties" : False
  }
}

By setting the level to "strict" all inserted or changed documents have to match the schema and the schema itself has become more rigorous as well. Now the names and mail have length requirements and are required to be present. Furthermore, there is an optional customer type property that must match one of the 3 given values. We do not allow additional properties to be added to the documents which is controlled by the additionalProperties false flag in the schema definition.

In [87]:
print()
print("Recreate Collection with validation rule")
collection = db.createCollection(
        name = "Customers",
        schema = schema
    )

#  Try to insert same documents
docs = []
doc = collection.createDocument()
doc["firstName"] = "James"
doc["lastName"] = "Cole"
# Note we are missing the required email attribute
docs.append(doc)

doc = collection.createDocument()
doc["firstName"] = "Claudius"
doc["lastName"] = "Weinberger"
doc["email"] = "[email protected]"
docs.append(doc)

try:
  collection.bulkSave(docs)
except Exception as exc:
  print("Expected exception as only one of docs is confirming the validation rule.")
  print(exc)

# Check customers
print()
print("Checking customers added with validation")
aql = """
  FOR customer in Customers
    return customer
  """
queryResult = db.AQLQuery(aql)
for customer in queryResult:
  print(customer)


# Next Steps
print()
print("If you are running this notebook in Google Colab, use these credentials to access the ArangoDB Web UI at:")
print("https://"+login["hostname"]+":"+str(login["port"]))
print("Username: " + login["username"])
print("Password: " + login["password"])
Recreate Collection with validation rule
Expected exception as only one of docs is confirming the validation rule.
1 documents could not be created. Errors: {'error': False, 'created': 1, 'errors': 1, 'empty': 0, 'updated': 0, 'ignored': 0}

Checking customers added with validation
ArangoDoc '_id: Customers/314071743, _key: 314071743, _rev: _a5RAPh2--_': <store: {'email': '[email protected]', 'firstName': 'Claudius', 'lastName': 'Weinberger'}>

If you are running this notebook in Google Colab, use these credentials to access the ArangoDB Web UI at:
https://tutorials.arangodb.cloud:8529
Username: TUThf37jys3b5d8xxeaepo9hl
Password: TUT868nebwahwwozqdi75z0eb

If you would like to dive deeper into this example, continue exploring with your temporary database credentials printed above to access the ArangoDB Web UI.

Otherwise, if you have not run the notebook in Colab yet, click the Open in Colab button at the top of the page.

Please, keep in mind that this database is temporary and will be automatically deleted. If you would like to have a permanent deployment to continue exploring 3.7 with ArangoDB Oasis, sign up for free!

If you would like to continue exploring ArangoDB and all of the new features of 3.7 locally instead, you can download the beta here.