Insert with Overwrite Option
With ArangoDB 3.7 there are a number of improvements to insert operations as well as the addition of some new functionality such as UPSERT and REPSERT. This post will detail some of these changes along with example in an interactive notebook.
Insert, Replace, Upsert¶
With a database one sometimes wants to insert data which might be stored already. Instead of re-inserting the data again in case it is already present, we would like to update it. In this tutorial we want to take a look at the different options to achieve this with ArangoDB.
%%capture
!git clone -b oasisConnector --single-branch https://github.com/cw00dw0rd/ArangoNotebooks.git
!rsync -av ArangoNotebooks/ ./ --exclude=.git
!pip3 install pyarango
!pip3 install "python-arango>=5.0"
import json
import requests
import sys
import pprint
import oasis
import time
from pyArango.connection import *
from pyArango.collection import Collection, Edges, Field
from pyArango.collection import BulkOperation as BulkOperation
First, create a temporary database:
pp = pprint.PrettyPrinter()
## Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials(tutorialName='Upsert37', tempURL='https://tutorials.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB')
## Connect to the temp database
conn = oasis.connect(login)
db = conn[login["dbName"]]
pp.pprint(login)
UPSERT¶
Let us create a sample collection and convenience print function:
db.createCollection(name="Websites")
all_query = """
FOR doc in Websites
RETURN doc
"""
def print_all():
query_result = db.AQLQuery(all_query, rawResults=True)
for doc in query_result:
print(doc)
print()
print_all()
Let us define and run an UPSERT query for the first time.
upsert_query = """
UPSERT { page: "index.html" }
INSERT { page: "index.html", status: "inserted", value: 1 }
UPDATE { status: "updated" }
IN Websites
"""
db.AQLQuery(upsert_query)
print_all()
When running the query for the first time we should see 'status': 'inserted'
Let us run the same query again:
db.AQLQuery(upsert_query)
print_all()
This time we should see 'status': 'updated'
but all the other fields unchanged.
Let us examine the UPSERT statement a bit further:
UPSERT { page: "index.html" }
INSERT { page: "index.html", status: "inserted", value: 1 }
UPDATE { status: "updated" }
IN Websites
The statement comes with three parts:
- UPSERT specifying the search criteria
- INSERT specifying the complete doc which should be inserted if target doc not present
- UPDATE specifying the fields which should be updated in case the target doc is already present
UPDATE
only adds/modifies the specified document attributes in the existing document.
Instead of an UPDATE
part one can also specify a REPLACE
part, specifying the full document that is used as a replacement.
upsert_replace_query = """
UPSERT { page: "index.html" }
INSERT { page: "index.html", status: "inserted", value: 1 }
REPLACE { page: "index.html", status: "replaced", value: 1 }
IN Websites
"""
db.AQLQuery(upsert_replace_query)
print_all()
Now we should be seeing 'status': 'replaced'
.
We can also leverage the existing attribute values using the pseudovariable OLD.
upsert_old_query = """
UPSERT { page: "index.html" }
INSERT { page: "index.html", status: "inserted", value: 1 }
UPDATE { page: "index.html", status: "updated", prev_status: OLD.status, value: OLD.value + 1 }
IN Websites
"""
db.AQLQuery(upsert_old_query)
print_all()
This should return 'status': 'updated', 'prev_status': 'replaced'
UPSERT Performance and Caveats¶
The lookup attribute(s) from the search expression should be indexed in order to improve UPSERT performance Ideally, the search expression contains the shard key, as this allows the lookup to be restricted to a single shard
The lookup and the insert/update/replace parts are executed non- atomically. That means if multiple UPSERT queries run concurrently, they may all determine that the target document does not exist and then create it multiple times!!
Note that due to the non-atomicity of the lookup and insert/update/replace, even with a unique index there may be duplicate key errors or conflicts. But if they occur, the application/client code can execute the same query again
To prevent this from happening, one should a unique index to the lookup attribute(s). Note that in the cluster a unique index can only be created if it is equal to the shard key attribute of the collection or at least contains it as a part.
An alternative to making an UPSERT statement work atomically is
to use the exclusive
option to limit write concurrency for this collection to 1, which helps avoiding conflicts but is bad for throughput!
upsert_exclusive_query = """
UPSERT { page: "index.html" }
INSERT { page: "index.html", status: "inserted", value: 1 }
UPDATE { page: "index.html", status: "updated", prev_status: OLD.status, value: OLD.value + 1 }
IN Websites
OPTIONS { exclusive: true }
"""
db.AQLQuery(upsert_exclusive_query)
print_all()
REPSERT¶
An alternative to UPSERT is to use a REPSERT, which is a special extension for the INSERT command It completely replaces a document, identified by _key, in case it already exists. Note, running the query for the first time will create a second document it has a new _key. Feel free to change the value and rerun!
repsert_query = """
INSERT { _key: "index.html", value: 1 }
IN Websites
OPTIONS { overwrite: true }
"""
db.AQLQuery(repsert_query)
print_all()
REPSERT is an atomic operation so it fixes both the atomicity/concurrency problem that exists with UPSERT. REPSERT is restricted to _key lookups and is thus very fast.
A unique index on _key is always present, so REPSERT doesn't require setting up additional indexes.
REPSERTs can only be used when the _key
attribute value is known or can be created in an unambiguous fashion by the application.
Values for _key
values have some character and length restrictions, but alphanumeric keys work well.
The underlying collection must be sharded by _key
.
Overwrite Modes¶
To further control the behavior of INSERT on primary index unique constraint violations, overwriteMode
can be specified as follows:
- replace: full replacement of document
- ignore: if a document with the specified _key value exists already, nothing will be done and no write operation will be carried out. The insert operation will return success in this case.
- update: if a document with the specified _key value exists already, it will be patched (partially updated) with the specified document value.
- conflict: if a document with the specified _key value exists already, return a unique constraint violation error so that the insert operation fails. This is also the default behavior in case the overwrite mode is not set, and the overwrite flag is false or not set either.
Note: The overwriteMode
option is only available from 3.7 upwards.
repsert_ignore_query = """
INSERT { _key: "index.html", value: 1 }
IN Websites
OPTIONS { overwriteMode: "ignore" }
"""
db.AQLQuery(repsert_ignore_query)
print_all()
Tradeoffs¶
UPSERT
+ documents can be identified by any field (not just _key)
− performance implications when using a non-indexed attributes for lookup and hence might require an addtional index
REPSERT
+ as documents have to be identified by _key, there is already a primary index guaranteeing performance and uniqueness
− documents have to be identified by _key
− cannot refer to OLD document version
Cleanup¶
### Delete collections
db.dropAllCollections()
db.reload()
Please, keep in mind that the database used in this example is temporary and will be automatically deleted. If you would like to have a permanent deployment with ArangoDB Oasis, sign up for free!
If you would like to continue exploring ArangoDB and all of the new features of 3.7, you can download the beta here.