Repair Jobs

distributeShardsLike

Before versions 3.2.12 and 3.3.4 there was a bug in the collection creation which could lead to a violation of the property that its shards were distributed on the DBServers exactly as the prototype collection from the distributeShardsLike setting.

Please read everything carefully before using this API!

There is a job that can restore this property safely. However, while the job is running,

  • the replicationFactor must not be changed for any affected collection or prototype collection (i.e. set in distributeShardsLike, including SmartGraphs),
  • neither should shards be moved of one of those prototypes
  • and shutdown of DBServers should be avoided during the repairs. Also only one repair job should run at any given time. Failure to meet those requirements will mostly cause the job to abort, but still allow to restart it safely. However, changing the replicationFactor during repairs may leave it in a state that is not repairable without manual intervention!

Shutting down the coordinator which executes the job will abort it, but it can safely be restarted on another coordinator. However, there may still be a shard move ongoing even after the job stopped. If the job is started again before the move is finished, repairing the affected collection will fail, but the repair can be restarted safely.

If there is any affected collection which replicationFactor is equal to the total number of DBServers, the repairs might abort. In this case, it is necessary to reduce the replicationFactor by one (or add a DBServer). The job will not do that automatically.

Generally, the job will abort if any of its assumptions fail, at the start or during the repairs. It can be started again and will resume from the current state.

Testing with GET /_admin/repairs/distributeShardsLike

Using GET will not trigger any repairs, but only calculate and return the operations necessary to repair the cluster. This way, you can also check if there is something to repair.

$ wget -qSO - http://localhost:8529/_admin/repair/distributeShardsLike | jq .
  HTTP/1.1 200 OK
  X-Content-Type-Options: nosniff
  Server: ArangoDB
  Connection: Keep-Alive
  Content-Type: application/json; charset=utf-8
  Content-Length: 53
{
  "error": false,
  "code": 200,
  "message": "Nothing to do."
}

In the example above, all collections with distributeShardsLike have their shards distributed correctly. The response if something is broken looks like this:

{
  "error": false,
  "code": 200,
  "collections": {
    "_system/someCollection": {
      "PlannedOperations": [
        {
          "BeginRepairsOperation": {
            "database": "_system",
            "collection": "someCollection",
            "distributeShardsLike": "aPrototypeCollection",
            "renameDistributeShardsLike": true,
            "replicationFactor": 4
          }
        },
        {
          "MoveShardOperation": {
            "database": "_system",
            "collection": "someCollection",
            "shard": "s2000109",
            "from": "PRMR-6b8c84be-1e80-4085-9065-177c6e31a702",
            "to": "PRMR-d3e62c96-c3f7-4766-bac6-f3bf8026f59a",
            "isLeader": false
          }
        },
        {
          "MoveShardOperation": {
            "database": "_system",
            "collection": "someCollection",
            "shard": "s2000109",
            "from": "PRMR-ee3d7af6-1fbf-4ab7-bfd1-56d0a1c1c9b9",
            "to": "PRMR-6b8c84be-1e80-4085-9065-177c6e31a702",
            "isLeader": true
          }
        },
        {
          "FixServerOrderOperation": {
            "database": "_system",
            "collection": "someCollection",
            "distributeShardsLike": "aPrototypeCollection",
            "shard": "s2000109",
            "distributeShardsLikeShard": "s2000092",
            "leader": "PRMR-6b8c84be-1e80-4085-9065-177c6e31a702",
            "followers": [
              "PRMR-99c2ac17-f417-4710-82aa-8350417dd089",
              "PRMR-3b0b85de-882b-4eb2-bbf2-ef1018bdc81e",
              "PRMR-d3e62c96-c3f7-4766-bac6-f3bf8026f59a"
            ],
            "distributeShardsLikeFollowers": [
              "PRMR-d3e62c96-c3f7-4766-bac6-f3bf8026f59a",
              "PRMR-99c2ac17-f417-4710-82aa-8350417dd089",
              "PRMR-3b0b85de-882b-4eb2-bbf2-ef1018bdc81e"
            ]
          }
        },
        {
          "FinishRepairsOperation": {
            "database": "_system",
            "collection": "someCollection",
            "distributeShardsLike": "aPrototypeCollection",
            "shards": [
              {
                "shard": "s2000109",
                "protoShard": "s2000092",
                "dbServers": [
                  "PRMR-6b8c84be-1e80-4085-9065-177c6e31a702",
                  "PRMR-d3e62c96-c3f7-4766-bac6-f3bf8026f59a",
                  "PRMR-99c2ac17-f417-4710-82aa-8350417dd089",
                  "PRMR-3b0b85de-882b-4eb2-bbf2-ef1018bdc81e"
                ]
              },
              {
                "shard": "s2000110",
                "protoShard": "s2000093",
                "dbServers": [
                  "PRMR-d3e62c96-c3f7-4766-bac6-f3bf8026f59a",
                  "PRMR-ee3d7af6-1fbf-4ab7-bfd1-56d0a1c1c9b9",
                  "PRMR-6b8c84be-1e80-4085-9065-177c6e31a702",
                  "PRMR-99c2ac17-f417-4710-82aa-8350417dd089"
                ]
              },
[...]
            ]
          }
        }
      ],
      "error": false
    }
  }
}

If something is to be repaired, the response will have the property collections with an entry <db>/<collection> for each collection which has to be repaired. Each collection also as a separate error property which will be true iff an error occurred for this collection (and false otherwise). If error is true, the properties errorNum and errorMessage will also be set, and in some cases also errorDetails with additional information on how to handle a specific error.

Repairing with POST /_admin/repairs/distributeShardsLike

As this job possibly has to move a lot of data around, it can take a while depending on the size of the affected collections. So this should not be called synchronously, but only via Async Results: i.e., set the header x-arango-async: store to put the job into background and get its results later. Otherwise the request will most probably result in a timeout and the response will be lost! The job will still continue unless the coordinator is stopped, but there is no way to find out if it is still running, or get success or error information afterwards.

Starting the job in background can be done like so:

$ wget --method=POST --header='x-arango-async: store' -qSO - http://localhost:8529/_admin/repair/distributeShardsLike 
  HTTP/1.1 202 Accepted
  X-Content-Type-Options: nosniff
  X-Arango-Async-Id: 152223973119118
  Server: ArangoDB
  Connection: Keep-Alive
  Content-Type: text/plain; charset=utf-8
  Content-Length: 0

This line is of notable importance:

  X-Arango-Async-Id: 152223973119118

as it contains the job id which can be used to fetch the state and results of the job later. GETting /_api/job/pending and /_api/job/done will list job ids of jobs that are pending or done, respectively.

This can also be done with the GET method for testing.

The job api must be used to fetch the state and results. It will return a 204 while the job is running. The actual response will be returned only once, after that the job is deleted and the api will return a 404. It is therefore recommended to write the response directly to a file for later inspection. Fetching the result is done by calling /_api/job via PUT:

$ wget --method=PUT -qSO - http://localhost:8529/_api/job/152223973119118 | jq .
  HTTP/1.1 200 OK
  X-Content-Type-Options: nosniff
  X-Arango-Async-Id: 152223973119118
  Server: ArangoDB
  Connection: Keep-Alive
  Content-Type: application/json; charset=utf-8
  Content-Length: 53
{
  "error": false,
  "code": 200,
  "message": "Nothing to do."
}

The final response will look like the response of the GET call. If an error occurred the response should contain details on how to proceed. If in doubt, ask as on Slack: arangodb.com/community/