How to Monitor ArangoDB using collectd, Prometheus and Grafana
Information on how to set up a monitoring system for ArangoDB (standalone or cluster)
Introduction
ArangoDB provides several statistics via HTTP/JSON APIs. Such statistics can be used to monitor ArangoDB, when collected, stored and then visualized.
In this Article we will present an ArangoDB monitoring approach that makes use, under Linux, of the tools collectd, Prometheus and Grafana. We will start with an overview on how to install and configure the needed tools. Then we will walk you through the necessary steps required to get some data through the pipeline and visualize it. A more complete example is then included. Finally, we will provide an example to monitor the health of an ArangoDB Cluster.
Required Software Tools and Components
The following is the list of tools used in this setup:
- ArangoDB
- collectd
- Prometheus
- Grafana
The data flow between the above tools is as follows:
- collectd data from ArangoDB, using its plugin
curl_json
- Prometheus fetches data from collectd, which presents it via its plugin
write_prometheus
(available since collectd v. 5.7) - Grafana queries Prometheus to visualize the data
Installing the software
We assume you already installed ArangoDB.
For this setup to work, you will need at least one instance of collectd. Please use version 5.7 or higher, so the required write_prometheus
plugin is included. You may prefer to install collectd on every server in your setup, as it can feed lots of valuable information about those systems into your Prometheus database, like CPU, memory or disk usage, which can complement the data from ArangoDB nicely. However, one installation suffices to get the information provided by ArangoDB and you may want to start with that.
Finally, you need to install Prometheus and Grafana.
Basic configuration
In the following examples, we use the following names for the different installation:
coordinator.arangodb.local
for one ArangoDB coordinatorcollectd.local
for your collectd instanceprometheus.local
for your Prometheus instance
These may also be installed on the same machine. Just replace the names used here with the actual names (or plain IP addresses) of your installations.
collectd
Assuming you are using a default collectd installation, it should already contain the following lines in /etc/collectd/collectd.conf
to include additional *.conf
files in the directory
/etc/collectd/collectd.conf.d:
1 2 3 4 5 |
<Include "/etc/collectd/collectd.conf.d"> Filter "*.conf" </Include> |
You may want to set/add a line to specify the time interval in seconds after which collectd fetches another set of data:
1 |
Interval 60 |
However, this can also be set for each plugin separately.
Now add the following file to configure the write_prometheus
plugin:
1 |
/etc/collectd/collectd.conf.d/write_prometheus.conf |
with the following content:
1 2 3 4 5 6 7 |
# Configure a prometheus endpoint LoadPlugin write_prometheus <Plugin "write_prometheus"> Port "9103" </Plugin> |
After (re)starting collectd, the Prometheus interface should already be available. To check if it works, open the address http://collectd.local:9103/metrics
in your browser. Do not forget to replace collectd.local
with your actual collectd server. You should see something like this:
1 2 3 4 5 6 7 8 |
# HELP collectd_df_df_complex write_prometheus plugin: 'df' Type: 'df_complex', Dstype: 'gauge', Dsname: 'value' # TYPE collectd_df_df_complex gauge collectd_df_df_complex{df="etc-hostname",type="free",instance="3c77f4c05a29"} 377251528704 1518599082748 collectd_df_df_complex{df="etc-hostname",type="reserved",instance="3c77f4c05a29"} 23305961472 1518599082748 collectd_df_df_complex{df="etc-hostname",type="used",instance="3c77f4c05a29"} 57782255616 1518599082748 ... |
Now we are ready to connect collectd to Prometheus.
Prometheus
A minimal working configuration file looks like this:
1 |
/etc/prometheus/prometheus.yml |
1 2 3 4 5 6 7 |
scrape_configs: - job_name: node static_configs: - targets: - 'collectd.local:9103' |
In case you already have a configuration file, you only need to add the line - collectd.local:9103
to an existing job node
, or add your own. You may also add multiple targets here if you chose to install multiple collectd instances. Later you will be able to discern metrics between the targets as Prometheus will enrich your time series with the labels instance="collectd.local:9103"
and job="node"
.
You may also want to configure how often Prometheus fetches data from collectd (taking into account also the Interval
setting of collectd):
1 |
/etc/prometheus/prometheus.yml |
1 2 3 4 |
global: scrape_interval: 60s |
The default setting for scrape_interval
is 1m
. More information can be found in the Prometheus documentation on configuration.
After (re)starting Prometheus, visit http://prometheus.local:9090/targets
in your browser. There should be a table node containing your endpoint, and its State should be UP: this means Prometheus is already scraping data from your collectd instance. It may take a minute (depending on the scrape_interval
you have used) until the status changes from UNKNOWN to UP.
Prometheus is now set up.
Grafana
After logging into your Grafana installation, you should arrive at the Home Dashboard , where there is a link to Create your first data source. Alternatively, navigate to Configuration → Data sources and from there to Add data source.
Fill out the field Name for your Prometheus data source (choose freely). You probably want to check the box Default to set it as your default data source. As Type, choose Prometheus.
Add your Prometheus server under HTTP → URL: http://prometheus.local:9090
.
Finally, click on Save & Test. If everything is configured correctly, you should get the message Data source is working.
Step-by-step example: Adding data to the pipeline
In this example, we add two metrics to our setup:
- The total physical memory in the ArangoDB Cluster (the sum of the physical memory of all Coordinators)
- The total resident set size, i.e. the amount of memory used by the ArangoDB instances
Other metrics can be added the same way.
Initial configuration of collectd / curl_json
This step has to be done only once. You can extend the configuration later as needed.
Add a config file for the curl_json
collectd plugin:
1 |
/etc/collectd/collectd.conf.d/curl_json.conf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
LoadPlugin curl_json TypesDB "/etc/collectd/arangodb_types.db" <Plugin curl_json> # Interval 60 <URL "http://coordinator.arangodb.local:8529/_admin/aardvark/statistics/coordshort"> # Instance "arango_coordshort" # Set your authentication to Aardvark here: User "root" Password "" # IMPORTANT: Add <Key> blocks here! The configuration file will not be valid # until there is at least one <Key> block. </URL> </Plugin> |
Optionally, you may override the Interval setting, specifying every how many seconds curl_json
should fetch data from ArangoDB. Please note that choosing a very low setting may generate load and therefore reduce the performance of the database.
Also optionally, you may add an Instance parameter. If you do set it, for example to arango_coordshort
, the label curl_json="arango_coordshort"
will be added to all metrics configured in the < URL >
block. Otherwise, the label curl_json="default"
will be used.
You have to configure your credentials User and Password which you use to login to http://coordinator.arangodb.local:8529/
.
Also, please create the file /etc/collectd/arangodb_types.db
. It may initially be empty.
Getting data from ArangoDB to collectd with curl_json
The URL http://coordinator.arangodb.local:8529/_admin/aardvark/statistics/coordshort
may be visited with a browser to get an overview of the available data. The response looks something like this:
1 |
http://coordinator.arangodb.local:8529/_admin/aardvark/statistics/coordshort |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
{ "enabled": true, "data": { "http": { ... }, "times": [ ... ], "physicalMemory": 101083078656, "residentSizeCurrent": 818298880, ... } } |
So the data we’re looking for is available under data/physicalMemory
and data/residentSizeCurrent
, respectively. These need to be added in the curl_json
configuration above.
First we add two new types:
1 |
/etc/collectd/arangodb_types.db |
1 2 3 4 |
coordshort_physicalMemory value:GAUGE:U:U coordshort_residentSizeCurrent value:GAUGE:U:U |
Using these types, curl_json
will use the names
collectd_curl_json_coordshort_physicalMemory
and collectd_curl_json_coordshort_residentSizeCurrent
for the metrics. You may choose your own names for the types. If you just use builtin types (e.g. gauge
) instead, all data will be fed into the same metric (e.g. collectd_curl_json_coordshort_gauge
) and can only be discerned using labels.
Now replace the lines
1 2 3 4 |
# IMPORTANT: Add blocks <Key> here! The configuration file will not be valid # until there is at least one <Key> block. |
In your <URL>
block with:
1 |
/etc/collectd/collectd.conf.d/curl_json.conf |
1 2 3 4 5 6 7 8 |
<Key "data/physicalMemory"> Type "coordshort_physicalMemory" </Key> <Key "data/residentSizeCurrent"> Type "coordshort_residentSizeCurrent" </Key> |
The Key
is the path to the data in the JSON document above, while the Type
is the one we added to /etc/collectd/arangodb_types.db
.
After a restart of collectd and a minute (or whatever Interval
is configured) of waiting, corresponding lines similar to the following should appear in the endpoint of write_prometheus
:
1 |
http://collectd.local:9103/metrics |
1 2 3 4 5 6 7 8 |
# HELP collectd_curl_json_coordshort_physicalMemory write_prometheus plugin: 'curl_json' Type: 'coordshort_physicalMemory', Dstype: 'gauge', Dsname: 'value' # TYPE collectd_curl_json_coordshort_physicalMemory gauge collectd_curl_json_coordshort_physicalMemory{curl_json="default",type="data-physicalMemory",instance="3c77f4c05a29"} 101083078656 1518609151473 # HELP collectd_curl_json_coordshort_residentSizeCurrent write_prometheus plugin: 'curl_json' Type: 'coordshort_residentSizeCurrent', Dstype: 'gauge', Dsname: 'value' # TYPE collectd_curl_json_coordshort_residentSizeCurrent gauge collectd_curl_json_coordshort_residentSizeCurrent{curl_json="default",type="data-residentSizeCurrent",instance="3c77f4c05a29"} 810180608 1518609151473 |
A minute or so (depending on scrape_interval
) later the first values should arrive in Prometheus. This can be checked by executing, for example, the query collectd_curl_json_coordshort_physicalMemory
in the Prometheus GUI under Graph. It should yield some results in either the Console or the Graph tab. If the message No datapoints found. appears, the metrics weren’t scraped (yet).
Creating a graph in Grafana
Now that the metrics on physical memory and resident set size, named collectd_curl_json_coordshort_physicalMemory
and collectd_curl_json_coordshort_residentSizeCurrent
, respectively, arrived in Prometheus, graphs to visualize them can be added in Grafana.
First, create a new dashboard (unless you created one already): either click on Create your first dashboard on Grafana’s Home Dashboard, or navigate to Create → Dashboard. You have to save all changes made to a dashboard explicitly, either by pressing Ctrl+S
, or by clicking on the floppy disk symbol in the upper right.
Then, a New panel dialog should be open. You can add more panels to the dashboard with the Add panel button in the upper right. Select the Graph visualization.
Navigate to Panel title and Edit.
In the General tab, you can set the panel’s Title; e.g. ArangoDB cluster: total memory. In the Metrics tab, set query A to collectd_curl_json_coordshort_physicalMemory
and set the Legend format to Physical memory. Now add another query B, set it to collectd_curl_json_coordshort_residentSizeCurrent
and its Legend format to Resident set size. Switch to the Axes tab, and set Left Y’s Unit to Data (IEC) → bytes. Close the panel by clicking on the X to the right.
If you are satisfied with the result, do not forget to save the dashboard!
More complete configurations
Add the following lines to
1 |
/etc/collectd/arangodb_types.db |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
coordshort_physicalMemory value:GAUGE:U:U coordshort_residentSizeCurrent value:GAUGE:U:U coordshort_clientConnectionsCurrent value:GAUGE:U:U coordshort_bytesSentPerSecond value:GAUGE:U:U coordshort_bytesReceivedPerSecond value:GAUGE:U:U coordshort_avgRequestTime value:GAUGE:U:U coordshort_http_requestsPerSecond value:GAUGE:U:U coordshort_http_optionsPerSecond value:GAUGE:U:U coordshort_http_putsPerSecond value:GAUGE:U:U coordshort_http_headsPerSecond value:GAUGE:U:U coordshort_http_postsPerSecond value:GAUGE:U:U coordshort_http_getsPerSecond value:GAUGE:U:U coordshort_http_deletesPerSecond value:GAUGE:U:U coordshort_http_othersPerSecond value:GAUGE:U:U coordshort_http_patchesPerSecond value:GAUGE:U:U |
and the following lines in the <URL>
block in
1 |
/etc/collectd/collectd.conf.d/curl_json.conf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
<Key "data/physicalMemory"> Type "coordshort_physicalMemory" </Key> <Key "data/residentSizeCurrent"> Type "coordshort_residentSizeCurrent" </Key> <Key "data/clientConnectionsCurrent"> Type "coordshort_clientConnectionsCurrent" </Key> <Key "data/bytesSentPerSecond/0"> Type "coordshort_bytesSentPerSecond" </Key> <Key "data/bytesReceivedPerSecond/0"> Type "coordshort_bytesReceivedPerSecond" </Key> <Key "data/avgRequestTime/0"> Type "coordshort_avgRequestTime" </Key> <Key "data/http/optionsPerSecond/0"> Instance "OPTION" Type "coordshort_http_requestsPerSecond" </Key> <Key "data/http/putsPerSecond/0"> Instance "PUT" Type "coordshort_http_requestsPerSecond" </Key> <Key "data/http/headsPerSecond/0"> Instance "HEAD" Type "coordshort_http_requestsPerSecond" </Key> <Key "data/http/postsPerSecond/0"> Instance "POST" Type "coordshort_http_requestsPerSecond" </Key> <Key "data/http/getsPerSecond/0"> Instance "GET" Type "coordshort_http_requestsPerSecond" </Key> <Key "data/http/deletesPerSecond/0"> Instance "DELETE" Type "coordshort_http_requestsPerSecond" </Key> <Key "data/http/othersPerSecond/0"> Instance "other" Type "coordshort_http_requestsPerSecond" </Key> <Key "data/http/patchesPerSecond/0"> Instance "PATCH" Type "coordshort_http_requestsPerSecond" </Key> |
Hence restart collectd
Grafana dashboard
In the Grafana GUI, navigate to Create → Import and paste the following JSON to get a dashboard with some cluster graphs. You only need to select your data source to configure it. The dashboard was created with Grafana 4.6.3, the current stable version at the time of writing this Article. If there are problems importing it, check your version first.
Expand for full JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
|
{
"__inputs": [
{
"name": "DS_PROMETHEUS",
"label": "Prometheus",
"description": "",
"type": "datasource",
"pluginId": "prometheus",
"pluginName": "Prometheus"
}
],
"__requires": [
{
"type": "grafana",
"id": "grafana",
"name": "Grafana",
"version": "4.6.3"
},
{
"type": "panel",
"id": "graph",
"name": "Graph",
"version": ""
},
{
"type": "datasource",
"id": "prometheus",
"name": "Prometheus",
"version": "1.0.0"
},
{
"type": "panel",
"id": "singlestat",
"name": "Singlestat",
"version": ""
}
],
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"gnetId": null,
"graphTooltip": 0,
"hideControls": false,
"id": null,
"links": [],
"rows": [
{
"collapse": false,
"height": 283,
"panels": [
{
"aliasColors": {},
"bars": true,
"dashLength": 10,
"dashes": false,
"datasource": "${DS_PROMETHEUS}",
"fill": 1,
"id": 1,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": false,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 5,
"stack": true,
"steppedLine": false,
"targets": [
{
"expr": "collectd_curl_json_coordshort_http_requestsPerSecond",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "{{type}}s per second",
"refId": "A"
}
],
"thresholds": [],
"timeFrom": null,
"timeShift": null,
"title": "ArangoDB cluster: HTTP requests per second",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "${DS_PROMETHEUS}",
"fill": 1,
"id": 2,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 5,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "collectd_curl_json_coordshort_bytesSentPerSecond",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "Bytes sent per second",
"refId": "A"
},
{
"expr": "collectd_curl_json_coordshort_bytesReceivedPerSecond",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "Bytes received per second",
"refId": "B"
}
],
"thresholds": [],
"timeFrom": null,
"timeShift": null,
"title": "ArangoDB cluster: network throughput",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "Bps",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"#299c46",
"rgba(237, 129, 40, 0.89)",
"#d44a3a"
],
"datasource": "${DS_PROMETHEUS}",
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 3,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 2,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": true
},
"tableColumn": "",
"targets": [
{
"expr": "collectd_curl_json_coordshort_clientConnectionsCurrent",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "Client connections",
"refId": "A"
}
],
"thresholds": "",
"title": "Client connections",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
}
],
"repeat": null,
"repeatIteration": null,
"repeatRowId": null,
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6"
},
{
"collapse": false,
"height": 308,
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "${DS_PROMETHEUS}",
"fill": 1,
"id": 4,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 6,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "collectd_curl_json_coordshort_avgRequestTime",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "average request time",
"refId": "A"
}
],
"thresholds": [],
"timeFrom": null,
"timeShift": null,
"title": "ArangoDB cluster: Request duration",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "s",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "${DS_PROMETHEUS}",
"fill": 1,
"id": 5,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"span": 6,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "collectd_curl_json_coordshort_physicalMemory",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "Physical memory",
"refId": "A"
},
{
"expr": "collectd_curl_json_coordshort_residentSizeCurrent",
"format": "time_series",
"intervalFactor": 2,
"legendFormat": "Resident set size",
"refId": "B"
}
],
"thresholds": [],
"timeFrom": null,
"timeShift": null,
"title": "ArangoDB cluster: total memory",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
]
}
],
"repeat": null,
"repeatIteration": null,
"repeatRowId": null,
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6"
}
],
"schemaVersion": 14,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "",
"title": "ArangoDB cluster",
"version": 2
}
|
Adding ArangoDB Cluster Health info to collectd/Prometheus/Grafana
To perform this step we assume you already have a working setup of ArangoDB, collectd, Prometheus and Grafana (see previous sections).
The Cluster Health information, that is used to show the number of Coordinators and DBServers on the Dashboard of the ArangoDB Web Interface, while available as JSON via HTTP, is not suitable for direct consumption with the curl_json
plugin in collectd. However, it is possible to get around this limitation using the exec
plugin and a small script.
Requirements
The packages curl
and jq
need to be installed on your system.
Adding and configuring the plugin in collectd
Create the following bash script:
1 |
/etc/collectd/arango_cluster_health.plugin.bash |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
#!/bin/bash HOSTNAME="${COLLECTD_HOSTNAME:-$(hostname -f)}" INTERVAL="${COLLECTD_INTERVAL:-60}" ARANGO_HEALTH_URL="$1" ARANGO_USER="$2" ARANGO_PASSWORD="$3" if ! which curl jq > /dev/null then exit 1 fi while sleep "$INTERVAL" do JSON="$(curl -s -u "$ARANGO_USER":"$ARANGO_PASSWORD" "$ARANGO_HEALTH_URL")" if [ $? -ne 0 ] then continue fi TOTAL_COORDINATORS="$(jq '.Health | map(select(.Role == "Coordinator")) | length' <<<"$JSON")" GOOD_COORDINATORS="$(jq '.Health | map(select(.Role == "Coordinator" and .Status == "GOOD")) | length' <<<"$JSON")" TOTAL_DBSERVERS="$(jq '.Health | map(select(.Role == "DBServer")) | length' <<<"$JSON")" GOOD_DBSERVERS="$(jq '.Health | map(select(.Role == "DBServer" and .Status == "GOOD")) | length' <<<"$JSON")" cat <<COLLECTD PUTVAL "$HOSTNAME/exec-arangodb/health_coordinatorsTotal" interval=$INTERVAL N:$TOTAL_COORDINATORS PUTVAL "$HOSTNAME/exec-arangodb/health_coordinatorsGood" interval=$INTERVAL N:$GOOD_COORDINATORS PUTVAL "$HOSTNAME/exec-arangodb/health_dbserversTotal" interval=$INTERVAL N:$TOTAL_DBSERVERS PUTVAL "$HOSTNAME/exec-arangodb/health_dbserversGood" interval=$INTERVAL N:$GOOD_DBSERVERS COLLECTD done |
Make the script above executable:
1 |
$ chmod +x /etc/collectd/arango_cluster_health.plugin.bash |
Add the following types to the types database:
1 |
/etc/collectd/arangodb_types.db |
1 2 3 4 5 6 |
health_coordinatorsTotal value:GAUGE:U:U health_coordinatorsGood value:GAUGE:U:U health_dbserversTotal value:GAUGE:U:U health_dbserversGood value:GAUGE:U:U |
Register it with the exec plugin by creating this file:
1 |
/etc/collectd/collectd.conf.d/exec.conf |
1 2 3 4 5 6 |
LoadPlugin exec <Plugin exec> Exec "nobody:nogroup" "/etc/collectd/arango_cluster_health.plugin.bash" "http://coordinator.arangodb.local:8529/_admin/cluster/health" </Plugin> |
The address coordinator.arangodb.local:8529
needs to be set to a coordinator of the Cluster to monitor. If needed, username and password can be provided in the URL for HTTP basic auth, i.e. replace http://coordinator.arangodb.local:8529
with http://USERNAME:PASSWORD@coordinator.arangodb.local:8529
. Note that the password can be read by users on the same system using ps
. User and group (nobody
and nogroup
) can be chosen freely, as long as they have permission to execute the script /etc/collectd/arango_cluster_health.plugin.bash
.
Adding useful dashboards
The following JSON documents can be added to the rows
array of the Grafana dashboard example shared above.
Expand for full JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
|
{
"collapse": false,
"height": 120,
"panels": [
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"#299c46",
"rgba(237, 129, 40, 0.89)",
"#d44a3a"
],
"datasource": "${DS_PROMETHEUS}",
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 6,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 1,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "collectd_exec_health_coordinatorsTotal",
"format": "time_series",
"intervalFactor": 2,
"refId": "A"
}
],
"thresholds": "",
"title": "Coordinators",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": true,
"colorValue": false,
"colors": [
"#299c46",
"#bf1b00",
"#bf1b00"
],
"datasource": "${DS_PROMETHEUS}",
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 7,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 1,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "collectd_exec_health_coordinatorsTotal - collectd_exec_health_coordinatorsGood",
"format": "time_series",
"intervalFactor": 2,
"refId": "A"
}
],
"thresholds": "1,2",
"title": "Coordinators down",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"#299c46",
"rgba(237, 129, 40, 0.89)",
"#d44a3a"
],
"datasource": "${DS_PROMETHEUS}",
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 8,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 1,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "collectd_exec_health_dbserversTotal",
"format": "time_series",
"intervalFactor": 2,
"refId": "A"
}
],
"thresholds": "",
"title": "DBServers",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": true,
"colorValue": false,
"colors": [
"#299c46",
"#bf1b00",
"#bf1b00"
],
"datasource": "${DS_PROMETHEUS}",
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"id": 9,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"span": 1,
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "collectd_exec_health_dbserversTotal - collectd_exec_health_dbserversGood",
"format": "time_series",
"intervalFactor": 2,
"refId": "A"
}
],
"thresholds": "1,2",
"title": "DBServers down",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
}
],
"repeat": null,
"repeatIteration": null,
"repeatRowId": null,
"showTitle": false,
"title": "Dashboard Row",
"titleSize": "h6"
}
],
|
You can alternatively add them manually, by adding a panel of type Singlestat. Add one each for the total number of Coordinators and DBServers, using the metrics collectd_exec_health_coordinatorsTotal
and collectd_exec_health_dbserversTotal
, respectively. Go to the Options tab, and under Value, set Stat to Current. Then, add one each for the number of faulty Coordinators and DBServers.
As queries, use collectd_exec_health_coordinatorsTotal - collectd_exec_health_coordinatorsGood
and collectd_exec_health_dbserversTotal - collectd_exec_health_dbserversGood
, respectively.
Under Options, also set Stat to Current. Check the box Coloring → Background, set Thresholds to 1,1
and choose an all-clear color (e.g. green) as the first and a warning color as the second (e.g. red) and third. That way, as soon as one server goes down, the panel turns red.
The following is a screenshot of a possible Grafana dashboard:
