Alerting with Nagios
OpenTSDB is great, but it's not (yet) a full monitoring platform. Now that you have a bunch of metrics in OpenTSDB, you want to start sending alerts when thresholds are getting too high. It's easy!
In the tools
directory is a Python script check_tsd
. This script queries OpenTSDB and returns Nagios compatible output that gives you OK/WARNING/CRITICAL state.
Parameters
Options: -h, --help show this help message and exit -H HOST, --host=HOST Hostname to use to connect to the TSD. -p PORT, --port=PORT Port to connect to the TSD instance on. -m METRIC, --metric=METRIC Metric to query. -t TAG, --tag=TAG Tags to filter the metric on. -d SECONDS, --duration=SECONDS How far back to look for data. Default 600s. -D METHOD, --downsample=METHOD Downsample function, e.g. one of avg, min, sum, max ... etc -W SECONDS, --downsample-window=SECONDS Window size over which to downsample. -a METHOD, --aggregator=METHOD Aggregation method: avg, min, sum (default), max .. etc -x METHOD, --method=METHOD Comparison method: gt, ge, lt, le, eq, ne. -r, --rate Use rate value as comparison operand. -w THRESHOLD, --warning=THRESHOLD Threshold for warning. Uses the comparison method. -c THRESHOLD, --critical=THRESHOLD Threshold for critical. Uses the comparison method. -v, --verbose Be more verbose. -T SECONDS, --timeout=SECONDS How long to wait for the response from TSD. -E, --no-result-ok Return OK when TSD query returns no result. -I SECONDS, --ignore-recent=SECONDS Ignore data points that are that are that recent. -P PERCENT, --percent-over=PERCENT Only alarm if PERCENT of the data points violate the threshold. -N UTC, --now=UTC Set unix timestamp for "now", for testing -S, --ssl Make queries to OpenTSDB via SSL (https)
For a complete list of downsample & aggregation modes, see http://opentsdb.net/docs/build/html/user_guide/query/aggregators.html#available-aggregators
Nagios Setup
Drop the script into your Nagios path and set up a command like this:
define command{ command_name check_tsd command_line $USER1$/check_tsd -H $HOSTADDRESS$ $ARG1$ }
Then define a host in nagios for your TSD server(s). You can give it a check_command that is guaranteed to always return something if the backend is healthy.
define host{ host_name tsd address tsd check_command check_tsd!-d 60 -m rate:tsd.rpc.received -t type=put -x lt -c 1 [...] }
Then define some service checks for the things you want to monitor.
define service{ host_name tsd service_description Apache too many internal errors check_command check_tsd!-d 300 -m rate:apache.stats.hits -t status=500 -w 1 -c 2 [...] }
Testing
If you have want to test your parameters against some specific point in time, you can use the --now <UTC>
parameter to specify an explicit unix timestamp which is used as the current timestamp instead of the actual current time. If set, the script will fetch data starting at UTC - duration
, ending at UTC
.
To see the values retreived, and potentially ignored (due to duration), use the --verbose
option.
© 2010–2016 The OpenTSDB Authors
Licensed under the GNU LGPLv2.1+ and GPLv3+ licenses.
http://opentsdb.net/docs/build/html/user_guide/utilities/nagios.html