Self-hosted distributed uptime/website monitoring with coordinated notifications

sgheghele · April 2021

I'm trying to make good use of some NAT VPS I have. I would like to deploy an uptime monitor system that handles the various locations in an intelligent way, like Hetrixtools does.

In a gist: criteria like notify me if 3 locations see the target as offline.

I am aware of smokeping, and to be fair it handles the primary/secondary nodes pretty well but, to my knowledge, you get an e-mail every time a location sees the target as down.

rcy026 · April 2021

You could probably accomplish something like that with Icinga and zones, agents and escalations. It would not be a plug and play install, but with some configuration I think it should be doable.

Falzo · April 2021

https://github.com/Ne00n/Night-Sky

Mason · April 2021

@Falzo said:
https://github.com/Ne00n/Night-Sky

Is Night-Sky distributed? I thought it was all done from a single node, but maybe I have my wires crossed

Falzo · April 2021

@Mason said:

@Falzo said:
https://github.com/Ne00n/Night-Sky

Is Night-Sky distributed? I thought it was all done from a single node, but maybe I have my wires crossed

probably depends on how you define 'distributed' - but night-sky runs on a master server and you can add remote slaves for the actual checking. it's @Neoon'esce documented so not for the faint of heart but definitely worth looking at, and at the very least an ideal starting point for what I assume OP is looking for ;-)

tetech · April 2021

The way I've been doing this for several years is with naemon and mod_gearman. Distribute the mod_gearman workers on your spare nodes.

It isn't a perfect system because all the checks are put into a queue and the next available worker grabs it. This means that the checks are distributed, but most likely not evenly. There's a few ways to work around this, e.g. write a gearman "stub" that takes the checks from the service queue and puts them into specific queues for each worker on a round-robin basis; or use HAProxy with round-robin load balancing.

In the past few weeks I've been investigating how to improve this. I'm currently deciding whether to have the checkers periodically push results and do some consolidation outside naemon, so that the naemon checker simply pulls an up/down value from a web site (like the UptimeRobot API), or whether to switch from naemon to prometheus.

sgheghele · April 2021

Thanks for the input so far, I am learning of new technology already!

Self-hosted distributed uptime/website monitoring with coordinated notifications

Comments