Self-hosted distributed uptime/website monitoring with coordinated notifications

I'm trying to make good use of some NAT VPS I have. I would like to deploy an uptime monitor system that handles the various locations in an intelligent way, like Hetrixtools does.

In a gist: criteria like notify me if 3 locations see the target as offline.

I am aware of smokeping, and to be fair it handles the primary/secondary nodes pretty well but, to my knowledge, you get an e-mail every time a location sees the target as down.

Comments

  • You could probably accomplish something like that with Icinga and zones, agents and escalations. It would not be a plug and play install, but with some configuration I think it should be doable.

    Thanked by (1)sgheghele
  • MasonMason AdministratorOG

    Is Night-Sky distributed? I thought it was all done from a single node, but maybe I have my wires crossed

    Humble janitor of LES
    Proud papa of YABS

  • @Mason said:

    Is Night-Sky distributed? I thought it was all done from a single node, but maybe I have my wires crossed

    probably depends on how you define 'distributed' - but night-sky runs on a master server and you can add remote slaves for the actual checking. it's @Neoon'esce documented so not for the faint of heart but definitely worth looking at, and at the very least an ideal starting point for what I assume OP is looking for ;-)

    Thanked by (2)sgheghele Mason
  • The way I've been doing this for several years is with naemon and mod_gearman. Distribute the mod_gearman workers on your spare nodes.

    It isn't a perfect system because all the checks are put into a queue and the next available worker grabs it. This means that the checks are distributed, but most likely not evenly. There's a few ways to work around this, e.g. write a gearman "stub" that takes the checks from the service queue and puts them into specific queues for each worker on a round-robin basis; or use HAProxy with round-robin load balancing.

    In the past few weeks I've been investigating how to improve this. I'm currently deciding whether to have the checkers periodically push results and do some consolidation outside naemon, so that the naemon checker simply pulls an up/down value from a web site (like the UptimeRobot API), or whether to switch from naemon to prometheus.

    Thanked by (1)sgheghele
  • Thanks for the input so far, I am learning of new technology already!

Sign In or Register to comment.