Between now and the end of Passover, I’m sharing excerpts from my book “The Four Questions Every Monitoring Engineer is Asked“. It blends wisdom and themes from Passover with common questions (and their answers) heard when you are setting up and running monitoring and observability solutions.
Here are some things to have in place that will let you know when all is not puppy dogs and rainbows:
Alerts that give you a view of the health of your environment
Under the heading of “who watches the watchmen,” any moderately sophisticated or mission-critical monitoring environment should have internal and external checks to help ensure that things are working well. These include:
A non-critical alert that notifies the monitoring team if data has not been collected for a specific node in X minutes.
A non-critical alert that notifies when a polling server/collector hasn’t written any data to the database in X minutes.
Treating your monitoring solution like an application that needs monitoring, because, you know, it does! The trick is to make the case, usually to the bean counters, that you can’t do this from the monitoring solution itself. Either have a second instance of the same tool, or use another solution, including open source.
Have a way to test individual aspects of the alert stream
You know that awful, sinking feeling you get when you realize that no alerts have been going out because one piece of the alerting infrastructure failed on you? I know. No fun. To avoid this, start by understanding and documenting every step an alert takes, from the source device through to the ticket, email, page, or smoke signal that is sent to the alert recipient. From there, create and document the ways you can validate that each of those steps is working independently. This will allow you to quickly validate each subsystem and nail down the point at which a message may have gotten lost.
Wait. You can test each alert subsystem?
A test procedure is just a monitor waiting for your loving touch. Get it done. You’ll need to do it on a separate system since, you know, if your alerting infrastructure is broken you won’t get an alert. This can usually be done inexpensively. Just to be clear, once you can manually test each of your alert infrastructure components (monitoring, event correlation, etc.), turn those manual tests into continuous monitors, and then set those monitors up with thresholds to ensure you get an alert.
Create a deadman switch
The concept is that you get an alert if something doesn’t happen. First set up an event which triggers an alert that goes all the way through the system regularly. Then set up another monitor to alert when the first message has not been seen in X minutes.