Between now and the end of Passover, I’m sharing excerpts from my book “The Four Questions Every Monitoring Engineer is Asked“. It blends wisdom and themes from Passover with common questions (and their answers) heard when you are setting up and running monitoring and observability solutions.
You can buy it as an ebook (Amazon Kindle, Barnes&Noble Nook, and more) or as a good old-fashioned physical book. You can even check it out through OverDrive.
Monitoring is not alerting. Some people confuse getting a ticket, page, email, or other alert with actual monitoring. Monitoring is nothing more–and nothing less–than the ongoing collection of data about a particular element or set of elements. Alerting is a happy by-product of monitoring, because once you have metrics you can notify people when a specific metric is above or below a threshold. I mention this here, in connection with the first question, because customers sometimes ask (or demand) that you fix (or even turn off) “monitoring.” What they really want is for you to change something about the alert they received, such as the frequency or level of detail. Rarely do they really mean you should stop collecting metrics.
To answer this question correctly, the bulk of your work is going to be focused on discovering the way an alert message was created. This is because the vagueness of the message is likely the reason why the recipient is confused. Basically, you should ensure that every alert message contains a few key pieces of information.
Yes, this creates more work when you are setting up the alert. But having an alert with this kind of messaging means that the recipient can more easily answer the question, “Why did I get this alert?” If you do it right, which means solid documentation and providing information in meetings, then users—especially those in heavy support roles—will learn to analyze alerts on their own.