Between now and the end of Passover, I’m sharing excerpts from my book “The Four Questions Every Monitoring Engineer is Asked“. It blends wisdom and themes from Passover with common questions (and their answers) heard when you are setting up and running monitoring and observability solutions.
If you have worked in IT for more than ten minutes, you know that things go wrong. In fact, it should be obvious that we have jobs in IT specifically because things go wrong.
This is what systems monitoring and automation is all about: building solutions that automatically mind the shop, raise a flag when things start going south, and provide the information that helps us understand what happened and when. We transform this knowledge into a solution that both fixes the issue and improves the environment so that the same issue can be avoided in the future.
That’s regular monitoring, though, and this chapter is about monitoring grief.
Let’s talk about regular grief first. Grief is what my wife feels when dinner isn’t ready because I got distracted watching cat videos and didn’t put the casserole in the oven like she asked. Grief is what you feel when you are driving around at 3:00 am looking for an open convenience store because you didn’t buy diapers earlier, like you said you would. Grief is what my mechanic feels when I tell him the “check engine” light has been on for the last two months.
Now, monitoring grief is what the monitoring engineer feels when consumers of monitoring—the network team, server admins, NOC operators, etc.—do things that cause problems down the road even when they know better. You—the monitoring engineer—have warned these people that certain actions will cause problems down the road, yet they choose to do them anyway.
I have spent the past two decades implementing monitoring systems at companies of all sizes. In that time, I’ve had the opportunity to witness certain behaviors that I eventually categorized into five, often sequential, stages. Organizations tend to experience these stages when rolling out a new monitoring system, but they also occur when a group or department starts to seriously implement an existing solution, when new capabilities are added to the current monitoring suite, or when it’s Tuesday.
Spoiler alert: Unlike the standard Kubler-Ross model, acceptance is not on this list.