(“In Case You Missed It Monday” is my chance to showcase something that I wrote and published in another venue, but is still relevant. This week’s post originally appeared on THWACK)
Last month we looked at false positives and how, as managers, we can avoid requesting monitors and alerts with false positives as the unintended consequence. Back in the last post, one of the key techniques I shared was asking “and then what?” questions to follow a logical path.
This time, I want to focus again on the “and then what” line of thinking.
Let’s say, for the sake of argument, you have a set of monitors feeding an alert. The appropriate team gets the alerts (in whatever form they arrive—email, ticket, owl post, etc.) and then… nothing changes. The problem continues to happen. Over and over again. Maybe not every hour or day or even week, but the exact same issue (possibly with exact same systems) keeps popping up. Why?
Maybe people responsible for the system have decided that, to quote a popular 80’s-era movie, “Screws fall out all the time. The world is an imperfect place.” On the other hand, maybe the problem seems too complex to solve, and the application of human effort reassuringly easy. Or perhaps nobody on the supporting team especially enjoys getting notified about the problem, but nobody really knows why it’s happening, either.
Whatever the underlying reason for it, one of the key methods to overcoming this roadblock is to recognize the team doesn’t have enough—or enough of the right—information.
Here’s the question you, as a manager/leader, can help them get un-stuck:
“How can we get a better understanding of what’s happening at the time the error is actually happening?”
Let me explain how this works and why it helps, by using my old standby for computer problems, the ubiquitous high CPU error. Typically IT folks will set up the alert, set a threshold (let’s say 95%) and wait for the alert to trigger. When it does, they jump on the system and look around for clues. Perhaps you can already see how this manual response has within it the seeds of its own destruction.
First, it’s slow. Because of delays built into monitoring collection, polling cycles, alert triggers, and more, it’s likely to be several minutes before the alert message hits the email or ticket queue.
Second, it’s slow. Because once the human gets the message, they jump onto their computer, look at the message, look at monitoring, and likely remote into the problem system.
Third and finally, it’s slow… just kidding. It’s prone to “observer bias.” Jumping on the problem system likely causes all sorts of other processes to fire up, obscuring what might be going wrong.
The result is it’s almost impossible to catch the culprit in the act of spiking the CPU, in all but the most egregious of cases.
The solution is to see what additional information can be collected at the time the threshold is breached and include this insight as part of the initial email/ticket/owl.
Maybe your monitoring solution can pull the top 10 processes running at the time of the threshold breach. Or maybe it can grab some logfile or event log messages and add them to the ticket/email. Or perhaps you can also monitor the processes related to major applications, and display both the system CPU and the CPU consumed by those major processes in one graph.
My point in this post isn’t to tell you and your IT folks precisely how to do it. My point is to let you, their leader, understand how additional actions would enrich the existing monitoring data you have, and potentially shine a light on the root cause of the issue.
Because once you know that, you can ensure the outcome isn’t just “we get a ticket, so we know it happened. Then we close the ticket.”
i.e.: “FYI alerts,” which are a waste of everyone’s time and your business’ precious resources.