(“In Case You Missed It Monday” is my chance to showcase something that I wrote and published in another venue, but is still relevant. This week’s post originally appeared on THWACK)
“It is a tale told by an idiot, full of sound and fury, signifying nothing.”
Not long ago, I wrote about ineffective monitoring—a situation where monitoring creates a lot of noise but not a lot of change. In the context of avoiding false positives and collecting information that allowed the team to solve underlying issues, I said, “Because once you know that, you can ensure the outcome isn’t just ‘we get a ticket, so we know it happened. Then we close the ticket.’”
What I was getting at was the idea of “FYI alerts,” a concept which—regardless of whether they are explicitly given that name or not—are a waste of your team’s time and energy; not to mention a crime against monitoring best practices.
Before I get into the details of why I think “FYI alerts” are bad, I should take a moment and explain what they are: when you get an alert (usually in the form of an email or message to a mobile device) with un-actionable information or status—where there’s nothing the person receiving the message is expected to do (whether immediately or in a relatively small timeframe); where the entire point of the alert is to inform the recipient something happened, without the expectation they’ll do anything about it—it’s an FYI alert.
You can tell an alert is of the FYI variety if you asked, “What do you do when you get it?” and hear some variation of these responses:
- “Nothing. I just like to keep tabs on the system.”
- “I don’t do anything after the first one. But if I get several all at the same time, I have to fix something.”
- “I put them in a folder and count up how many I get each week”
- “I don’t even see them. I have an Outlook rule to put them in a folder, but if we have a big issue, I can go back through those messages later and see if there’s important insight.”
FYI alerts get created out of a fundamental misunderstanding (or lack of knowledge) of the other ways monitoring systems can share information. As a manager, you need to be aware of these other options are, so you can
- A) help guide your team to provide the correct deliverables, even if the requestor asks for something else, and
- B) communicate the range of modes monitoring data can be delivered with partner teams and superiors.
What I’m talking about breaks down into three basic categories: reports, screens, and actual alerts.
Also sometimes referred to as “views.” The key qualities are they’re view-able on a screen of some kind; display real time or near real time status of the objects being shown; and contain relatively little history. The point of a screen is to tell you what’s happening right now. When folks ask for an alert so they can keep tabs on a system, application, or what have you, they most often are trying to create the experience of a screen—a thing you can glance at quickly when the mood strikes.
Whether they’re printed, emailed, left in a directory on a file server, or pushed to another system for viewing, reports represent “history.” They show how the monitored devices were at a particular point in time, but not NOW-now. Because of this, they have an easier time showing a range of past values.
If a screen shows you near real time status, an alert is the ultimate expression of that—it’s the status of a system, application, counter, and so on right now. But the important aspect of an alert is it requires not only someone’s attention, but it requires human intervention as well.
Because of this, I want to take a short sidetrack and talk about:
We discussed this idea in the previous post, where I suggested—at the time an alert triggers—it’s possible to collect more information to provide greater insight to the team who’s responding. But there’s another, and arguably more important, use of automation: doing the thing the human would have done in the first place.
As a manager, you can help your team find opportunities for automation by asking the alert recipient this question: “After you get the alert, what do you do next?” If they say something like “I run the XYZ command,” or “I clear the TEMP folder,” or “I restart the blahblah service,” then the logical next step is to have the alert automatically do the action in response to the alert. Immediately. Then (and only then) if doing so doesn’t clear up the issue, should a human be notified.
The benefit to the business (and the IT practitioners) is threefold:
First, the response to a problem is lightning-fast and can often correct an issue before anyone would have otherwise noticed.
Second, when the technologists get a ticket, they know these actions have already been taken and can pick up where they left off—at the point where a human brain is needed.
Third, support folks know if they get an alert, it’s actionable. Something MUST be done about it. It’s important, if not urgent.
So let’s look back at those non-alert responses and see if we can’t categorize them correctly. Say it with me, “What do you do when you get the alert?”
“Nothing. I just like to keep tabs on the system.”
That’s a screen.
“I don’t do anything after the first one. But if I get several all at the same time, I have to fix something.”
This is an alert, but it needs to be tuned to only trigger after there have been “several” occurrences.
“I put them in a folder and count up how many I get each week”
A report. You just described a report.
“I don’t even see them. I have an Outlook rule to put them in a folder but if we have a big issue, I can go back through those messages later and see if there’s important insight.”
Honestly? This is just plain “monitoring” with data retention. Anything someone gets via alert has, by definition, created a data point in the monitoring tool. Therefore, one can go back through the data—in the form of charts, graphs, raw data tables, etc., and view the historic trending.
As always, thank you for joining me in these discussions. If you have questions on how to manage monitoring teams or better represent monitoring solutions to your leadership, drop me a line.