The Four Questions, Part 1

 

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked: Why did I get an alert? Why didn’t I get an alert? What is being monitored on my system? What will alert on my system? and What standard monitoring do you do? My goal in this next post is to give you the tools you need to answer the first of those:

Why did I get an alert?

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, the fact is that most of the techniques can be translated to any toolset.

**************
It’s 8:45am, and you are just settling in at your desk. You notice that one email came in overnight from your company’s 24-7 operations desk:

“We got an alert for high CPU on the server WinSrvABC123 at 2:37am. We didn’t notice anything when we jumped on the box. Can you explain what happened?”

Out of all of The Four Questions of monitoring, this is the easiest one to answer, as long as you have done your homework and set up your environment.

Before I dig in, I want to clarify that this is not the same question as “What WILL alert on my server?” or “What are the monitoring and alerting standards for this type of device?” (I’ll cover both of those in later parts of this series.) Here, we’re dealing strictly with a user’s reaction when they receive an alert.

I also have to stress that it’s imperative that you always take the time to answer this question. It can be annoying, tedious, and time-consuming. But if you don’t, before long all of your alerts will be dismissed as “useless.” That is the first step on a long road that leads to a CIO-mandated RFP for monitoring tools, you defending your choice of tools, and other conversations that are significantly more annoying, tedious, and time-consuming.

However, my tips below should cut down on your workload significantly. So let’s get started.

First, let’s be clear: monitoring is not alerting. Some people confuse getting a ticket, page, email, or other alert with actual monitoring. In my opinionbook, “Monitoring” is the ongoing collection of data about a particular element or set of elements. Alerting is a happy by-product of having monitoring, because once you have those metrics you can notify people when a specific metric is above or below a threshold. I say this because customers sometimes ask (or demand) that you fix (or even turn off) “monitoring.”. What they really want is for you to change the alert they receive. Rarely do they really mean you should stop collecting metrics.

The bulk of your work is going to be in the way you create alert messages, because in reality, it’s the vagueness of those messages that has the recipient confused. Basically, you should ensure that every alert message contains a few key elements. Some are obvious:

  • The machine having the problem
  • The time of alert
  • Current statistic

Some are slightly less obvious but no less important:

  • Any other identifying information about the device
    • Any custom properties indicating location, owner group, etc.
    • OS type and version (the MachineType variable)
    • The IP address
    • The DNS Name and/or Sysname variables if your device names are… less than standard
  • The threshold value which breached
  • The duration – how long the alert has been in effect
  • A link or other reference to a place where the alert recipient can see this metric. Speaking in SolarWinds-specific terms, this could be:
    • The node Details page – using either the ${NodeDetailsURL} (or the equivalent for your kind of alert) or a “forged” URL (i.e.: “http://myserver/Orion/NetPerfMon/NodeDetails.aspx?NetObject=N:${NodeID}
    • A link to the metric details page. For example, the CPU average would be http://myserver/Orion/NetPerfMon/CustomChart.aspx?chartname=HostAvgCPULoad&NetObject=N:${NodeID}
    • Or even a report that shows this device (or a collection of devices where this is one member) and the metric involved

Finally, one element that should always be included in each alert:

  • The name of alert

For your straightforward alerts, this should not be a difficult task and can be something you (almost) copy and paste from one alert to another. Here’s an example for CPU:

CPU Utilization on the ${MachineType} device owned by ${OwnerGroup} named ${NodeName} (IP: ${IP_Address}, DNS: ${DNS}) has been over ${CPU_Thresh} for more than 15 minutes. Current load at ${AlertTriggerTime} is ${CPULoad}.

View full device details here: ${NodeDetailsURL}.
Click here to acknowledge the alert: ${AcknowledgeTime}

This message was brought to you by the alert: ${AlertName}

While it means more work during alert setup, having an alert with this kind of messaging means that the recipient has several answers to the “Why did I get this alert?” at their fingertips:

  • They have everything they need to identify the machine – which team owns it, what version of OS it’s running, and the office or data center where it’s located.
  • They have what they need to connect to the device – whether by name, DNS name, IP address, etc.
  • They know what metric (CPU) triggered the alert.
  • They know when the problem was detected (because let’s face it, sometimes emails DO get delayed).
  • They have a way to quickly get to a status screen (i.e.: the Node details page) to see the history of that metric and hopefully see where the spike occurred.

Finally, by including the ${AlertName}, you’re enabling the recipient to help you help them. You now know precisely which alert to research. And that’s critical, because there’re more things you should be prepared to do.

There is one more value you might want to include if you have a larger environment, and that’s the name of the SolarWinds polling engine. There are times when a device is moved to the wrong poller—wrong because of networking rules, AD membership, etc. Having the polling engine in the message is a good sanity check in this situation.

Let’s say that the owner of the device is still unclear why they received the alert. (Hey, it happens!) With the information the recipient can give you from the alert message, you can now use the following tools and techniques:

The Message Center

Some people live and die on this screen. Some never touch it. But in this case, it can be your best friend. Note two specific areas:

  • The Network Object drop-down – this lets you zero in on just the alerts from the offending device. Step one is to look at EVERYTHING coming off this box for the time period. Events, alerts, etc. See if this builds a story about what may have led up to the event.
  • The Alert Name drop-down under Triggered Alerts – this allows you to look at ALL of the instances when this alert triggered, or further zero in on the one event you are trying to find.

Side Note: The Time Period drop-down is critical here. Make sure you set it to show the correct period of time for the alert or else you’re going to constantly miss the mark.

Using these two simple controls in Message Center, you (and your users) should be able to drill into the event stream around the ticket time. Hopefully that will answer their question.

If you do it right (meaning take your time explaining what you are doing in a meeting, or using a screen share; maybe even come up with some light “how to” documentation with screen shots), users—especially those in heavy support roles—will learn over the course of time to analyze alerts on their own.

But what about the holdouts? The ones where Message Center hasn’t shown them (or you) what you hoped to see. What then?

Be prepared to test your alert. It’s something you should do every time you’re ready to release a new alert into your environment. Also remember that sometimes you get busy, and sometimes you test everything, but then the situation on the ground changes without your participation.

So, however you got here, you need to go back to the testing phase.

  • Make a copy of the alert. Never test a live normal production alert. There’s a COPY button in the alert manager for that very reason.
  • Change the alert copy by adding an alert trigger for the machine in question. JUST that machine. (i.e.: “where node caption is equal to WinSRVABC123”).
  • Set your triggering criteria (“CPULoad > 90%” or whatever) to a value so low that it’s guaranteed to trigger.

At that point, test the heck out of that bugger until both you and the recipient are satisfied that it works as expected. Copy whatever modifications you need over to the existing alert, and beware that updating the alert trigger will cause any existing alerts to re-fire. So you may need to hold off on those changes until a quieter moment.

Stay tuned for our next installment: “Why didn’t I get an alert?”