The Four Questions, Part 2

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked: Why did I get an alert? Why didn’t I get an alert? What is being monitored on my system? What will alert on my system? and What standard monitoring do you do? My goal in this post is to give you the tools you need to answer the second question:

Why DIDN’T I get an alert?

You can find information on the  first question (Why did I get this alert) here.

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

 

Good Morning, Dave…

It’s 9:45am. You finally got your first caller off the phone – <LINK-to-first-question>the one who wanted to know why they got a particular alert.</LINK>

You hear the joyous little “ping” of a new email. (Never mind that the joy of that ping sound got old 15 minutes after you set it up.)

“We had an outage on CorpSRV042, but never got a notification from your system. What exactly are we paying all that money for, anyway?

Out of The Four Questions of monitoring, this one is possibly the most labor-intensive because proving why something didn’t trigger requires an intimate knowledge of both the monitoring in place and the on-the-ground conditions of the system in question.

Unlike the previous question, there’s very little preparation you can do to lessen your workload. For that reason, my advice to you is going to be more of a checklist of items and areas to look into.

I’m also going to be working on the assumption that an event really did happen, and that a monitor was in place, which (ostensibly) should have caught it.

So what could have failed?

What We Have Here is a Failure…

It’s a non-failure failure (it was designed to work that way)

Items in this category represent more of a lack of awareness on the part of the device owner about how monitoring works. Once you narrow down the alleged “miss” to one of these, the next thing you need to evaluate is whether you should provide additional end-user education (lunch-and-learns, documentation, a narrated interpretive dance piece in the cafeteria, etc).

  • Alert windows
    Some alerting systems will allow you to specify that the alert should only trigger during certain times of the day. If the problem occurred and corrected itself (or was manually corrected) outside of that window, then no alert is triggered.
  • Alert duration
    The best alerts are set up to look for more than one occurrence of the issue. Nobody wants to get a page at 2:00am when a device failed one ping. But if you have an alert that is set to trigger when ping fails for 15 minutes, 13 minutes can seem like an eternity to the application support team that already knows the app is down.
  • The alert never reset
    After the last failure, the owners of the device worked on the problem, but they never actually got it into a good enough state where your monitoring solution registered that it was actually resolved. This also happens when staff decide to close the ticket (because it was probably nothing, anyway) without looking at the system. A good example of this is the disk alert that triggers when the disk is over 90% utilized, but doesn’t reset until it’s under 70% utilized. Staff may clear logs and get things down to a nice comfy 80%, but the alert never resets. Thus when the disk fills to 100% a week later, no alert is cut.
  • The device was unmanaged
    <IMAGE: unmanaged node with blue dot>It’s surprisingly easy to overlook that cute blue dot—especially if your staff aren’t in the habit of looking at the monitoring portal at all. Nevertheless, if it’s unmanaged, no alerts will trigger.
  • Mute, Squelch, Hush, or “Please-Shut-Up-Now” functions
    I’m a big believer in using custom properties for many of the things SolarWinds doesn’t do by design, and one of the first techniques I developed was a “mute” option. This gets around the issue with UnManage, where monitoring actually stops. Read more about that here (https://thwack.solarwinds.com/message/142288). With that said, if you use this type of option, you’ll also need to check its status as part of your analysis when you get this type of question.
  • Parent-Child, part 1
    The first kind of parent-child option I want to talk about is the one added to Orion Core in version 10.3. From that point forward, SolarWinds had the ability to make one device, element, application, component, group, or “thingy” (I hope I don’t lose you with these technical terms) the parent of another device, element, application, component… etc. Thus, if the parent is deemed to be down, the child will NOT trigger an alert (even if it is also down) but will rather be listed in a state of “unreachable.”
  • Parent-Child, part 2
    The second kind of suppression is implicit but often unrecognized by many users. In its simplest terms, if a device is down, you won’t get an alert about the disk being full. That one makes sense. But frequently an application team will ask why they didn’t get an alert that their app is down, and the reason is that the device was down (i.e., ping had failed) during that same period. Because of this, SolarWinds suppressed the application-down alert.

Failure with Change Control

In this section, the issue that we’re looking at changes either in the environment or within the monitoring system, which would cause an alert to be missed. I’m calling this “change control” because if there was a record and/or awareness of the change in question (as well as how the alert is configured), the device owner would probably not be calling you.

  • A credential changed
    If someone changes the SNMP string or the AD account password you’re using to connect, your monitoring tool ceases to be able to collect. Usually you’ll get some type of indication that this has happened, but not always.
  • The network changed
    All it takes is one firewall rule or rou gting change to make the difference between beautiful data and total radio silence. The problem is that the most basic monitoring—ping—is usually not impacted. So you won’t get that device down message that everyone generally relies on to know something is amiss. But higher-level protocols like SNMP or WMI are blocked. So you have a device which is up (ping) but where disk or CPU information is no longer being collected.
  • A custom property changed
    As I said before, I love me some custom properties. Along with the previously mentioned “mute,” there are properties for the owner group, location, environment (prod, dev, QA, etc.), server status (build, testing, managed), criticality (low, normal, business-critical) and more. Alert triggers leverage these properties so that we can have escalated alerts for some, and just warnings for others. But what happens when someone changes a server status from “PRODUCTION” to “DEV”? Likely, there can be an alert missed if an alert is configured to specifically look for “PROD” servers.
  • Drive (or other element) has been removed/changed
    I say drives because this seems to happen most often. IF your environment doesn’t include a “disk down” alert (don’t laugh, I’ve seen them), then volumes can be unmounted or mounted with a new name with amazing frequency. When that happens, many monitoring tools do not automatically start monitoring the new element; and the tools that almost never apply all the correct settings (like those custom properties). You end up with a situation where the device owner is completely aware of the update, but monitoring is the last to know.
  • The Server Vanished into the Virtualization Cloud
    The drive toward virtualization is strong. When a server goes from physical to virtual (P-to-V), it’s effectively a whole new machine. Even though the IP address and server name are the same, the drives go from a series of physical disks attached to a real storage controller to (usually fewer) volumes which appear to be running off a generic SCSI bus. Not only that, but other elements (interfaces, hardware monitors, CPU, and more) all completely change. Almost all monitoring tools require manual updating to track those changes, or else you are left with ghost elements that don’t respond to polling requests.

Failure of the Monitoring Technology

The previous two sections speak to educational or procedural breakdowns. But loath as I am to admit it, sometimes our monitoring tools fail us too. Here are some things you need to be aware of:

  • The element or device is not actually getting polled
    Often, this is a result of disks or other elements being removed and added; or a P-to-V migration (see previous section). But it also happens that an element simply stops getting polled. You’ll see this when you dig into the detailed data and find nothing collected for a certain period of time.
  • Polling interval is throttled
    One of the first things that a polling server does (at least the good ones) when it begins to be overwhelmed is to pull back on polling cycles so that it can collect at least SOME data from each element. You’ll see this as gaps in data sets. It’s not a wholesale loss of polling, but sort of a herky-jerky collection.
  • Polling data is out of sync
    This one can be quite challenging to nail down. In some cases, a monitoring system will add data into the central data store using localized times (either from the polling engine or (horrors!) from the target device itself). If that happens, then an event that occurred at 9am in New York shows up as having happened at 8am in Chicago. This shouldn’t be a problem unless, as mentioned in the first section, your system won’t trigger alerts before 9am.

Failure Somewhere After the Monitoring Solution

As much as you might hate to admit it, monitoring isn’t the center of the universe. And it’s not even the center of the universe with regard to monitoring. If everything within your monitoring solution checks out and you are still scratching your head, here are some additional areas to look:

  • Email (or whatever notification system you use) is down
    One of the most obvious items, but often not something IT pros think to check, is whether the system that is sending out alerts is actually alive and well. If email is down, you won’t get that email telling you the email server has crashed.
  • Event correlation rules
    Event correlation rules are wonderful, magical things. They take you beyond the simple parent-child suppression discussed earlier, and into a whole new realm of dependencies. But there are times when they inhibit a “good” alert in an unexpected way:

    • De-duplication
      The point of de-duplication is that multiple alerts won’t create multiple tickets. But if a ticket closed and didn’t update the event correlation system, de-dup will continue forever.
    • Blackout/Maintenance windows
      Another common feature for EC systems is the ability to look up a separate data source that lists times when a device is “out of service.” This can be a recurring schedule, or a specific one-time rule. Either way, you’ll want to check if the device in question was listed on the blackout list during the time when the error occurred.
  • Already open ticket
    Ticket systems themselves can be quite sophisticated, and many have the ability to suppress new tickets if there is already one open for the same alert/device combination. If you have a team that forgets to close their old ticket, they may never hear about the new events.

Hokey religions … are no match for a good blaster, kid

After laying out all the glorious theoretical ways in which monitoring can be missed, I thought it was only fair to give you some advice on techniques or tools you can use to either identify these problems, to resolve them, or (best of all) avoid them in the first place.

An Ounce of Prevention…

Here are some things to have in place that will let you know when all is not puppy dogs and rainbows:

  • Alerts that give you the health of the environment
    Under the heading of “who watches the watchmen,” in any moderately-sophisticated (or mission critical) monitoring environment, you should have both internal and external checks that things are working well:

    • The SolarWinds out-of-the-box (OOTB) alert that tells you SNMP has not been collected for a node in xx minutes.
    • The SolarWinds OOTB alert that tells you a poller hasn’t collected any metrics in xx minutes.
    • If you can swing it, running the OOTB Server & Application Monitor (SAM) templates for SolarWinds on a separate system is a great option. If having a second SolarWinds instance watching the first is simply not possible, look at the SAM template and mimic it using whatever other options you have available to you.
  • Have a way to test individual aspects of the alert stream
    It’s a horrible sinking feeling when you realize that no alerts have been going out because one piece of the alerting infrastructure failed on you. Start by understanding (and documenting) every step an alert takes, from the source device through to the ticket, email, page, or smoke signal that is sent to the alert recipient. From there, create and document the ways you can validate that each of those steps is independently working. This will allow you to quickly validate each subsystem and nail down where a message might have gotten lost.
  • So wait… you can test each alert subsystem?
    A test procedure is just a monitor waiting for your loving touch. Get it done. You’ll need to do it on a separate system (since, you know, if your alerting infrastructure is broken you won’t get an alert), but usually this can be done inexpensively. Just to be clear, once you can manually test each of your alert infrastructure components (monitoring, event correlation, etc), turn those manual tests into continuous monitors, and then set those monitors up with thresholds so you get an alert.
  • Create a deadman-switch
    The concept of a deadman switch is that you get an alert if something DOESN’T happen. In this case, you set yourself up to receive an alert if something doesn’t clear a trigger. You then send that clearing event through the monitoring system.

    • Example: Every 10 minutes, an internal process on your ticket system creates a ticket for you saying, “Monitoring is broken!” That ticket escalates, and alerts you, if it has been open for 7 minutes. Now you have your monitoring system send an alert whenever the current minutes are evenly divisible by 5. This alert is the reversing event your ticket system is looking for. As long as monitoring is working, the original ticket will be cleared before it escalates.

… about that pound of cure, now

Inevitably, you’ll have to dig in and analyze a failed alert. Here are some specific techniques to use in those cases:

  • Re-scan the device
    The problem might be that the device has changed. It could be a new firewall rule, an SNMP string change, or even the SNMP agent on the target device dying on you. Regardless, you’ll want to go through all the same steps you used to add the device. In SolarWinds, this means using the “Test” button under node properties, and then also running the “List Resources” option to make sure all of the interfaces, disks, and yes, even CPU and RAM options are correctly selected.

    • TRAP: Remember that List Resources will never remove Un-checking them doesn’t do diddly. You have to go into Manage Nodes and specifically delete them before they are really gone.
  • Test the alert
    Are you sure your logic was solid? Be prepared to copy the alert and add a statement limiting it down to the single device in question. Then re-fire that sucker and see if it flies.
  • Message Center Kung-Fu
    The Message Center remains your best friend for forensic analysis of alerts. Things to keep in mind:
  • Set the timeframe correctly – It defaults to “today.” If you got a call about something yesterday, make sure you change this.
  • Set the message count – If you are dealing with a wide range of time or a large number of devices, bump this number up. There’s no paging on this screen so if the event you want is #201, out of 200, you’ll never know.
  • Narrow down to a single device – to help avoid message count issues, use the Network Object dropdown to select a specific device
  • Un-check the message categories you don’t want. Syslog and Trap are the big offenders here. But even Events can get in your way depending on what you’re trying to find.
  • Limiting Events (or Alerts) – Speaking of Events and Alerts, if you know you’re looking for something particular, use those drop-down boxes to narrow down the pile of data you need. This option also lets you look for a particular alert, but also see any event during that time period.
  • The search box. Yes, there is one. It’s on the right side (often off the right side of the page and you have to scroll for it). If you put something in there, it acts as an additional filter along with all the stuff in the grey box.

Stay tuned for our next installment where we’ll dive into the third question: “What is being monitored on my system?”