Are We There Yet? Are We There Yet? Are We…

How often you request information from a target device (commonly known as the polling cycle) is one of the fine arts of monitoring design, and one of the things that requires the most up-front discussion with consumers of your information.

Why? Because it’s often unclear to the consumer (the person who is getting the ticket, report, etc) the delays that exist between the occurrence of an actual event  and the reporting/recording of it. Worse, most of us in the monitoring segment of IT do a poor job of communicating these things.

Let’s take good ol’ ping as an example.

You want to check the up-down status of your systems on a regular basis. But you DON’T want to turn your monitoring system into an inadvertent DDOS Smurf attack generator.

So you pick some nice innocuous interval – say a ping check every 1 minute.

Then you realize that you need to monitor 3,000 systems, and so you back that off to 2 minutes

 Once you get past that frequency, there’s the additional level of verification. Just because a system failed ONE ping check doesn’t mean it’s actually down. So maybe your toolset does an additional level of just-in-time validation (let’s say it pings an additional 10 times, just to be sure the target is really down). That verification introduces at least another full minute of delay.

Is your alerting system doing nothing else? It’s just standing idle, waiting for this one alert to come in and be processed? Probably not. Let’s throw another 60 seconds into the mix for inherent delays in the alert engine.

Does your server team really want to know when a device is unavailable (ie: ping failed) for 1 minute? Probably not. They know servers get busy and don’t respond to ping even though they are processing. Heck, they may even want to avoid cutting tickets during patching cycle reboots. My experience is that most teams don’t want to hear that a server is down unless its really been down for at least 5 minutes.

Now you get the call: “I thought we were pinging every minute. But I got a ticket this morning and when I went to check the server, it had been down for over 10 minutes. What gives?”

There’s no single “right” answer to this. Nobody wants false alarms. And nobody wants to find out a system is down after the customer already knows.

But it is between those two requirements that one minute stretches to ten.

%d bloggers like this: