The Four Questions, Part 5

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked. You can read that introduction here. Information on the first question (Why did I get this alert) is here. You can get the low-down on the second question (Why DIDN’T I get an alert) here. The third question (What is monitored on my system) is here. And the fourth question (What alerts WILL trigger for my system) is posted here.

But as I’ve hinted all along, and despite the title of the series, there are actually five questions that you will be asked when you start down the path to becoming a monitoring engineer. The fifth, and final, question that you will most likely be asked is:

What do you monitor “standard”?

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

No Rest for the Weary…

You thought you were going to make it out of the building without another task hanging over your head, but the CIO caught you in the elevator and decided to have a “quick chat” on the way down.

“I’m glad I caught up to you,” he says, “I just have a quick question…”

“I love what you’re doing with the monitoring system,” he begins. “but one thing that I keep hearing from support teams is that they feel like every new device has to be monitored from the ground up. Is there a way we can just have some monitors in place as soon as it enters the system?”

Choosing your words carefully (it *is* the CIO, after all) you respond “Well, there’s a whole raft of metrics we collect as soon as any new system is added into monitoring. Then we augment that with metrics based on what the new device provides and the owner and support teams need.”

“That’s a relief,” smiles your CIO, as you arrive at your car. He is now literally standing in between you and a well-earned beer. “but what our teams – your customers, you know,” he adds with a chuckle. “need to know is what those standard options are. Like when we’re buying a new car.” He says, eyeing your 2007 rustbucket. “Monitoring should list which features are standard, and which are optional upgrades. That shouldn’t be too hard, right?”

“Not at all!” you chirp, your enthusiasm having more to do with the fact that he’s moved out of the way than the specter of new work.

“Excellent.” He says. “Can you pull it together for me to look over tomorrow?” Without waiting for an answer he calls “Have a great night!” over his shoulder.

Standard Time

The reason this (“what do you monitor standard?”) is a common question has a lot to do with how monitoring solutions usually come into being in organizations; how monitoring teams are established; and how conversations between you (the monitoring engineer) and the recipient of monitoring tend to go.

First, while in some cases monitoring solutions are conceived and implemented as a company-wide initiative in a greenfield scenario, those cases are rare. More often what ends up being the company standard started off as a departmental (or even individual) initiative, which somebody noticed and liked and promoted until it spread.

“Standard procedures? We do it like this because Harold did it that way when he installed this thing 3 years ago.”

Second, monitoring teams often form around the solution (product XYZ) OR around the cause (“what do we want?” MONITORING! “When do we want it?” PING!). Unlike more mature IT disciplines like networking, storage, or even the relative newcomer virtualization, people don’t usually set out to become monitoring engineers. And thus, there are precious few courses, books, or even tribal knowledge to fall back on to understand how it is “usually” done. Teams form and, for better or worse, begin to write their own set of rules.

Third (and last), because of the preceeding two points, conversations between the ersatz monitoring engineer and the monitoring services consumer (the person or team who will read the reports, look at the screens, and/or receive the alerts) tends to have a number of disconnects. More on that in a moment.

What We Have Here…

Put yourself in the shoes of that monitoring consumer I mentioned a minute ago. You’ve got a new device, application, or service which needs to be monitored. At best, you need it monitored because you realize the value monitoring brings to the table. At worst, you need it monitored because you were told your device, application, or service is a class III critical element to the organization, and therefore monitoring is mandatory. You just need to check the little box that says “monitoring” so that you can get this sucker rolled into production.

So you make the call and set up a meeting with someone from the vaguely shadowy sounding ‘monitoring team’ (what, do they manage all the keylogger and spyware software you’re sure this place is crawling with?) and a little while later, you’re sitting down in a conference room with them.

You explain your new project, device, application, or service. The person is taking notes, nodding at all the right places. This is looking positive. Then they look up from their notes and ask:

“So, what do you need to have monitored here.”

In your head, you’re thinking “I thought YOU were supposed to tell ME that. Why do I have to do all the heavy lifting here?”

But you’re a pro so of course you don’t say that. Instead you reply, “Well, I’m sure whatever you monitor standard should be fine.”

The monitoring person looks mildly annoyed, and asks “Anything ELSE? I mean, this is a pretty important <device/application/service>, right?”

So now you DO open your mouth. “Well, it’s hard to tell since I’m not sure what standard monitoring is!”

“Oh come on,” comes the retort. You’ve been using XYZ monitoring for 2 years now. You know what we monitor on the other stuff for you. It’s…”

And then then do the most infuriating thing: They rattle off a list of words. Some of which are components, some are counters, some sound vaguely familiar, and others you’ve never heard of. And there they sit with their arms crossed, looking at you across the table.

*************

While the above scene is probably (hopefully) not as dramatic as the ones you’ve encountered in real life, Never the less this probably captures the essence of the issue. If the monitoring team and those who benefit from monitoring are working from different playbooks, the overall effectiveness of monitoring (not to mention YOU) is going to be impacted.

So how do you avoid this? Or if you’ve already fallen into friction, how do you get past it? As with the rest of this series, I have are a few suggestions which largely boil down to “knowing you will be asked this question is half the battle” because now you can prepare.

But the daemon is, as they say, in the details. So let’s dig in!

好記性,不如爛筆頭

(“A good memory is no match for a bad pen nub”)

Yes, my first suggestion is going to be to make sure you have good documentation.

Before you quail at the thought of mountains of research, keep in mind that that you aren’t documenting the specific thresholds, values, etc for each node. You are compiling a list of the default values which are collected when a new system is added into your monitoring solution. Depending on your software, that could be a consistent list regardless of device type, or it could be a core set of common values along with a few additional variations by device type.

Your first stop should always be to RTFM which stands, of course, for “Read The Friendly Manual”. Many vendors have already done this work for you and all you need to do is copy and paste that into your documentation.

But in the case that your vendor considers such information to be legally protected intellectual property, your next step is to just LOOK AT THE SCREENS THEMSELVES. I mean, it’s not that hard. After scanning a couple of the detail pages for devices, you’re going to see a pattern emerge. It will probably be something like:

Availability
- Response time
- Packet loss
CPU
- Overall percent utilization, 5 minute average
- Per processor % util 5 min avg
Physical RAM
- % Util 5 min avg
- IO Operations Per Second (IOPS)
- Errors
Virtual RAM
- % Util 5 min avg
- IO Operations Per Second (IOPS)
- Errors
Disk
- Availability
- Capacity
- Used
- IOPS
Interface
- Availability
- Bandwidth
  - Bits per second (BPS)
  - % Utilization
  - Packets Per Second (PPS)
  - Errors
  - Discards

There, in that relatively short list, is probably the core of the metrics every machine provides. Then you have specifics for routers (Flapping routes, VPN metrics, etc), switches (stacked switch metrics, VLANs), virtual machines (host statistics, “noisy neighbor” events), and so on.

When all else fails, there’s wireshark. By monitoring the connection between the monitoring system and a single target device (preferably one you have newly added) you can capture all communication and suss out the metrics being requested/supplied. It’s not pretty, it might not even be easy. But it’s a lot easier than having to repeatedly tell your colleagues that you have no idea what gets monitored.

PS: If you are down to the wireshark method, it might be time to look at another monitoring solution.

There’s An App For That

Everything I’ve listed above is great for hardware-based metrics, including specialized storage and virtualization tools. But once you cross into the valley of the shadow of application monitoring, you need a better strategy.

By their very nature, application monitoring is highly complex, highly customized, and highly variable. Take something as straightforward as Microsoft Exchange. When building a monitoring solution, you might need specific monitoring for the hub transport server, the mailbox server, the client access server, or even the external OWA server (just to name a few). While there are common element from one to the other, they serve very different purposes and require very different components and thresholds.

Take that concept and multiply it by all the applications in your environment.

So what is the diligent yet slightly-overwhelmed monitoring engineer to do? While movies like “Glengarry Glen Ross” and “The Boiler Room” have popularized the phrase “ABC – Always Be Closing” within sales circles, in monitoring (and indeed, within much of IT) I prefer “ABS – Always Be Standardizing”.

Whether the topic is alerts, reports, or sets of application monitoring components, the best thing you can do for yourself is to continuously try to create a single “gold standard” for a particular need and then keep using it, expanding that single standard when necessary to account for additional variations, but avoiding as much as possible the “copy and paste” process that leaves you with 42 different Exchange monitoring templates and (18 months down the road) no idea which one is the default, let alone which one is applied to which server.

If you are able to adhere to this one practice, then answering the question “what do you monitor, standard” for applications becomes immeasurably more achievable.

Good Monitoring Is It’s Own Reward

Armed with the information I’ve described above, that conversation I described at the beginning of this essay goes in a very different direction:

Them: “We need to monitor about 20 of our devices and the applications that go with them.”

You: “Great! It’s a good thing you’re talking to me and not accounts receivable then. Do you have a list of what you are putting together?”

Them: (laughing politely at your very not funny joke) “Right here”

You : “OK, great. I’ve got a list here of the things that get monitored automatically when we load your devices into our software. Can you take a look and tell me if anything you think is important is missing?” (hands printed list of default hardware monitors across the table)

Them: (Scanning the list) “Wow. This is… a lot. It’s more than we expected. I don’t want tickets on all this tho…”

You: “Oh, no. These are the metrics we collect. Alerts are something else. But if you need to see the general health of your systems, this is what we can provide you automatically.”

Them: “Oh, I get it! That’s great. You know what? We’ll ride with this list for the time being.”

You: “Perfect. Now, on the application side, here’s what we have in place today.” (hands over printed list of application monitoring).

Them: “OK, I can already see we’ll need a few extra items that the dev’s have said they want to track. They will also want to alert on some custom messages they’re generating.”

You: “No worries. You can get me that list by email later if you want. I’m sure we’ll be able to get it all set up for you.”

Them: “Wow, this was a lot less painful than I expected.”

You: (Winks at camera) “I know!”

The Long And Winding Road

You pull into the driveway and head into your home, reflecting on how 8:45am seemed like 100 years ago in terms of the miles you’ve put on that old brain of yours. But then you sit down and take a minute to clear your head, and you realize a few things:

First, that monitoring, GOOD monitoring, is something that is achievable by anyone that is willing to put in a little time and effort, and can be accomplished with most of the monitoring solutions currently available on the market today.

Second, that putting in the time and effort to create good monitoring makes the overall job significantly more enjoyable.

Third, that monitoring is, and should be treated as, a separate discipline within IT, as valid a sub-specialty as networking, programming, storage, virtualization or information security. And only when it IS treated as a discipline will people be willing to put in the time and effort I’ve been describing, and to invest in the tools that help make that possible.

Fourth, and finally, that everything that happened today – from the 8:45am beginning I described in part 1 all the way to the cold beer in your hand now – is all part of a day’s work in IT.