ICYMI: Monitoring Nirvana

(this originally appeared on Data Center Journal)

Recently, I had the privilege of giving a talk at the ByNet Expo in Tel Aviv. There, in a room of about 200 local IT professionals from all walks of life—including international companies and at least three branches of the Israeli Defense Forces—I took 30 minutes to wax philosophical about cloud and hybrid IT. But not before launching into one of my core soapbox issues: what is data center monitoring, and who is (and should) be doing it in your company.

In talking about monitoring, I feel like I should start with a quote from one of my heroes, Professor Albus Dumbledore. In Harry Potter and the Prisoner of Azkaban by J.K. Rowling (1999), he remarks, “This is magic at its deepest, its most impenetrable, Harry.” Well, monitoring isn’t magic, but the results can certainly be magical (in my mind, at least)—as in transformative both to the individual and a business. But only if it’s done mindfully, and done well.

So, over the next few posts here at the Data Center Journal, I’d like to walk through some elements that make up great monitoring and that, when followed, make monitoring (as an IT discipline) great.

Monitoring Mad Lib: Monitoring Is…

Let’s start with what monitoring is by talking about what it isn’t:

Monitoring isn’t a ticket (even though I worked at a company where uptime was calculated as “100 percent minus the number of tickets for the system”)
Monitoring is not a blinky icon on a screen
Monitoring is not a page, an email or a “whoop whoop” noise on your speakers

Monitoring is nothing more (or less) than the ongoing, regular and consistent collection of data and metrics from a set of target devices. Everything else mentioned above—the ticket, the blinky icon and the whoop-whoop alarm—is a happy byproduct you enjoy when you first do monitoring.

Confuse those things with monitoring and you make the kind of mistake Bruce Lee talked about in Enter the Dragon(1973): “It’s like a finger pointing away to the moon. Don’t concentrate on the finger or you will miss all that heavenly glory.”

Monitoring as a Discipline

Once we agree on what data center monitoring is, the next thing is who in IT should be in charge of it.

In many companies, monitoring is a checkbox on a long list of to-dos for staff members, including people on the teams that support the network, servers, storage, voice system and more. These individuals often use software that’s highly specific to their area and that doesn’t share information with other teams or systems. When I’m giving a talk and I say this out loud, the response I get most often is, “When you put it that way, it does sound kind of funny.”

Think back about a decade—few people called themselves “information-security professionals.” Sure, a few did, but they usually worked for mammoth companies (or the government) and were the IT version of a superspy.

Today, no company would dream of operating without an information-security resource—on call if not on staff. But 10 years ago, that job was done by the network engineer, who happened to like ACLs and security in general, or the server administrator, who had a bizarre love of log files. Even today, we don’t have one information-security person for networking and another for servers. It’s a single role that’s responsible for the entire IT environment.

This is the direction monitoring must go. We IT professionals who manage the staff, workflow and resources associated with the modern data center must recognize that the only way monitoring can live up to its potential—by which I mean monitor everything holistically, detect issues, resolve problems automatically when possible and generally keep us out of trouble—is to have staff dedicated to that purpose. It should include people who understand more than a single silo of technology and instead focus on a wide range of techniques to collect data, detect failure modes and quickly build automation that can respond to those failure modes by either taking corrective action or by gathering additional insight into the event. This approach can then enrich the messaging sent to staff, enabling them to get to work faster on the right problem.

Monitoring has to become—and I believe, is becoming—its own discipline in the IT ecosystem. Becoming a monitoring engineer is as valid a career goal as becoming a systems administrator, storage administrator or network engineer.

Where to Start

Sometimes it helps to build a concept into a model or imaginary framework so you can fit all the new concepts into an overarching structure. Luckily, monitoring already has one: it’s called FCAPS.

FCAPS stands for fault, capacity, administration, performance and security. Using the imagery of an airplane flying from one city to another, the FCAPS model would look something like the following:

Although this table is a simplification of what can be a broad subject, most of it holds true most of the time.

Monitoring is largely concerned with the F, C and P of FCAPS. Administration (who has access to a system) and security (who accessed the system at any particular time) are usually the purview of the security team and/or your RADIUS/TACACS-type tools.

Thus, monitoring professionals tend to focus on the F, C and P.

The Technique Is All in the Wrist

With the philosophy and theory now on the table, the core of monitoring glory (or the basics, at least) is knowing what tools you have and how to use them effectively. But delving into the darkest recesses of protocols and functions is beyond the scope of this article. Nevertheless, you should understand how each of the following techniques work and in which situations they are best used:

Ping
SNMP (poll and trap)
Syslog
Log-file reading
Log-file aggregation
WMI
- Eventlog (Windows)
- Performance mon ctr
SQL query
IPSLA
NetFlow
Vendor API/scripted

An Alarming Development

A desirable feature of a monitoring solution is notifications when something goes wrong. Unfortunately, what all too often follows is a situation where someone says, “Turn on all the alerts and we’ll decide which ones to turn off later.” Spoiler alert: If you do this, the answer to the question, “Which one should we turn off?” will be “All of them!”

Making the situation worse, the default alerts—the ones that ship with the software—are often used without modification, leading to the claim that “this monitoring is useless!” Note once again that alerting should not be mistaken as monitoring.

To avoid this situation, it’s important to understand a few things about alerting and monitoring software in general. First, the alerts included with your monitoring software are suggestions and examples—not best practices. They’re a way of either giving you some starter alerts, which are generic enough that they will work in almost any environment, or they’re samples that showcase a particular function, technique or concept. In any case, you shouldn’t use these alerts as is, but modify them, built on them or use them as templates for other alerts tailored to your business’s particular architecture, workflow and needs.

Second, good alerts come from good conversations. Over 30 years in IT, I’ve found that folks who request monitoring tend to fall into one of two categories:

Those who come in telling you what to monitor and when to alert, but don’t tell you what they are actually trying to accomplish
Those who come in with no idea what to monitor or alert on because they have no idea what they are trying to accomplish

In working with both categories of people, I ask the following questions:

How do you know when something is wrong? (No, it’s not when the user calls and screams.)
1. What system do you look at or jump into when something appears not quite right?
2. Which commands do you run when something appears not quite right?
3. What thresholds are you paying attention to when something appears not quite right?
What do you do then?
1. Do you clear a directory or counter?
2. Do you restart a service?
3. Do you run some additional commands in the application API?
How do you know a situation has been resolved?
1. Is it the same set of items as number one above, just in reverse?
2. Are there other indicators you watch to ensure the pressure has been relieved?

The answers to these questions will inform what you should monitor, how you should build the alert and what actions you should automate when the alert triggers.

Once Again, Good Technique Will Win the Day

As with monitoring, the “work of the work” after you understand the philosophy and theory of monitoring is to have a solid grasp of the tools. For alerting, they include understanding concepts such as

Flapping
Parent-child
Delta trigger
Multi-event trigger
Deduplication

What Should I Do?

Ultimately, you still need a software tool that does some of the things I’ve described here. Because all the software vendors out there are working from the same basic playbook, what should you look for as a differentiating factor? What exactly makes brand X better than brand Y?

The answer has as much to do with you and your organization as it does with how monitoring gets done.

Will your monitoring team be one person who is also your server team and network team and help-desk team and database team? If so, you probably need a tool that sacrifices comprehensive options for simplicity and manageability. Does your organization need absolute flexibility so that the monitoring solution is the one-stop shop for all your needs? You will pay more and require more staff, but you’ll have a software suite that fits you like a glove.

In the end, if you choose wisely and follow the suggestions above, you will be well on your way to monitoring nirvana!