(This originally appeared on DataCenterJourna.com)
In my last guest article, I described how speaking at the recent ByNet Expo in Tel Aviv gave me the opportunity to think about one of my favorite questions: “What is data center monitoring?” And more to the point, “What is good data center monitoring?” To dig into that topic, I first had to lay out the foundation of monitoring: the fundamental techniques, tools and technologies at the heart of every monitoring solution. It was a crash course to be sure, but it covered the main ideas. (The free e-book Monitoring 101 digs a little deeper.)
Now, we’re ready to go further—Monitoring 201, if you will.
For our purposes, I’ll assume you’re already familiar with technologies such as SNMP and WMI; protocols such as IPSLA and NetFlow; and correlation techniques such as deduplication, parent-child suppression and triggering on a delta. I’m also assuming you’ve selected a monitoring solution capable of performing these and all the other tasks I outlined previously.
But remember that simply buying a race horse doesn’t automatically qualify you for the Kentucky Derby. Thus, in this series I’ll cover how to create monitors and alerts that are meaningful, actionable and valuable to you and your organization.
Once again, there’s only so much I can explain here, so if these articles leave you wanting more, you may want to check out Monitoring 201, another free e-book, in which I delve further into this topic.
The Four Phases of Development: Create, Test, Test and Test
Hopefully, I’ll cover a few examples in this series that’ll get you excited, spark your creativity or maybe make you say, “I need to see that for myself.” But diving right in and putting one of them into action in your production environment is never a good idea.
Not ever. I’m serious.
Before we can begin any discussion that suggests creating new monitors and/or alerts, I must first discuss proper testing. Here are some guidelines to keep in mind:
- There’s a difference between the scope of an alert and the trigger conditions. “Operating System = Windows” isn’t an alert trigger, it’s a scoping statement. When testing, ratchet the scope as small as you can and expand it slowly. What you really want to test is the trigger condition first.
- Learn to use reverse thresholds. Although your ultimate alert will check for CPU>90%, you probably want to avoid spiking the test machines repeatedly. CPU<90% will trigger much more reliably.
- Verbose is your friend. Not at cocktail parties or a movie theater, but in this case, you want to have every possible means of understanding what’s happening and when. If your tool supports its own logging, turn it on. Insert “I’m beginning step XYZ now” and “I just completed step XYZ” messages generously in your alert actions. It’s tedious, but you’ll be glad you did.
- Eat your own dog food. If you were thinking you’d test by sending those alerts to the production team, think again. You won’t send it to any team; you’ll be getting those alerts yourself.
- Serve the dog food in a simple bowl. You don’t need to fire those alerts through email. All that approach does is add delays and pressure on your infrastructure, as well as run the risk of creating other problems if your alert kicks off 732 messages at the same time. Send the messages to a local log file, to the display and so on.
- Share the dog food. Now you can share them with the team as part of a conversation. Yes, a conversation. Setting up a new (and likely more sophisticated) monitor or alert is collaborative, because you and the folks who will live with the results daily should agree on everything from base function to message formatting.
Where to Start
Now that we’ve covered how to test, let’s talk about what to test. Meaning: which example—either from this series or from others—should you start with?
The answer? Find the things that deliver the biggest bang for the least effort.
Look at your current help-desk tickets and identify monitoring-based alerts that have a high incidence of “mass close” operations (if your ticket system supports that feature), or where large numbers of the same ticket type are closed at about the same time (within roughly three minutes of each other). They’re likely the monitors and alerts that happen too often with no actionable result.
Another good place to find monitoring-automation ideas is the lunch room—whether physical or virtual. (Don’t ignore your remote folks. They often know more about things that go bump in the night because their hours are more flexible.) Listen to teams complain and decide whether any of those complaints are driven by system failures. If so, it may be an opportunity for a sophisticated alert to save the day.
Finally, don’t plan too far ahead. As apprehensive as you may feel now, after one or two solid (if small) successes, you’ll find that teams are seeking you out with suggestions about ways you can help.
Show Me the Money
Monitoring improvements, especially the kind that we’ll be discussing here, take time. They take time to dream up, develop and test. They take time to test again, and one more time after that, because we’re professionals and we know how things go. They also take time to deploy. And—please excuse the cliché—all that time translates into money for the business.
So, smart data center professionals must take the accountants into account. Although I can’t tell you everything you must do to satisfy your company’s number crunchers, the following list of suggestions should get you started.
Remember the Bad Old Days?
As painful as it may be, you should think back to the worst of times, at least from a numbers perspective. Make sure you have data on the ticket counts before you install a new monitoring-automation system. Doing so will allow you to say, for example, “Before monitoring, we were generating 800 systems-related tickets a month, with approximately 200 for interface issues, 400 for system outages, 150 for service failures and 250 for assorted application issues. After implementing improved alerts, those tickets dropped to…”
Avoid the Retro Encabulator
Watch this video. It’s what you sound like to everyone else. It’s especially what you sound like to your company’s business leaders. They may nod appreciably, but your pet iguana probably understands more of your conversation than they do.
Several data center managers learned this lesson many moons ago, but I’m always surprised at how many people I meet at user groups and trade shows haven’t caught on. So, for the sake of those who may be unaware: in the name of getting what you want, skip it. All of it. Instead, put things into terms others care about. You have the ticket data from the previous step, right? Now turn it into dollars.
The algorithm is easy. Set an estimated amount of time to resolve each of the ticket types you listed in the previous steps. For example,
- Interface issue: 0.75 hour
- System outage: 1 hour
- Service failure: 1.5 hour
and so on.
Now ask your number crunchers what the average total loaded cost of an employee is. It’s the hourly rate for an employee that factors in everything about them, including their portion of the heat and electrical bill. Now do the following: <total loaded cost> x <time per issue> x <number of tickets for that issue>.
This calculation will give you the total cost for that issue in that period of time. If you do it for both the before and after phases, you know how much you’ve saved the company just by adding or improving monitoring and alerting.
To help you out, I’ll share real-life examples.
Gather Your Posse
Despite our suspicions to the contrary, your typical business leader’s day doesn’t comprise three hours skimming the Wall Street Journal followed by a round of golf, martinis and one meeting with you, where they listen perfunctorily and simply say “no” when you’re done.
Rather, their day is an endless series of meetings where the person at the front of the room declares how much they know, followed by some version of “trust me, this is important.” Inevitably, they request an amount of money that the business leader suspects is three times what’s necessary, to protect against eventual budget cuts.
One of the best ways to attract the kind of attention you want from important decision makers in your company is to give them solid numbers. Another way is to show your results in the form of end-user testimonials. When coupled with numbers, positive descriptions of experiences before and after installing monitoring automation (from people of various departments) reinforce the idea that the effort required to implement automation is worth the investment. Doing so also has the secondary benefit of reinforcing the message that investing in the monitoring solution was also a good decision and that further investments will have similar positive outcomes.
Revenue, Cost and Risk
During the 2015 ThwackCamp session, “Buy Me a Pony,” I sat down with the SolarWinds CTO to discuss the drivers that help executives make decisions. He boiled it down to three things:
- Increasing revenue
- Reducing cost
- Avoiding risk
If your project, software or initiative doesn’t speak to one of those things, it simply won’t be a priority for them.
The good news is that effective monitoring does at least two of those things. First, it helps to reduce costs by catching issues sooner in the failure process—potentially before they spiral out of control—clearly reducing business expenses. Second, monitoring helps avoid risk. Conversations about risk mitigation and avoidance frequently focus on ways for teams to predict and then circumvent potential failures that could affect a system, application or service. Monitoring comes into play when you’ve descended from that philosophical mountain and accepted that some failures are simply unavoidable. So, in this case, risk monitoring helps you avoid the downstream consequences of failure by detecting and responding to it as soon as possible.
Hey, Leon, Where Are Those Examples?
Like any good technology implementation, success is often contingent on setting a good foundation. I hope that’s what this article has done. Now that you know the why and the how, we’re ready to dig deeper into the what next time. Stay tuned!