ICYMI: Two Recurring Issues in Data Center Monitoring Automation

(This originally appeared on Datacenterournal.com and is a continuation of part 2, as well as part 1)

In my previous post, I began to explore the what and why of data center monitoring automation:

What: Device discovery, connectivity, application discovery, monitoring assignments, automated actions triggered by alerts and more.

Why: Because we’re lazy.

In my next post, I’m going to dig into the how and provide some specific examples of automation that exemplify the raw potential that automation brings to bear, and how it can be a powerful force for good in your data center. But in this post, I want to take a step sideways and address two major issues that often come up at this point in the conversation.

Issue 1: The Elephant in the Room

“Strategy is finding broccoli on sale and buying it for dinner. Tactics are getting your kids to eat it.”
—Thomas LaRock, Head Geek, SolarWinds

Before we dig any further into automation—whether that’s automated discovery, report delivery or alert-trigger actions—we absolutely must make a critical point: In some circles, it’s called the DPR cycle, which stands for detection, prevention and response. So far in this series, I’ve discussed detection (or monitoring), and in future posts I’ll address response (or automated remediation). But I would be negligent to not address prevention right now.

Alerts are the way we catch errors when they occur, but then it’s up to us to determine why they occurred and find a way to keep them from occurring again. When we build a solution to automatically respond to an alert and remediate it, as responsible data center professionals we should also commit to doing the hard work of analyzing the situation to find a pattern and root cause. Then we need to address the root cause and also create checks so we know whether the problem occurs again.

(As a side note, anyone who’s been within 500 feet of an ITIL manual should find this concept very familiar.)

Yes, automatic responses to alerts keep your business running at all hours of the night and help ensure you get the beauty rest you need, but in the morning, you and your team must be able to see that something happened and do the work of figuring out why it happened, so you can prevent it from happening in the future.

Issue 2: “I’m Scared”

“I must not fear. Fear is the mind-killer.”
—Litany Against Fear, Dune by Frank Herbert

Many data center professionals feel apprehensive when I first bring up the idea of including automatic responses to alerts. There is something deeply comforting about knowing that a real, live human with a brain will receive an alert, think about it carefully, and then act deliberately and with caution. That’s the idea, anyway.

Yes, standing at the edge of the ocean called “automation” can be a little daunting. But you have to trust me when I tell you that the water is fine, you won’t drown and you have the ability to try things one step at a time. This is not an all-or-nothing proposition where the risk is that you’d go from zero to all-my-systems-are-offline.

As with any IT work, having a plan of attack is sometimes more important than the attack (or in this case, the automation) itself. So let’s talk a little more about this plan of attack:

Identify your test machines first. Whether that’s lab gear set aside for these purposes or a few less critical volunteers, set up your alert so that it only triggers for those machines.
Learn to use reverse thresholds. Although your ultimate alert will check for CPU>90%, you probably want to avoid spiking the test machines repeatedly. Turn that bracket around. CPU<90% will trigger a whole lot more reliably—at least we hope so.
Find the reset option. Closely related to the previous point, know how your monitoring tool resets an alert so it triggers again. You will likely be using that function a lot.
Be verbose. Maybe not at cocktail parties or at the movies, but in this case you want every possible means of understanding what’s happening and when. If your tool supports its own logging, turn it on. Insert “I’m starting the XYZ step now” messages generously in your automation. It’s tedious, but you’ll be glad you did.
Eat your own dog food. If you were thinking you’d test by sending those alerts to the server team, think again. In fact, you aren’t going to send it to any team. You’ll be getting those alerts yourself.
Serve the dog food in a very simple bowl. You really don’t need to fire those alerts through email. All that approach does is create additional delays and pressure on your infrastructure, as well as run this risk of creating other problems if your alert does kick off 732 messages at the same time. Send the messages to a local log file, to the display and so on.
Share the dog food. Now, you can share the alerts with the rest of the team as part of a conversation. Yes, a conversation.
Embrace the conversation. This process will involve talking to other people. Setting up automation is collaborative because you and the folks who will live with the results day in and day out should agree on everything from base function to message formatting.
Set phase-ers to full. Once the automation is working on your test systems, plan on a phased approach. Using the same mechanism you used to limit the alert to just a couple of machines, you are going to widen the net a bit, maybe 10-20 systems. And you test again to observe the results in the wild. Then you expand out to 50 or so. Make sure both you and the recipients are comfortable with what you’re seeing. Remember, by this point, the team is receiving the regular alerts, but you should still be seeing the verbose messages I mentioned earlier. You should be reviewing with the team to make sure what you think is happening is what really is happening.

By following these guidelines, any automated response should have a high degree of success, or at least you’ll catch bad automation before it does too much damage. A good rule of thumb for automating is to find the things that have the biggest bang for the least effort. A place to start is your current help-desk tickets. Whatever system-based events you are seeing the most of now, that’s probably where you can get the biggest impact. Another good place to find ideas for automation is the lunchroom. Listen to teams complain, and see whether any of those complaints are driven by system failures. If so, it may be an opportunity for automation to save the day. Finally, don’t plan too far ahead. As apprehensive as you may feel right now, after one or two solid (even if small) successes, you will find that teams are seeking you out with suggestions about ways you can help.

Issue 1: The Elephant in the Room

Issue 2: “I’m Scared”

Share this:

Like this: