(“In Case You Missed It Monday” is my chance to showcase something that I wrote and published in another venue, but is still relevant. This week’s post originally appeared on DataCenterJournal.com)
In my last post, I highlighted the need to expand the way we who work in data centers think—from effectively managing the data we’re responsible for maintaining (beyond turning it into information) to transforming that information into action (in the form of automation). That made me realize that even in 2016, not everyone has had a chance to work in an environment where monitoring and automation function together as a seamless whole, so some may not understand why I’m so emphatic about it.
Therefore, I’d like to explain what I’m talking about and offer some examples to illustrate.
If a perfectly good monitoring system is in place, you may wonder why automation is so important.
The answer is because you’re lazy. Because most seasoned IT professionals are lazy. Because spending your day responding to tickets, alerts and emails is both tedious and boring. Most of all, because your time is valuable.
As a systems-monitoring engineer, my second-favorite thing is listening to colleagues answer the question, “How do you know something is wrong?” It makes them define bad for themselves, which leads to good data center monitoring and alerting. But as I mentioned in my last post, my most favorite thing is following with, “So, then what? When ‘it’ is bad, like you just described, what do you do about it?”
The answer to that question sits at the heart of the conversational path that leads to automation. Maybe after they get an alert they clear a queue, restart a service, or delete all of the files out of a temp directory. Whatever it is, it will likely be something that could have run without human intervention, assuming the action isn’t already built into the monitoring tool. This is the reason why sophisticated monitoring solutions allow you to build an alert that triggers an initial action, then wait a specified amount of time. If the condition persists, a second (or third, fourth or whatever) action will be triggered.
But let’s say there is no definitive action that can be taken. Maybe the answer is to check the last 15 lines of a certain log file, look at another counter and run a test query from the application server to the database. Then, on the basis of the results, you will know what to do next. In this case, your automated action is to do all those steps and then insert those results into the alert message. You will then receive a ticket that already contains greater insight into the conditions at the time of the failure, as opposed to 15 minutes later when you’ve dragged your butt out of bed, fired up the laptop and started to dig into the situation.
Trust me when I tell you that doing this kind of thing will make heroes out of you and your monitoring system. Now, with a clearer understanding of why automation is a good idea, let’s look at some of the areas where automation brings solid, measurable value to the monitoring discipline of IT.
Discovery, Part One
One of the first things you do when you fire up a new monitoring system is load devices. Although you can do so manually by specifying the name or IP address of the machines and connection information, in any data center larger than about 20 systems, this task becomes an exercise in tedium and frustration. Scanning the environment is far more convenient and yields a complete picture. For about a week.
Afterwards, in data centers larger than 20 systems, devices have likely been added, changed or removed. Manually kicking off a scan of the environment gets old after the third week. Now is often when the itch for automation starts.
Discovering a bunch of devices is good, but knowing how they all connect is better. Why? Imagine a simple scenario in which you have one router connected to two switches, which are connected to 10 servers. If the router goes down, how many device down alerts should you get? That’s right: one. Just one. For the router.
Now, how many alerts will you typically get? If you answered 10 (for the servers), 12 (for the servers plus the switches) or any number greater than one, you may be right and at the same time very wrong.
The solution is to set up downstream suppression so that when the router goes down, the rest of the downstream devices go into an unreachable (i.e., not down) state. Although you can do this task manually, it’s much better when automation does it for you.
Discovery, Part Two
Say you’ve found every device that was hiding in the nooks and crannies of your data center. You’ve scanned them to find out how they’re connected. You’ve established your virtualization hierarchy, from VM to host to cluster to data center. You’ve even determined what kind of hardware each device is, how many interfaces and drives they have, and the status of their physical hardware, such as fans, temperature sensors, RAID controllers and more. You’ve got it all figured out.
Now if you could just figure out which of those stinkers is your Exchange box.
That’s right. This discovery is all about software. As in the previous section, “Discovery, Part One,” how this is done is less important than how often. Because, sure, you can scan a server when you add it and determine that it is running Exchange or SharePoint or your company’s custom app, but you’ll never know (until it’s too late) that three weeks later the app developer enabled IIS—or, worse, DHCP.
If anything, you need automation to scan for applications even more than you need it to scan for new, changed or deleted devices. But it’s actually more complicated than that.
It’s one thing to know that a server is running IIS (or DHCP, SharePoint, etc.). It’s another thing to understand what your company thinks about that server. Is it a critical SharePoint box, or just a QA server? Is it currently a new build that is under test or is it full production?
In many organizations, a server will go through a build and staging process before moving to production. At the other end of the cycle, it may spend considerable time in a decommissioned state where it is running but not in use except during emergencies. Similarly, some companies have systems that could move from test to QA to product and/or back again.
In each of those cases, the intensity of the monitoring may change drastically. So, the automation opportunity here is not so much detection as it is applying the correct templates on the basis of custom variables. You want to empower the owner teams to set these attributes and then have the monitoring system detect those settings and automatically adjust the monitoring accordingly.
And Then There’s Alerting
Finally we come back around to alerting. But by this point, I’m hopeful you already understand the concept behind automated alert actions. What we need now is to dig into specifics—the proverbial steak to go with the sizzle.
Where to Go From Here?
Good automation is enabled by and is a result of good monitoring. When done correctly, it’s elegantly simple, and most importantly, it’s not artisanal—it’s just automation the way automation is meant to work. In the end, monitoring and automation are limited only by your ability to imagine and implement, assuming you’ve got a good monitoring tool in place.
Stay tuned over the next little while as I explore more specifics of the areas laid out here and offer techniques and thoughts about how to implement automation in your environment, in the hopes that it makes your life easier. And your weekends longer.