ICYMI: Data Center Monitoring Automation: Getting Started

(“In Case You Missed It Monday” is my chance to showcase something that I wrote and published in another venue, but is still relevant. This week’s post originally appeared on DataCenterJournal.com)

Previously, I covered the what and why of automation. I then digressed a bit to discuss the fact that automation is not an excuse for avoiding ITIL best practices to actually fix problems, and I provided some reassurances that automation need not be scary.

Now, as promised, here I will provide some concrete ways automation can be done by you, with your tools and in your data center.

Device Discovery and Connectivity

As a first step, you must ensure you can scan your network in a variety of ways, combining the ability to do the following:

Enter a subnet
Enter a list of IP addresses or DNS-resolvable machine names
Enter a seed device from which the network is discovered by finding connected devices
Specify an Active Directory (AD) OU, as well as scanning computers in that OU

It may help to use a monitoring solution that provides the ability to specify devices that should not be scanned. This “no-fly zone” capability should take the exact same input as the discovery—in other words, a subnet, list of IPs, AD OU and so on.

When a device is found during discovery, interrogation (to determine device type, hardware components, etc.) should use the gentlest possible approach, interrogating just a few basic pieces of information, such as the device name, vendor and model. That information should then kick off a discovery of specific elements unique to that vendor/model rather than a walk of every known numeric combination. You will also find it helpful to be able to use protocols other than SNMP, including WMI and vendor APIs (Cisco UCS, VMWare and Microsoft Hyper-V, to name a few).

Finally, part of the scan should determine connectivity to other devices. It should reveal everything, including what switch a server is connected to, the VM host cluster hierarchy in the data center and beyond.

Automating Discovery

But that procedure just addresses the scan. What about subsequent scans?

The next step is to take that whole profile that we just discussed and use your data center monitoring system to schedule runs at regular intervals:

Every x hours/days/weeks
Every xth day of the week
Every xth day of the month
At specified times of the day

In addition, implement controls to turn off the scan if it runs past a particular time of the day, or if it has been running for more than a fixed duration. In this way, you can break down large environments into manageable slices; set up robust, sophisticated discoveries; and avoid overworking your monitoring system and the environment you’re scanning.

Finally, in addition to running as a scheduled job, you should configure the scan to be triggered by an event. For example, if an interface on a router has been down for more than 30 minutes, a scan should be set to trigger on the subnet to which the interface belonged to check whether a new interface has been brought up and if any new far-end devices have come online. But regardless of the triggering event, the function you want is to set off a controlled discovery based on real-time events in your data center.

Processing the Discovered Devices

That brings us to the question of what to do with new hardware when it is found. The hard truth that many of us run headlong into is that not all hardware in a data center needs to be monitored, and even in highly regimented and controlled environments, not every device that appears in a subnet is supposed to be there.

So, first, hopefully your monitoring system at the very least lists the newly discovered devices for approval. Second, seek to be able to list the subelements of those devices. Third, get the ability to filter certain element types out of hand; for example, nobody ever needs to monitor the CD drive for disk capacity. More to the point, these filters should be specific to device types.

Application Discovery

Although hardware discovery is important, it tends to be the simplest aspect of the total monitoring-automation story. In the years since I started in IT, hardware discovery, identification and enumeration have become fairly standard and predictable. But applications continue to be their own ball of wax. Figuring out what is installed on a server, what is running and what those running applications are doing continues to be a challenge for even the most advanced data center professionals.

Needless to say, even though most application-monitoring vendors work diligently to create and maintain libraries of application signatures—the combination of running services, file names in standard locations and registry entries that provide a high degree of certainty that software is present—technical challenges are associated with accurately understanding what is running on a server. Meanwhile, however, it’s critical to the business to have robust and accurate application monitoring in place.

So, how do you ensure that business-critical applications are being properly monitored? In addition to common-sense items of software discovery, there’s one form of data center automation that has the potential to make your application monitoring push-button simple: assignment based on role.

To understand this concept, let me clarify a few assumptions:

Not all applications are equal, even if they are the same. What I mean is that an Exchange server running in the DMZ, an Exchange server running in the data center completely inside the corporate firewall and an Exchange server running in a cloud instance are all running Exchange but are completely different in their usage, security profile and needs.
Even within those specifications, a single server will have multiple statuses during its life cycle, which will affect the level of monitoring it needs: build versus test versus production versus decommissioned.
You—meaning the technical team in your organization that is requesting and provisioning those servers—already know what the usage and needs are.

With these assumptions in mind, hopefully it’s clear that monitoring automation requires multiple variations of the same template or set of application monitoring components. But how will you know when to apply the correct one?

The answer is to use the information that’s already in your asset-management system, provisioning request or even your naming conventions. By using properties such as the following, you can employ the ability of most robust monitoring tools to automatically assign or unassign monitoring on the basis of roles, status, location and so on:

Net location: DMZ, data center, warehouse, remote, etc.
Disposition: build, stage, test, preproduction, production, decommissioned, etc.
Business_critical: 1 through 5
Primary_use: SQL, AD, Tomcat, file server, etc.
Associated_application: email, order entry, XYZ_App, etc.

When applied correctly, these properties allow your application monitoring to proceed without frequent scans because the monitoring you apply is based on your intended use of the server.

What About the Cloud?

All this talk of servers and applications may sound woefully metal-centric, as in “old school bare-metal servers with manually installed applications.” But the truth is that all of these monitoring-automation techniques apply to hybrid IT or even pure cloud environments.

For cloud-based systems, monitoring hooks can (and should) be included in the build scripts so that the server reaches out to the monitoring solution and registers both its existence as well as its purpose, location and so on.

What’s Missing?

Eagle-eyed readers may have noticed that the glaring omission among all this talk of data center monitoring automation is the automated responses to alerts. Truth is, I’m saving the best for last. So look for the next (and final) installment on automation for some real-world examples of ways you can reduce cost, increase productivity and save your uptime figures with alert-trigger automation.

Device Discovery and Connectivity

What About the Cloud?

What’s Missing?

Share this:

Like this: