ICYMI: Automation’s Impact on Data Center Monitoring Alerts

(This originally appeared on Datacenterournal.com and is a continuation of part 3, part 2, as well as part 1)

In my last installment, I discussed a few different areas where data center monitoring automation can not only make life in the data center more convenient but also become a force multiplier. I ran out of space, however, before I ran out of ideas (the story of my life). The one thing I didn’t cover was the automation you can implement in response to an alert.

A Child’s Garden of Automation

As a data center professional, you probably have a solid understanding of monitoring and alerting already, but to truly appreciate how automation can relieve an enormous burden, it may be helpful to review a few examples.

What follows are some clippings from my “garden of automation”—alert responses that have had a huge impact on the environments where they were implemented.

Example 1: Disk Full

Disk-full alerting is a simple concept with a deceptively large number of moving parts. So, I want to break it down into specifics. First, get the alert right. As my fellow SolarWinds Head Geek Thomas LaRock and I discussed in a recent episode of SolarWinds Lab, simplistic disk alerts help nobody. If you have a 2TB disk, alerting when it’s 90 percent used translates to having 204.8GBs of disk space remaining.

A good solution to this problem is to check for both percent used and also remaining space. A better solution is to include logic in the alert that tests for the total space of the drive, so that drives with less than 1TB of space have one set of criteria and drives with greater than 1Tb have another. These tests should all be in the same alert, if possible, because who wants to manage hundreds of alert rules? Nevertheless, you want to ensure you are monitoring disk space in a way that is reasonable for the volumes in question, and only create necessary alerts.

Next, clear unnecessary disk files out of various directories. For the purpose of this article, I’ll just say that all systems have a temporary directory and that you can delete all files out of that folder with impunity. The challenge in doing so easily comes down to a problem of impersonation. Many monitoring solutions run on the server as the system account. As a result, performing certain actions requires the script to impersonate a privileged user account. There are a variety of ways to do so, which is why I’ll leave the problem here for you to solve in a way that best fits your individual environment.

Once the impersonation issue is resolved, there’s another challenge specific to the disk-full alert: knowing that the correct directories for the specific server are being targeted. The best approach is to use a common shared folder that maps to all servers and place a script file there. That script can be set up to first detect the proper directories and then clear them out with all the necessary safeguards and checks in place to avoid accidental damage.

Example 2: Restart an IIS Application Pool

Sadly, restarting application pools is often the easiest and best fix for website-related issues. I’m not saying that running appcmd stop… and then appcmd start… from the server command line is a quick kludge that ignores the bigger issues. I’m saying that often, resetting the application pool is the fix.

If your web team finds itself in this situation, waking a human being to do the honors is absolutely your most expensive option. But automatically restarting the application pool becomes slightly more challenging because one server could be running multiple websites, which in turn have multiple application pools. Or you could have one big application pool controlling multiple websites. It all depends on how the server and websites were configured and you have no way of knowing.

If your monitoring solution can monitor the application pool, it will provide the name for you. Most mature monitoring solutions do so already. Once you have the name, you can do the following:

  • Use the built-in restart application pool option that’s now included in most sophisticated monitoring tools.
  • Run this command from the command line of the affected server: appcmd [stop/start] apppool /apppool.name: <app pool name goes here>. Keep in mind that appcmd.exe may not be in the path. You can typically find it in C:\windows\system32\inetd\appcmd.exe. Also note that appcmd.exe can’t run against a remote system, so you will have to use another utility, such as psexec, to run it from the monitoring server against a remote machine.
  • Run a PowerShell script locally or remotely. The code would look like the below:
# Load IIS module:
Import-Module WebAdministration
# Set a name of the site we want to recycle the pool for:
$site = "Default Web Site"
# Get pool name by the site name:
$pool = (Get-Item ‘IIS:\Sites\$site’| Select-Object applicationPool).applicationPool
# Recycle the application pool:
Restart-WebAppPool $pool

Example 3: Restart IIS

Running a close second behind restarting application pools is resetting IIS. Doing so is, of course, the nuclear option of website fixes since you are bouncing all websites and all connections. Even though it’s drastic, it’s a necessary step in some cases.

As with restarting application pools, getting a human involved in this incredibly simple action is a waste of everyone’s time and the company’s money. It’s far better to automatically restart and then recheck the website a minute or two later. If all is well, the server logs can be investigated in the morning as part of a postmortem. If the website is still down, it’s time to send in the troops.

You can restart the IIS web server in a number of ways:

  • Use the restart IIS option in your monitoring solution (many of the better tools have this built in).
  • Execute iisreset/restart at the local command line of the affected system.
  • Remotely execute iisreset <computername>/restart.
  • Create and execute a PowerShell command, such as invoke-command -scriptblock {iisreset}.
  • Or, more simply, use the call operator & {iisreset}.

Example 4: Restart a Server

If restarting the IIS service is the nuclear option, restarting the entire server is akin to nuclear Armageddon. Yet we all know there are times when restarting the server is the best option, given a certain set of conditions that you can monitor. Assuming your monitoring solution doesn’t support a built-in capability for this function, some options include the following:

  • For Linux, issue the command ssh -l <username> <computername> ‘shutdown -r now’.
  • In Windows, you can remotely restart a machine by issuing the command shutdown /r /f /t 0 /m \\<machinename> /c <comment to add to eventlog>.
  • Using PowerShell, you can do it with restart-computer <computername>.

Example 5: Restart a Service

Occasionally, services stop. They are sometimes even services that you, as a data center professional who needs to monitor your infrastructure, care about, such as SNMP. So, you are cutting dozens of service-down alerts. Have you thought about restarting them? In some cases, a restart doesn’t really help much. But in far more situations it does. Computers are funny things. After all, “Screws fall out all the time. The world is an imperfect place.” (From The Breakfast Club.)

Sometimes, they just need a gentle nudge. If this is the case, you can do the following:

  • Use the restart service action that is built into most monitoring solutions.
  • Issue the command net start <servicename> on the local computer.
  • Issue the command sc <computer> start <servicename> from a remote machine.
  • Run a PowerShell script with the following commands: (get-service -ComputerName <computername> -Name <servicename>).Start().
  • For Linux systems, run the pkill command, either locally (pkill -9 <process name>) or remotely (ssh -l <username> <computername> ‘pkill -9 <process name>’).

Example 6: Backup a Network-Device Configuration

Everything I’ve gone over so far covers direct remediation-type actions. But in some cases, automation can be defensive and informational. Network-device configurations are a good example, in that they don’t fix anything, but instead gather additional information to help you fix the issue faster.

It’s important to note that between 40 and 80 percent of all corporate-network downtime is the result of unauthorized or uncontrolled changes to network devices. These changes aren’t always malicious. Often, the change simply went unreviewed by another set of eyes or an otherwise simple error slipped past the team.

So, having the ability to spontaneously pull a device configuration based on an event trigger is super helpful. To do so, you can use the following approach:

  • Copy the config with built-in functions in your monitoring solution.
  • Copy the config with PowerShell:
New-SshSession <device_IP> -Username <username> -Password “<password>”
$Results = Invoke-Sshcommand -InvokeOnAll -Command “show run” | Out-File
“<filepath and filename>”
Remove-SshSession -RemoveAll

There are two general cases when you may want to execute this automatic action. The first is when your monitoring solution receives a config change trap. Although the details of SNMP traps are beyond the scope of this article, you can configure your network devices to send spontaneous alerts on the basis of certain events. One of these events is a configuration change. The second is when the behavior of a device changes drastically, such as when ping success drops below 75 percent or ping latency increases. In either case, often the device is in the process of becoming unavailable. But in some situations, it’s wobbly, and there’s a chance to grab the configuration before it drops completely.

In both of those situations, having the latest configuration provides valuable forensic information that can help troubleshoot the issue. It also gives you a chance to restore the absolutely last-known-good configuration, if necessary. And if it leads you to think, “Well, if I have the last known good configuration, why can’t I just push that one back?” Then you, my friend, have caught the automation bug! Run with it.

Example 7: Reset a User Session

Somewhere in the murky past, the first computer went online and became Node 1 in the vast network we now call the Internet. The next thing that probably happened, mere seconds later, was that the first user forgot to log off their session and left it hanging.

For any system that supports remote connections—whether it’s in the form of telnet/ssh, drive mappings or RDP sessions—having the ability to monitor and manage remote-connection user sessions can make running weekly, if not daily, restarts unnecessary. Or at least much smoother.

For Linux, use the “who” command to discover current sessions, or with greater granularity by remotely running netstat -tnpa | grep ‘ESTABLISHED.*sshd’. Once you have the process ID, you can kill it. For Windows, you get the active sessions on a system using the query session <servername> command and disconnect the session using the reset session <Session name or ID> <servername> command. Or you can use the PowerShell cmdlet Invoke-RDUserLogoff.

Example 8: Clear DNS Cache

At times, a server and/or application will misbehave because it can’t contact an external system. This misbehavior is either because the DNS cache (the list of known systems and their IP addresses) is corrupt, or because the remote system has moved. In either case, a really easy fix is to clear the DNS cache and let the server attempt to contact the system at its new location.

In Windows, use the command ipconfig /flushdns. In Linux, the command varies from one distribution to another, so it’s possible that sudo /etc/init.d/nscd restart will do the trick, or /etc/init.d/dns-clean, or perhaps another command. Research may be necessary for this one.

Wrapping Up

Hopefully at least a few of things I’ve shared here and in this series on automation as a whole have inspired you to give automation a try in your data center. If so, or if you’re already well on your way to automating “all the things.” I’d love to hear about your experiences and perspective in the comments section.

%d bloggers like this: