(this originally appeared in the print version of Network Security Newsletter. I’ve reprinted it below)
IT professionals take a certain pride in their own IT infrastructure, believing that their environment is somehow unique and special compared to all the others and measurably different from the usual enterprise infrastructure designs. Therefore, when it comes to best practices, common techniques and standard solutions, there isn’t a one-size-fits-all approach. Instead, the approach must be altered to fit the delicate and distinctive DNA makeup of each individual IT infrastructure.
Experience shows that this perception of a unique, one-of-a-kind IT infrastructure is found most commonly in monitoring technologies. IT departments, which treat their particular mix of servers, applications, network devices and overall infrastructure as exclusive, have often built up a custom monitoring system in-house. However, there is one caveat here. Rather than a logical and properly thought-out monitoring system, these custom solutions often come across as more of an interpretative dance. They require special care and attention from a specifically trained sysadmin who can speak this mysterious language.
However, speaking as someone with over 30 years of experience in IT and monitoring – and having used my fair share of monitoring solutions and tools in every imaginable environment – I think monitoring is simple and doesn’t need to be transformed into a complex infrastructure.
Monitoring is nothing more (or less) than the ongoing, regular collection of a consistent set of metrics from a group of targets, whether these are physical devices, virtual machines, cloud-based applications or something in between. It is not an alarm, page, screen or ticket. It is the collection of metrics, data and information from a set of devices. Everything else – alerts, tickets and automation – is a happy by-product of the first part. All of this is in support of the true goal: monitoring should provide meaningful, actionable alerts (rather than just a lot of noise), seamlessly collect the statistics and insights you need and perform automated responses for frequent issues.
As with most things, initially setting up a good monitoring system can be a
hard slog. Rest assured, though, that the rewards will be worth it.
Oddly Elusive
There is one magnificent part of monitoring that is often overlooked and that is automation. This is an aspect of monitoring that remains oddly elusive to many organisations. Many IT professionals believe that automation is best used with servers and applications and that the only way to use it is through the brave new world of Software-Defined Networks (SDNs). However, this isn’t entirely the case.
Good and effective automation can be the result of good monitoring. For example, if a robust monitoring system is set up, it is simple for the IT professional to:
- Receive configuration-change traps.
- Collect the configurations from the device that just sent out a trap.
- Compare the configurations.
- Collect network device configurations on a regular basis.
Like the above, devices that are modified without proper change control – which causes the majority of corporate network downtime problems – are
forced back to their previous state until the new changes can be understood.
Little Things Add Up
In the IT team, minor error alerts occur on a daily basis, so ideally the responses need to be automatic. If they aren’t automated, then these little actions and remedies to issues, even if they only take a few minutes individually, quickly add up to a lot of time wasted – time that could be saved through effective automation.
Setting up sophisticated automation and monitoring can help solve those
pesky errors that are often overlooked:
- Alert: XYZ service is down –
- Automated response: attempt device restart.
- Alert: Disk is over X% full –
- Automated response: clear standard TEMP folders.
- Alert: IP address conflict detected –
- Automated response: shut down port of newer device.
At any time, if an automated response is not successful, proper monitoring tools will trigger a secondary action. At worst, an email, text message or ticket will be delayed by just a few minutes. And even so, technicians responding to that email, text or ticket will do so knowing that the initial action was already attempted and failed, so they are already a few steps ahead in the normal troubleshooting process flow. Either way, the resolution is being dealt with – it is far quicker with the automation
tools than it would be ordinarily.
Doing the thinking for you
The possibilities for automation don’t end at simple one-step solutions.
Effective monitoring tools also allow you to automatically start collecting additional necessary information at the time of the alert and then ‘inject’ it into the alert itself. For example:
- Alert: CPU utilisation is over X% –
- Automated response: identify the top 10 processes, sorted by CPU usage.
- Alert: RAM utilisation is over X% –
- Automated response: identify the top 10 processes, sorted by RAM usage.
- Alert: VM is using more than X% of host resources –
- Automated response: identify VM by name, gather and list other VMs on the same host.
- Alert: Disk is over X% full after clearing TEMP folders –
- Automated response: scan disk for top 10 files, sorted by size, that have been added or updated in the past 24 hours.
Not only does the automation of configuration, provisioning, operation,
orchestration and management save significant time and resources, it also
maximises network and team efficiency. This significant time saving can then mean that the IT team is able to focus on more critical problems that require a human approach.
Eliminate human error
If you’re not already motivated to implement an automated management model based on time savings alone, consider past ‘fat finger accidents’ – we’ve all been there. The more converged an infrastructure is, the more likely it is that a single misconfigured interface could take down hundreds of mission-critical applications
Finding a way to help ensure repeatability and auditability, templatise operations, implement version control and quickly identify misconfigurations is key. Automating these processes (with programming code or third-party tools for configuration and code management, such as GitHub, Puppet, Chef and others) takes a lot of human error out of the
equation.
The bottom line
Saving time and reducing errors – what’s not to like? However, does this
type of monitoring automation also impact the bottom line? The answer is
a resounding ‘yes’.
Case in point: a company recently implemented nothing more sophisticated
than the disk-related automated responses outlined above – clearing the TEMP folders and alerting again after another 15 minutes if the disks were still full and adding the top 10 processes to the high CPU alert. The results were anywhere from 30% to 70% fewer alerts compared to the same month the previous year. In real numbers, this translated to anywhere from 43 to 175 fewer alerts per month. In addition, the support staff saw the results and responded faster to the remaining alerts, because they knew the automated initial actions had already been done.
The CPU-related alerts obviously didn’t disappear altogether, but once
again, the support staff response improved since the tickets included
information about what specifically was going wrong. In one case, the company was able to go back to a vendor and request a patch because it was able to finally prove a long-standing issue with the software.
As virtualisation and falling costs – coupled, thankfully, with expanding budgets as businesses recognise the integral role that IT plays in marketplace success – push the growth of IT environments, the need to leverage monitoring to ensure the stability of computing environments becomes ever more obvious. Less obvious, but just as critical and valuable, is ensuring that the human cost of that monitoring remains low by implementing a monitoring solution that facilitates automation and by actually leveraging those automation capabilities.
It’s simple
This is automation – simple, elegant and hassle-free and something that IT professionals need to start prioritising in their monitoring protocols. If it’s so great, why don’t more IT teams utilise automation? The biggest barrier to implementation is not the wrong skills or wrong tools, but often the wrong mindset. Monitoring and automation have a bad reputation for being seen as difficult and complicated by many IT professionals. But the reality is that these technologies are only as problematic as you make them. The lesson here is to implement a good monitoring tool and then reap the rewards of automation, rather than continue to perform a strange interpretive dance that no one understands. Although it may not feel like an easy feat to set up these monitoring and automation technologies, in the long run it will make the lives of IT professionals much better. They can finally focus their efforts on the important stuff, while the background noise is taken care of at last.