(“In Case You Missed It Monday” is my chance to showcase something that I wrote and published in another venue, but is still relevant. This week’s post originally appeared on OrangeMatter)
“Monitoring” is a deceptively broad category, so it’s not surprising that IT professionals often have differing descriptions. Some believe monitoring is the thing that creates tickets in the helpdesk system. Others say it’s merely the archival data that can be used for forensics after a system has crashed. Still other IT pros define monitoring as the source of reports, which are run automatically or manually.
The reality is that monitoring is nothing more than the ongoing, steady collection of metrics and data from a set of target devices. Everything else – tickets, alerts, reports, and forensic information – is merely the happy by-product that comes once monitoring starts.
But that still doesn’t bring us any closer to a meaningful definition of monitoring.
The best method I’ve found in my career as a monitoring engineer, monitoring architect, and Head Geek is the FCAPS model. FCAPS stands for fault, capacity, authority, performance, and security.
Right off the bat I’m going to ignore authority and security. Those are topics for another day.
Fault is easy to understand. When a system or a service goes down, when a hardware element like CPU or temperature goes past a threshold, or when an event message is received that indicates a problem state, that’s a fault.
But what about all the data that ISN’T a fault? What does that data represent?
The answer is capacity and performance.
Capacity tells you how much of a particular resource (Ghz of CPU, Gb of RAM, Mbps of bandwidth, Tb of disk, etc.) a system has, as well as how much is currently available. Meanwhile, performance indicates the usage of those resources over time, as well as a prediction of how resource usage will appear in the future.
Monitoring tools traditionally do a great job handling the fault part of FCAPS. But capacity and performance often are given short shrift, or require a separate tool set, which itself does a poor job of fault monitoring.
This is, to put things politely, sub-optimal as it’s all the same data that we’re talking about. We at SolarWinds have built capacity and performance information into our core tools for precisely that reason. But since monitoring professionals tend to focus on fault, we thought we’d take a moment to describe ways to use the tool set you already know (and hopefully love) to create performance trends and capacity forecasts so that resource utilization becomes predictable and, therefore, proactively manageable, rather than something that blindsides your team and creates a fire drill.
Even without a monitoring tool, we all understand that fewer fire drills make for a better day.