In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked: Why did I get an alert? Why didn’t I get an alert? What is being monitored on my system? What will alert on my system? and What standard monitoring do you do? My goal in this post is to give you the tools you need to answer the second question:
What is being monitored on my system
You can find information on the first question (Why did I get this alert) here.
And information on the second question (Why DIDN’T I get an alert) here.
Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.
Not so fast, my friend!
It’s 1:35pm. Your first two callers—the one who wanted to know why they got a particular alert and the one who wanted to know why they didn’t get an alert—are finally a distant memory, and you’ve managed to squeeze out some productive work setting up a monitor that will detect when your cell phone backup circuit will…
That’s when your manager ambles over and looks at you expectantly over the cube wall.
“I just met with the manager of the APP-X support team,” he tells you. “They want a matrix of what is monitored on their system.”
To his credit he adds, “I checked on the system in the Reports section, but nothing jumped out at me. Did I overlook something?”
This question is solved with a combination of foresight (knowing you are going to get asked this question), preparation, and know-how.
It is also one of the questions where the steps are extremely tool-specific. An agent-based solution will have a lot of this information embedded in the agent, whereas you will probably find what you need on an agentless solution like SolarWinds in a central database or configuration system.
Understand that this is a question that you can answer, and with some preparation you can have the answer with the push of a button (or two). But like so many solutions in this series, preparation now will save you from desperation later.
It’s also important to recognize that this type of report is absolutely essential, both to you and to the owner of the systems.
<PHILOSOPY>
I believe strongly that monitoring is more than just watching the elements which will create alerts (be they tickets, emails, pager messages, or an ahoogah noise over the loudspeakers. Your monitoring scope should cover elements which are used for capacity planning, performance analysis, and forensics after a critical event. For example, you may never alert on the size of the swap drive, but you will absolutely want to know what its size was during the time of a crash. For that reason, knowing what is monitored is essential, even if you won’t alert on some of those elements.
</PHILOSOPHY>
The knee bone’s connected to the…
In order to answer this question, the first thing you should do is break down the areas of monitoring data. That can include:
- Hardware information for the primary device – CPU and RAM are primary items but there are other aspects. The key is that you are ONLY interested in hardware that affects the whole “chassis” or box, not sub-elements like cards, disks, ports, etc..
- Hardware sub-elements – You may find that you have one, more than one, or none. Network cards, disks, external mount points, and VLAN’s are just a few examples.
- Specialized hardware elements – Fans, power supplies, temperature sensors, and the like.
- Application components – PerfMon counters, services, processes, logfile monitors and more—all the things that make up a holistic view of an application.
And now… I’m going to stop. While there are certainly many more items on that list, if you can master the concept of those first few bullets, adding more should come fairly easily.
It should be noted that this type of report is not a fire-and-forget affair. It’s more of a labor of love that you will come back to and refine over time.
I also need to point out that this will likely not be a one-size-fits-all-device-types solution. The report you create for network devices like routers and access points may need to be radically different from the one you build for server-type devices. Virtual hosts may need data points that have no relevance to anything else. And specialty devices like load balancers, UPS-es, or environmental systems are in a class of their own.
Finally, in order to get what you want, you also have to understand how the data is stored, and be extremely comfortable interacting with that system. Because the tool I have handy is SolarWinds, that’s the data structure we’re going to use here.
As I mentioned earlier, this type of report is push-button simple on some toolsets. If that’s the case for you, then feel free to stop reading and walk away with the knowledge that you will be asked this question on a regular basis, and you should be prepared.
For those using a toolset where answering this question requires effort, read on!
select nodes.statusled, nodes.nodeid, nodes.caption, s1.LED, s1.elementtype, s1.element, s1.Description from nodes left join ( select '01' as elementorder, 'HW' as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description from APM_HardwareAlertData union all select '02' as elementorder, 'NIC' as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description from interfaces union all select '03' as elementorder, 'DISK' as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description from volumes ) s1 on nodes.nodeid = s1.NodeID Order by nodes.nodeid , s1.elementorder asc, s1.element Catch all that? OK, let’s break this down a little. The basic format of
this structure is:
Get device information (name, IP, etc) Get sub-element information (name, status, description)
The key to this process is standardizing the sub-element information so that it’s consistent across each type of element.
One thing to note is that the SQL “union all” command will let you combine results from two separate queries – such as a query to the interfaces table and another to the volumes table. BUT it requires the same number of columns as a result. In my example I’ve kept it simple – just the name and description, really.
The other trick I learned was to add icons rather than text wherever possible. That includes the “statusLED” and “status” columns, which display dots instead of text when rendered by the SolarWinds report engine. I find this gives a much quicker sense of what’s going on (oh, they’re monitoring this disk, but it’s currently offline).
Another addition worth noting is:
select 'xx' as elementorder, 'yyy' as elementtype,
I use element order to sort each of the query blocks, and elementtype to give the report reader a clue as to the source of this information (disk, application, hardware, etc.)
But what if I included data points that existed for some elements, but not for others? Well, you still have to ensure that each query connected by “union all” has the same number of columns. So let’s say that we wanted to include circuit capacity for interfaces, disk capacity for disks, but nothing for hardware, it would look like this:
select nodes.statusled, nodes.nodeid, nodes.caption, s1.LED, s1.elementtype, s1.element, s1.Description, s1.capacity from nodes left join ( select '01' as elementorder, 'HW' as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description, '' as capacity from APM_HardwareAlertData union all select '02' as elementorder, 'NIC' as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description, interfaces.InterfaceSpeed as capacity from interfaces union all select '03' as elementorder, 'DISK' as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description, volumes.VolumeSize as capacity from volumes ) s1 on nodes.nodeid = s1.NodeID Order by nodes.nodeid , s1.elementorder asc, s1.element
By adding to the ” as capacity hardware block (and any other section where it’s needed, we avert errors with the union all command.
Conspicuous by their absence in all of this are the things I listed first on the “must have” list: CPU, RAM, etc.
In this case, I used a couple of simple tricks: For CPU, I’m going to give the count of cpus since any other data (current processor utilization, etc.) is probably not helpful. For RAM, the solution is even simpler: I just queried the nodes table again and pulled out JUST the Total memory.
select nodes.statusled, nodes.nodeid, nodes.caption, s1.LED, s1.elementtype, s1.element, s1.Description, s1.capacity from nodes left join ( select '01' as elementorder, 'CPU' as elementtype, c1.NodeID, 'Up.gif' as LED, 'CPU Count:' as element, CONVERT(varchar,COUNT(c1.CPUIndex)) as description, '' as capacity from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex from CPUMultiLoad) c1 group by c1.NodeID union all select '02' as elementorder, 'RAM' as elementtype, nodes.NodeID, 'Up.gif' as LED, 'Total RAM' as element, CONVERT(varchar,nodes.TotalMemory), '' as capacity from nodes union all select '03' as elementorder, 'HW' as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description, '' as capacity from APM_HardwareAlertData union all select '04' as elementorder, 'NIC' as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description, interfaces.InterfaceSpeed as capacity from interfaces union all select '05' as elementorder, 'DISK' as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description, volumes.VolumeSize as capacity from volumes union all select '06' as elementorder, 'APP' as elementtype, APM_AlertsAndReportsData.nodeid as NodeID, APM_AlertsAndReportsData.ComponentStatus as LED, APM_AlertsAndReportsData.ComponentName as element, APM_AlertsAndReportsData.ApplicationName as description, '' as capacity from APM_AlertsAndReportsData union all select '07' as elementorder, 'POLLER' as elementtype, CustomPollerAssignmentView.NodeID, 'up.gif' as LED, CustomPollerAssignmentView.CustomPollerName as element, CustomPollerAssignmentView.CustomPollerDescription as description, '' as capacity from CustomPollerAssignmentView ) s1 on nodes.nodeid = s1.NodeID Order by nodes.nodeid , s1.elementorder asc, s1.element
In this iteration I found that the data output from some sources was integer, some was float, and somme was even text! So I started using the “CONVERT()” option to keep everything in the same format.
The result looks something like this:
I could stop here and you would have, more or less, the building blocks you need to build your own “What is monitored on my system?” report. But there is one more piece that takes this type of report to the next level.
Including the built-in thresholds for these elements increases complexity to the query, but also adds an entirely new (and important) dimension to the information you are providing.
More than ever, the success of this type of report lies in your knowing where threshold data is kept. In the case of SolarWinds, a series of “Thresholds” views (InterfacesThresholds, NodesPercentMemoryUsedThreshold, NodesCpuLoadThreshold, and so on) makes the job easier but you still have to know where thresholds are kept for applications, and that there are NO built-in thresholds for disks or custom pollers.
With that said, the final report query would look like this:
select nodes.statusled, nodes.nodeid, nodes.caption, s1.LED, s1.elementtype, s1.element, s1.Description, s1.capacity, s1.threshold_value, s1.warn, s1.crit from nodes left join ( select '01' as elementorder, 'CPU' as elementtype, c1.NodeID, 'Up.gif' as LED, 'CPU Count:' as element, CONVERT(varchar,COUNT(c1.CPUIndex)) as description, '' as capacity, 'CPU Utilization' as threshold_value, convert(varchar, t1.Level1Value) as warn, convert(varchar, t1.Level2Value) as crit from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex from CPUMultiLoad) c1 join NodesCpuLoadThreshold t1 on c1.nodeID = t1.InstanceId group by c1.NodeID, t1.Level1Value, t1.Level2Value union all select '02' as elementorder, 'RAM' as elementtype, nodes.NodeID, 'Up.gif' as LED, 'Total RAM' as element, CONVERT(varchar,nodes.TotalMemory), '' as capacity, 'RAM Utilization' as threshold_value, convert(varchar, NodesPercentMemoryUsedThreshold.Level1Value) as warn, convert(varchar, NodesPercentMemoryUsedThreshold.Level2Value) as crit from nodes join NodesPercentMemoryUsedThreshold on nodes.nodeid = NodesPercentMemoryUsedThreshold.InstanceId union all select '95' as elementorder, 'HW' as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description, '' as capacity, '' as threshold_value, '' as warn, '' as crit from APM_HardwareAlertData union all select '03' as elementorder, 'NIC' as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description, interfaces.InterfaceSpeed as capacity, 'bandwidth in/out' as threshold_value, convert(varchar, i1.Level1Value)+'/'+convert(varchar,i2.level1value) as warn, convert(varchar, i1.Level2Value)+'/'+convert(varchar,i2.level2value) as crit from interfaces join (select InterfacesThresholds.instanceid, InterfacesThresholds.level1value , InterfacesThresholds.level2value from InterfacesThresholds where InterfacesThresholds.name = 'NPM.Interfaces.Stats.InPercentUtilization') i1 on interfaces.interfaceid = i1.InstanceId join (select InterfacesThresholds.instanceid, InterfacesThresholds.Level1Value, InterfacesThresholds.level2value from InterfacesThresholds where InterfacesThresholds.name = 'NPM.Interfaces.Stats.OutPercentUtilization') i2 on interfaces.interfaceid = i2.InstanceId union all select '04' as elementorder, 'DISK' as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description, volumes.VolumeSize as capacity, '' as threshold_value, '' as warn, '' as crit from volumes union all select '05' as elementorder, 'APP' as elementtype, APM_AlertsAndReportsData.nodeid as NodeID, APM_AlertsAndReportsData.ComponentStatus as LED, APM_AlertsAndReportsData.ComponentName as element, APM_AlertsAndReportsData.ApplicationName as description, '' as capacity, 'CPU Utilization' as threshold_value, convert(varchar, APM_AlertsAndReportsData.[Threshold-CPU-Warning]) as warn, convert(varchar, APM_AlertsAndReportsData.[Threshold-CPU-Critical]) as crit from APM_AlertsAndReportsData union all select '06' as elementorder, 'POLLER' as elementtype, CustomPollerAssignmentView.NodeID, 'up.gif' as LED, CustomPollerAssignmentView.CustomPollerName as element, CustomPollerAssignmentView.CustomPollerDescription as description, '' as capacity, '' as threshold_value, '' as warn, '' as crit from CustomPollerAssignmentView ) s1 on nodes.nodeid = s1.NodeID Order by nodes.nodeid , s1.elementorder asc, s1.element
And here are the results:
It’s 90% Perspiration…
While answering this question requires persistence, skill, and in-depth knowledge of your monitoring toolset, the benefits are significantly greater than for the previous two questions.
Done right, teams can use this report to validate that the correct elements on each device are monitored – nothing is left out, nothing which has been decommissioned is still there. And when an alert does trigger, it will be easier to understand where you can look for hints, instead of just clicking around screens looking for something interesting.
Stock up on your tea leaves, goat entrails, and crystal balls because in my next post we’re going to take a peek into the future by answering the question “What *WILL* alert on my system?”