Category Archives: Monitoring

If you Knew Me, You Wouldn’t Believe Me

And all of a sudden, people are referring to you as an expert in the field. When this first happens, you may even feel like an imposter, a fraud. But don’t worry.

As long as you stay focused on helping make things better, on helping others, on elevating good work (whether it’s yours or someone else’s) that you find “out there”… as long as you do that, you won’t be a fraud.

Because that’s genuine work, even if it doesn’t feel like work to you. And as hard as it is to believe, it’s not easy enough that anyone can do it, because very few other people are doing it.

Otherwise, who would you be helping in the first place?

The Four Questions, Part 4

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked. You can read that introduction here. Information on the first question (Why did I get this alert) is here. You can get the low-down on the second question (Why DIDN’T I get an alert) here. And the third question (What is monitored on my system) is here.

My goal in this post is to give you the tools you need to answer the fourth question: Which of the existing alerts will potentially trigger for my system?

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

Riddle Me This, Batman…

It’s 3:00pm. You can’t quite see the end of the day over the horizon, but you know it’s there. You throw a handful of trail mix into your face to try to avoid the onset of mid-afternoon nap-attack syndrome and hope to slide through the next two hours unmolested.

Which, of course, is why you are pulled into a team meeting. Not your team meeting, mind you. It’s the Linux server team. On the one hand, you’re flattered. They typically don’t invite anyone who can’t speak fluent Perl or quote every XKCD comic in chronological order (using EPOC time as their reference, of course). On the other…well, team meeting.

The manager wrote:

            kill `ps -ef | grep -i talking | awk ‘{print $1}’`

on the board, eliciting a chorus of laughter (from everyone but me). Of course, this gave the manager the perfect opportunity to focus the conversation on yours truly.

“We have this non-trivial issue, and are hoping you can grep out the solution for us.” He begins, “we’re responsible for roughly 4,000 sytems…”

Unable to contain herself, a staff member followed by stating, “4,732 systems. Of which 200 are physical and the remainder are virtualized…”

Unimpressed, her manager said, “Ms. Deal, unless I’m off by an order of magnitude, there’s no need to correct.”

She replied, “Sorry boss.”

“As I was saying,” he continued. “We have a…significant number of systems. Now how many alerts currently exist in the monitoring system which could generate a ticket?”

“436, with 6 currently in active development.” I respond, eager to show that I’m just on top of my systems as they are of theirs.

“So how many of those affect our systems?” the manager asked.

Now I’m in my element. I answer, “Well, if you aren’t getting tickets, then none. I mean, if nothing has a spiked CPU or RAM or whatever, then it’s safe to say all of your systems are stable. You can look at each node’s detail page for specifics, although with 4,000—I can see where you would want a summary. We can put something together to show the current statistics, or the average over time, or…”

“You misunderstand,” he cuts me off. “I’m fully cognizant of the fact that our systems are stable. That’s not my question. My question is…should one of my systems become unstable, how many of your 436-soon-to-be-442 alerts WOULD trigger for my systems?”

He continued, “As I understand it, your alert logic does two things: it identifies the devices which could trigger the alert—All Windows systems in the 10.199.1 subnet, for example—and at the same time specifies the conditions under which an alert is triggered—say, when the CPU goes over 80% for more than 15 minutes.”

“So what I mean,” he concluded, “Is this: can you create a report that shows me the devices which are included in the scope of an alert logic irrespective of the trigger condition?”

Your Mission, Should You Choose to Accept it…

As with the other questions we’ve discussed in this series, the specifics of HOW to answer this question is less critical than knowing you will be asked it.

In this case, it’s also important to understand that this question is actually two questions masquerading as one:

  1. For each alert, tell me which machines could potentially be triggers
  2. For each machine, tell me which alerts may potentially triggered

Why is this such an important question—perhaps the most important  – in this series? Because it determines the scale of the potential notifications monitoring may generate. It’s one thing if 5 alerts apply to 30 machines. It’s entirely another when 30 alerts apply to 4,000 machines.

The answer to this question has implications to staffing, shift allocation, pager rotation, and even the number of alerts a particular may approve for production.

The way you go about building this information is going to depend heavily on the monitoring solution you are using.

In general, agent-based solutions are better at this because trigger logic – in the form of an alert name –  is usually pushed down to the agent on each device, and thus can be queried (both “Hey, node, what alerts are “on” you?” and “hey, alert, which nodes have you been pushed to?”)

That’s not to say that agentless monitoring solutions are intrinsically unable to get the job done. The more full-featured monitoring tools have options built-in.

Reports that look like this:

Note to mention little “reminders” like this on the alert screen:

Or even resources on the device details page that look like this:

Houston, We Have a Problem

What if it doesn’t though? What if you have poured through the documentation, opened a ticket with the vendor, visited the online forums and asked the greatest gurus up on the mountain, and came back with a big fat goose egg? What then?

Your choices at this point still depend largely on the specific software, but generally speaking there are 3 options:

  • Reverse-engineer the alert trigger and remove the actual trigger part

Many monitoring solutions use a database back-end for the bulk of their metrics, and alerts are simply a query against this data. The alert trigger queries may exist in the database itself, or in a configuration file. Once you have found them, you will need to create a copy of each alert and then go through each, removing the parts which comprise the actual trigger (i.e.: CPU_Utilization > 80%) and leaving the parts that simply indicate scope (where Vendor = “Microsoft”; where “OperatingSystem” = “Windows 2003”; where “IP_address” contains “10.199.1”; etc). This will likely necessitate your learning the back-end query language for your tool. Difficult? Probably. Will it increase your street cred with the other users of the tool? Undoubtedly. Will it save your butt within the first month after you create it? Guarenteed.

And once you’ve done it, running a report for each alert becomes extremely simple.

  • Create duplicate alerts with no trigger

If you can’t export the alert triggers, another option is to create a duplicate of each alert that has the “scope” portion, but not the trigger elements (so the “Windows machines in the 10.199.1.x subnet” part but not the “CPU_Utilization > 80%” part). The only recipient of that alert will be you and the alert action should be something like writing to a logfile with a very simple string (“Alert x has triggered for Device y”). If you are VERY clever, you can export the key information in CSV format so you can import it into a spreadsheet or database for easy consumption. Every so often—every month or quarter—fire off those alerts and then tally up the results that recipient groups can slice and dice.

  • Do it by hand

If all else fails (and the inability to answer this very essential question doesn’t cause you to re-evaluate your choice of monitoring tool), you can start documenting by hand. If you know up-front that you are in this situation, then it’s simply part of the ongoing documentation process. But most times it’s going to be a slog through of existing alerts and writing down the trigger information. Hopefully you can take that trigger info and turn it into an automated query against your existing devices. If not, then I would seriously recommend looking at another tool. Because in any decent-sized environment, this is NOT the kind of thing you want to spend your life documenting, and it’s also not something you want to live without.  

What Time Is It? Beer:o’clock

After that last meeting—not to mention the whole day—you are ready pack it in. You successfully navigated the four impossible questions that every monitoring expert is asked (on more or less a daily basis)—Why did I get that alert, Why didn’t I get that alert, What is being monitored on my systems, and What alerts might trigger on my systems? Honestly, if you can do that, there’s not much more that life can throw at you.

Of course, the CIO walks up to you on your way to the elevator. “I’m glad I caught up to you,” he says, “I just have a quick question…”

Stay tuned for the bonus question!

#FeatureFriday: What is SNMP?

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

Today’s episode finds Chris O’Brien and me talking about one of the other old workhorses of monitoring: SNMP. We tease apart the structure of SNMP objects, how systems interact with them, and some tricks for usage in your monitoring environment.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

Don’t Fear Starting Over

Sometimes revisions are inevitable. But have you ever considered starting over when you didn’t have to?

What if you reinstalled the system from ground-up because you understand how all the applications fit together? What if you started that code project from a blank screen, now that the requirements are completely fleshed out? What if you finished the essay, put it in a drawer, and started writing it all over again from the top?

What would you lose? Time, to be sure.

But what might you gain?

Some of our greatest discoveries were found after losing the initial effort, the first draft, the prototype.

So why don’t we do it – start over – more often, on purpose?

Do we really think the time we save is more valuable than the wonders we could uncover?

The Four Questions, Part 3

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked: Why did I get an alert? Why didn’t I get an alert? What is being monitored on my system? What will alert on my system? and What standard monitoring do you do? My goal in this post is to give you the tools you need to answer the second question:

What is being monitored on my system

You can find information on the  first question (Why did I get this alert) here.
And information on the second question (Why DIDN’T I get an alert) here.

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

Not so fast, my friend!

It’s 1:35pm. Your first two callers—the one who wanted to know why they got a particular alert  and the one who wanted to know why they didn’t get an alert—are finally a distant memory, and you’ve managed to squeeze out some productive work setting up a monitor that will detect when your cell phone backup circuit will…

That’s when your manager ambles over and looks at you expectantly over the cube wall.

“I just met with the manager of the APP-X support team,” he tells you. “They want a matrix of what is monitored on their system.”

To his credit he adds, “I checked on the system in the Reports section, but nothing jumped out at me. Did I overlook something?”

This question is solved with a combination of foresight (knowing you are going to get asked this question), preparation, and know-how.

It is also one of the questions where the steps are extremely tool-specific. An agent-based solution will have a lot of this information embedded in the agent, whereas you will probably find what you need on an agentless solution like SolarWinds in a central database or configuration system.

Understand that this is a question that you can answer, and with some preparation you can have the answer with the push of a button (or two). But like so many solutions in this series, preparation now will save you from desperation later.

It’s also important to recognize that this type of report is absolutely essential, both to you and to the owner of the systems.

<PHILOSOPY>

I believe strongly that monitoring is more than just watching the elements which will create alerts (be they tickets, emails, pager messages, or an ahoogah noise over the loudspeakers. Your monitoring scope should cover elements which are used for capacity planning, performance analysis, and forensics after a critical event. For example, you may never alert on the size of the swap drive, but you will absolutely want to know what its size was during the time of a crash. For that reason, knowing what is monitored is essential, even if you won’t alert on some of those elements.

</PHILOSOPHY>

The knee bone’s connected to the…

In order to answer this question, the first thing you should do is break down the areas of monitoring data. That can include:

  1. Hardware information for the primary device – CPU and RAM are primary items but there are other aspects. The key is that you are ONLY interested in hardware that affects the whole “chassis” or box, not sub-elements like cards, disks, ports, etc..
  2. Hardware sub-elements – You may find that you have one, more than one, or none. Network cards, disks, external mount points, and VLAN’s are just a few examples.
  3. Specialized hardware elements – Fans, power supplies, temperature sensors, and the like.
  4. Application components – PerfMon counters, services, processes, logfile monitors and more—all the things that make up a holistic view of an application.

And now… I’m going to stop. While there are certainly many more items on that list, if you can master the concept of those first few bullets, adding more should come fairly easily.

It should be noted that this type of report is not a fire-and-forget affair. It’s more of a labor of love that you will come back to and refine over time.

I also need to point out that this will likely not be a one-size-fits-all-device-types solution. The report you create for network devices like routers and access points may need to be radically different from the one you build for server-type devices. Virtual hosts may need data points that have no relevance to anything else. And specialty devices like load balancers, UPS-es, or environmental systems are in a class of their own.

Finally, in order to get what you want, you also have to understand how the data is stored, and be extremely comfortable interacting with that system.  Because the tool I have handy is SolarWinds, that’s the data structure we’re going to use here.

As I mentioned earlier, this type of report is push-button simple on some toolsets. If that’s the case for you, then feel free to stop reading and walk away with the knowledge that you will be asked this question on a regular basis, and you should be prepared.

For those using a toolset where answering this question requires effort, read on!

select nodes.statusled, nodes.nodeid, nodes.caption,
s1.LED, s1.elementtype, s1.element, s1.Description
from nodes
left join (
select '01' as elementorder, 'HW' as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description
from APM_HardwareAlertData
union all
select '02' as elementorder, 'NIC' as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description
from interfaces
union all
select '03' as elementorder, 'DISK' as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description
from volumes
) s1
on nodes.nodeid = s1.NodeID
Order by nodes.nodeid ,  s1.elementorder asc, s1.element
Catch all that? OK, let’s break this down a little. The basic format of

this structure is:

Get device information (name, IP, etc)
            Get sub-element information (name, status, description)

The key to this process is standardizing the sub-element information so that it’s consistent across each type of element.

One thing to note is that the SQL “union all” command will let you combine results from two separate queries – such as a query to the interfaces table and another to the volumes table. BUT it requires the same number of columns as a result. In my example I’ve kept it simple – just the name and description, really.

The other trick I learned was to add icons rather than text wherever possible. That includes the “statusLED” and “status” columns, which display dots instead of text when rendered by the SolarWinds report engine. I find this gives a much quicker sense of what’s going on (oh, they’re monitoring this disk, but it’s currently offline).

Another addition worth noting is:

            select 'xx' as elementorder, 'yyy' as elementtype,

I use element order to sort each of the query blocks, and elementtype to give the report reader a clue as to the source of this information (disk, application, hardware, etc.)

But what if I included data points that existed for some elements, but not for others? Well, you still have to ensure that each query connected by “union all” has the same number of columns. So let’s say that we wanted to include circuit capacity for interfaces, disk capacity for disks, but nothing for hardware, it would look like this:

select nodes.statusled, nodes.nodeid, nodes.caption,
s1.LED, s1.elementtype, s1.element, s1.Description, s1.capacity
from nodes
left join (
select '01' as elementorder, 'HW' as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description, '' as capacity
from APM_HardwareAlertData
union all
select '02' as elementorder, 'NIC' as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description, interfaces.InterfaceSpeed as capacity
from interfaces
union all
select '03' as elementorder, 'DISK' as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description, volumes.VolumeSize as capacity
from volumes
) s1
on nodes.nodeid = s1.NodeID
Order by nodes.nodeid ,  s1.elementorder asc, s1.element

By adding to the ” as capacity hardware block (and any other section where it’s needed, we avert errors with the union all command.

Conspicuous by their absence in all of this are the things I listed first on the “must have” list: CPU, RAM, etc.

In this case, I used a couple of simple tricks: For CPU, I’m going to give the count of cpus since any other data (current processor utilization, etc.) is probably not helpful. For RAM, the solution is even simpler: I just queried the nodes table again and pulled out JUST the Total memory.

select nodes.statusled, nodes.nodeid, nodes.caption, s1.LED, s1.elementtype, s1.element, s1.Description, s1.capacity
from nodes
left join (
select '01' as elementorder, 'CPU' as elementtype, c1.NodeID, 'Up.gif' as LED, 'CPU Count:' as element, CONVERT(varchar,COUNT(c1.CPUIndex)) as description, '' as capacity
from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex from CPUMultiLoad) c1
group by c1.NodeID
union all
select '02' as elementorder, 'RAM' as elementtype, nodes.NodeID, 'Up.gif' as LED, 'Total RAM' as element, CONVERT(varchar,nodes.TotalMemory), '' as capacity
from nodes
union all
select '03' as elementorder, 'HW' as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description, '' as capacity
from APM_HardwareAlertData
union all
select '04' as elementorder, 'NIC' as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description, interfaces.InterfaceSpeed as capacity
from interfaces
union all
select '05' as elementorder, 'DISK' as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description, volumes.VolumeSize as capacity
from volumes
union all
select '06' as elementorder, 'APP' as elementtype, APM_AlertsAndReportsData.nodeid as NodeID, APM_AlertsAndReportsData.ComponentStatus as LED, APM_AlertsAndReportsData.ComponentName as element, APM_AlertsAndReportsData.ApplicationName as description, '' as capacity
from APM_AlertsAndReportsData
union all
select '07' as elementorder, 'POLLER' as elementtype, CustomPollerAssignmentView.NodeID, 'up.gif' as LED, CustomPollerAssignmentView.CustomPollerName as element, CustomPollerAssignmentView.CustomPollerDescription as description, '' as capacity
from CustomPollerAssignmentView
) s1
on nodes.nodeid = s1.NodeID
Order by nodes.nodeid ,  s1.elementorder asc, s1.element

In this iteration I found that the data output from some sources was integer, some was float, and somme was even text! So I started using the “CONVERT()” option to keep everything in the same format.

The result looks something like this:

I could stop here and you would have, more or less, the building blocks you need to build your own “What is monitored on my system?” report. But there is one more piece that takes this type of report to the next level.

Including the built-in thresholds for these elements increases complexity to the query, but also adds an entirely new (and important) dimension to the information you are providing.

More than ever, the success of this type of report lies in your knowing where threshold data is kept. In the case of SolarWinds, a series of “Thresholds” views (InterfacesThresholds, NodesPercentMemoryUsedThreshold, NodesCpuLoadThreshold, and so on) makes the job easier but you still have to know where thresholds are kept for applications, and that there are NO built-in thresholds for disks or custom pollers.

With that said, the final report query would look like this:

select nodes.statusled, nodes.nodeid, nodes.caption, s1.LED, s1.elementtype, s1.element, s1.Description, s1.capacity,
s1.threshold_value, s1.warn, s1.crit
from nodes
left join (
select '01' as elementorder, 'CPU' as elementtype, c1.NodeID, 'Up.gif' as LED, 'CPU Count:' as element, CONVERT(varchar,COUNT(c1.CPUIndex)) as description, '' as capacity, 'CPU Utilization' as threshold_value, convert(varchar, t1.Level1Value) as warn, convert(varchar, t1.Level2Value) as crit
from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex from CPUMultiLoad) c1
join NodesCpuLoadThreshold t1 on c1.nodeID = t1.InstanceId
group by c1.NodeID, t1.Level1Value, t1.Level2Value
union all
select '02' as elementorder, 'RAM' as elementtype, nodes.NodeID, 'Up.gif' as LED, 'Total RAM' as element, CONVERT(varchar,nodes.TotalMemory), '' as capacity, 'RAM Utilization' as threshold_value, convert(varchar, NodesPercentMemoryUsedThreshold.Level1Value) as warn, convert(varchar, NodesPercentMemoryUsedThreshold.Level2Value) as crit
from nodes
join NodesPercentMemoryUsedThreshold on nodes.nodeid = NodesPercentMemoryUsedThreshold.InstanceId
union all
select '95' as elementorder, 'HW' as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description, '' as capacity, '' as threshold_value, '' as warn, '' as crit
from APM_HardwareAlertData
union all
select '03' as elementorder, 'NIC' as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description, interfaces.InterfaceSpeed as capacity, 'bandwidth in/out' as threshold_value,
convert(varchar, i1.Level1Value)+'/'+convert(varchar,i2.level1value) as warn, convert(varchar, i1.Level2Value)+'/'+convert(varchar,i2.level2value) as crit
from interfaces
join (select InterfacesThresholds.instanceid, InterfacesThresholds.level1value , InterfacesThresholds.level2value
       from InterfacesThresholds where InterfacesThresholds.name = 'NPM.Interfaces.Stats.InPercentUtilization') i1 on interfaces.interfaceid = i1.InstanceId
join (select InterfacesThresholds.instanceid, InterfacesThresholds.Level1Value, InterfacesThresholds.level2value
       from InterfacesThresholds where InterfacesThresholds.name = 'NPM.Interfaces.Stats.OutPercentUtilization') i2 on interfaces.interfaceid = i2.InstanceId
union all
select '04' as elementorder, 'DISK' as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description, volumes.VolumeSize as capacity, '' as threshold_value, '' as warn, '' as crit
from volumes
union all
select '05' as elementorder, 'APP' as elementtype, APM_AlertsAndReportsData.nodeid as NodeID, APM_AlertsAndReportsData.ComponentStatus as LED, APM_AlertsAndReportsData.ComponentName as element, APM_AlertsAndReportsData.ApplicationName as description, '' as capacity, 'CPU Utilization' as threshold_value, convert(varchar, APM_AlertsAndReportsData.[Threshold-CPU-Warning]) as warn, convert(varchar, APM_AlertsAndReportsData.[Threshold-CPU-Critical]) as crit
from APM_AlertsAndReportsData
union all
select '06' as elementorder, 'POLLER' as elementtype, CustomPollerAssignmentView.NodeID, 'up.gif' as LED, CustomPollerAssignmentView.CustomPollerName as element, CustomPollerAssignmentView.CustomPollerDescription as description, '' as capacity, '' as threshold_value, '' as warn, '' as crit
from CustomPollerAssignmentView
) s1
on nodes.nodeid = s1.NodeID
Order by nodes.nodeid ,  s1.elementorder asc, s1.element

And here are the results:

It’s 90% Perspiration…

While answering this question requires persistence, skill, and in-depth knowledge of your monitoring toolset, the benefits are significantly greater than for the previous two questions.

Done right, teams can use this report to validate that the correct elements on each device are monitored – nothing is left out, nothing which has been decommissioned is still there. And when an alert does trigger, it will be easier to understand where you can look for hints, instead of just clicking around screens looking for something interesting.

Stock up on your tea leaves, goat entrails, and crystal balls because in my next post we’re going to take a peek into the future by answering the question “What *WILL* alert on my system?”

#FeatureFriday: What is Syslog?

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

In this installment, Chris O’Brien and I dig into the protocol/feature/tool called Syslog: How it works and how you use it effectively for monitoring in your environment.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

Luxury

In monitoring solutions, there are features which are essential.

There are features which are convenient.

There are features which are interesting, but not particularly useful.

Then there are features which are luxury items.

I think of a luxury feature in a monitoring solution (or any software, really) as something which may cost more, but is so amazingly convenient, or well-designed, or comfortable to use that it makes me excited to use the tool. It makes me better at what I do – able to accomplish more in less time, or devote more time doing it because it takes longer for me to get tired of it.

Knowing the difference between essential, convenient, and luxury important. You need to know what you are paying for, and why.

You need to build the skill of identifying luxury features, and making a case for them when needed, so that you don’t find yourself “making due” when it’s not necessary.

The Four Questions, Part 2

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked: Why did I get an alert? Why didn’t I get an alert? What is being monitored on my system? What will alert on my system? and What standard monitoring do you do? My goal in this post is to give you the tools you need to answer the second question:

Why DIDN’T I get an alert?

You can find information on the  first question (Why did I get this alert) here.

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

 

Good Morning, Dave…

It’s 9:45am. You finally got your first caller off the phone – <LINK-to-first-question>the one who wanted to know why they got a particular alert.</LINK>

You hear the joyous little “ping” of a new email. (Never mind that the joy of that ping sound got old 15 minutes after you set it up.)

“We had an outage on CorpSRV042, but never got a notification from your system. What exactly are we paying all that money for, anyway?

Out of The Four Questions of monitoring, this one is possibly the most labor-intensive because proving why something didn’t trigger requires an intimate knowledge of both the monitoring in place and the on-the-ground conditions of the system in question.

Unlike the previous question, there’s very little preparation you can do to lessen your workload. For that reason, my advice to you is going to be more of a checklist of items and areas to look into.

I’m also going to be working on the assumption that an event really did happen, and that a monitor was in place, which (ostensibly) should have caught it.

So what could have failed?

What We Have Here is a Failure…

It’s a non-failure failure (it was designed to work that way)

Items in this category represent more of a lack of awareness on the part of the device owner about how monitoring works. Once you narrow down the alleged “miss” to one of these, the next thing you need to evaluate is whether you should provide additional end-user education (lunch-and-learns, documentation, a narrated interpretive dance piece in the cafeteria, etc).

  • Alert windows
    Some alerting systems will allow you to specify that the alert should only trigger during certain times of the day. If the problem occurred and corrected itself (or was manually corrected) outside of that window, then no alert is triggered.
  • Alert duration
    The best alerts are set up to look for more than one occurrence of the issue. Nobody wants to get a page at 2:00am when a device failed one ping. But if you have an alert that is set to trigger when ping fails for 15 minutes, 13 minutes can seem like an eternity to the application support team that already knows the app is down.
  • The alert never reset
    After the last failure, the owners of the device worked on the problem, but they never actually got it into a good enough state where your monitoring solution registered that it was actually resolved. This also happens when staff decide to close the ticket (because it was probably nothing, anyway) without looking at the system. A good example of this is the disk alert that triggers when the disk is over 90% utilized, but doesn’t reset until it’s under 70% utilized. Staff may clear logs and get things down to a nice comfy 80%, but the alert never resets. Thus when the disk fills to 100% a week later, no alert is cut.
  • The device was unmanaged
    <IMAGE: unmanaged node with blue dot>It’s surprisingly easy to overlook that cute blue dot—especially if your staff aren’t in the habit of looking at the monitoring portal at all. Nevertheless, if it’s unmanaged, no alerts will trigger.
  • Mute, Squelch, Hush, or “Please-Shut-Up-Now” functions
    I’m a big believer in using custom properties for many of the things SolarWinds doesn’t do by design, and one of the first techniques I developed was a “mute” option. This gets around the issue with UnManage, where monitoring actually stops. Read more about that here (https://thwack.solarwinds.com/message/142288). With that said, if you use this type of option, you’ll also need to check its status as part of your analysis when you get this type of question.
  • Parent-Child, part 1
    The first kind of parent-child option I want to talk about is the one added to Orion Core in version 10.3. From that point forward, SolarWinds had the ability to make one device, element, application, component, group, or “thingy” (I hope I don’t lose you with these technical terms) the parent of another device, element, application, component… etc. Thus, if the parent is deemed to be down, the child will NOT trigger an alert (even if it is also down) but will rather be listed in a state of “unreachable.”
  • Parent-Child, part 2
    The second kind of suppression is implicit but often unrecognized by many users. In its simplest terms, if a device is down, you won’t get an alert about the disk being full. That one makes sense. But frequently an application team will ask why they didn’t get an alert that their app is down, and the reason is that the device was down (i.e., ping had failed) during that same period. Because of this, SolarWinds suppressed the application-down alert.

Failure with Change Control

In this section, the issue that we’re looking at changes either in the environment or within the monitoring system, which would cause an alert to be missed. I’m calling this “change control” because if there was a record and/or awareness of the change in question (as well as how the alert is configured), the device owner would probably not be calling you.

  • A credential changed
    If someone changes the SNMP string or the AD account password you’re using to connect, your monitoring tool ceases to be able to collect. Usually you’ll get some type of indication that this has happened, but not always.
  • The network changed
    All it takes is one firewall rule or rou gting change to make the difference between beautiful data and total radio silence. The problem is that the most basic monitoring—ping—is usually not impacted. So you won’t get that device down message that everyone generally relies on to know something is amiss. But higher-level protocols like SNMP or WMI are blocked. So you have a device which is up (ping) but where disk or CPU information is no longer being collected.
  • A custom property changed
    As I said before, I love me some custom properties. Along with the previously mentioned “mute,” there are properties for the owner group, location, environment (prod, dev, QA, etc.), server status (build, testing, managed), criticality (low, normal, business-critical) and more. Alert triggers leverage these properties so that we can have escalated alerts for some, and just warnings for others. But what happens when someone changes a server status from “PRODUCTION” to “DEV”? Likely, there can be an alert missed if an alert is configured to specifically look for “PROD” servers.
  • Drive (or other element) has been removed/changed
    I say drives because this seems to happen most often. IF your environment doesn’t include a “disk down” alert (don’t laugh, I’ve seen them), then volumes can be unmounted or mounted with a new name with amazing frequency. When that happens, many monitoring tools do not automatically start monitoring the new element; and the tools that almost never apply all the correct settings (like those custom properties). You end up with a situation where the device owner is completely aware of the update, but monitoring is the last to know.
  • The Server Vanished into the Virtualization Cloud
    The drive toward virtualization is strong. When a server goes from physical to virtual (P-to-V), it’s effectively a whole new machine. Even though the IP address and server name are the same, the drives go from a series of physical disks attached to a real storage controller to (usually fewer) volumes which appear to be running off a generic SCSI bus. Not only that, but other elements (interfaces, hardware monitors, CPU, and more) all completely change. Almost all monitoring tools require manual updating to track those changes, or else you are left with ghost elements that don’t respond to polling requests.

Failure of the Monitoring Technology

The previous two sections speak to educational or procedural breakdowns. But loath as I am to admit it, sometimes our monitoring tools fail us too. Here are some things you need to be aware of:

  • The element or device is not actually getting polled
    Often, this is a result of disks or other elements being removed and added; or a P-to-V migration (see previous section). But it also happens that an element simply stops getting polled. You’ll see this when you dig into the detailed data and find nothing collected for a certain period of time.
  • Polling interval is throttled
    One of the first things that a polling server does (at least the good ones) when it begins to be overwhelmed is to pull back on polling cycles so that it can collect at least SOME data from each element. You’ll see this as gaps in data sets. It’s not a wholesale loss of polling, but sort of a herky-jerky collection.
  • Polling data is out of sync
    This one can be quite challenging to nail down. In some cases, a monitoring system will add data into the central data store using localized times (either from the polling engine or (horrors!) from the target device itself). If that happens, then an event that occurred at 9am in New York shows up as having happened at 8am in Chicago. This shouldn’t be a problem unless, as mentioned in the first section, your system won’t trigger alerts before 9am.

Failure Somewhere After the Monitoring Solution

As much as you might hate to admit it, monitoring isn’t the center of the universe. And it’s not even the center of the universe with regard to monitoring. If everything within your monitoring solution checks out and you are still scratching your head, here are some additional areas to look:

  • Email (or whatever notification system you use) is down
    One of the most obvious items, but often not something IT pros think to check, is whether the system that is sending out alerts is actually alive and well. If email is down, you won’t get that email telling you the email server has crashed.
  • Event correlation rules
    Event correlation rules are wonderful, magical things. They take you beyond the simple parent-child suppression discussed earlier, and into a whole new realm of dependencies. But there are times when they inhibit a “good” alert in an unexpected way:

    • De-duplication
      The point of de-duplication is that multiple alerts won’t create multiple tickets. But if a ticket closed and didn’t update the event correlation system, de-dup will continue forever.
    • Blackout/Maintenance windows
      Another common feature for EC systems is the ability to look up a separate data source that lists times when a device is “out of service.” This can be a recurring schedule, or a specific one-time rule. Either way, you’ll want to check if the device in question was listed on the blackout list during the time when the error occurred.
  • Already open ticket
    Ticket systems themselves can be quite sophisticated, and many have the ability to suppress new tickets if there is already one open for the same alert/device combination. If you have a team that forgets to close their old ticket, they may never hear about the new events.

Hokey religions … are no match for a good blaster, kid

After laying out all the glorious theoretical ways in which monitoring can be missed, I thought it was only fair to give you some advice on techniques or tools you can use to either identify these problems, to resolve them, or (best of all) avoid them in the first place.

An Ounce of Prevention…

Here are some things to have in place that will let you know when all is not puppy dogs and rainbows:

  • Alerts that give you the health of the environment
    Under the heading of “who watches the watchmen,” in any moderately-sophisticated (or mission critical) monitoring environment, you should have both internal and external checks that things are working well:

    • The SolarWinds out-of-the-box (OOTB) alert that tells you SNMP has not been collected for a node in xx minutes.
    • The SolarWinds OOTB alert that tells you a poller hasn’t collected any metrics in xx minutes.
    • If you can swing it, running the OOTB Server & Application Monitor (SAM) templates for SolarWinds on a separate system is a great option. If having a second SolarWinds instance watching the first is simply not possible, look at the SAM template and mimic it using whatever other options you have available to you.
  • Have a way to test individual aspects of the alert stream
    It’s a horrible sinking feeling when you realize that no alerts have been going out because one piece of the alerting infrastructure failed on you. Start by understanding (and documenting) every step an alert takes, from the source device through to the ticket, email, page, or smoke signal that is sent to the alert recipient. From there, create and document the ways you can validate that each of those steps is independently working. This will allow you to quickly validate each subsystem and nail down where a message might have gotten lost.
  • So wait… you can test each alert subsystem?
    A test procedure is just a monitor waiting for your loving touch. Get it done. You’ll need to do it on a separate system (since, you know, if your alerting infrastructure is broken you won’t get an alert), but usually this can be done inexpensively. Just to be clear, once you can manually test each of your alert infrastructure components (monitoring, event correlation, etc), turn those manual tests into continuous monitors, and then set those monitors up with thresholds so you get an alert.
  • Create a deadman-switch
    The concept of a deadman switch is that you get an alert if something DOESN’T happen. In this case, you set yourself up to receive an alert if something doesn’t clear a trigger. You then send that clearing event through the monitoring system.

    • Example: Every 10 minutes, an internal process on your ticket system creates a ticket for you saying, “Monitoring is broken!” That ticket escalates, and alerts you, if it has been open for 7 minutes. Now you have your monitoring system send an alert whenever the current minutes are evenly divisible by 5. This alert is the reversing event your ticket system is looking for. As long as monitoring is working, the original ticket will be cleared before it escalates.

… about that pound of cure, now

Inevitably, you’ll have to dig in and analyze a failed alert. Here are some specific techniques to use in those cases:

  • Re-scan the device
    The problem might be that the device has changed. It could be a new firewall rule, an SNMP string change, or even the SNMP agent on the target device dying on you. Regardless, you’ll want to go through all the same steps you used to add the device. In SolarWinds, this means using the “Test” button under node properties, and then also running the “List Resources” option to make sure all of the interfaces, disks, and yes, even CPU and RAM options are correctly selected.

    • TRAP: Remember that List Resources will never remove Un-checking them doesn’t do diddly. You have to go into Manage Nodes and specifically delete them before they are really gone.
  • Test the alert
    Are you sure your logic was solid? Be prepared to copy the alert and add a statement limiting it down to the single device in question. Then re-fire that sucker and see if it flies.
  • Message Center Kung-Fu
    The Message Center remains your best friend for forensic analysis of alerts. Things to keep in mind:
  • Set the timeframe correctly – It defaults to “today.” If you got a call about something yesterday, make sure you change this.
  • Set the message count – If you are dealing with a wide range of time or a large number of devices, bump this number up. There’s no paging on this screen so if the event you want is #201, out of 200, you’ll never know.
  • Narrow down to a single device – to help avoid message count issues, use the Network Object dropdown to select a specific device
  • Un-check the message categories you don’t want. Syslog and Trap are the big offenders here. But even Events can get in your way depending on what you’re trying to find.
  • Limiting Events (or Alerts) – Speaking of Events and Alerts, if you know you’re looking for something particular, use those drop-down boxes to narrow down the pile of data you need. This option also lets you look for a particular alert, but also see any event during that time period.
  • The search box. Yes, there is one. It’s on the right side (often off the right side of the page and you have to scroll for it). If you put something in there, it acts as an additional filter along with all the stuff in the grey box.

Stay tuned for our next installment where we’ll dive into the third question: “What is being monitored on my system?”

#FeatureFriday: What is Ping?

I’m starting a new series of blog posts this week, which I’m calling “Feature Friday”. These will be (typically short) videos which explain a feature, function, or technique.

For our inaugural post, I’m going to start at the VERY beginning, with the grandaddy of all monitoring techniques: Ping.

In this video, Chris O’Brien and I dig into Ping, looking at how it works and what it can do for you from a monitoring perspective.

 


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

I Wish Someone Had Told Me

Ira Glass (Host of “This American Life“) has an oft-quoted piece about creative work titled “The Gap”. You can read it here, but it’s been made into vines and videos and lots of other forms that are more fun to watch – you can Google the first sentence and find enough to waste an hour or two.

What Ira describes has a parallel in the world of I.T., and just like Ira’s experience, nobody told me when I started. So with all necessary apologies and legal disclaimers, here is my adaptation of Ira’s famous advice:

Nobody tells this to I.T. noobies. I wish someone told me. All of us who do technical work, we get into it because we have a desire to make things work better. Folks drawn to IT are quick figure out how things work, but then we have this vision of how it could work – how cool it could be. And we know that if we could just get in there and tinker with it, we’d get it all sorted out and it would be incredible.

But there’s this gap.

For the first couple of years you are just plowing along. There’s so much to learn, and so much to do, and you have to earn respect before you get to do some of the cool stuff. So you do it. You just do the hard work of doing the work and learning and growing.

But at the same time you can see so much more that you want to fix, to improve, to be part of.

So you keep plugging away, soaking it all in and just trying to be part of whatever you can get into.

After a few years, you realize you’ve fallen into this trap – you have done all these different things (and done them well!) but now everyone expects you to be a jack of all trades all the time. To be the “he can figure it out” guy.

So this is the part nobody told me. The part that I had to figure out for myself. The part I’m telling you now:

After a few years, when you’ve seen the whole landscape of I.T. and you know what it’s all about, you need to pick. You need to decide where your personal desires and skills overlap. It might be storage, or voice, or server, or appdev, or whatever. It might have NOTHING to do with what you are doing here and now (actually, that’s a pretty safe bet). The point isn’t that you start doing “it” (at least not at first). The point is that you have chosen, and you commit to that goal.

To get there, you might need to work with a whole other team “on the side”, or after hours, or volunteer, or just hang out with “those guys” when they eat lunch. You might have to start reading a whole other set of blogs, or sneak to user groups or conventions on your lunch hour and days off.

And the job you are doing now, at the company you work for? You should get used to the idea that they’re not going to help you get there. Right now you are this amazing do-it-all resource. If you start only doing one thing they’re going to have to hire 2 or 3 more people to cover what you used to be doing. So don’t expect a lot of love in that direction.

But please, PLEASE keep doing it. Tap into the passion that got you into this in the first place – the desire to figure it out and the vision of how it can be better – and you push ahead. You’ll start commenting in forums, or writing blog posts, or jumping on tweet-ups.

You transform interest and enthusiasm into experience.

And all of a sudden, people are referring to you as an expert in the field. And all of a sudden, you are doing what you love, not just what you can.

Like Ira says: “It’s gonna take awhile. It’s normal to take awhile. You’ve just gotta fight your way through.”