Category Archives: SolarWinds

The Four Questions, Part 4

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked. You can read that introduction here. Information on the first question (Why did I get this alert) is here. You can get the low-down on the second question (Why DIDN’T I get an alert) here. And the third question (What is monitored on my system) is here.

My goal in this post is to give you the tools you need to answer the fourth question: Which of the existing alerts will potentially trigger for my system?

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

Riddle Me This, Batman…

It’s 3:00pm. You can’t quite see the end of the day over the horizon, but you know it’s there. You throw a handful of trail mix into your face to try to avoid the onset of mid-afternoon nap-attack syndrome and hope to slide through the next two hours unmolested.

Which, of course, is why you are pulled into a team meeting. Not your team meeting, mind you. It’s the Linux server team. On the one hand, you’re flattered. They typically don’t invite anyone who can’t speak fluent Perl or quote every XKCD comic in chronological order (using EPOC time as their reference, of course). On the other…well, team meeting.

The manager wrote:

            kill `ps -ef | grep -i talking | awk ‘{print $1}’`

on the board, eliciting a chorus of laughter (from everyone but me). Of course, this gave the manager the perfect opportunity to focus the conversation on yours truly.

“We have this non-trivial issue, and are hoping you can grep out the solution for us.” He begins, “we’re responsible for roughly 4,000 sytems…”

Unable to contain herself, a staff member followed by stating, “4,732 systems. Of which 200 are physical and the remainder are virtualized…”

Unimpressed, her manager said, “Ms. Deal, unless I’m off by an order of magnitude, there’s no need to correct.”

She replied, “Sorry boss.”

“As I was saying,” he continued. “We have a…significant number of systems. Now how many alerts currently exist in the monitoring system which could generate a ticket?”

“436, with 6 currently in active development.” I respond, eager to show that I’m just on top of my systems as they are of theirs.

“So how many of those affect our systems?” the manager asked.

Now I’m in my element. I answer, “Well, if you aren’t getting tickets, then none. I mean, if nothing has a spiked CPU or RAM or whatever, then it’s safe to say all of your systems are stable. You can look at each node’s detail page for specifics, although with 4,000—I can see where you would want a summary. We can put something together to show the current statistics, or the average over time, or…”

“You misunderstand,” he cuts me off. “I’m fully cognizant of the fact that our systems are stable. That’s not my question. My question is…should one of my systems become unstable, how many of your 436-soon-to-be-442 alerts WOULD trigger for my systems?”

He continued, “As I understand it, your alert logic does two things: it identifies the devices which could trigger the alert—All Windows systems in the 10.199.1 subnet, for example—and at the same time specifies the conditions under which an alert is triggered—say, when the CPU goes over 80% for more than 15 minutes.”

“So what I mean,” he concluded, “Is this: can you create a report that shows me the devices which are included in the scope of an alert logic irrespective of the trigger condition?”

Your Mission, Should You Choose to Accept it…

As with the other questions we’ve discussed in this series, the specifics of HOW to answer this question is less critical than knowing you will be asked it.

In this case, it’s also important to understand that this question is actually two questions masquerading as one:

  1. For each alert, tell me which machines could potentially be triggers
  2. For each machine, tell me which alerts may potentially triggered

Why is this such an important question—perhaps the most important  – in this series? Because it determines the scale of the potential notifications monitoring may generate. It’s one thing if 5 alerts apply to 30 machines. It’s entirely another when 30 alerts apply to 4,000 machines.

The answer to this question has implications to staffing, shift allocation, pager rotation, and even the number of alerts a particular may approve for production.

The way you go about building this information is going to depend heavily on the monitoring solution you are using.

In general, agent-based solutions are better at this because trigger logic – in the form of an alert name –  is usually pushed down to the agent on each device, and thus can be queried (both “Hey, node, what alerts are “on” you?” and “hey, alert, which nodes have you been pushed to?”)

That’s not to say that agentless monitoring solutions are intrinsically unable to get the job done. The more full-featured monitoring tools have options built-in.

Reports that look like this:

Note to mention little “reminders” like this on the alert screen:

Or even resources on the device details page that look like this:

Houston, We Have a Problem

What if it doesn’t though? What if you have poured through the documentation, opened a ticket with the vendor, visited the online forums and asked the greatest gurus up on the mountain, and came back with a big fat goose egg? What then?

Your choices at this point still depend largely on the specific software, but generally speaking there are 3 options:

  • Reverse-engineer the alert trigger and remove the actual trigger part

Many monitoring solutions use a database back-end for the bulk of their metrics, and alerts are simply a query against this data. The alert trigger queries may exist in the database itself, or in a configuration file. Once you have found them, you will need to create a copy of each alert and then go through each, removing the parts which comprise the actual trigger (i.e.: CPU_Utilization > 80%) and leaving the parts that simply indicate scope (where Vendor = “Microsoft”; where “OperatingSystem” = “Windows 2003”; where “IP_address” contains “10.199.1”; etc). This will likely necessitate your learning the back-end query language for your tool. Difficult? Probably. Will it increase your street cred with the other users of the tool? Undoubtedly. Will it save your butt within the first month after you create it? Guarenteed.

And once you’ve done it, running a report for each alert becomes extremely simple.

  • Create duplicate alerts with no trigger

If you can’t export the alert triggers, another option is to create a duplicate of each alert that has the “scope” portion, but not the trigger elements (so the “Windows machines in the 10.199.1.x subnet” part but not the “CPU_Utilization > 80%” part). The only recipient of that alert will be you and the alert action should be something like writing to a logfile with a very simple string (“Alert x has triggered for Device y”). If you are VERY clever, you can export the key information in CSV format so you can import it into a spreadsheet or database for easy consumption. Every so often—every month or quarter—fire off those alerts and then tally up the results that recipient groups can slice and dice.

  • Do it by hand

If all else fails (and the inability to answer this very essential question doesn’t cause you to re-evaluate your choice of monitoring tool), you can start documenting by hand. If you know up-front that you are in this situation, then it’s simply part of the ongoing documentation process. But most times it’s going to be a slog through of existing alerts and writing down the trigger information. Hopefully you can take that trigger info and turn it into an automated query against your existing devices. If not, then I would seriously recommend looking at another tool. Because in any decent-sized environment, this is NOT the kind of thing you want to spend your life documenting, and it’s also not something you want to live without.  

What Time Is It? Beer:o’clock

After that last meeting—not to mention the whole day—you are ready pack it in. You successfully navigated the four impossible questions that every monitoring expert is asked (on more or less a daily basis)—Why did I get that alert, Why didn’t I get that alert, What is being monitored on my systems, and What alerts might trigger on my systems? Honestly, if you can do that, there’s not much more that life can throw at you.

Of course, the CIO walks up to you on your way to the elevator. “I’m glad I caught up to you,” he says, “I just have a quick question…”

Stay tuned for the bonus question!

#FeatureFriday: What is SNMP?

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

Today’s episode finds Chris O’Brien and me talking about one of the other old workhorses of monitoring: SNMP. We tease apart the structure of SNMP objects, how systems interact with them, and some tricks for usage in your monitoring environment.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

The Four Questions, Part 3

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked: Why did I get an alert? Why didn’t I get an alert? What is being monitored on my system? What will alert on my system? and What standard monitoring do you do? My goal in this post is to give you the tools you need to answer the second question:

What is being monitored on my system

You can find information on the  first question (Why did I get this alert) here.
And information on the second question (Why DIDN’T I get an alert) here.

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

Not so fast, my friend!

It’s 1:35pm. Your first two callers—the one who wanted to know why they got a particular alert  and the one who wanted to know why they didn’t get an alert—are finally a distant memory, and you’ve managed to squeeze out some productive work setting up a monitor that will detect when your cell phone backup circuit will…

That’s when your manager ambles over and looks at you expectantly over the cube wall.

“I just met with the manager of the APP-X support team,” he tells you. “They want a matrix of what is monitored on their system.”

To his credit he adds, “I checked on the system in the Reports section, but nothing jumped out at me. Did I overlook something?”

This question is solved with a combination of foresight (knowing you are going to get asked this question), preparation, and know-how.

It is also one of the questions where the steps are extremely tool-specific. An agent-based solution will have a lot of this information embedded in the agent, whereas you will probably find what you need on an agentless solution like SolarWinds in a central database or configuration system.

Understand that this is a question that you can answer, and with some preparation you can have the answer with the push of a button (or two). But like so many solutions in this series, preparation now will save you from desperation later.

It’s also important to recognize that this type of report is absolutely essential, both to you and to the owner of the systems.

<PHILOSOPY>

I believe strongly that monitoring is more than just watching the elements which will create alerts (be they tickets, emails, pager messages, or an ahoogah noise over the loudspeakers. Your monitoring scope should cover elements which are used for capacity planning, performance analysis, and forensics after a critical event. For example, you may never alert on the size of the swap drive, but you will absolutely want to know what its size was during the time of a crash. For that reason, knowing what is monitored is essential, even if you won’t alert on some of those elements.

</PHILOSOPHY>

The knee bone’s connected to the…

In order to answer this question, the first thing you should do is break down the areas of monitoring data. That can include:

  1. Hardware information for the primary device – CPU and RAM are primary items but there are other aspects. The key is that you are ONLY interested in hardware that affects the whole “chassis” or box, not sub-elements like cards, disks, ports, etc..
  2. Hardware sub-elements – You may find that you have one, more than one, or none. Network cards, disks, external mount points, and VLAN’s are just a few examples.
  3. Specialized hardware elements – Fans, power supplies, temperature sensors, and the like.
  4. Application components – PerfMon counters, services, processes, logfile monitors and more—all the things that make up a holistic view of an application.

And now… I’m going to stop. While there are certainly many more items on that list, if you can master the concept of those first few bullets, adding more should come fairly easily.

It should be noted that this type of report is not a fire-and-forget affair. It’s more of a labor of love that you will come back to and refine over time.

I also need to point out that this will likely not be a one-size-fits-all-device-types solution. The report you create for network devices like routers and access points may need to be radically different from the one you build for server-type devices. Virtual hosts may need data points that have no relevance to anything else. And specialty devices like load balancers, UPS-es, or environmental systems are in a class of their own.

Finally, in order to get what you want, you also have to understand how the data is stored, and be extremely comfortable interacting with that system.  Because the tool I have handy is SolarWinds, that’s the data structure we’re going to use here.

As I mentioned earlier, this type of report is push-button simple on some toolsets. If that’s the case for you, then feel free to stop reading and walk away with the knowledge that you will be asked this question on a regular basis, and you should be prepared.

For those using a toolset where answering this question requires effort, read on!

select nodes.statusled, nodes.nodeid, nodes.caption,
s1.LED, s1.elementtype, s1.element, s1.Description
from nodes
left join (
select '01' as elementorder, 'HW' as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description
from APM_HardwareAlertData
union all
select '02' as elementorder, 'NIC' as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description
from interfaces
union all
select '03' as elementorder, 'DISK' as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description
from volumes
) s1
on nodes.nodeid = s1.NodeID
Order by nodes.nodeid ,  s1.elementorder asc, s1.element
Catch all that? OK, let’s break this down a little. The basic format of

this structure is:

Get device information (name, IP, etc)
            Get sub-element information (name, status, description)

The key to this process is standardizing the sub-element information so that it’s consistent across each type of element.

One thing to note is that the SQL “union all” command will let you combine results from two separate queries – such as a query to the interfaces table and another to the volumes table. BUT it requires the same number of columns as a result. In my example I’ve kept it simple – just the name and description, really.

The other trick I learned was to add icons rather than text wherever possible. That includes the “statusLED” and “status” columns, which display dots instead of text when rendered by the SolarWinds report engine. I find this gives a much quicker sense of what’s going on (oh, they’re monitoring this disk, but it’s currently offline).

Another addition worth noting is:

            select 'xx' as elementorder, 'yyy' as elementtype,

I use element order to sort each of the query blocks, and elementtype to give the report reader a clue as to the source of this information (disk, application, hardware, etc.)

But what if I included data points that existed for some elements, but not for others? Well, you still have to ensure that each query connected by “union all” has the same number of columns. So let’s say that we wanted to include circuit capacity for interfaces, disk capacity for disks, but nothing for hardware, it would look like this:

select nodes.statusled, nodes.nodeid, nodes.caption,
s1.LED, s1.elementtype, s1.element, s1.Description, s1.capacity
from nodes
left join (
select '01' as elementorder, 'HW' as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description, '' as capacity
from APM_HardwareAlertData
union all
select '02' as elementorder, 'NIC' as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description, interfaces.InterfaceSpeed as capacity
from interfaces
union all
select '03' as elementorder, 'DISK' as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description, volumes.VolumeSize as capacity
from volumes
) s1
on nodes.nodeid = s1.NodeID
Order by nodes.nodeid ,  s1.elementorder asc, s1.element

By adding to the ” as capacity hardware block (and any other section where it’s needed, we avert errors with the union all command.

Conspicuous by their absence in all of this are the things I listed first on the “must have” list: CPU, RAM, etc.

In this case, I used a couple of simple tricks: For CPU, I’m going to give the count of cpus since any other data (current processor utilization, etc.) is probably not helpful. For RAM, the solution is even simpler: I just queried the nodes table again and pulled out JUST the Total memory.

select nodes.statusled, nodes.nodeid, nodes.caption, s1.LED, s1.elementtype, s1.element, s1.Description, s1.capacity
from nodes
left join (
select '01' as elementorder, 'CPU' as elementtype, c1.NodeID, 'Up.gif' as LED, 'CPU Count:' as element, CONVERT(varchar,COUNT(c1.CPUIndex)) as description, '' as capacity
from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex from CPUMultiLoad) c1
group by c1.NodeID
union all
select '02' as elementorder, 'RAM' as elementtype, nodes.NodeID, 'Up.gif' as LED, 'Total RAM' as element, CONVERT(varchar,nodes.TotalMemory), '' as capacity
from nodes
union all
select '03' as elementorder, 'HW' as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description, '' as capacity
from APM_HardwareAlertData
union all
select '04' as elementorder, 'NIC' as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description, interfaces.InterfaceSpeed as capacity
from interfaces
union all
select '05' as elementorder, 'DISK' as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description, volumes.VolumeSize as capacity
from volumes
union all
select '06' as elementorder, 'APP' as elementtype, APM_AlertsAndReportsData.nodeid as NodeID, APM_AlertsAndReportsData.ComponentStatus as LED, APM_AlertsAndReportsData.ComponentName as element, APM_AlertsAndReportsData.ApplicationName as description, '' as capacity
from APM_AlertsAndReportsData
union all
select '07' as elementorder, 'POLLER' as elementtype, CustomPollerAssignmentView.NodeID, 'up.gif' as LED, CustomPollerAssignmentView.CustomPollerName as element, CustomPollerAssignmentView.CustomPollerDescription as description, '' as capacity
from CustomPollerAssignmentView
) s1
on nodes.nodeid = s1.NodeID
Order by nodes.nodeid ,  s1.elementorder asc, s1.element

In this iteration I found that the data output from some sources was integer, some was float, and somme was even text! So I started using the “CONVERT()” option to keep everything in the same format.

The result looks something like this:

I could stop here and you would have, more or less, the building blocks you need to build your own “What is monitored on my system?” report. But there is one more piece that takes this type of report to the next level.

Including the built-in thresholds for these elements increases complexity to the query, but also adds an entirely new (and important) dimension to the information you are providing.

More than ever, the success of this type of report lies in your knowing where threshold data is kept. In the case of SolarWinds, a series of “Thresholds” views (InterfacesThresholds, NodesPercentMemoryUsedThreshold, NodesCpuLoadThreshold, and so on) makes the job easier but you still have to know where thresholds are kept for applications, and that there are NO built-in thresholds for disks or custom pollers.

With that said, the final report query would look like this:

select nodes.statusled, nodes.nodeid, nodes.caption, s1.LED, s1.elementtype, s1.element, s1.Description, s1.capacity,
s1.threshold_value, s1.warn, s1.crit
from nodes
left join (
select '01' as elementorder, 'CPU' as elementtype, c1.NodeID, 'Up.gif' as LED, 'CPU Count:' as element, CONVERT(varchar,COUNT(c1.CPUIndex)) as description, '' as capacity, 'CPU Utilization' as threshold_value, convert(varchar, t1.Level1Value) as warn, convert(varchar, t1.Level2Value) as crit
from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex from CPUMultiLoad) c1
join NodesCpuLoadThreshold t1 on c1.nodeID = t1.InstanceId
group by c1.NodeID, t1.Level1Value, t1.Level2Value
union all
select '02' as elementorder, 'RAM' as elementtype, nodes.NodeID, 'Up.gif' as LED, 'Total RAM' as element, CONVERT(varchar,nodes.TotalMemory), '' as capacity, 'RAM Utilization' as threshold_value, convert(varchar, NodesPercentMemoryUsedThreshold.Level1Value) as warn, convert(varchar, NodesPercentMemoryUsedThreshold.Level2Value) as crit
from nodes
join NodesPercentMemoryUsedThreshold on nodes.nodeid = NodesPercentMemoryUsedThreshold.InstanceId
union all
select '95' as elementorder, 'HW' as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description, '' as capacity, '' as threshold_value, '' as warn, '' as crit
from APM_HardwareAlertData
union all
select '03' as elementorder, 'NIC' as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description, interfaces.InterfaceSpeed as capacity, 'bandwidth in/out' as threshold_value,
convert(varchar, i1.Level1Value)+'/'+convert(varchar,i2.level1value) as warn, convert(varchar, i1.Level2Value)+'/'+convert(varchar,i2.level2value) as crit
from interfaces
join (select InterfacesThresholds.instanceid, InterfacesThresholds.level1value , InterfacesThresholds.level2value
       from InterfacesThresholds where InterfacesThresholds.name = 'NPM.Interfaces.Stats.InPercentUtilization') i1 on interfaces.interfaceid = i1.InstanceId
join (select InterfacesThresholds.instanceid, InterfacesThresholds.Level1Value, InterfacesThresholds.level2value
       from InterfacesThresholds where InterfacesThresholds.name = 'NPM.Interfaces.Stats.OutPercentUtilization') i2 on interfaces.interfaceid = i2.InstanceId
union all
select '04' as elementorder, 'DISK' as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description, volumes.VolumeSize as capacity, '' as threshold_value, '' as warn, '' as crit
from volumes
union all
select '05' as elementorder, 'APP' as elementtype, APM_AlertsAndReportsData.nodeid as NodeID, APM_AlertsAndReportsData.ComponentStatus as LED, APM_AlertsAndReportsData.ComponentName as element, APM_AlertsAndReportsData.ApplicationName as description, '' as capacity, 'CPU Utilization' as threshold_value, convert(varchar, APM_AlertsAndReportsData.[Threshold-CPU-Warning]) as warn, convert(varchar, APM_AlertsAndReportsData.[Threshold-CPU-Critical]) as crit
from APM_AlertsAndReportsData
union all
select '06' as elementorder, 'POLLER' as elementtype, CustomPollerAssignmentView.NodeID, 'up.gif' as LED, CustomPollerAssignmentView.CustomPollerName as element, CustomPollerAssignmentView.CustomPollerDescription as description, '' as capacity, '' as threshold_value, '' as warn, '' as crit
from CustomPollerAssignmentView
) s1
on nodes.nodeid = s1.NodeID
Order by nodes.nodeid ,  s1.elementorder asc, s1.element

And here are the results:

It’s 90% Perspiration…

While answering this question requires persistence, skill, and in-depth knowledge of your monitoring toolset, the benefits are significantly greater than for the previous two questions.

Done right, teams can use this report to validate that the correct elements on each device are monitored – nothing is left out, nothing which has been decommissioned is still there. And when an alert does trigger, it will be easier to understand where you can look for hints, instead of just clicking around screens looking for something interesting.

Stock up on your tea leaves, goat entrails, and crystal balls because in my next post we’re going to take a peek into the future by answering the question “What *WILL* alert on my system?”

#FeatureFriday: What is Syslog?

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

In this installment, Chris O’Brien and I dig into the protocol/feature/tool called Syslog: How it works and how you use it effectively for monitoring in your environment.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

The Four Questions, Part 2

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked: Why did I get an alert? Why didn’t I get an alert? What is being monitored on my system? What will alert on my system? and What standard monitoring do you do? My goal in this post is to give you the tools you need to answer the second question:

Why DIDN’T I get an alert?

You can find information on the  first question (Why did I get this alert) here.

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

 

Good Morning, Dave…

It’s 9:45am. You finally got your first caller off the phone – <LINK-to-first-question>the one who wanted to know why they got a particular alert.</LINK>

You hear the joyous little “ping” of a new email. (Never mind that the joy of that ping sound got old 15 minutes after you set it up.)

“We had an outage on CorpSRV042, but never got a notification from your system. What exactly are we paying all that money for, anyway?

Out of The Four Questions of monitoring, this one is possibly the most labor-intensive because proving why something didn’t trigger requires an intimate knowledge of both the monitoring in place and the on-the-ground conditions of the system in question.

Unlike the previous question, there’s very little preparation you can do to lessen your workload. For that reason, my advice to you is going to be more of a checklist of items and areas to look into.

I’m also going to be working on the assumption that an event really did happen, and that a monitor was in place, which (ostensibly) should have caught it.

So what could have failed?

What We Have Here is a Failure…

It’s a non-failure failure (it was designed to work that way)

Items in this category represent more of a lack of awareness on the part of the device owner about how monitoring works. Once you narrow down the alleged “miss” to one of these, the next thing you need to evaluate is whether you should provide additional end-user education (lunch-and-learns, documentation, a narrated interpretive dance piece in the cafeteria, etc).

  • Alert windows
    Some alerting systems will allow you to specify that the alert should only trigger during certain times of the day. If the problem occurred and corrected itself (or was manually corrected) outside of that window, then no alert is triggered.
  • Alert duration
    The best alerts are set up to look for more than one occurrence of the issue. Nobody wants to get a page at 2:00am when a device failed one ping. But if you have an alert that is set to trigger when ping fails for 15 minutes, 13 minutes can seem like an eternity to the application support team that already knows the app is down.
  • The alert never reset
    After the last failure, the owners of the device worked on the problem, but they never actually got it into a good enough state where your monitoring solution registered that it was actually resolved. This also happens when staff decide to close the ticket (because it was probably nothing, anyway) without looking at the system. A good example of this is the disk alert that triggers when the disk is over 90% utilized, but doesn’t reset until it’s under 70% utilized. Staff may clear logs and get things down to a nice comfy 80%, but the alert never resets. Thus when the disk fills to 100% a week later, no alert is cut.
  • The device was unmanaged
    <IMAGE: unmanaged node with blue dot>It’s surprisingly easy to overlook that cute blue dot—especially if your staff aren’t in the habit of looking at the monitoring portal at all. Nevertheless, if it’s unmanaged, no alerts will trigger.
  • Mute, Squelch, Hush, or “Please-Shut-Up-Now” functions
    I’m a big believer in using custom properties for many of the things SolarWinds doesn’t do by design, and one of the first techniques I developed was a “mute” option. This gets around the issue with UnManage, where monitoring actually stops. Read more about that here (https://thwack.solarwinds.com/message/142288). With that said, if you use this type of option, you’ll also need to check its status as part of your analysis when you get this type of question.
  • Parent-Child, part 1
    The first kind of parent-child option I want to talk about is the one added to Orion Core in version 10.3. From that point forward, SolarWinds had the ability to make one device, element, application, component, group, or “thingy” (I hope I don’t lose you with these technical terms) the parent of another device, element, application, component… etc. Thus, if the parent is deemed to be down, the child will NOT trigger an alert (even if it is also down) but will rather be listed in a state of “unreachable.”
  • Parent-Child, part 2
    The second kind of suppression is implicit but often unrecognized by many users. In its simplest terms, if a device is down, you won’t get an alert about the disk being full. That one makes sense. But frequently an application team will ask why they didn’t get an alert that their app is down, and the reason is that the device was down (i.e., ping had failed) during that same period. Because of this, SolarWinds suppressed the application-down alert.

Failure with Change Control

In this section, the issue that we’re looking at changes either in the environment or within the monitoring system, which would cause an alert to be missed. I’m calling this “change control” because if there was a record and/or awareness of the change in question (as well as how the alert is configured), the device owner would probably not be calling you.

  • A credential changed
    If someone changes the SNMP string or the AD account password you’re using to connect, your monitoring tool ceases to be able to collect. Usually you’ll get some type of indication that this has happened, but not always.
  • The network changed
    All it takes is one firewall rule or rou gting change to make the difference between beautiful data and total radio silence. The problem is that the most basic monitoring—ping—is usually not impacted. So you won’t get that device down message that everyone generally relies on to know something is amiss. But higher-level protocols like SNMP or WMI are blocked. So you have a device which is up (ping) but where disk or CPU information is no longer being collected.
  • A custom property changed
    As I said before, I love me some custom properties. Along with the previously mentioned “mute,” there are properties for the owner group, location, environment (prod, dev, QA, etc.), server status (build, testing, managed), criticality (low, normal, business-critical) and more. Alert triggers leverage these properties so that we can have escalated alerts for some, and just warnings for others. But what happens when someone changes a server status from “PRODUCTION” to “DEV”? Likely, there can be an alert missed if an alert is configured to specifically look for “PROD” servers.
  • Drive (or other element) has been removed/changed
    I say drives because this seems to happen most often. IF your environment doesn’t include a “disk down” alert (don’t laugh, I’ve seen them), then volumes can be unmounted or mounted with a new name with amazing frequency. When that happens, many monitoring tools do not automatically start monitoring the new element; and the tools that almost never apply all the correct settings (like those custom properties). You end up with a situation where the device owner is completely aware of the update, but monitoring is the last to know.
  • The Server Vanished into the Virtualization Cloud
    The drive toward virtualization is strong. When a server goes from physical to virtual (P-to-V), it’s effectively a whole new machine. Even though the IP address and server name are the same, the drives go from a series of physical disks attached to a real storage controller to (usually fewer) volumes which appear to be running off a generic SCSI bus. Not only that, but other elements (interfaces, hardware monitors, CPU, and more) all completely change. Almost all monitoring tools require manual updating to track those changes, or else you are left with ghost elements that don’t respond to polling requests.

Failure of the Monitoring Technology

The previous two sections speak to educational or procedural breakdowns. But loath as I am to admit it, sometimes our monitoring tools fail us too. Here are some things you need to be aware of:

  • The element or device is not actually getting polled
    Often, this is a result of disks or other elements being removed and added; or a P-to-V migration (see previous section). But it also happens that an element simply stops getting polled. You’ll see this when you dig into the detailed data and find nothing collected for a certain period of time.
  • Polling interval is throttled
    One of the first things that a polling server does (at least the good ones) when it begins to be overwhelmed is to pull back on polling cycles so that it can collect at least SOME data from each element. You’ll see this as gaps in data sets. It’s not a wholesale loss of polling, but sort of a herky-jerky collection.
  • Polling data is out of sync
    This one can be quite challenging to nail down. In some cases, a monitoring system will add data into the central data store using localized times (either from the polling engine or (horrors!) from the target device itself). If that happens, then an event that occurred at 9am in New York shows up as having happened at 8am in Chicago. This shouldn’t be a problem unless, as mentioned in the first section, your system won’t trigger alerts before 9am.

Failure Somewhere After the Monitoring Solution

As much as you might hate to admit it, monitoring isn’t the center of the universe. And it’s not even the center of the universe with regard to monitoring. If everything within your monitoring solution checks out and you are still scratching your head, here are some additional areas to look:

  • Email (or whatever notification system you use) is down
    One of the most obvious items, but often not something IT pros think to check, is whether the system that is sending out alerts is actually alive and well. If email is down, you won’t get that email telling you the email server has crashed.
  • Event correlation rules
    Event correlation rules are wonderful, magical things. They take you beyond the simple parent-child suppression discussed earlier, and into a whole new realm of dependencies. But there are times when they inhibit a “good” alert in an unexpected way:

    • De-duplication
      The point of de-duplication is that multiple alerts won’t create multiple tickets. But if a ticket closed and didn’t update the event correlation system, de-dup will continue forever.
    • Blackout/Maintenance windows
      Another common feature for EC systems is the ability to look up a separate data source that lists times when a device is “out of service.” This can be a recurring schedule, or a specific one-time rule. Either way, you’ll want to check if the device in question was listed on the blackout list during the time when the error occurred.
  • Already open ticket
    Ticket systems themselves can be quite sophisticated, and many have the ability to suppress new tickets if there is already one open for the same alert/device combination. If you have a team that forgets to close their old ticket, they may never hear about the new events.

Hokey religions … are no match for a good blaster, kid

After laying out all the glorious theoretical ways in which monitoring can be missed, I thought it was only fair to give you some advice on techniques or tools you can use to either identify these problems, to resolve them, or (best of all) avoid them in the first place.

An Ounce of Prevention…

Here are some things to have in place that will let you know when all is not puppy dogs and rainbows:

  • Alerts that give you the health of the environment
    Under the heading of “who watches the watchmen,” in any moderately-sophisticated (or mission critical) monitoring environment, you should have both internal and external checks that things are working well:

    • The SolarWinds out-of-the-box (OOTB) alert that tells you SNMP has not been collected for a node in xx minutes.
    • The SolarWinds OOTB alert that tells you a poller hasn’t collected any metrics in xx minutes.
    • If you can swing it, running the OOTB Server & Application Monitor (SAM) templates for SolarWinds on a separate system is a great option. If having a second SolarWinds instance watching the first is simply not possible, look at the SAM template and mimic it using whatever other options you have available to you.
  • Have a way to test individual aspects of the alert stream
    It’s a horrible sinking feeling when you realize that no alerts have been going out because one piece of the alerting infrastructure failed on you. Start by understanding (and documenting) every step an alert takes, from the source device through to the ticket, email, page, or smoke signal that is sent to the alert recipient. From there, create and document the ways you can validate that each of those steps is independently working. This will allow you to quickly validate each subsystem and nail down where a message might have gotten lost.
  • So wait… you can test each alert subsystem?
    A test procedure is just a monitor waiting for your loving touch. Get it done. You’ll need to do it on a separate system (since, you know, if your alerting infrastructure is broken you won’t get an alert), but usually this can be done inexpensively. Just to be clear, once you can manually test each of your alert infrastructure components (monitoring, event correlation, etc), turn those manual tests into continuous monitors, and then set those monitors up with thresholds so you get an alert.
  • Create a deadman-switch
    The concept of a deadman switch is that you get an alert if something DOESN’T happen. In this case, you set yourself up to receive an alert if something doesn’t clear a trigger. You then send that clearing event through the monitoring system.

    • Example: Every 10 minutes, an internal process on your ticket system creates a ticket for you saying, “Monitoring is broken!” That ticket escalates, and alerts you, if it has been open for 7 minutes. Now you have your monitoring system send an alert whenever the current minutes are evenly divisible by 5. This alert is the reversing event your ticket system is looking for. As long as monitoring is working, the original ticket will be cleared before it escalates.

… about that pound of cure, now

Inevitably, you’ll have to dig in and analyze a failed alert. Here are some specific techniques to use in those cases:

  • Re-scan the device
    The problem might be that the device has changed. It could be a new firewall rule, an SNMP string change, or even the SNMP agent on the target device dying on you. Regardless, you’ll want to go through all the same steps you used to add the device. In SolarWinds, this means using the “Test” button under node properties, and then also running the “List Resources” option to make sure all of the interfaces, disks, and yes, even CPU and RAM options are correctly selected.

    • TRAP: Remember that List Resources will never remove Un-checking them doesn’t do diddly. You have to go into Manage Nodes and specifically delete them before they are really gone.
  • Test the alert
    Are you sure your logic was solid? Be prepared to copy the alert and add a statement limiting it down to the single device in question. Then re-fire that sucker and see if it flies.
  • Message Center Kung-Fu
    The Message Center remains your best friend for forensic analysis of alerts. Things to keep in mind:
  • Set the timeframe correctly – It defaults to “today.” If you got a call about something yesterday, make sure you change this.
  • Set the message count – If you are dealing with a wide range of time or a large number of devices, bump this number up. There’s no paging on this screen so if the event you want is #201, out of 200, you’ll never know.
  • Narrow down to a single device – to help avoid message count issues, use the Network Object dropdown to select a specific device
  • Un-check the message categories you don’t want. Syslog and Trap are the big offenders here. But even Events can get in your way depending on what you’re trying to find.
  • Limiting Events (or Alerts) – Speaking of Events and Alerts, if you know you’re looking for something particular, use those drop-down boxes to narrow down the pile of data you need. This option also lets you look for a particular alert, but also see any event during that time period.
  • The search box. Yes, there is one. It’s on the right side (often off the right side of the page and you have to scroll for it). If you put something in there, it acts as an additional filter along with all the stuff in the grey box.

Stay tuned for our next installment where we’ll dive into the third question: “What is being monitored on my system?”

#FeatureFriday: What is Ping?

I’m starting a new series of blog posts this week, which I’m calling “Feature Friday”. These will be (typically short) videos which explain a feature, function, or technique.

For our inaugural post, I’m going to start at the VERY beginning, with the grandaddy of all monitoring techniques: Ping.

In this video, Chris O’Brien and I dig into Ping, looking at how it works and what it can do for you from a monitoring perspective.

 


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

The Four Questions, Part 1

 

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked: Why did I get an alert? Why didn’t I get an alert? What is being monitored on my system? What will alert on my system? and What standard monitoring do you do? My goal in this next post is to give you the tools you need to answer the first of those:

Why did I get an alert?

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, the fact is that most of the techniques can be translated to any toolset.

**************
It’s 8:45am, and you are just settling in at your desk. You notice that one email came in overnight from your company’s 24-7 operations desk:

“We got an alert for high CPU on the server WinSrvABC123 at 2:37am. We didn’t notice anything when we jumped on the box. Can you explain what happened?”

Out of all of The Four Questions of monitoring, this is the easiest one to answer, as long as you have done your homework and set up your environment.

Before I dig in, I want to clarify that this is not the same question as “What WILL alert on my server?” or “What are the monitoring and alerting standards for this type of device?” (I’ll cover both of those in later parts of this series.) Here, we’re dealing strictly with a user’s reaction when they receive an alert.

I also have to stress that it’s imperative that you always take the time to answer this question. It can be annoying, tedious, and time-consuming. But if you don’t, before long all of your alerts will be dismissed as “useless.” That is the first step on a long road that leads to a CIO-mandated RFP for monitoring tools, you defending your choice of tools, and other conversations that are significantly more annoying, tedious, and time-consuming.

However, my tips below should cut down on your workload significantly. So let’s get started.

First, let’s be clear: monitoring is not alerting. Some people confuse getting a ticket, page, email, or other alert with actual monitoring. In my opinionbook, “Monitoring” is the ongoing collection of data about a particular element or set of elements. Alerting is a happy by-product of having monitoring, because once you have those metrics you can notify people when a specific metric is above or below a threshold. I say this because customers sometimes ask (or demand) that you fix (or even turn off) “monitoring.”. What they really want is for you to change the alert they receive. Rarely do they really mean you should stop collecting metrics.

The bulk of your work is going to be in the way you create alert messages, because in reality, it’s the vagueness of those messages that has the recipient confused. Basically, you should ensure that every alert message contains a few key elements. Some are obvious:

  • The machine having the problem
  • The time of alert
  • Current statistic

Some are slightly less obvious but no less important:

  • Any other identifying information about the device
    • Any custom properties indicating location, owner group, etc.
    • OS type and version (the MachineType variable)
    • The IP address
    • The DNS Name and/or Sysname variables if your device names are… less than standard
  • The threshold value which breached
  • The duration – how long the alert has been in effect
  • A link or other reference to a place where the alert recipient can see this metric. Speaking in SolarWinds-specific terms, this could be:
    • The node Details page – using either the ${NodeDetailsURL} (or the equivalent for your kind of alert) or a “forged” URL (i.e.: “http://myserver/Orion/NetPerfMon/NodeDetails.aspx?NetObject=N:${NodeID}
    • A link to the metric details page. For example, the CPU average would be http://myserver/Orion/NetPerfMon/CustomChart.aspx?chartname=HostAvgCPULoad&NetObject=N:${NodeID}
    • Or even a report that shows this device (or a collection of devices where this is one member) and the metric involved

Finally, one element that should always be included in each alert:

  • The name of alert

For your straightforward alerts, this should not be a difficult task and can be something you (almost) copy and paste from one alert to another. Here’s an example for CPU:

CPU Utilization on the ${MachineType} device owned by ${OwnerGroup} named ${NodeName} (IP: ${IP_Address}, DNS: ${DNS}) has been over ${CPU_Thresh} for more than 15 minutes. Current load at ${AlertTriggerTime} is ${CPULoad}.

View full device details here: ${NodeDetailsURL}.
Click here to acknowledge the alert: ${AcknowledgeTime}

This message was brought to you by the alert: ${AlertName}

While it means more work during alert setup, having an alert with this kind of messaging means that the recipient has several answers to the “Why did I get this alert?” at their fingertips:

  • They have everything they need to identify the machine – which team owns it, what version of OS it’s running, and the office or data center where it’s located.
  • They have what they need to connect to the device – whether by name, DNS name, IP address, etc.
  • They know what metric (CPU) triggered the alert.
  • They know when the problem was detected (because let’s face it, sometimes emails DO get delayed).
  • They have a way to quickly get to a status screen (i.e.: the Node details page) to see the history of that metric and hopefully see where the spike occurred.

Finally, by including the ${AlertName}, you’re enabling the recipient to help you help them. You now know precisely which alert to research. And that’s critical, because there’re more things you should be prepared to do.

There is one more value you might want to include if you have a larger environment, and that’s the name of the SolarWinds polling engine. There are times when a device is moved to the wrong poller—wrong because of networking rules, AD membership, etc. Having the polling engine in the message is a good sanity check in this situation.

Let’s say that the owner of the device is still unclear why they received the alert. (Hey, it happens!) With the information the recipient can give you from the alert message, you can now use the following tools and techniques:

The Message Center

Some people live and die on this screen. Some never touch it. But in this case, it can be your best friend. Note two specific areas:

  • The Network Object drop-down – this lets you zero in on just the alerts from the offending device. Step one is to look at EVERYTHING coming off this box for the time period. Events, alerts, etc. See if this builds a story about what may have led up to the event.
  • The Alert Name drop-down under Triggered Alerts – this allows you to look at ALL of the instances when this alert triggered, or further zero in on the one event you are trying to find.

Side Note: The Time Period drop-down is critical here. Make sure you set it to show the correct period of time for the alert or else you’re going to constantly miss the mark.

Using these two simple controls in Message Center, you (and your users) should be able to drill into the event stream around the ticket time. Hopefully that will answer their question.

If you do it right (meaning take your time explaining what you are doing in a meeting, or using a screen share; maybe even come up with some light “how to” documentation with screen shots), users—especially those in heavy support roles—will learn over the course of time to analyze alerts on their own.

But what about the holdouts? The ones where Message Center hasn’t shown them (or you) what you hoped to see. What then?

Be prepared to test your alert. It’s something you should do every time you’re ready to release a new alert into your environment. Also remember that sometimes you get busy, and sometimes you test everything, but then the situation on the ground changes without your participation.

So, however you got here, you need to go back to the testing phase.

  • Make a copy of the alert. Never test a live normal production alert. There’s a COPY button in the alert manager for that very reason.
  • Change the alert copy by adding an alert trigger for the machine in question. JUST that machine. (i.e.: “where node caption is equal to WinSRVABC123”).
  • Set your triggering criteria (“CPULoad > 90%” or whatever) to a value so low that it’s guaranteed to trigger.

At that point, test the heck out of that bugger until both you and the recipient are satisfied that it works as expected. Copy whatever modifications you need over to the existing alert, and beware that updating the alert trigger will cause any existing alerts to re-fire. So you may need to hold off on those changes until a quieter moment.

Stay tuned for our next installment: “Why didn’t I get an alert?”

IT Monitoring Scalability Planning: 3 Roadblocks

Planning for growth is key to effective IT monitoring, but it can be stymied by certain mindsets. Here’s how to overcome them.

As IT professionals, planning for growth is something we do all day almost unconsciously. Whether it’s a snippet of code, provisioning the next server, or building out a network design, we’re usually thinking: Will it handle the load? How long until I’ll need a newer, faster, or bigger one? How far will this scale?

Despite this almost compulsive concern with scalability, there are still areas of IT where growth tends to be an afterthought. One of these happens to be my area of specialization: IT monitoring. So, I’d like to address growth planning (or non-planning) as it pertains to monitoring by highlighting several mindsets that typically hinder this important, but often surprisingly overlooked element, and showing how to deal with each.

The fire drill mindset
The occurs when something bad has already happened either because there was either no monitoring solution in place or because the existing toolset didn’t scale far enough to detect a critical failure, and so it was missed. Regardless, the result is usually a focus on finding a tool that would have caught the problem that already occurred, and finding it fast.

However, short of a TARDIS, there’s no way to implement an IT monitoring tool that will help avoid a problem after it occurs. Furthermore, moving too quickly as a result of a crisis can mean you don’t take the time to plan for future growth, focusing instead solely on solving the current problem.

My advice is to stop, take a deep breath, and collect yourself. Start by quickly, but intelligently developing a short list of possible tools that will both solve the current problem and scale with your environment as it grows. Next, ask the vendors if they have free (or cheap) licenses for in-house demoing and proofs of concept.

Then, and this is where you should let the emotion surrounding the failure creep back in, get a proof-of-concept environment set up quickly and start testing. Finally, make a smart decision based on all the factors important to you and your environment. (Hint: one of which should always be scalability.) Then implement the tool right away.

The bargain hunter
The next common pitfall that often prevents better growth planning when implementing a monitoring tool is the bargain-hunter mindset. This usually occurs not because of a crisis, but when there is pressure to find the cheapest solution for the current environment.

How do you overcome this mindset? Consider the following scenario: If your child currently wears a size 3 shoe, you absolutely don’t want to buy a size 5 today, right? But you should also recognize that your child is going to grow. So, buying enough size 3 shoes for the next five years is not a good strategy, either.

Also, if financials really are one of the top priorities preventing you from better preparing for future growth, remember that the cheapest time to buy the right-sized solution for your current and future environment is now. Buying a solution for your current environment alone because “that’s all we need” is going to result in your spending more money later for the right-sized solution you will need in the future. (I’m not talking about incrementally more, but start-all-over-again more.)

My suggestion is to use your company’s existing business growth projections to calculate how big of a monitoring solution you need. If your company foresees 10% revenue growth each year over the next three years and then 5% each year after that, and you are willing to consider completely replacing your monitoring solution after five years, then buy a product that can scale to 40% of the size you currently need.

The dollar auction
The dollar auction mindset happens when there is already a tool in place — a tool that wasn’t cheap and that a lot of time was spent perfecting. The problem is, it’s no longer perfect. It needs to be replaced because company growth has expanded beyond its scalability, but the idea of walking away from everything invested in it is a hard pill to swallow.

Really, this isn’t so much of a mindset that prevents preparing for future growth as it is something that’s all too often overlooked as an important lesson: If only you had better planned for future growth the first time around. The reality is that if you’re experiencing this mindset, you need a new solution. However, don’t make the same mistake. This time, take scalability into account.

Whether you’re suffering from one of these mindsets or another that is preventing you from better preparing your IT monitoring for future growth, remember, scalability is key to long term success.

(This article originally appeared on NetworkComputing)

Time for a network monitoring application? What to look for

You might think that implementing a network monitoring tool is like every other rollout. You would be wrong.

Oh, so you’re installing a new network monitoring tool, huh? No surprise there, right? What, was it time for a rip-and-replace? Is your team finally moving away from monitoring in silos? Perhaps there were a few too many ‘Let me Google that for you’ moments with the old vendor’s support line?

Let’s face it. There are any number of reasons that could have led you to this point. What’s important is that you’re here. Now, you may think a new monitoring implementation is no different than any other rollout. There are some similarities, but there are also some critical elements that are very different. How you handle these can mean the difference between success and failure.

I’ve found there are three primary areas that are often overlooked when it comes to deploying a network monitoring application. This isn’t an exhaustive list, but taking your time with these three things will pay off in the end.

Scope–First, consider how far and how deep you need the monitoring to go. This will affect every other aspect of your rollout, so take your time thinking this through. When deciding how far, ask yourself the following questions:

  • Do I need to monitor all sites, or just the primary data center?
  • How about the development, test or quality assurance systems?
  • Do I need to monitor servers or just network devices?
  • If I do need to include servers,  should I cover every OS or just the main one(s)?
  • What about devices in DMZs?
  • What about small remote sites across low-speed connections?

And when considering how deep to go, ask these questions:

  • Do I need to also monitor up/down for non-routable interfaces (e.g., EtherChannel connections, multiprotocol label switching links, etc.)?
  • Do I need to monitor items that are normally down and alert when they’re up (e.g., cold standby servers, cellular wide area network links, etc.)?
  • Do I need to be concerned about virtual elements like host resource consumption by virtual machine, storage, security, log file aggregation and custom, home-grown applications?

Protocols and permissions–After you’ve decided which systems to monitor and what data to collect, you need to consider the methods to use. Protocols such as Simple Network Management Protocol (SNMP), Windows Management Instrumentation (WMI), syslog and NetFlow each have its own permissions and connection points in the environment.

For example, many organizations plan to use SNMP for hardware monitoring, only to discover it’s not enabled on dozens –or hundreds — of systems. Alternatively, they find out it is enabled, but the connection strings are inconsistent, undocumented or unset. Then they go to monitor in the DMZ and realize that the security policy won’t allow SNMP across the firewall.

Additionally, remember that different collection methods have different access schemes. For example, WMI uses a Windows account on the target machine. If it’s not there, has the wrong permissions or is locked, monitoring won’t work. Meanwhile, SNMP uses a simple string that can be different on each machine.

Architecture–Finally, consider the architecture of the tools you’re considering. This breaks down to connectivity and scalability.

First, let’s consider connectivity. Agent-based platforms have on-device agents that collect and store data locally, then forward large data sets at regular intervals. Each collector bundles and sends this data to a manger-of-managers, which passes it to the repository. Meanwhile, agentless solutions use a collector that directly polls source devices and forwards the information to the data store.

You need to understand the connectivity architecture of these various tools so you can effectively handle DMZs, remote sites, secondary data centers and the like. You also need to look at the connectivity limitations of various tools, such as how many devices each collector can support and how much data will be traversing the wire, so you can design a monitoring implementation that doesn’t cripple your network or collapse under its own weight.

Next comes scalability. Understand what kind of load the monitoring application will tolerate, and what your choices are to expand when — yes, when, not if — you hit that limit. To be honest, this is a tough one and many vendors hope you’ll accept some form of a, “it-really-depends” response.

In all fairness, it does matter, and some things are simply impossible to predict. For example, I once had a client who wanted to implement syslog monitoring on 4,000 devices. It ended up generating upwards of 20 million messages per hour. That was not a foreseeable outcome.

By taking these key elements of a monitoring tool implementation into consideration, you should be able to avoid most of the major missteps many monitoring rollouts suffer from. And the good news is that from there, the same techniques that serve you well during other implementations will help here.  You want to ask lots of questions; meet with customers in similar situations, such as environment size, business sector, etc.; set up a proof of concept first; engage experienced professionals to assist as necessary; and be prepared — both financially and psychologically — to adapt as wrinkles crop up. Because they will.

(this originally appeared on SearchNetworking)

ICYMI: IT monitoring: ignore it at your peril

This interview was originally posted on http://onlyit.ca

To many businesses, IT monitoring software is a luxury they cannot afford. However, that mindset is dangerous. Not monitoring your IT infrastructure can cost you in stolen data and damage your reputation. Leon Adato, who holds the title of “head geek” at SolarWinds, shared his thoughts on why IT monitoring software is vital to the health of companies as well as the consequences of ignoring the need to monitor your IT infrastructure.

“Over the course of my 25 years in IT, with 12 years specifically focused on monitoring, I would say that more often than not-say 60 percent of the time-businesses lack a gut understanding that monitoring helps save them money, and lots of it,” said Adato. “In addition, I’ve never seen a company, large or small, actually do the work to estimate and document the savings monitoring provides, either overall or on a per-alert basis.”

Adato recounted an anecdote from when he first started working in IT. “As an example, early in my career when I was doing desktop support, I got a call that the barcode printer on ‘production line seven’ was down,” he remarked. “When I got there, I realized the fix was going to take some time. It was the end of the day, I was tired and I wanted to get home. I figured this particular printer issue could wait until the next day. The guy working that line said to me, ‘I completely understand if you’ve got responsibilities, but let me make sure you understand the choice you are making-each one of these circuit boards is $10,000 of profit, and we don’t get the money until they ship, and they don’t ship until they get a barcode from that printer.’ I realized I was looking at 4 racks with about 150 boards per rack. I made a few calls and stayed late to get the printer back up and running.”

“The point of the story is that the guy on the line knew exactly what the cost breakdown was,” Adato continued. “He knew the material costs, labor costs, gross and net revenue, and he could have told you per minute, per hour, per production line how much money was being lost. That’s not uncommon in production environments. Unfortunately, companies usually don’t approach IT monitoring and alerting with the same attitude and level of awareness, even though they could, and in my opinion, certainly should.”

Even if businesses have some type of IT monitoring in place, it might not be across the entire business. “Monitoring is always happening, whether it’s a server tech who checks all his servers manually from time to time (‘monitoring via eyeballs’) or teams that implement their own ‘skunkworks’ systems,” Adato commented. “People in the trenches don’t like surprises. Those systems will be narrowly focused, though, and will probably overlap in terms of features as well as scope. For example, the server team and the Exchange team might both monitor the same server; possibly using two different tools that collect much of the same data.” This approach is inefficient and not cost effective.

Adato cited the benefits of a business-wide IT monitoring program. He noted that it provides “the ability to have ongoing metrics that allow for capacity planning, forensic analysis of unexpected events-there will always be black swans-and the shortening of not only mean time to repair but also mean time to innocence by using data to prove that something, such as the network, is not at fault so efforts can be focused elsewhere.”

SolarWinds’ head geek acknowledged that businesses will need to invest financial and personnel resources into IT monitoring. Furthermore, IT monitoring can shatter some illusions about infrastructure. “[There is] the potentially unhappy realization that the environment is not as stable as you thought it was,” Adato said. He sees a silver lining to that situation, though. “Of course, this is a good thing masquerading as a bad thing because knowing there’s a previously-undetectable problem is the first step to fixing it before it blows up,” Adato concluded.