Category Archives: SolarWinds

#FeatureFriday: Validating SolarWinds Database Maintenance

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

One of the key aspects of SolarWinds tools are their ease of use. Other enterprise-class monitoring solutions require you to have subject matter experts on hand to help with implementation and maintenance. SolarWinds can be installed by a single technician and doesn’t require you to have a DBA or Linux expert or CCIE on hand. 

But that doesn’t mean there’s no maintenance happening. And while a lot of it is automated, it’s important for folks who are responsible for the SolarWinds toolset to understand whether that maintenance is running correctly.

In this video, Head Geeks Kong Yang, Patrick Hubbard, and I go over the SolarWinds maintenance subroutines and how to see whether things are happy or not under the hood.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

The Four Questions, Part 5

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked. You can read that introduction here. Information on the first question (Why did I get this alert) is here. You can get the low-down on the second question (Why DIDN’T I get an alert) here. The third question (What is monitored on my system) is here. And the fourth question (What alerts WILL trigger for my system) is posted here.

But as I’ve hinted all along, and despite the title of the series, there are actually five questions that you will be asked when you start down the path to becoming a monitoring engineer. The fifth, and final, question that you will most likely be asked is:

What do you monitor “standard”?

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

No Rest for the Weary…

You thought you were going to make it out of the building without another task hanging over your head, but the CIO caught you in the elevator and decided to have a “quick chat” on the way down.

 “I’m glad I caught up to you,” he says, “I just have a quick question…”

“I love what you’re doing with the monitoring system,” he begins. “but one thing that I keep hearing from support teams is that they feel like every new device has to be monitored from the ground up. Is there a way we can just have some monitors in place as soon as it enters the system?”

Choosing your words carefully (it *is* the CIO, after all) you respond “Well, there’s a whole raft of metrics we collect as soon as any new system is added into monitoring. Then we augment that with metrics based on what the new device provides and the owner and support teams need.”

“That’s a relief,” smiles your CIO, as you arrive at your car. He is now literally standing in between you and a well-earned beer. “but what our teams – your customers, you know,” he adds with a chuckle. “need to know is what those standard options are. Like when we’re buying a new car.” He says, eyeing your 2007 rustbucket. “Monitoring should list which features are standard, and which are optional upgrades. That shouldn’t be too hard, right?”

“Not at all!” you chirp, your enthusiasm having more to do with the fact that he’s moved out of the way than the specter of new work.

“Excellent.” He says. “Can you pull it together for me to look over tomorrow?” Without waiting for an answer he calls “Have a great night!” over his shoulder.

Standard Time

The reason this (“what do you monitor standard?”) is a common question has a lot to do with how monitoring solutions usually come into being in organizations; how monitoring teams are established; and how conversations between you (the monitoring engineer) and the recipient of monitoring tend to go.

First, while in some cases monitoring solutions are conceived and implemented as a company-wide initiative in a greenfield scenario, those cases are rare. More often what ends up being the company standard started off as a departmental (or even individual) initiative, which somebody noticed and liked and promoted until it spread.

“Standard procedures? We do it like this because Harold did it that way when he installed this thing 3 years ago.”

Second, monitoring teams often form around the solution (product XYZ) OR around the cause (“what do we want?” MONITORING! “When do we want it?” PING!). Unlike more mature IT disciplines like networking, storage, or even the relative newcomer virtualization, people don’t usually set out to become monitoring engineers. And thus, there are precious few courses, books, or even tribal knowledge to fall back on to understand how it is “usually” done. Teams form and, for better or worse, begin to write their own set of rules.

Third (and last), because of the preceeding two points, conversations between the ersatz monitoring engineer and the monitoring services consumer (the person or team who will read the reports, look at the screens, and/or receive the alerts) tends to have a number of disconnects. More on that in a moment.

What We Have Here…

Put yourself in the shoes of that monitoring consumer I mentioned a minute ago. You’ve got a new device, application, or service which needs to be monitored. At best, you need it monitored because you realize the value monitoring brings to the table. At worst, you need it monitored because you were told your device, application, or service is a class III critical element to the organization, and therefore monitoring is mandatory. You just need to check the little box that says “monitoring” so that you can get this sucker rolled into production.

So you make the call and set up a meeting with someone from the vaguely shadowy sounding ‘monitoring team’ (what, do they manage all the keylogger and spyware software you’re sure this place is crawling with?) and a little while later, you’re sitting down in a conference room with them.

You explain your new project, device, application, or service. The person is taking notes, nodding at all the right places. This is looking positive. Then they look up from their notes and ask:

“So, what do you need to have monitored here.”

In your head, you’re thinking “I thought YOU were supposed to tell ME that. Why do I have to do all the heavy lifting here?”

But you’re a pro so of course you don’t say that. Instead you reply, “Well, I’m sure whatever you monitor standard should be fine.”

The monitoring person looks mildly annoyed, and asks “Anything ELSE? I mean, this is a pretty important <device/application/service>, right?”

So now you DO open your mouth. “Well, it’s hard to tell since I’m not sure what standard monitoring is!”

“Oh come on,” comes the retort. You’ve been using XYZ monitoring for 2 years now. You know what we monitor on the other stuff for you. It’s…”

And then then do the most infuriating thing: They rattle off a list of words. Some of which are components, some are counters, some sound vaguely familiar, and others you’ve never heard of. And there they sit with their arms crossed, looking at you across the table.

*************

While the above scene is probably (hopefully) not as dramatic as the ones you’ve encountered in real life, Never the less this probably captures the essence of the issue. If the monitoring team and those who benefit from monitoring are working from different playbooks, the overall effectiveness of monitoring (not to mention YOU) is going to be impacted.

So how do you avoid this? Or if you’ve already fallen into friction, how do you get past it? As with the rest of this series, I have are a few suggestions which largely boil down to “knowing you will be asked this question is half the battle” because now you can prepare.

But the daemon is, as they say, in the details. So let’s dig in!

好記性,不如爛筆頭

(“A good memory is no match for a bad pen nub”)

Yes, my first suggestion is going to be to make sure you have good documentation.

Before you quail at the thought of mountains of research, keep in mind that that you aren’t documenting the specific thresholds, values, etc for each node. You are compiling a list of the default values which are collected when a new system is added into your monitoring solution. Depending on your software, that could be a consistent list regardless of device type, or it could be a core set of common values along with a few additional variations by device type.

Your first stop should always be to RTFM which stands, of course, for “Read The Friendly Manual”. Many vendors have already done this work for you and all you need to do is copy and paste that into your documentation.

But in the case that your vendor considers such information to be legally protected intellectual property, your next step is to just LOOK AT THE SCREENS THEMSELVES. I mean, it’s not that hard. After scanning a couple of the detail pages for devices, you’re going to see a pattern emerge. It will probably be something like:

  • Availability
    • Response time
    • Packet loss
  • CPU
    • Overall percent utilization, 5 minute average
    • Per processor % util 5 min avg
  • Physical RAM
    • % Util 5 min avg
    • IO Operations Per Second (IOPS)
    • Errors
  • Virtual RAM
    • % Util 5 min avg
    • IO Operations Per Second (IOPS)
    • Errors
  • Disk
    • Availability
    • Capacity
    • Used
    • IOPS
  • Interface
    • Availability
    • Bandwidth
      • Bits per second (BPS)
      • % Utilization
      • Packets Per Second (PPS)
      • Errors
      • Discards

There, in that relatively short list, is probably the core of the metrics every machine provides. Then you have specifics for routers (Flapping routes, VPN metrics, etc), switches (stacked switch metrics, VLANs), virtual machines (host statistics,  “noisy neighbor” events), and so on.

When all else fails, there’s wireshark. By monitoring the connection between the monitoring system and a single target device (preferably one you have newly added) you can capture all communication and suss out the metrics being requested/supplied. It’s not pretty, it might not even be easy. But it’s a lot easier than having to repeatedly tell your colleagues that you have no idea what gets monitored.

PS: If you are down to the wireshark method, it might be time to look at another monitoring solution.

There’s An App For That

Everything I’ve listed above is great for hardware-based metrics, including specialized storage and virtualization tools. But once you cross into the valley of the shadow of application monitoring, you need a better strategy.

By their very nature, application monitoring is highly complex, highly customized, and highly variable. Take something as straightforward as Microsoft Exchange. When building a monitoring solution, you might need specific monitoring for the hub transport server, the mailbox server, the client access server, or even the external OWA server (just to name a few). While there are common element from one to the other, they serve very different purposes and require very different components and thresholds.

Take that concept and multiply it by all the applications in your environment.

So what is the diligent yet slightly-overwhelmed monitoring engineer to do? While movies like “Glengarry Glen Ross” and “The Boiler Room” have popularized the phrase “ABC – Always Be Closing” within sales circles, in monitoring (and indeed, within much of IT) I prefer “ABS – Always Be Standardizing”.

Whether the topic is alerts, reports, or sets of application monitoring components, the best thing you can do for yourself is to continuously try to create a single “gold standard” for a particular need and then keep using it, expanding that single standard when necessary to account for additional variations, but avoiding as much as possible the “copy and paste” process that leaves you with 42 different Exchange monitoring templates and (18 months down the road) no idea which one is the default, let alone which one is applied to which server.

If you are able to adhere to this one practice, then answering the question “what do you monitor, standard” for applications becomes immeasurably more achievable.

Good Monitoring Is It’s Own Reward

Armed with the information I’ve described above, that conversation I described at the beginning of this essay goes in a very different direction:

Them: “We need to monitor about 20 of our devices and the applications that go with them.”

You: “Great! It’s a good thing you’re talking to me and not accounts receivable then. Do you have a list of what you are putting together?”

Them: (laughing politely at your very not funny joke) “Right here”

You : “OK, great. I’ve got a list here of the things that get monitored automatically when we load your devices into our software. Can you take a look and tell me if anything you think is important is missing?” (hands printed list of default hardware monitors across the table)

Them: (Scanning the list) “Wow. This is… a lot. It’s more than we expected. I don’t want tickets on all this tho…”

You: “Oh, no. These are the metrics we collect. Alerts are something else. But if you need to see the general health of your systems, this is what we can provide you automatically.”

Them: “Oh, I get it! That’s great. You know what? We’ll ride with this list for the time being.”

You: “Perfect. Now, on the application side, here’s what we have in place today.” (hands over printed list of application monitoring).

Them: “OK, I can already see we’ll need a few extra items that the dev’s have said they want to track. They will also want to alert on some custom messages they’re generating.”

You: “No worries. You can get me that list by email later if you want. I’m sure we’ll be able to get it all set up for you.”

Them: “Wow, this was a lot less painful than I expected.”

You: (Winks at camera) “I know!”

The Long And Winding Road

You pull into the driveway and head into your home, reflecting on how 8:45am seemed like 100 years ago in terms of the miles you’ve put on that old brain of yours. But then you sit down and take a minute to clear your head, and you realize a few things:

First, that monitoring, GOOD monitoring, is something that is achievable by anyone that is willing to put in a little time and effort, and can be accomplished with most of the monitoring solutions currently available on the market today.

Second, that putting in the time and effort to create good monitoring makes the overall job significantly more enjoyable.

Third, that monitoring is, and should be treated as, a separate discipline within IT, as valid a sub-specialty as networking, programming, storage, virtualization or information security. And only when it IS treated as a discipline will people be willing to put in the time and effort I’ve been describing, and to invest in the tools that help make that possible.

Fourth, and finally, that everything that happened today – from the 8:45am beginning I described in part 1 all the way to the cold beer in your hand now – is all part of a day’s work in IT.

 

#FeatureFriday: What is WMI?

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

WMI (Windows Management Instrumentation) is, as the name implies, a foundational protocol for Microsoft Windows – based environments. Despite this, it is not well understood in terms of what it can do and how you make it do it.

In this video Chris O’Brien and I take a look at the basics of WMI and what it looks like in its most basic form.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

#FeatureFriday: Understanding Network Configuration Backups

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

Network Configuration Backups is usually not considered to be a monitoring technique, and yet the opportunities this technique opens up are pretty significant.

In this video, Chris O’Brien and I take a look at Config Backups and their potential for you and your environment.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

#FeatureFriday: What is IPSLA?

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

Like NetFlow, it’s easy to fall into the trap of thinking IPSLA is some “new” feature that’s just come on the scene. In reality, it was part of Cisco IOS release 11.2 (as a featured called “Response Time Reporter”, or RTR) all the way back in October 1996. Later it was renamed “Service Assurance Agent” (SAA), and finally changed to it’s current moniker which stands for “Internet Protocol Service Level Agreement”.

In this video, Chris O’Brien and I talk over the things IPSLA can do for you and how it tends to be pigeon-holed into being “just good for voice monitoring” where it can actually do much, much more.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

#FeatureFriday: What is NetFlow?

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

Compared to features like Ping and SNMP, NetFlow is a relative newcomer to the monitoring scene, but that doesn’t mean it’s young and untested. In fact, it was introduced by Cisco almost a decade ago!

In this video, Chris O’Brien and I talk about what NetFlow means to a monitoring engineer and the insights it can provide about your network.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

The Four Questions, Part 4

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked. You can read that introduction here. Information on the first question (Why did I get this alert) is here. You can get the low-down on the second question (Why DIDN’T I get an alert) here. And the third question (What is monitored on my system) is here.

My goal in this post is to give you the tools you need to answer the fourth question: Which of the existing alerts will potentially trigger for my system?

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

Riddle Me This, Batman…

It’s 3:00pm. You can’t quite see the end of the day over the horizon, but you know it’s there. You throw a handful of trail mix into your face to try to avoid the onset of mid-afternoon nap-attack syndrome and hope to slide through the next two hours unmolested.

Which, of course, is why you are pulled into a team meeting. Not your team meeting, mind you. It’s the Linux server team. On the one hand, you’re flattered. They typically don’t invite anyone who can’t speak fluent Perl or quote every XKCD comic in chronological order (using EPOC time as their reference, of course). On the other…well, team meeting.

The manager wrote:

            kill `ps -ef | grep -i talking | awk ‘{print $1}’`

on the board, eliciting a chorus of laughter (from everyone but me). Of course, this gave the manager the perfect opportunity to focus the conversation on yours truly.

“We have this non-trivial issue, and are hoping you can grep out the solution for us.” He begins, “we’re responsible for roughly 4,000 sytems…”

Unable to contain herself, a staff member followed by stating, “4,732 systems. Of which 200 are physical and the remainder are virtualized…”

Unimpressed, her manager said, “Ms. Deal, unless I’m off by an order of magnitude, there’s no need to correct.”

She replied, “Sorry boss.”

“As I was saying,” he continued. “We have a…significant number of systems. Now how many alerts currently exist in the monitoring system which could generate a ticket?”

“436, with 6 currently in active development.” I respond, eager to show that I’m just on top of my systems as they are of theirs.

“So how many of those affect our systems?” the manager asked.

Now I’m in my element. I answer, “Well, if you aren’t getting tickets, then none. I mean, if nothing has a spiked CPU or RAM or whatever, then it’s safe to say all of your systems are stable. You can look at each node’s detail page for specifics, although with 4,000—I can see where you would want a summary. We can put something together to show the current statistics, or the average over time, or…”

“You misunderstand,” he cuts me off. “I’m fully cognizant of the fact that our systems are stable. That’s not my question. My question is…should one of my systems become unstable, how many of your 436-soon-to-be-442 alerts WOULD trigger for my systems?”

He continued, “As I understand it, your alert logic does two things: it identifies the devices which could trigger the alert—All Windows systems in the 10.199.1 subnet, for example—and at the same time specifies the conditions under which an alert is triggered—say, when the CPU goes over 80% for more than 15 minutes.”

“So what I mean,” he concluded, “Is this: can you create a report that shows me the devices which are included in the scope of an alert logic irrespective of the trigger condition?”

Your Mission, Should You Choose to Accept it…

As with the other questions we’ve discussed in this series, the specifics of HOW to answer this question is less critical than knowing you will be asked it.

In this case, it’s also important to understand that this question is actually two questions masquerading as one:

  1. For each alert, tell me which machines could potentially be triggers
  2. For each machine, tell me which alerts may potentially triggered

Why is this such an important question—perhaps the most important  – in this series? Because it determines the scale of the potential notifications monitoring may generate. It’s one thing if 5 alerts apply to 30 machines. It’s entirely another when 30 alerts apply to 4,000 machines.

The answer to this question has implications to staffing, shift allocation, pager rotation, and even the number of alerts a particular may approve for production.

The way you go about building this information is going to depend heavily on the monitoring solution you are using.

In general, agent-based solutions are better at this because trigger logic – in the form of an alert name –  is usually pushed down to the agent on each device, and thus can be queried (both “Hey, node, what alerts are “on” you?” and “hey, alert, which nodes have you been pushed to?”)

That’s not to say that agentless monitoring solutions are intrinsically unable to get the job done. The more full-featured monitoring tools have options built-in.

Reports that look like this:

Note to mention little “reminders” like this on the alert screen:

Or even resources on the device details page that look like this:

Houston, We Have a Problem

What if it doesn’t though? What if you have poured through the documentation, opened a ticket with the vendor, visited the online forums and asked the greatest gurus up on the mountain, and came back with a big fat goose egg? What then?

Your choices at this point still depend largely on the specific software, but generally speaking there are 3 options:

  • Reverse-engineer the alert trigger and remove the actual trigger part

Many monitoring solutions use a database back-end for the bulk of their metrics, and alerts are simply a query against this data. The alert trigger queries may exist in the database itself, or in a configuration file. Once you have found them, you will need to create a copy of each alert and then go through each, removing the parts which comprise the actual trigger (i.e.: CPU_Utilization > 80%) and leaving the parts that simply indicate scope (where Vendor = “Microsoft”; where “OperatingSystem” = “Windows 2003”; where “IP_address” contains “10.199.1”; etc). This will likely necessitate your learning the back-end query language for your tool. Difficult? Probably. Will it increase your street cred with the other users of the tool? Undoubtedly. Will it save your butt within the first month after you create it? Guarenteed.

And once you’ve done it, running a report for each alert becomes extremely simple.

  • Create duplicate alerts with no trigger

If you can’t export the alert triggers, another option is to create a duplicate of each alert that has the “scope” portion, but not the trigger elements (so the “Windows machines in the 10.199.1.x subnet” part but not the “CPU_Utilization > 80%” part). The only recipient of that alert will be you and the alert action should be something like writing to a logfile with a very simple string (“Alert x has triggered for Device y”). If you are VERY clever, you can export the key information in CSV format so you can import it into a spreadsheet or database for easy consumption. Every so often—every month or quarter—fire off those alerts and then tally up the results that recipient groups can slice and dice.

  • Do it by hand

If all else fails (and the inability to answer this very essential question doesn’t cause you to re-evaluate your choice of monitoring tool), you can start documenting by hand. If you know up-front that you are in this situation, then it’s simply part of the ongoing documentation process. But most times it’s going to be a slog through of existing alerts and writing down the trigger information. Hopefully you can take that trigger info and turn it into an automated query against your existing devices. If not, then I would seriously recommend looking at another tool. Because in any decent-sized environment, this is NOT the kind of thing you want to spend your life documenting, and it’s also not something you want to live without.  

What Time Is It? Beer:o’clock

After that last meeting—not to mention the whole day—you are ready pack it in. You successfully navigated the four impossible questions that every monitoring expert is asked (on more or less a daily basis)—Why did I get that alert, Why didn’t I get that alert, What is being monitored on my systems, and What alerts might trigger on my systems? Honestly, if you can do that, there’s not much more that life can throw at you.

Of course, the CIO walks up to you on your way to the elevator. “I’m glad I caught up to you,” he says, “I just have a quick question…”

Stay tuned for the bonus question!

#FeatureFriday: What is SNMP?

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

Today’s episode finds Chris O’Brien and me talking about one of the other old workhorses of monitoring: SNMP. We tease apart the structure of SNMP objects, how systems interact with them, and some tricks for usage in your monitoring environment.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

The Four Questions, Part 3

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked: Why did I get an alert? Why didn’t I get an alert? What is being monitored on my system? What will alert on my system? and What standard monitoring do you do? My goal in this post is to give you the tools you need to answer the second question:

What is being monitored on my system

You can find information on the  first question (Why did I get this alert) here.
And information on the second question (Why DIDN’T I get an alert) here.

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

Not so fast, my friend!

It’s 1:35pm. Your first two callers—the one who wanted to know why they got a particular alert  and the one who wanted to know why they didn’t get an alert—are finally a distant memory, and you’ve managed to squeeze out some productive work setting up a monitor that will detect when your cell phone backup circuit will…

That’s when your manager ambles over and looks at you expectantly over the cube wall.

“I just met with the manager of the APP-X support team,” he tells you. “They want a matrix of what is monitored on their system.”

To his credit he adds, “I checked on the system in the Reports section, but nothing jumped out at me. Did I overlook something?”

This question is solved with a combination of foresight (knowing you are going to get asked this question), preparation, and know-how.

It is also one of the questions where the steps are extremely tool-specific. An agent-based solution will have a lot of this information embedded in the agent, whereas you will probably find what you need on an agentless solution like SolarWinds in a central database or configuration system.

Understand that this is a question that you can answer, and with some preparation you can have the answer with the push of a button (or two). But like so many solutions in this series, preparation now will save you from desperation later.

It’s also important to recognize that this type of report is absolutely essential, both to you and to the owner of the systems.

<PHILOSOPY>

I believe strongly that monitoring is more than just watching the elements which will create alerts (be they tickets, emails, pager messages, or an ahoogah noise over the loudspeakers. Your monitoring scope should cover elements which are used for capacity planning, performance analysis, and forensics after a critical event. For example, you may never alert on the size of the swap drive, but you will absolutely want to know what its size was during the time of a crash. For that reason, knowing what is monitored is essential, even if you won’t alert on some of those elements.

</PHILOSOPHY>

The knee bone’s connected to the…

In order to answer this question, the first thing you should do is break down the areas of monitoring data. That can include:

  1. Hardware information for the primary device – CPU and RAM are primary items but there are other aspects. The key is that you are ONLY interested in hardware that affects the whole “chassis” or box, not sub-elements like cards, disks, ports, etc..
  2. Hardware sub-elements – You may find that you have one, more than one, or none. Network cards, disks, external mount points, and VLAN’s are just a few examples.
  3. Specialized hardware elements – Fans, power supplies, temperature sensors, and the like.
  4. Application components – PerfMon counters, services, processes, logfile monitors and more—all the things that make up a holistic view of an application.

And now… I’m going to stop. While there are certainly many more items on that list, if you can master the concept of those first few bullets, adding more should come fairly easily.

It should be noted that this type of report is not a fire-and-forget affair. It’s more of a labor of love that you will come back to and refine over time.

I also need to point out that this will likely not be a one-size-fits-all-device-types solution. The report you create for network devices like routers and access points may need to be radically different from the one you build for server-type devices. Virtual hosts may need data points that have no relevance to anything else. And specialty devices like load balancers, UPS-es, or environmental systems are in a class of their own.

Finally, in order to get what you want, you also have to understand how the data is stored, and be extremely comfortable interacting with that system.  Because the tool I have handy is SolarWinds, that’s the data structure we’re going to use here.

As I mentioned earlier, this type of report is push-button simple on some toolsets. If that’s the case for you, then feel free to stop reading and walk away with the knowledge that you will be asked this question on a regular basis, and you should be prepared.

For those using a toolset where answering this question requires effort, read on!

select nodes.statusled, nodes.nodeid, nodes.caption,
s1.LED, s1.elementtype, s1.element, s1.Description
from nodes
left join (
select '01' as elementorder, 'HW' as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description
from APM_HardwareAlertData
union all
select '02' as elementorder, 'NIC' as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description
from interfaces
union all
select '03' as elementorder, 'DISK' as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description
from volumes
) s1
on nodes.nodeid = s1.NodeID
Order by nodes.nodeid ,  s1.elementorder asc, s1.element
Catch all that? OK, let’s break this down a little. The basic format of

this structure is:

Get device information (name, IP, etc)
            Get sub-element information (name, status, description)

The key to this process is standardizing the sub-element information so that it’s consistent across each type of element.

One thing to note is that the SQL “union all” command will let you combine results from two separate queries – such as a query to the interfaces table and another to the volumes table. BUT it requires the same number of columns as a result. In my example I’ve kept it simple – just the name and description, really.

The other trick I learned was to add icons rather than text wherever possible. That includes the “statusLED” and “status” columns, which display dots instead of text when rendered by the SolarWinds report engine. I find this gives a much quicker sense of what’s going on (oh, they’re monitoring this disk, but it’s currently offline).

Another addition worth noting is:

            select 'xx' as elementorder, 'yyy' as elementtype,

I use element order to sort each of the query blocks, and elementtype to give the report reader a clue as to the source of this information (disk, application, hardware, etc.)

But what if I included data points that existed for some elements, but not for others? Well, you still have to ensure that each query connected by “union all” has the same number of columns. So let’s say that we wanted to include circuit capacity for interfaces, disk capacity for disks, but nothing for hardware, it would look like this:

select nodes.statusled, nodes.nodeid, nodes.caption,
s1.LED, s1.elementtype, s1.element, s1.Description, s1.capacity
from nodes
left join (
select '01' as elementorder, 'HW' as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description, '' as capacity
from APM_HardwareAlertData
union all
select '02' as elementorder, 'NIC' as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description, interfaces.InterfaceSpeed as capacity
from interfaces
union all
select '03' as elementorder, 'DISK' as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description, volumes.VolumeSize as capacity
from volumes
) s1
on nodes.nodeid = s1.NodeID
Order by nodes.nodeid ,  s1.elementorder asc, s1.element

By adding to the ” as capacity hardware block (and any other section where it’s needed, we avert errors with the union all command.

Conspicuous by their absence in all of this are the things I listed first on the “must have” list: CPU, RAM, etc.

In this case, I used a couple of simple tricks: For CPU, I’m going to give the count of cpus since any other data (current processor utilization, etc.) is probably not helpful. For RAM, the solution is even simpler: I just queried the nodes table again and pulled out JUST the Total memory.

select nodes.statusled, nodes.nodeid, nodes.caption, s1.LED, s1.elementtype, s1.element, s1.Description, s1.capacity
from nodes
left join (
select '01' as elementorder, 'CPU' as elementtype, c1.NodeID, 'Up.gif' as LED, 'CPU Count:' as element, CONVERT(varchar,COUNT(c1.CPUIndex)) as description, '' as capacity
from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex from CPUMultiLoad) c1
group by c1.NodeID
union all
select '02' as elementorder, 'RAM' as elementtype, nodes.NodeID, 'Up.gif' as LED, 'Total RAM' as element, CONVERT(varchar,nodes.TotalMemory), '' as capacity
from nodes
union all
select '03' as elementorder, 'HW' as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description, '' as capacity
from APM_HardwareAlertData
union all
select '04' as elementorder, 'NIC' as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description, interfaces.InterfaceSpeed as capacity
from interfaces
union all
select '05' as elementorder, 'DISK' as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description, volumes.VolumeSize as capacity
from volumes
union all
select '06' as elementorder, 'APP' as elementtype, APM_AlertsAndReportsData.nodeid as NodeID, APM_AlertsAndReportsData.ComponentStatus as LED, APM_AlertsAndReportsData.ComponentName as element, APM_AlertsAndReportsData.ApplicationName as description, '' as capacity
from APM_AlertsAndReportsData
union all
select '07' as elementorder, 'POLLER' as elementtype, CustomPollerAssignmentView.NodeID, 'up.gif' as LED, CustomPollerAssignmentView.CustomPollerName as element, CustomPollerAssignmentView.CustomPollerDescription as description, '' as capacity
from CustomPollerAssignmentView
) s1
on nodes.nodeid = s1.NodeID
Order by nodes.nodeid ,  s1.elementorder asc, s1.element

In this iteration I found that the data output from some sources was integer, some was float, and somme was even text! So I started using the “CONVERT()” option to keep everything in the same format.

The result looks something like this:

I could stop here and you would have, more or less, the building blocks you need to build your own “What is monitored on my system?” report. But there is one more piece that takes this type of report to the next level.

Including the built-in thresholds for these elements increases complexity to the query, but also adds an entirely new (and important) dimension to the information you are providing.

More than ever, the success of this type of report lies in your knowing where threshold data is kept. In the case of SolarWinds, a series of “Thresholds” views (InterfacesThresholds, NodesPercentMemoryUsedThreshold, NodesCpuLoadThreshold, and so on) makes the job easier but you still have to know where thresholds are kept for applications, and that there are NO built-in thresholds for disks or custom pollers.

With that said, the final report query would look like this:

select nodes.statusled, nodes.nodeid, nodes.caption, s1.LED, s1.elementtype, s1.element, s1.Description, s1.capacity,
s1.threshold_value, s1.warn, s1.crit
from nodes
left join (
select '01' as elementorder, 'CPU' as elementtype, c1.NodeID, 'Up.gif' as LED, 'CPU Count:' as element, CONVERT(varchar,COUNT(c1.CPUIndex)) as description, '' as capacity, 'CPU Utilization' as threshold_value, convert(varchar, t1.Level1Value) as warn, convert(varchar, t1.Level2Value) as crit
from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex from CPUMultiLoad) c1
join NodesCpuLoadThreshold t1 on c1.nodeID = t1.InstanceId
group by c1.NodeID, t1.Level1Value, t1.Level2Value
union all
select '02' as elementorder, 'RAM' as elementtype, nodes.NodeID, 'Up.gif' as LED, 'Total RAM' as element, CONVERT(varchar,nodes.TotalMemory), '' as capacity, 'RAM Utilization' as threshold_value, convert(varchar, NodesPercentMemoryUsedThreshold.Level1Value) as warn, convert(varchar, NodesPercentMemoryUsedThreshold.Level2Value) as crit
from nodes
join NodesPercentMemoryUsedThreshold on nodes.nodeid = NodesPercentMemoryUsedThreshold.InstanceId
union all
select '95' as elementorder, 'HW' as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description, '' as capacity, '' as threshold_value, '' as warn, '' as crit
from APM_HardwareAlertData
union all
select '03' as elementorder, 'NIC' as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description, interfaces.InterfaceSpeed as capacity, 'bandwidth in/out' as threshold_value,
convert(varchar, i1.Level1Value)+'/'+convert(varchar,i2.level1value) as warn, convert(varchar, i1.Level2Value)+'/'+convert(varchar,i2.level2value) as crit
from interfaces
join (select InterfacesThresholds.instanceid, InterfacesThresholds.level1value , InterfacesThresholds.level2value
       from InterfacesThresholds where InterfacesThresholds.name = 'NPM.Interfaces.Stats.InPercentUtilization') i1 on interfaces.interfaceid = i1.InstanceId
join (select InterfacesThresholds.instanceid, InterfacesThresholds.Level1Value, InterfacesThresholds.level2value
       from InterfacesThresholds where InterfacesThresholds.name = 'NPM.Interfaces.Stats.OutPercentUtilization') i2 on interfaces.interfaceid = i2.InstanceId
union all
select '04' as elementorder, 'DISK' as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description, volumes.VolumeSize as capacity, '' as threshold_value, '' as warn, '' as crit
from volumes
union all
select '05' as elementorder, 'APP' as elementtype, APM_AlertsAndReportsData.nodeid as NodeID, APM_AlertsAndReportsData.ComponentStatus as LED, APM_AlertsAndReportsData.ComponentName as element, APM_AlertsAndReportsData.ApplicationName as description, '' as capacity, 'CPU Utilization' as threshold_value, convert(varchar, APM_AlertsAndReportsData.[Threshold-CPU-Warning]) as warn, convert(varchar, APM_AlertsAndReportsData.[Threshold-CPU-Critical]) as crit
from APM_AlertsAndReportsData
union all
select '06' as elementorder, 'POLLER' as elementtype, CustomPollerAssignmentView.NodeID, 'up.gif' as LED, CustomPollerAssignmentView.CustomPollerName as element, CustomPollerAssignmentView.CustomPollerDescription as description, '' as capacity, '' as threshold_value, '' as warn, '' as crit
from CustomPollerAssignmentView
) s1
on nodes.nodeid = s1.NodeID
Order by nodes.nodeid ,  s1.elementorder asc, s1.element

And here are the results:

It’s 90% Perspiration…

While answering this question requires persistence, skill, and in-depth knowledge of your monitoring toolset, the benefits are significantly greater than for the previous two questions.

Done right, teams can use this report to validate that the correct elements on each device are monitored – nothing is left out, nothing which has been decommissioned is still there. And when an alert does trigger, it will be easier to understand where you can look for hints, instead of just clicking around screens looking for something interesting.

Stock up on your tea leaves, goat entrails, and crystal balls because in my next post we’re going to take a peek into the future by answering the question “What *WILL* alert on my system?”

#FeatureFriday: What is Syslog?

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

In this installment, Chris O’Brien and I dig into the protocol/feature/tool called Syslog: How it works and how you use it effectively for monitoring in your environment.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)