Category Archives: SolarWinds

#FeatureFriday: Improving Alerts with Query Execution Plans

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

Alerts are, for many monitoring engineers, the bread-and-butter of their job. What many fail to recognize is that, regardless of how graphical and “Natural English Language” the alert builder appears, what you are really creating is a query. Often it is a query which runs frequently (every minute or even more) against the entire database.

Because of that, a single, poorly-constructed query can have a huge (and hugely negative) impact on overall performance. Get a few bad eggs, and the rest of the monitoring system – polling, display, reports, etc – can grind to a crawl, or even come to a halt.

Luckily there’s a tool which can help you discover a query’s execution performance, and identify where the major bottlenecks are.

In the video below my fellow Solarwinds Head Geeks and I channel our inner SQLRockstar and dive into query execution plans and how to apply that technique to SolarWinds Orion alerts. 


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

#FeatureFriday: All About Diagnostics, Baselines, and Dependencies in SolarWinds Orion

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

After a quick reminder that running diagnostics is NOT just for when you are in trouble, my discussion with fellow SolarWinds Head Geeks Kong Yang and Patrick Hubbard turns to when and how to enable baseline calculations – which allow the system to use collected metrics to build a model of what is “normal” – and automatic dependencies – which suppress alerts on devices downstream of the root cause.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

#FeatureFriday: Verifying and Fixing Permissions in SolarWinds Orion

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

Every once in a while I come across an installation that has inexplicable problems – little hiccoughs here and there. Not enough for the monitoring admin to throw up their hands in frustration, but enough to make people scratch their heads. Also not enough to make those admins open a ticket or even mention it to me when we’re talking casually.

Usually I (and the support team) hear about it, however, during upgrades. Because that’s when stuff just doesn’t work as expected. And often, this is because of permissions.

Before you even watch the video, let me lay some SolarWinds Orion wisdom on you: install as local administrator. Not admin equivalent. Not “Joe who has admin priviledges”. Not even DOMAIN admin. Local admin. Doing the install as anything else is going to cause you problems somewhere down the road.

BUT… if this is the first time you are hearing about this (because nobody reads the admin guide), what do you do with the install you already have in place? The answer to that lies in the video below, where my fellow Solarwinds Head Geeks Kong Yang, Patrick Hubbard, and I talk about how to verify and fix permission issues in Orion:


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

#FeatureFriday: Understanding (and fixing) Logging Levels

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

Within SolarWinds Orion (the framework which provides a common set of functions like reporting and alerting; and which also provides the glue to bind together all the different modules) one of the more robust ways to understand what is happening under the hood is the log files. But all the information you COULD see is not there by default. So SolarWinds provides a way to tweak those logging levels.

Unfortunately, over time the logging levels end up out of whack with what you need on a daily basis.

In the video below SolarWinds Head Geeks Kong Yang, Patrick Hubbard, and I dig into where those logs are, what they contain, how to change your logging levels, and how to put them back to the “right” way when you are done.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

#FeatureFriday: Checking and Changing Polling Cycles and Retention

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

One of the biggest challenges when maintaining a monitoring system is handling the sheer volume of data. While your initial thought may be “keep it all”, the reality is that this has implications with regard to everything from display speed (because data has to be culled from the database and loaded onto the screen and if the data set is large, the query time will be long) to storage (leading to the question “if you were the IT pro who had everything, where would you put it all?”).

The secret to effectively managing this issue lies in understanding and being able to tune polling cycles and retention times, which is what I discuss with my fellow SolarWinds Head Geeks Kong Yang, Patrick Hubbard in the video below.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

#FeatureFriday: Checking Additional Polling Health and Load

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

Once your monitoring environment grows beyond a certain size, a single server – no matter how robust your software or how beefy your hardware – is simply not going to cut it. And not long after that, when you’ve added another polling server, you will want to know how it is performing, and if it’s time for yet another additional poller.

To that end, in the video below, my fellow SolarWinds Head Geeks Kong Yang, Patrick Hubbard, and I speak about ways you can validate the health and current load being carried by each additional poller in your environment, so there are no surprises.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

#FeatureFriday: SolarWinds NPM Syslog/Trap Health

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

A frequently misunderstood aspect of setting up a monitoring solution is how to handle syslog and trap. Without going too deeply into it here (because the video below does a great job of that) the volume of messages in some environments can overwhelm the most robust software on the beefiest hardware.

The simple fact is that syslog and trap are chatty protocols. If you don’t have a design in place that can filter out the noise, you may end up thinking your monitoring solution is performing poorly when it is merely struggling under the weight of an unmanageable message stream.

In this video, I explain to my fellow SolarWinds Head Geeks Kong Yang and Patrick Hubbard exactly how this happens  and how build a design that avoids the problem in the first place.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

#FeatureFriday: Rebuilding Database Indexes

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

While many monitoring solutions have built-in database maintenance routines, often more may be needed to ensure everything is absolutely squeaky-clean. This is especially important to do prior to upgrades and patches. And one of the basic cleanup tasks for a database is rebuilding indexes. 

In this video SolarWinds Head Geeks Kong Yang, Patrick Hubbard, and myself look at what database reindexing looks like, and even offer some simple scripts to do it in your environment.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

#FeatureFriday: Validating SolarWinds Database Maintenance

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

One of the key aspects of SolarWinds tools are their ease of use. Other enterprise-class monitoring solutions require you to have subject matter experts on hand to help with implementation and maintenance. SolarWinds can be installed by a single technician and doesn’t require you to have a DBA or Linux expert or CCIE on hand. 

But that doesn’t mean there’s no maintenance happening. And while a lot of it is automated, it’s important for folks who are responsible for the SolarWinds toolset to understand whether that maintenance is running correctly.

In this video, Head Geeks Kong Yang, Patrick Hubbard, and I go over the SolarWinds maintenance subroutines and how to see whether things are happy or not under the hood.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

The Four Questions, Part 5

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked. You can read that introduction here. Information on the first question (Why did I get this alert) is here. You can get the low-down on the second question (Why DIDN’T I get an alert) here. The third question (What is monitored on my system) is here. And the fourth question (What alerts WILL trigger for my system) is posted here.

But as I’ve hinted all along, and despite the title of the series, there are actually five questions that you will be asked when you start down the path to becoming a monitoring engineer. The fifth, and final, question that you will most likely be asked is:

What do you monitor “standard”?

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

No Rest for the Weary…

You thought you were going to make it out of the building without another task hanging over your head, but the CIO caught you in the elevator and decided to have a “quick chat” on the way down.

 “I’m glad I caught up to you,” he says, “I just have a quick question…”

“I love what you’re doing with the monitoring system,” he begins. “but one thing that I keep hearing from support teams is that they feel like every new device has to be monitored from the ground up. Is there a way we can just have some monitors in place as soon as it enters the system?”

Choosing your words carefully (it *is* the CIO, after all) you respond “Well, there’s a whole raft of metrics we collect as soon as any new system is added into monitoring. Then we augment that with metrics based on what the new device provides and the owner and support teams need.”

“That’s a relief,” smiles your CIO, as you arrive at your car. He is now literally standing in between you and a well-earned beer. “but what our teams – your customers, you know,” he adds with a chuckle. “need to know is what those standard options are. Like when we’re buying a new car.” He says, eyeing your 2007 rustbucket. “Monitoring should list which features are standard, and which are optional upgrades. That shouldn’t be too hard, right?”

“Not at all!” you chirp, your enthusiasm having more to do with the fact that he’s moved out of the way than the specter of new work.

“Excellent.” He says. “Can you pull it together for me to look over tomorrow?” Without waiting for an answer he calls “Have a great night!” over his shoulder.

Standard Time

The reason this (“what do you monitor standard?”) is a common question has a lot to do with how monitoring solutions usually come into being in organizations; how monitoring teams are established; and how conversations between you (the monitoring engineer) and the recipient of monitoring tend to go.

First, while in some cases monitoring solutions are conceived and implemented as a company-wide initiative in a greenfield scenario, those cases are rare. More often what ends up being the company standard started off as a departmental (or even individual) initiative, which somebody noticed and liked and promoted until it spread.

“Standard procedures? We do it like this because Harold did it that way when he installed this thing 3 years ago.”

Second, monitoring teams often form around the solution (product XYZ) OR around the cause (“what do we want?” MONITORING! “When do we want it?” PING!). Unlike more mature IT disciplines like networking, storage, or even the relative newcomer virtualization, people don’t usually set out to become monitoring engineers. And thus, there are precious few courses, books, or even tribal knowledge to fall back on to understand how it is “usually” done. Teams form and, for better or worse, begin to write their own set of rules.

Third (and last), because of the preceeding two points, conversations between the ersatz monitoring engineer and the monitoring services consumer (the person or team who will read the reports, look at the screens, and/or receive the alerts) tends to have a number of disconnects. More on that in a moment.

What We Have Here…

Put yourself in the shoes of that monitoring consumer I mentioned a minute ago. You’ve got a new device, application, or service which needs to be monitored. At best, you need it monitored because you realize the value monitoring brings to the table. At worst, you need it monitored because you were told your device, application, or service is a class III critical element to the organization, and therefore monitoring is mandatory. You just need to check the little box that says “monitoring” so that you can get this sucker rolled into production.

So you make the call and set up a meeting with someone from the vaguely shadowy sounding ‘monitoring team’ (what, do they manage all the keylogger and spyware software you’re sure this place is crawling with?) and a little while later, you’re sitting down in a conference room with them.

You explain your new project, device, application, or service. The person is taking notes, nodding at all the right places. This is looking positive. Then they look up from their notes and ask:

“So, what do you need to have monitored here.”

In your head, you’re thinking “I thought YOU were supposed to tell ME that. Why do I have to do all the heavy lifting here?”

But you’re a pro so of course you don’t say that. Instead you reply, “Well, I’m sure whatever you monitor standard should be fine.”

The monitoring person looks mildly annoyed, and asks “Anything ELSE? I mean, this is a pretty important <device/application/service>, right?”

So now you DO open your mouth. “Well, it’s hard to tell since I’m not sure what standard monitoring is!”

“Oh come on,” comes the retort. You’ve been using XYZ monitoring for 2 years now. You know what we monitor on the other stuff for you. It’s…”

And then then do the most infuriating thing: They rattle off a list of words. Some of which are components, some are counters, some sound vaguely familiar, and others you’ve never heard of. And there they sit with their arms crossed, looking at you across the table.

*************

While the above scene is probably (hopefully) not as dramatic as the ones you’ve encountered in real life, Never the less this probably captures the essence of the issue. If the monitoring team and those who benefit from monitoring are working from different playbooks, the overall effectiveness of monitoring (not to mention YOU) is going to be impacted.

So how do you avoid this? Or if you’ve already fallen into friction, how do you get past it? As with the rest of this series, I have are a few suggestions which largely boil down to “knowing you will be asked this question is half the battle” because now you can prepare.

But the daemon is, as they say, in the details. So let’s dig in!

好記性,不如爛筆頭

(“A good memory is no match for a bad pen nub”)

Yes, my first suggestion is going to be to make sure you have good documentation.

Before you quail at the thought of mountains of research, keep in mind that that you aren’t documenting the specific thresholds, values, etc for each node. You are compiling a list of the default values which are collected when a new system is added into your monitoring solution. Depending on your software, that could be a consistent list regardless of device type, or it could be a core set of common values along with a few additional variations by device type.

Your first stop should always be to RTFM which stands, of course, for “Read The Friendly Manual”. Many vendors have already done this work for you and all you need to do is copy and paste that into your documentation.

But in the case that your vendor considers such information to be legally protected intellectual property, your next step is to just LOOK AT THE SCREENS THEMSELVES. I mean, it’s not that hard. After scanning a couple of the detail pages for devices, you’re going to see a pattern emerge. It will probably be something like:

  • Availability
    • Response time
    • Packet loss
  • CPU
    • Overall percent utilization, 5 minute average
    • Per processor % util 5 min avg
  • Physical RAM
    • % Util 5 min avg
    • IO Operations Per Second (IOPS)
    • Errors
  • Virtual RAM
    • % Util 5 min avg
    • IO Operations Per Second (IOPS)
    • Errors
  • Disk
    • Availability
    • Capacity
    • Used
    • IOPS
  • Interface
    • Availability
    • Bandwidth
      • Bits per second (BPS)
      • % Utilization
      • Packets Per Second (PPS)
      • Errors
      • Discards

There, in that relatively short list, is probably the core of the metrics every machine provides. Then you have specifics for routers (Flapping routes, VPN metrics, etc), switches (stacked switch metrics, VLANs), virtual machines (host statistics,  “noisy neighbor” events), and so on.

When all else fails, there’s wireshark. By monitoring the connection between the monitoring system and a single target device (preferably one you have newly added) you can capture all communication and suss out the metrics being requested/supplied. It’s not pretty, it might not even be easy. But it’s a lot easier than having to repeatedly tell your colleagues that you have no idea what gets monitored.

PS: If you are down to the wireshark method, it might be time to look at another monitoring solution.

There’s An App For That

Everything I’ve listed above is great for hardware-based metrics, including specialized storage and virtualization tools. But once you cross into the valley of the shadow of application monitoring, you need a better strategy.

By their very nature, application monitoring is highly complex, highly customized, and highly variable. Take something as straightforward as Microsoft Exchange. When building a monitoring solution, you might need specific monitoring for the hub transport server, the mailbox server, the client access server, or even the external OWA server (just to name a few). While there are common element from one to the other, they serve very different purposes and require very different components and thresholds.

Take that concept and multiply it by all the applications in your environment.

So what is the diligent yet slightly-overwhelmed monitoring engineer to do? While movies like “Glengarry Glen Ross” and “The Boiler Room” have popularized the phrase “ABC – Always Be Closing” within sales circles, in monitoring (and indeed, within much of IT) I prefer “ABS – Always Be Standardizing”.

Whether the topic is alerts, reports, or sets of application monitoring components, the best thing you can do for yourself is to continuously try to create a single “gold standard” for a particular need and then keep using it, expanding that single standard when necessary to account for additional variations, but avoiding as much as possible the “copy and paste” process that leaves you with 42 different Exchange monitoring templates and (18 months down the road) no idea which one is the default, let alone which one is applied to which server.

If you are able to adhere to this one practice, then answering the question “what do you monitor, standard” for applications becomes immeasurably more achievable.

Good Monitoring Is It’s Own Reward

Armed with the information I’ve described above, that conversation I described at the beginning of this essay goes in a very different direction:

Them: “We need to monitor about 20 of our devices and the applications that go with them.”

You: “Great! It’s a good thing you’re talking to me and not accounts receivable then. Do you have a list of what you are putting together?”

Them: (laughing politely at your very not funny joke) “Right here”

You : “OK, great. I’ve got a list here of the things that get monitored automatically when we load your devices into our software. Can you take a look and tell me if anything you think is important is missing?” (hands printed list of default hardware monitors across the table)

Them: (Scanning the list) “Wow. This is… a lot. It’s more than we expected. I don’t want tickets on all this tho…”

You: “Oh, no. These are the metrics we collect. Alerts are something else. But if you need to see the general health of your systems, this is what we can provide you automatically.”

Them: “Oh, I get it! That’s great. You know what? We’ll ride with this list for the time being.”

You: “Perfect. Now, on the application side, here’s what we have in place today.” (hands over printed list of application monitoring).

Them: “OK, I can already see we’ll need a few extra items that the dev’s have said they want to track. They will also want to alert on some custom messages they’re generating.”

You: “No worries. You can get me that list by email later if you want. I’m sure we’ll be able to get it all set up for you.”

Them: “Wow, this was a lot less painful than I expected.”

You: (Winks at camera) “I know!”

The Long And Winding Road

You pull into the driveway and head into your home, reflecting on how 8:45am seemed like 100 years ago in terms of the miles you’ve put on that old brain of yours. But then you sit down and take a minute to clear your head, and you realize a few things:

First, that monitoring, GOOD monitoring, is something that is achievable by anyone that is willing to put in a little time and effort, and can be accomplished with most of the monitoring solutions currently available on the market today.

Second, that putting in the time and effort to create good monitoring makes the overall job significantly more enjoyable.

Third, that monitoring is, and should be treated as, a separate discipline within IT, as valid a sub-specialty as networking, programming, storage, virtualization or information security. And only when it IS treated as a discipline will people be willing to put in the time and effort I’ve been describing, and to invest in the tools that help make that possible.

Fourth, and finally, that everything that happened today – from the 8:45am beginning I described in part 1 all the way to the cold beer in your hand now – is all part of a day’s work in IT.