#FeatureFriday: Checking Additional Polling Health and Load

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

Once your monitoring environment grows beyond a certain size, a single server – no matter how robust your software or how beefy your hardware – is simply not going to cut it. And not long after that, when you’ve added another polling server, you will want to know how it is performing, and if it’s time for yet another additional poller.

To that end, in the video below, my fellow SolarWinds Head Geeks Kong Yang, Patrick Hubbard, and I speak about ways you can validate the health and current load being carried by each additional poller in your environment, so there are no surprises.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

“Nice to Have” is Relative

What could you absolutely positively not live without? … Draw a line. Come up with reasons. Know where you stand.

Cute. Except without requirements you can’t draw a line effectively.

So go from there. Start with what you need to accomplish, and then set your must-haves to achieve that goal. Don’t embellish, don’t waffle. Just the essentials.

Draw the line, set the next goal, list the deal-breakers.

Believe it or not, this is an exercise most organizations don’t do (at least for montioring). They start with a (often vague) goal – “Select a NetFlow tool” or “Monitor customer experience for applications”. And then they look for tools that do that.

Before you know it, you have 20, or 50, or 150 tools (I am NOT exaggerating). . You have staff – even whole teams – who invest hundreds of hours getting good using those tools.

And then you get allegiences. And then you get politics.

Respect Your Elders

NOTE: This article originally appeared here.

“Oh Geez,” exclaimed the guy who sits 2 desks from me, “that thing is ancient! Why would they give him that?”

Taking the bait, I popped my head over the wall and asked “what is?”

He showed me a text message, sent to him from a buddy—an engineer (EE, actually) who worked for an oil company. My co-worker’s iPhone 6 displayed an image of a laptop we could only describe as “vintage”:

(A Toshiba Tecra 510CDT, which was cutting edge…back in 1997.)

“Wow.” I said. “Those were amazing. I worked on a ton of those. They were serious workhorses—you could practically drop one from a 4 story building and it would still work. I wanted one like nobody’s business, but I could never afford it.”

“OK, back in the day I’m sure they were great,” said my 20-something coworker dismissively. “But what the hell is he going to do with it NOW? Can it even run an OS anymore?”

I realized he was coming from a particular frame of reference that is common to all of us in I.T. Newer is better. Period. With few exceptions (COUGH-Windows M.E.-COUGH), the latest version of something—be it hardware or software—is always a step up from what came before.

While true, it leads to a frame of mind that is patently un-true: a belief that what is old is also irrelevant. Especially for I.T. Professionals, it’s a dangerous line of thought that almost always leads to un-necessary mistakes and avoidable failures.

In fact, ask any I.T. pro who’s been at it for a decade, and you’ll hear story after story:

  • When programmers used COBOL, back when dinosaurs roamed the earth, one of the fundamental techniques that were drilled into their heads was, “check your inputs.” Thinking about the latest version of exploits, be they an SSLv3 thing like ‘Poodle’, or a SQL injection, or any of a plethora of web based security problems, the fundamental flaw is the server NOT checking its inputs, for sanity.
  • How about the OSI model? Yes, we all know its required knowledge for many certification exams (and at least one IT joke). But more importantly, it was (and still is) directly relevant to basic network troubleshooting.
  • Nobody needs to know CORBA database structure anymore, right? Except that a major monitoring tool was originally developed on CORBA and that foundation has stuck. Which is why, if you try to create a folder-inside-a-folder more than 3 times, the entire system corrupts. CORBA (one of the original object-oriented databases) could only handle 3 levels of object containership.
  • Powershell can be learned without understanding the Unix/Linux command line concepts. But, it’s sure EASIER to learn if you already know how to pipe ls into grep into awk into awk so that you get a list of just the files you want, sorted by date. That technique (among other Unix/Linux concepts) was one of the original goals of Powershell.
  • Older rev’s of industrial motion-control systems used specific pin-outs on the serial port. The new USB-to-Serial cables don’t mimic those pin-outs correctly, and trying to upload a program with the new cables will render the entire system useless.

And in fact, that’s why my co-worker’s buddy was handed one of those venerable Tecra laptops. It had a standard serial port and it was preloaded with the vendor’s DOS-based ladder-logic programming utility. Nobody expected it to run Windows 10, but it fulfilled a role that modern hardware simply couldn’t have done.

It’s an interesting story, but you have to ask: aside from some interesting anecdotes and a few bizarre use-cases, does this have any relevance to our work day-today?

You bet.

We live in a world where servers, storage, and now the network is rushing toward a quantum singularity of virtualization.

And the “old-timers” in the mainframe team are laughing their butts off as they watch us run in circles, inventing new words to describe techniques they learned at the beginning of their career; making mistakes they solved decades ago; and (worst of all) dismissing everything they know as utterly irrelevant.

Think I’m exaggerating? SAN and NAS look suspiciously like DASD, just on faster hardware. Services like Azure and AWS, for all their glitz and automation, aren’t as far from rented time on a mainframe as we’d like to imagine. And when my company replaces my laptop with a fancy “appliance” that connects to Citrix VDI session, it reminds me of nothing as much as the VAX terminals I supported back in the day.

My point isn’t that I’m a techno-Ecclesiastes shouting “there is nothing new under the sun!” Or some I.T. hipster who was into the cloud before it was cool. My point is that it behooves us to remember that everything we do, and every technology we use, had its origins in something much older than 20 minutes ago.

If we take the time to understand that foundational technology we have the chance to avoid past mis-steps, leverage undocumented efficiencies built into the core of the tools, and build on ideas elegant enough to have withstood the test of time.


Got your own “vintage” story, or long-ago-learned tidbit that is still true today? Share it in the comments!

#FeatureFriday: SolarWinds NPM Syslog/Trap Health

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

A frequently misunderstood aspect of setting up a monitoring solution is how to handle syslog and trap. Without going too deeply into it here (because the video below does a great job of that) the volume of messages in some environments can overwhelm the most robust software on the beefiest hardware.

The simple fact is that syslog and trap are chatty protocols. If you don’t have a design in place that can filter out the noise, you may end up thinking your monitoring solution is performing poorly when it is merely struggling under the weight of an unmanageable message stream.

In this video, I explain to my fellow SolarWinds Head Geeks Kong Yang and Patrick Hubbard exactly how this happens  and how build a design that avoids the problem in the first place.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

Are We There Yet? Are We There Yet? Are We…

How often you request information from a target device (commonly known as the polling cycle) is one of the fine arts of monitoring design, and one of the things that requires the most up-front discussion with consumers of your information.

Why? Because it’s often unclear to the consumer (the person who is getting the ticket, report, etc) the delays that exist between the occurrence of an actual event  and the reporting/recording of it. Worse, most of us in the monitoring segment of IT do a poor job of communicating these things.

Let’s take good ol’ ping as an example.

You want to check the up-down status of your systems on a regular basis. But you DON’T want to turn your monitoring system into an inadvertent DDOS Smurf attack generator.

So you pick some nice innocuous interval – say a ping check every 1 minute.

Then you realize that you need to monitor 3,000 systems, and so you back that off to 2 minutes

 Once you get past that frequency, there’s the additional level of verification. Just because a system failed ONE ping check doesn’t mean it’s actually down. So maybe your toolset does an additional level of just-in-time validation (let’s say it pings an additional 10 times, just to be sure the target is really down). That verification introduces at least another full minute of delay.

Is your alerting system doing nothing else? It’s just standing idle, waiting for this one alert to come in and be processed? Probably not. Let’s throw another 60 seconds into the mix for inherent delays in the alert engine.

Does your server team really want to know when a device is unavailable (ie: ping failed) for 1 minute? Probably not. They know servers get busy and don’t respond to ping even though they are processing. Heck, they may even want to avoid cutting tickets during patching cycle reboots. My experience is that most teams don’t want to hear that a server is down unless its really been down for at least 5 minutes.

Now you get the call: “I thought we were pinging every minute. But I got a ticket this morning and when I went to check the server, it had been down for over 10 minutes. What gives?”

There’s no single “right” answer to this. Nobody wants false alarms. And nobody wants to find out a system is down after the customer already knows.

But it is between those two requirements that one minute stretches to ten.

When, Not What, Defines Today’s Networking Career

Back in December, Cisco filed a lawsuit against Arista Networks because Arista’s network device operating system, EOS, was too similar to Cisco’s beloved IOS.

 This caused Tom Hollingsworth (a.k.a. “The Networking Nerd”) to speculate that this action presaged the ultimate death of the network device command-line interface (CLI).

Time will tell whether Hollingsworth is right or wrong and to what degree, but the idea intrigued me. Why would it matter if the command-line interface went away? What would be the loss?

Now, before going further, here’s a little background on me: I tend to be a “toaster” guy when it comes to technology. I don’t love my toaster or hate my toaster. I don’t proselytize the glorious features of my toaster to non-users. I just use my toaster. If it burns the toast, it’s time for a new toaster. Sure, over the years I’ve built up a body of experience that tells me my bagels tend to get stuck in the upright models, so I prefer the toaster/oven combos. But at the end of the day, it’s about making good toast.

Today’s networking career means learning new techniques

Jeez! Now I have a craving for a panini. Where was I? Oh right, technology.

I use a lot of technologies. My phone is Android. My work laptop runs Windows 8.1. My home desktop runs Linux. My wife lives on her iPad. So on and so forth. I’ve come to believe that learning technology is like learning to play cards.

The first game you learn is new, interesting and a little confusing, but ultimately thrilling because of the world it opens up. But the second card game, that’s the hard one. You know all the rules of the first game, but now there’s this other game that completely shatters what you knew. Then you learn your third card game, and you start to see the differences and similarities. By the fifth game, you understand that the cards themselves are just a vehicle for different ways of structuring sets.

I believe that’s why people are concerned about Hollingsworth’s prediction of the death of CLI. If you only know one game — and let’s face it, CLI is an extremely comprehensive and well-known “game” — and you’ve invested a lot of time and energy learning not only the rules of that game but also its nuances and tricks, finding out that game is about to be discontinued can be distressing. But when it comes to CLI, I believe that distress is actually due to a misplaced sense of self. Because you aren’t really a CLI professional, are you?

You’re a networking professional, not a CLI pro

Sure, you know a lot about CLI. But really, you’re a networking professional. Being able to configure open shortest path first (OSPF) from memory makes your job easier. But your job is knowing what OSPF is, when to use it versus enhanced interior gateway routing protocol, how to evaluate its performance and so on.

No, the concern about the death of CLI is really rooted in the fear of personal obsolescence. I’ve heard that notion repeated in conversations about the mainframe, Novell networking, WordPerfect 5.1 and dozens of other technologies that were brilliant in their time, but which, for various reasons, were superseded by something else — sometimes something else that is better, and sometimes not.

And a fear of personal obsolescence in your networking career is ultimately false, unless you are digging in your heels and choosing never to learn anything new again. (OK, that was sarcasm, folks. As IT pros, we should be committed to life-long learning. Even if you are two years away from retirement, learning new stuff is still A Good Thing™.) As long as you are open to new ideas, new techniques and yes, new systems, then you won’t become obsolete.

Employers exploit networkers’ insecurity

I’ll be honest. I think there are a lot of employers that exploit this insecurity. “Where’s a Perl script-kiddie like you going to find this kind of role?” they whisper — usually implicitly, although sometimes much more explicitly than any of us prefer. Or if we’re interviewing for a new job, they ask, “I see you have a lot of experience with AIX, but we’re a Windows shop. Do you really think your skills translate?”

I’m not here to talk about interviewing skills, salary negotiations or career improvements, so I’m not going to get into the potential responses, but I will say that the ultimate answer in each of these cases — and many others — is “Yes.” Why? Because it’s not about whether I know the fifth parameter of the gerfrinkel command in CodeMe version 12.3.9.7, which was deprecated in 12.3.9.8 in favor of the unglepacht function. It’s not about any of that. It’s about my experience on when to use certain commands, when to look for a workaround, how to manage an environment of this scale and scope and so on.

To play off the old joke about the copier repairman, a small part of your paycheck goes toward turning the screw; more of it is based on knowing which screw to turn.

As IT pros, we are paid — and are valuable — because we know how to find out which screw to turn and when to turn it. So to speak, of course.

#FeatureFriday: Rebuilding Database Indexes

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

While many monitoring solutions have built-in database maintenance routines, often more may be needed to ensure everything is absolutely squeaky-clean. This is especially important to do prior to upgrades and patches. And one of the basic cleanup tasks for a database is rebuilding indexes. 

In this video SolarWinds Head Geeks Kong Yang, Patrick Hubbard, and myself look at what database reindexing looks like, and even offer some simple scripts to do it in your environment.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

A Cornucopia

Before you know it, you have 20, or 50, or 150 monitoring tools in your company.

How does that happen?

Like most other things happen. Organically, slowly, in response to events on the ground, to pressures of time and budget and need (both real and perceived), to people’s biases.

Can you quantify which of your monitoring tools can ping? Can perform an SNMP walk? Can receive NetFlow information? Can perform a WMI query?

Of all the tools in your environment, do you know how many have features which overlap? And how much overlap?

If you had one solution which covered 90% (or even 80%) of the functionality of another tool (or more than one!), would you know? And would you know if the remaining 10-20% was essential?

And now that I’ve asked, how would you go about capturing that information?

“Logfile Monitoring” – I Do Not Think It Means What You Think It Means

This is a conversation I have A LOT with clients. They say we want “logfile monitoring” and I am not sure what they mean. So I end up having to unwind all the different things it COULD be, so we can get to what it is they actually need.

It’s also an important clarification for me to make as SolarWinds Head Geek because depending on what the requested means, I might need to point them toward Kiwi Syslog Server, Software & Application Monitor, or Log & Event Manager (LEM).

Here’s a handy guide to identify what people are talking about. “Logfile monitoring” is usually applied to 4 different and mutually exclusive areas. Before you allow the speaker to continue, please ask them to clarify which one they are talking about:

  1. Windows Logfile
  2. Syslog
  3. Logfile aggregation
  4. Monitoring individual text files on specific servers

More clarification on each of these areas below:

Windows Logfile

Monitoring in this area refers specifically to the Windows event log, which isn’t actually a log “file” at all, but a database unique to Windows machines.

In the SolarWinds world, the tool that does this is Server & Application Monitor. Or if you are looking for a small, quick, and dirty utility, the Eventlog Forwarder for Windows will take Eventlog messages that match a search pattern and pass them via Syslog to another machine.

Syslog

Syslog is a protocol, which describes how to send a message from one machine to another on UDP port 514. The messages must fit a pre-defined structure. Syslog is different from SNMP Traps. This protocol is most often found when monitoring network and *nix (Unix, Linux) devices, although network and security devices send out their fair share as well.

In terms of products, this is covered natively by Network Performance Monitor (NPM), but as I’ve said often you shouldn’t send syslog or trap directly to your NPM primary poller. You should send it into a syslog/trap “filtration” first. And that would be the Kiwi Syslog server (or its freeware cousin).

Logfile aggregation

This technique involves sending (or pulling) log files from multiple machines and collecting them on a central server. This collection is done at regular intervals. A second process then searches across all the collected logs, looking for trends or patterns in the enterprise. When the audit and security groups talk about “logfile monitoring,” this is usually what they mean.

As you may have already guessed, the SolarWinds tool for this job is Log & Event Manager. I should point out that LEM will ALSO receive syslog and traps, so you kind of get a twofer if you have this tool. Although, I personally STILL think you should send all of your syslog and trap to a filtration layer, and then send the non-garbage messages to the next step in the chain (NPM or LEM).

Monitoring individual text files on specific servers

This activity focuses on watching a specific (usually plain text) file in a specific directory on a specific machine, looking for a string or pattern to appear. When that pattern is found, an alert is triggered. Now it can get more involved than that—maybe not a specific file, but a file matching a specific pattern (like a date); maybe not a specific directory, but the newest sub-directory in a directory; maybe not a specific string, but a string pattern; maybe not ONE string, but 3 occurrences of the string within a 5 minute period; and so on. But the goal is the same—to find a string or pattern within a file.

Within the context of SolarWinds, SAM has been the go-to solution for this type of thing. But, at this moment it’s only through a series of Perl, Powershell, and VBScript templates.

We know that’s not the best way to get the job done, but that’s a subject for another post.

The More You Know…

For now, it’s important that you are able to clearly define—for both you and your colleagues, customers, and consumers—the difference between “logfile monitoring” and which tool or technique you need to employ to get the job done.

#FeatureFriday: Validating SolarWinds Database Maintenance

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

One of the key aspects of SolarWinds tools are their ease of use. Other enterprise-class monitoring solutions require you to have subject matter experts on hand to help with implementation and maintenance. SolarWinds can be installed by a single technician and doesn’t require you to have a DBA or Linux expert or CCIE on hand. 

But that doesn’t mean there’s no maintenance happening. And while a lot of it is automated, it’s important for folks who are responsible for the SolarWinds toolset to understand whether that maintenance is running correctly.

In this video, Head Geeks Kong Yang, Patrick Hubbard, and I go over the SolarWinds maintenance subroutines and how to see whether things are happy or not under the hood.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)