Category Archives: Monitoring

When It Comes to System Outages, Don’t Prepare For the Worst

NOTE: This article originally appeared here.

During the 2015 World Cup soccer competition, Nate Silver and the psychic witches he keeps in his basement — because how else could he make the predictions he does with such accuracy? — got it wrong. Really, really wrong. They were completely blindsided by Germany’s win over Brazil. As Silver described it, it was a completely unforeseeable event.

 

In sports and, to a lesser extent, politics, the tendency in the face of these things is to eat the loss, chalk it up to a fluke — a black swan in statistics parlance — and get on with life.

But as network administrators, we know that’s not how it works in IT.

In my experience, when a black swan event affects IT systems, management usually acquires a dark obsession with the event. Meetings are called under the guise of “lessons learned exercises,” with the express intent of ensuring said system outages never happen again.

Don’t spend too much time studying what might occur

Now, I’m not saying that after a failure we should just blithely ignore any lessons that could be learned. Far from it, actually. In the ashes of a failure, you often find the seeds of future avoidance. One of the first things an IT organization should do after such an event is determine whether the failure was predictable, or if it was one of those cases where there wasn’t enough historical data to determine a decent probability.

 

If the latter is the case, I’m here to tell you your efforts are much better spent elsewhere. What’s a better approach? Instead of spending time trying to figure out if a probability may or may not exist, catch and circumvent those common, everyday IT annoyances. This is a tactic that’s overlooked far too often.

Don’t believe me? Well, let’s take the example of a not-so-imaginary company I know that had a single, spectacular IT failure that cost somewhere in the neighborhood of $100,000. Management was understandably upset. It immediately set up a task force to identify the root cause of the failure and recommend steps to avoid it in the future. Sounds reasonable, right?

The task force — five experts pulled from the server, network, storage, database and applications teams — took three months and more than 100 staff-hours to investigate the root cause. Being conservative, let’s say the hourly cost to the company was $50. Now, multiply that by five people, then by 100 hours, then by three months. It comes to a nice round $125,000.

Not so reasonable after all

Yes, at the end of it all the root problem was not only identified — at least, as much as possible — but code was put in place to (probably) predict the next time the exact same event might occur. Doesn’t sound so bad. But keep this in mind: The company spent $25,000 more than the cost of the original failure to create a system outages solution that may or may not predict the occurrence of a black swan exactly like the one that hit before.

Maybe it wasn’t so reasonable after all.

You may be thinking, “But where else are you saying we should be focusing on? After all, we’re held accountable to the bottom line as much as anyone else in the company.”

I get that, and it’s actually my point. Let’s compare the previous example of chasing a black swan to another, far more common problem: network interface card (NIC) failures.

In this example, another not-so-fictitious company saw bandwidth usage spike and stay high. NICs threw errors until the transmission rates bottomed out, and eventually the card just up and died. The problem was that while bandwidth usage was monitored, there was no alerting in place for interfaces that stopped responding or disappeared (the company monitored the IP at the end of the connection, which meant WAN links were absent alerts until the far end went down).

Let’s assume that a NIC failure takes an average of one hour to notice and correctly diagnose, and then it takes two hours to fix by network administrators who cost the company $53 per hour. While the circuit is out, the company loses about $1,000 per hour in revenue, lost opportunity, etc. That means system outages like this one could cost the company $3,106.

Setting a framework anchored by alerting and monitoring

Now, consider that, in my experience, proper monitoring and alerting reduces the time it takes to notice and diagnose problems such as NIC failures to 15 minutes. That’s it. Nothing else fancy, at least not in this scenario. But that simple thing could reduce the cost of the outage by $750.

I know those numbers don’t sound too impressive. That is, until you realize a moderately sized company can easily experience 100 NIC failures per year. That translates to more than $300,000 in lost revenue if the problem is unmonitored, and an annual savings of $75,000 if alerting is in place.

And that doesn’t take into account the ability to predict NIC failures and replace the card pre-emptively. If we estimate that 50% of the failures could be avoided using predictive monitoring, the savings could rise to more than $190,000.

Again, I’m not saying preparing for black swan events isn’t a worthy endeavor, but when tough budget decisions need to be made, some simple alerting on common problems can save more than trying to predict and prevent “the big one” that may or may not ever happen.

After all, NIC failures are no black swan. I think even Nate Silver would agree they’re a sure thing.

#FeatureFriday: Checking Additional Polling Health and Load

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

Once your monitoring environment grows beyond a certain size, a single server – no matter how robust your software or how beefy your hardware – is simply not going to cut it. And not long after that, when you’ve added another polling server, you will want to know how it is performing, and if it’s time for yet another additional poller.

To that end, in the video below, my fellow SolarWinds Head Geeks Kong Yang, Patrick Hubbard, and I speak about ways you can validate the health and current load being carried by each additional poller in your environment, so there are no surprises.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

“Nice to Have” is Relative

What could you absolutely positively not live without? … Draw a line. Come up with reasons. Know where you stand.

Cute. Except without requirements you can’t draw a line effectively.

So go from there. Start with what you need to accomplish, and then set your must-haves to achieve that goal. Don’t embellish, don’t waffle. Just the essentials.

Draw the line, set the next goal, list the deal-breakers.

Believe it or not, this is an exercise most organizations don’t do (at least for montioring). They start with a (often vague) goal – “Select a NetFlow tool” or “Monitor customer experience for applications”. And then they look for tools that do that.

Before you know it, you have 20, or 50, or 150 tools (I am NOT exaggerating). . You have staff – even whole teams – who invest hundreds of hours getting good using those tools.

And then you get allegiences. And then you get politics.

Respect Your Elders

NOTE: This article originally appeared here.

“Oh Geez,” exclaimed the guy who sits 2 desks from me, “that thing is ancient! Why would they give him that?”

Taking the bait, I popped my head over the wall and asked “what is?”

He showed me a text message, sent to him from a buddy—an engineer (EE, actually) who worked for an oil company. My co-worker’s iPhone 6 displayed an image of a laptop we could only describe as “vintage”:

(A Toshiba Tecra 510CDT, which was cutting edge…back in 1997.)

“Wow.” I said. “Those were amazing. I worked on a ton of those. They were serious workhorses—you could practically drop one from a 4 story building and it would still work. I wanted one like nobody’s business, but I could never afford it.”

“OK, back in the day I’m sure they were great,” said my 20-something coworker dismissively. “But what the hell is he going to do with it NOW? Can it even run an OS anymore?”

I realized he was coming from a particular frame of reference that is common to all of us in I.T. Newer is better. Period. With few exceptions (COUGH-Windows M.E.-COUGH), the latest version of something—be it hardware or software—is always a step up from what came before.

While true, it leads to a frame of mind that is patently un-true: a belief that what is old is also irrelevant. Especially for I.T. Professionals, it’s a dangerous line of thought that almost always leads to un-necessary mistakes and avoidable failures.

In fact, ask any I.T. pro who’s been at it for a decade, and you’ll hear story after story:

  • When programmers used COBOL, back when dinosaurs roamed the earth, one of the fundamental techniques that were drilled into their heads was, “check your inputs.” Thinking about the latest version of exploits, be they an SSLv3 thing like ‘Poodle’, or a SQL injection, or any of a plethora of web based security problems, the fundamental flaw is the server NOT checking its inputs, for sanity.
  • How about the OSI model? Yes, we all know its required knowledge for many certification exams (and at least one IT joke). But more importantly, it was (and still is) directly relevant to basic network troubleshooting.
  • Nobody needs to know CORBA database structure anymore, right? Except that a major monitoring tool was originally developed on CORBA and that foundation has stuck. Which is why, if you try to create a folder-inside-a-folder more than 3 times, the entire system corrupts. CORBA (one of the original object-oriented databases) could only handle 3 levels of object containership.
  • Powershell can be learned without understanding the Unix/Linux command line concepts. But, it’s sure EASIER to learn if you already know how to pipe ls into grep into awk into awk so that you get a list of just the files you want, sorted by date. That technique (among other Unix/Linux concepts) was one of the original goals of Powershell.
  • Older rev’s of industrial motion-control systems used specific pin-outs on the serial port. The new USB-to-Serial cables don’t mimic those pin-outs correctly, and trying to upload a program with the new cables will render the entire system useless.

And in fact, that’s why my co-worker’s buddy was handed one of those venerable Tecra laptops. It had a standard serial port and it was preloaded with the vendor’s DOS-based ladder-logic programming utility. Nobody expected it to run Windows 10, but it fulfilled a role that modern hardware simply couldn’t have done.

It’s an interesting story, but you have to ask: aside from some interesting anecdotes and a few bizarre use-cases, does this have any relevance to our work day-today?

You bet.

We live in a world where servers, storage, and now the network is rushing toward a quantum singularity of virtualization.

And the “old-timers” in the mainframe team are laughing their butts off as they watch us run in circles, inventing new words to describe techniques they learned at the beginning of their career; making mistakes they solved decades ago; and (worst of all) dismissing everything they know as utterly irrelevant.

Think I’m exaggerating? SAN and NAS look suspiciously like DASD, just on faster hardware. Services like Azure and AWS, for all their glitz and automation, aren’t as far from rented time on a mainframe as we’d like to imagine. And when my company replaces my laptop with a fancy “appliance” that connects to Citrix VDI session, it reminds me of nothing as much as the VAX terminals I supported back in the day.

My point isn’t that I’m a techno-Ecclesiastes shouting “there is nothing new under the sun!” Or some I.T. hipster who was into the cloud before it was cool. My point is that it behooves us to remember that everything we do, and every technology we use, had its origins in something much older than 20 minutes ago.

If we take the time to understand that foundational technology we have the chance to avoid past mis-steps, leverage undocumented efficiencies built into the core of the tools, and build on ideas elegant enough to have withstood the test of time.


Got your own “vintage” story, or long-ago-learned tidbit that is still true today? Share it in the comments!

#FeatureFriday: SolarWinds NPM Syslog/Trap Health

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

A frequently misunderstood aspect of setting up a monitoring solution is how to handle syslog and trap. Without going too deeply into it here (because the video below does a great job of that) the volume of messages in some environments can overwhelm the most robust software on the beefiest hardware.

The simple fact is that syslog and trap are chatty protocols. If you don’t have a design in place that can filter out the noise, you may end up thinking your monitoring solution is performing poorly when it is merely struggling under the weight of an unmanageable message stream.

In this video, I explain to my fellow SolarWinds Head Geeks Kong Yang and Patrick Hubbard exactly how this happens  and how build a design that avoids the problem in the first place.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

Are We There Yet? Are We There Yet? Are We…

How often you request information from a target device (commonly known as the polling cycle) is one of the fine arts of monitoring design, and one of the things that requires the most up-front discussion with consumers of your information.

Why? Because it’s often unclear to the consumer (the person who is getting the ticket, report, etc) the delays that exist between the occurrence of an actual event  and the reporting/recording of it. Worse, most of us in the monitoring segment of IT do a poor job of communicating these things.

Let’s take good ol’ ping as an example.

You want to check the up-down status of your systems on a regular basis. But you DON’T want to turn your monitoring system into an inadvertent DDOS Smurf attack generator.

So you pick some nice innocuous interval – say a ping check every 1 minute.

Then you realize that you need to monitor 3,000 systems, and so you back that off to 2 minutes

 Once you get past that frequency, there’s the additional level of verification. Just because a system failed ONE ping check doesn’t mean it’s actually down. So maybe your toolset does an additional level of just-in-time validation (let’s say it pings an additional 10 times, just to be sure the target is really down). That verification introduces at least another full minute of delay.

Is your alerting system doing nothing else? It’s just standing idle, waiting for this one alert to come in and be processed? Probably not. Let’s throw another 60 seconds into the mix for inherent delays in the alert engine.

Does your server team really want to know when a device is unavailable (ie: ping failed) for 1 minute? Probably not. They know servers get busy and don’t respond to ping even though they are processing. Heck, they may even want to avoid cutting tickets during patching cycle reboots. My experience is that most teams don’t want to hear that a server is down unless its really been down for at least 5 minutes.

Now you get the call: “I thought we were pinging every minute. But I got a ticket this morning and when I went to check the server, it had been down for over 10 minutes. What gives?”

There’s no single “right” answer to this. Nobody wants false alarms. And nobody wants to find out a system is down after the customer already knows.

But it is between those two requirements that one minute stretches to ten.

When, Not What, Defines Today’s Networking Career

Back in December, Cisco filed a lawsuit against Arista Networks because Arista’s network device operating system, EOS, was too similar to Cisco’s beloved IOS.

 This caused Tom Hollingsworth (a.k.a. “The Networking Nerd”) to speculate that this action presaged the ultimate death of the network device command-line interface (CLI).

Time will tell whether Hollingsworth is right or wrong and to what degree, but the idea intrigued me. Why would it matter if the command-line interface went away? What would be the loss?

Now, before going further, here’s a little background on me: I tend to be a “toaster” guy when it comes to technology. I don’t love my toaster or hate my toaster. I don’t proselytize the glorious features of my toaster to non-users. I just use my toaster. If it burns the toast, it’s time for a new toaster. Sure, over the years I’ve built up a body of experience that tells me my bagels tend to get stuck in the upright models, so I prefer the toaster/oven combos. But at the end of the day, it’s about making good toast.

Today’s networking career means learning new techniques

Jeez! Now I have a craving for a panini. Where was I? Oh right, technology.

I use a lot of technologies. My phone is Android. My work laptop runs Windows 8.1. My home desktop runs Linux. My wife lives on her iPad. So on and so forth. I’ve come to believe that learning technology is like learning to play cards.

The first game you learn is new, interesting and a little confusing, but ultimately thrilling because of the world it opens up. But the second card game, that’s the hard one. You know all the rules of the first game, but now there’s this other game that completely shatters what you knew. Then you learn your third card game, and you start to see the differences and similarities. By the fifth game, you understand that the cards themselves are just a vehicle for different ways of structuring sets.

I believe that’s why people are concerned about Hollingsworth’s prediction of the death of CLI. If you only know one game — and let’s face it, CLI is an extremely comprehensive and well-known “game” — and you’ve invested a lot of time and energy learning not only the rules of that game but also its nuances and tricks, finding out that game is about to be discontinued can be distressing. But when it comes to CLI, I believe that distress is actually due to a misplaced sense of self. Because you aren’t really a CLI professional, are you?

You’re a networking professional, not a CLI pro

Sure, you know a lot about CLI. But really, you’re a networking professional. Being able to configure open shortest path first (OSPF) from memory makes your job easier. But your job is knowing what OSPF is, when to use it versus enhanced interior gateway routing protocol, how to evaluate its performance and so on.

No, the concern about the death of CLI is really rooted in the fear of personal obsolescence. I’ve heard that notion repeated in conversations about the mainframe, Novell networking, WordPerfect 5.1 and dozens of other technologies that were brilliant in their time, but which, for various reasons, were superseded by something else — sometimes something else that is better, and sometimes not.

And a fear of personal obsolescence in your networking career is ultimately false, unless you are digging in your heels and choosing never to learn anything new again. (OK, that was sarcasm, folks. As IT pros, we should be committed to life-long learning. Even if you are two years away from retirement, learning new stuff is still A Good Thing™.) As long as you are open to new ideas, new techniques and yes, new systems, then you won’t become obsolete.

Employers exploit networkers’ insecurity

I’ll be honest. I think there are a lot of employers that exploit this insecurity. “Where’s a Perl script-kiddie like you going to find this kind of role?” they whisper — usually implicitly, although sometimes much more explicitly than any of us prefer. Or if we’re interviewing for a new job, they ask, “I see you have a lot of experience with AIX, but we’re a Windows shop. Do you really think your skills translate?”

I’m not here to talk about interviewing skills, salary negotiations or career improvements, so I’m not going to get into the potential responses, but I will say that the ultimate answer in each of these cases — and many others — is “Yes.” Why? Because it’s not about whether I know the fifth parameter of the gerfrinkel command in CodeMe version 12.3.9.7, which was deprecated in 12.3.9.8 in favor of the unglepacht function. It’s not about any of that. It’s about my experience on when to use certain commands, when to look for a workaround, how to manage an environment of this scale and scope and so on.

To play off the old joke about the copier repairman, a small part of your paycheck goes toward turning the screw; more of it is based on knowing which screw to turn.

As IT pros, we are paid — and are valuable — because we know how to find out which screw to turn and when to turn it. So to speak, of course.

#FeatureFriday: Rebuilding Database Indexes

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

While many monitoring solutions have built-in database maintenance routines, often more may be needed to ensure everything is absolutely squeaky-clean. This is especially important to do prior to upgrades and patches. And one of the basic cleanup tasks for a database is rebuilding indexes. 

In this video SolarWinds Head Geeks Kong Yang, Patrick Hubbard, and myself look at what database reindexing looks like, and even offer some simple scripts to do it in your environment.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

A Cornucopia

Before you know it, you have 20, or 50, or 150 monitoring tools in your company.

How does that happen?

Like most other things happen. Organically, slowly, in response to events on the ground, to pressures of time and budget and need (both real and perceived), to people’s biases.

Can you quantify which of your monitoring tools can ping? Can perform an SNMP walk? Can receive NetFlow information? Can perform a WMI query?

Of all the tools in your environment, do you know how many have features which overlap? And how much overlap?

If you had one solution which covered 90% (or even 80%) of the functionality of another tool (or more than one!), would you know? And would you know if the remaining 10-20% was essential?

And now that I’ve asked, how would you go about capturing that information?

“Logfile Monitoring” – I Do Not Think It Means What You Think It Means

This is a conversation I have A LOT with clients. They say we want “logfile monitoring” and I am not sure what they mean. So I end up having to unwind all the different things it COULD be, so we can get to what it is they actually need.

It’s also an important clarification for me to make as SolarWinds Head Geek because depending on what the requested means, I might need to point them toward Kiwi Syslog Server, Software & Application Monitor, or Log & Event Manager (LEM).

Here’s a handy guide to identify what people are talking about. “Logfile monitoring” is usually applied to 4 different and mutually exclusive areas. Before you allow the speaker to continue, please ask them to clarify which one they are talking about:

  1. Windows Logfile
  2. Syslog
  3. Logfile aggregation
  4. Monitoring individual text files on specific servers

More clarification on each of these areas below:

Windows Logfile

Monitoring in this area refers specifically to the Windows event log, which isn’t actually a log “file” at all, but a database unique to Windows machines.

In the SolarWinds world, the tool that does this is Server & Application Monitor. Or if you are looking for a small, quick, and dirty utility, the Eventlog Forwarder for Windows will take Eventlog messages that match a search pattern and pass them via Syslog to another machine.

Syslog

Syslog is a protocol, which describes how to send a message from one machine to another on UDP port 514. The messages must fit a pre-defined structure. Syslog is different from SNMP Traps. This protocol is most often found when monitoring network and *nix (Unix, Linux) devices, although network and security devices send out their fair share as well.

In terms of products, this is covered natively by Network Performance Monitor (NPM), but as I’ve said often you shouldn’t send syslog or trap directly to your NPM primary poller. You should send it into a syslog/trap “filtration” first. And that would be the Kiwi Syslog server (or its freeware cousin).

Logfile aggregation

This technique involves sending (or pulling) log files from multiple machines and collecting them on a central server. This collection is done at regular intervals. A second process then searches across all the collected logs, looking for trends or patterns in the enterprise. When the audit and security groups talk about “logfile monitoring,” this is usually what they mean.

As you may have already guessed, the SolarWinds tool for this job is Log & Event Manager. I should point out that LEM will ALSO receive syslog and traps, so you kind of get a twofer if you have this tool. Although, I personally STILL think you should send all of your syslog and trap to a filtration layer, and then send the non-garbage messages to the next step in the chain (NPM or LEM).

Monitoring individual text files on specific servers

This activity focuses on watching a specific (usually plain text) file in a specific directory on a specific machine, looking for a string or pattern to appear. When that pattern is found, an alert is triggered. Now it can get more involved than that—maybe not a specific file, but a file matching a specific pattern (like a date); maybe not a specific directory, but the newest sub-directory in a directory; maybe not a specific string, but a string pattern; maybe not ONE string, but 3 occurrences of the string within a 5 minute period; and so on. But the goal is the same—to find a string or pattern within a file.

Within the context of SolarWinds, SAM has been the go-to solution for this type of thing. But, at this moment it’s only through a series of Perl, Powershell, and VBScript templates.

We know that’s not the best way to get the job done, but that’s a subject for another post.

The More You Know…

For now, it’s important that you are able to clearly define—for both you and your colleagues, customers, and consumers—the difference between “logfile monitoring” and which tool or technique you need to employ to get the job done.