Category Archives: ICYMI

The Four Questions – Introduction

This article originally appeared in a scaled-down version here on PacketPushers.net. I’m posting it in it’s full form here as an introduction to the full series.


For people who are interested in monitoring, there is a leap that you make when you go from watching systems that YOU care about, to monitoring systems that other people care about.

When you are doing it for yourself, it’s all about ease of maintenance, getting good (meaning useful, interesting) data, and having the information at your fingertips to deflect accusations that YOUR system is down/slow/ugly/whatever.

But if you do that job well, and show up at enough meetings showing off your shiny happy data, inevitably you will get nominated/conscripted into the monitoring group where it is expected you will take as much interest in other people’s sh…tuff as your beloved systems from your former job.

And this is where things get especially tricky.

Assuming you LIKE monitoring as a discipline, and find it exciting to learn about different types of systems (and ways they can fail), you are going to want to provide the same levels of insight for your coworkers as you had for yourself.

Inevitably, you will find yourself answering The Four Questions. These are questions which—for reasons that will become apparent—you never really had to ask yourself when you were doing it on your own. The four questions—with brief explanations—are:

  1. Why did I get an alert?
    The person is not asking, “Why did this alert trigger at this time?” They are asking why they got the alert at all.
  2. Why didn’t I get an alert?
    Something happened that the owner of the system felt should have triggered an alert, but they didn’t receive one.
  3. What is being monitored on my system?
    What reports and data can be pulled for their system (and in what form) so they can look at trending, performance, and forensic information after a failure.
  4. What will alert on my system?
    I’d like to be able to predict under which conditions I will get an alert for this system.

…and the Fifth Beatle… I mean question.

5. What do you monitor “standard”?
What metrics and data are typically collected for systems like this? This is the inevitable (and logical) response when you say, “We put standard monitoring in place.”

In the coming (weeks/days/months/series) I’m going to explore each of these questions in-depth, and offer techniques you can use to respond to each one.

Using NetFlow monitoring tactics to solve bandwidth mysteries

NetFlow eliminated the hassle of network troubleshooting after a school complained about its Internet access.

Life in IT can be pretty entertaining. And apparently we admins aren’t the only ones who think so — Hollywood’s taken notice, too. The problem is, the television shows and feature films about IT are all comedies. I’m not saying we don’t experience some pretty humorous stuff, but I want a real show; you know, something with substance. I can see it now — The Xmit (and RCV) Files: The Truth is Out There.

 In fact, I’ve already got the pilot episode all planned. It’s based on an experience I had not long ago with the NetFlow monitoring protocol.

The company I was with at the time offered monitoring as a service to a variety of clients. One day, I was holding the receiver away from my head as a school principal shouted, “The Internet keeps going down! You’ve got to do something.”

Now, there are few phrases that get my goat like “the Internet is down,” or its more common cousin, “the network is down.” So, my first thought was, “Buddy, we have redundant circuits, switches configured for failover and comprehensive monitoring. The network is not down, so please shut up.”

Of course, that’s not what I said. Instead, I asked a few questions that would help me narrow down the root cause of the problem.

First up: “How often are you experiencing this issue?”

“A bunch,” I was told.

“Ooookay … at any particular time?” I asked.

He replied, “Well, it seems kind of random.”

Gee, thanks. I’m sure I can figure it out with such insightful detail.

It was obvious I was going to have to do some real investigation. My first check was the primary circuit to our provider. Nothing there. So, I’m sorry, Virginia, but “the Internet” is not down, as if I had any doubt.

Next, I looked at the school’s WAN interface, which revealed that yes indeed, the WAN link to the school was becoming saturated at various intervals during the day. Usage would spike for 20 to 30 minutes, then drop down until the next incident. I checked the firewall logs — not my favorite job, which showed a high volume of http connections at the same times.

Now, for many years, checking was the pinnacle of network troubleshooting — check the devices, check the logs, wait for the event to happen again, dig a little further. And in my case, that might have been all I could do. Our contract had us monitoring the entire core data center for the school system, but that only extended to the edge router for the school. We had exactly zero visibility beyond each individual school building’s WAN connection.

But as chance would have it, I had one more trick up my sleeve: NetFlow.

NetFlow has been around a while, but it’s been only in the last few years that it’s entered the common network admin lexicon, largely due to the maturation of tools that can reliably and meaningfully collect and display NetFlow data. NetFlow collects and correlates “conversations” between two devices as the data passes through a router. You don’t have to monitor the specific endpoints in the conversation, you just have to get data from one router that sits between the two points.

 

Hmm, that sounds a lot like a WAN router connected to the Internet provider, which is exactly what I had. Correlating the spike times from our bandwidth stats, we saw that during the same period, 10 MAC addresses were starting conversations with YouTube. Every time there was a spike, it was the same set of MAC addresses.

Now, if we had been monitoring inside the school, we could have gleaned much more information — IP address, location, maybe even username if we had some of the more sophisticated user device tracking tools in place — but we weren’t. However, a visit to WireShark’s OUI Lookup Tool revealed that all 10 of those MAC addresses were from — and please forgive the irony — Apple Inc.

At that point, I had exhausted all of the tools at my disposal. So, I called the principal back and gave him the start and stop times of the spikes, along with the information about 10 Apple products being to blame.

“Wait, what time was that?” he asked.

I repeated the times.

“Oh, for the love of … I know what the problem is.” Click.

It turns out the art teacher had been awarded a grant for 10 shiny new iPads. He would go from room to room during art period handing them out and teaching kids how to do video mashups.

This was one of those rare times when a bandwidth increase really was warranted, and after the school’s WAN circuit was reprovisioned, the Internet stopped mysteriously “going down.”

The episode would close with the handsome and sophisticated admin — played by yours truly, of course — looking into the camera and while channeling the great Fox Mulder saying, “Remember, my fellow admins, the truth is out there.” (And, I would add, for those of you reading this blog post, don’t forget how valuable NetFlow can be in finding network truth.)

Now, if that’s not compelling TV, I don’t know what is.

(This article originally appeared on SearchNetworking)

IT Monitoring Scalability Planning: 3 Roadblocks

Planning for growth is key to effective IT monitoring, but it can be stymied by certain mindsets. Here’s how to overcome them.

As IT professionals, planning for growth is something we do all day almost unconsciously. Whether it’s a snippet of code, provisioning the next server, or building out a network design, we’re usually thinking: Will it handle the load? How long until I’ll need a newer, faster, or bigger one? How far will this scale?

Despite this almost compulsive concern with scalability, there are still areas of IT where growth tends to be an afterthought. One of these happens to be my area of specialization: IT monitoring. So, I’d like to address growth planning (or non-planning) as it pertains to monitoring by highlighting several mindsets that typically hinder this important, but often surprisingly overlooked element, and showing how to deal with each.

The fire drill mindset
The occurs when something bad has already happened either because there was either no monitoring solution in place or because the existing toolset didn’t scale far enough to detect a critical failure, and so it was missed. Regardless, the result is usually a focus on finding a tool that would have caught the problem that already occurred, and finding it fast.

However, short of a TARDIS, there’s no way to implement an IT monitoring tool that will help avoid a problem after it occurs. Furthermore, moving too quickly as a result of a crisis can mean you don’t take the time to plan for future growth, focusing instead solely on solving the current problem.

My advice is to stop, take a deep breath, and collect yourself. Start by quickly, but intelligently developing a short list of possible tools that will both solve the current problem and scale with your environment as it grows. Next, ask the vendors if they have free (or cheap) licenses for in-house demoing and proofs of concept.

Then, and this is where you should let the emotion surrounding the failure creep back in, get a proof-of-concept environment set up quickly and start testing. Finally, make a smart decision based on all the factors important to you and your environment. (Hint: one of which should always be scalability.) Then implement the tool right away.

The bargain hunter
The next common pitfall that often prevents better growth planning when implementing a monitoring tool is the bargain-hunter mindset. This usually occurs not because of a crisis, but when there is pressure to find the cheapest solution for the current environment.

How do you overcome this mindset? Consider the following scenario: If your child currently wears a size 3 shoe, you absolutely don’t want to buy a size 5 today, right? But you should also recognize that your child is going to grow. So, buying enough size 3 shoes for the next five years is not a good strategy, either.

Also, if financials really are one of the top priorities preventing you from better preparing for future growth, remember that the cheapest time to buy the right-sized solution for your current and future environment is now. Buying a solution for your current environment alone because “that’s all we need” is going to result in your spending more money later for the right-sized solution you will need in the future. (I’m not talking about incrementally more, but start-all-over-again more.)

My suggestion is to use your company’s existing business growth projections to calculate how big of a monitoring solution you need. If your company foresees 10% revenue growth each year over the next three years and then 5% each year after that, and you are willing to consider completely replacing your monitoring solution after five years, then buy a product that can scale to 40% of the size you currently need.

The dollar auction
The dollar auction mindset happens when there is already a tool in place — a tool that wasn’t cheap and that a lot of time was spent perfecting. The problem is, it’s no longer perfect. It needs to be replaced because company growth has expanded beyond its scalability, but the idea of walking away from everything invested in it is a hard pill to swallow.

Really, this isn’t so much of a mindset that prevents preparing for future growth as it is something that’s all too often overlooked as an important lesson: If only you had better planned for future growth the first time around. The reality is that if you’re experiencing this mindset, you need a new solution. However, don’t make the same mistake. This time, take scalability into account.

Whether you’re suffering from one of these mindsets or another that is preventing you from better preparing your IT monitoring for future growth, remember, scalability is key to long term success.

(This article originally appeared on NetworkComputing)

Time for a network monitoring application? What to look for

You might think that implementing a network monitoring tool is like every other rollout. You would be wrong.

Oh, so you’re installing a new network monitoring tool, huh? No surprise there, right? What, was it time for a rip-and-replace? Is your team finally moving away from monitoring in silos? Perhaps there were a few too many ‘Let me Google that for you’ moments with the old vendor’s support line?

Let’s face it. There are any number of reasons that could have led you to this point. What’s important is that you’re here. Now, you may think a new monitoring implementation is no different than any other rollout. There are some similarities, but there are also some critical elements that are very different. How you handle these can mean the difference between success and failure.

I’ve found there are three primary areas that are often overlooked when it comes to deploying a network monitoring application. This isn’t an exhaustive list, but taking your time with these three things will pay off in the end.

Scope–First, consider how far and how deep you need the monitoring to go. This will affect every other aspect of your rollout, so take your time thinking this through. When deciding how far, ask yourself the following questions:

  • Do I need to monitor all sites, or just the primary data center?
  • How about the development, test or quality assurance systems?
  • Do I need to monitor servers or just network devices?
  • If I do need to include servers,  should I cover every OS or just the main one(s)?
  • What about devices in DMZs?
  • What about small remote sites across low-speed connections?

And when considering how deep to go, ask these questions:

  • Do I need to also monitor up/down for non-routable interfaces (e.g., EtherChannel connections, multiprotocol label switching links, etc.)?
  • Do I need to monitor items that are normally down and alert when they’re up (e.g., cold standby servers, cellular wide area network links, etc.)?
  • Do I need to be concerned about virtual elements like host resource consumption by virtual machine, storage, security, log file aggregation and custom, home-grown applications?

Protocols and permissions–After you’ve decided which systems to monitor and what data to collect, you need to consider the methods to use. Protocols such as Simple Network Management Protocol (SNMP), Windows Management Instrumentation (WMI), syslog and NetFlow each have its own permissions and connection points in the environment.

For example, many organizations plan to use SNMP for hardware monitoring, only to discover it’s not enabled on dozens –or hundreds — of systems. Alternatively, they find out it is enabled, but the connection strings are inconsistent, undocumented or unset. Then they go to monitor in the DMZ and realize that the security policy won’t allow SNMP across the firewall.

Additionally, remember that different collection methods have different access schemes. For example, WMI uses a Windows account on the target machine. If it’s not there, has the wrong permissions or is locked, monitoring won’t work. Meanwhile, SNMP uses a simple string that can be different on each machine.

Architecture–Finally, consider the architecture of the tools you’re considering. This breaks down to connectivity and scalability.

First, let’s consider connectivity. Agent-based platforms have on-device agents that collect and store data locally, then forward large data sets at regular intervals. Each collector bundles and sends this data to a manger-of-managers, which passes it to the repository. Meanwhile, agentless solutions use a collector that directly polls source devices and forwards the information to the data store.

You need to understand the connectivity architecture of these various tools so you can effectively handle DMZs, remote sites, secondary data centers and the like. You also need to look at the connectivity limitations of various tools, such as how many devices each collector can support and how much data will be traversing the wire, so you can design a monitoring implementation that doesn’t cripple your network or collapse under its own weight.

Next comes scalability. Understand what kind of load the monitoring application will tolerate, and what your choices are to expand when — yes, when, not if — you hit that limit. To be honest, this is a tough one and many vendors hope you’ll accept some form of a, “it-really-depends” response.

In all fairness, it does matter, and some things are simply impossible to predict. For example, I once had a client who wanted to implement syslog monitoring on 4,000 devices. It ended up generating upwards of 20 million messages per hour. That was not a foreseeable outcome.

By taking these key elements of a monitoring tool implementation into consideration, you should be able to avoid most of the major missteps many monitoring rollouts suffer from. And the good news is that from there, the same techniques that serve you well during other implementations will help here.  You want to ask lots of questions; meet with customers in similar situations, such as environment size, business sector, etc.; set up a proof of concept first; engage experienced professionals to assist as necessary; and be prepared — both financially and psychologically — to adapt as wrinkles crop up. Because they will.

(this originally appeared on SearchNetworking)