Category Archives: ICYMI

“Logfile Monitoring” – I Do Not Think It Means What You Think It Means

This is a conversation I have A LOT with clients. They say we want “logfile monitoring” and I am not sure what they mean. So I end up having to unwind all the different things it COULD be, so we can get to what it is they actually need.

It’s also an important clarification for me to make as SolarWinds Head Geek because depending on what the requested means, I might need to point them toward Kiwi Syslog Server, Software & Application Monitor, or Log & Event Manager (LEM).

Here’s a handy guide to identify what people are talking about. “Logfile monitoring” is usually applied to 4 different and mutually exclusive areas. Before you allow the speaker to continue, please ask them to clarify which one they are talking about:

  1. Windows Logfile
  2. Syslog
  3. Logfile aggregation
  4. Monitoring individual text files on specific servers

More clarification on each of these areas below:

Windows Logfile

Monitoring in this area refers specifically to the Windows event log, which isn’t actually a log “file” at all, but a database unique to Windows machines.

In the SolarWinds world, the tool that does this is Server & Application Monitor. Or if you are looking for a small, quick, and dirty utility, the Eventlog Forwarder for Windows will take Eventlog messages that match a search pattern and pass them via Syslog to another machine.


Syslog is a protocol, which describes how to send a message from one machine to another on UDP port 514. The messages must fit a pre-defined structure. Syslog is different from SNMP Traps. This protocol is most often found when monitoring network and *nix (Unix, Linux) devices, although network and security devices send out their fair share as well.

In terms of products, this is covered natively by Network Performance Monitor (NPM), but as I’ve said often you shouldn’t send syslog or trap directly to your NPM primary poller. You should send it into a syslog/trap “filtration” first. And that would be the Kiwi Syslog server (or its freeware cousin).

Logfile aggregation

This technique involves sending (or pulling) log files from multiple machines and collecting them on a central server. This collection is done at regular intervals. A second process then searches across all the collected logs, looking for trends or patterns in the enterprise. When the audit and security groups talk about “logfile monitoring,” this is usually what they mean.

As you may have already guessed, the SolarWinds tool for this job is Log & Event Manager. I should point out that LEM will ALSO receive syslog and traps, so you kind of get a twofer if you have this tool. Although, I personally STILL think you should send all of your syslog and trap to a filtration layer, and then send the non-garbage messages to the next step in the chain (NPM or LEM).

Monitoring individual text files on specific servers

This activity focuses on watching a specific (usually plain text) file in a specific directory on a specific machine, looking for a string or pattern to appear. When that pattern is found, an alert is triggered. Now it can get more involved than that—maybe not a specific file, but a file matching a specific pattern (like a date); maybe not a specific directory, but the newest sub-directory in a directory; maybe not a specific string, but a string pattern; maybe not ONE string, but 3 occurrences of the string within a 5 minute period; and so on. But the goal is the same—to find a string or pattern within a file.

Within the context of SolarWinds, SAM has been the go-to solution for this type of thing. But, at this moment it’s only through a series of Perl, Powershell, and VBScript templates.

We know that’s not the best way to get the job done, but that’s a subject for another post.

The More You Know…

For now, it’s important that you are able to clearly define—for both you and your colleagues, customers, and consumers—the difference between “logfile monitoring” and which tool or technique you need to employ to get the job done.

The Three Black Boxes (hint: the network was only the beginning)

NOTE: This post originally appeared here

Once upon a time, back in the dark ages of IT, people would sit around the table and talk about “the black box” and everyone knew what was meant—the network. The network, whether it was ARCNET, DECnet, Ethernet or even LANTastic, seemed inscrutable to them. They just connected their stuff to the wires and hoped for the best.

Back in those early days, conversations with us early network engineers often went something like this:

Them: “I think the network is slow.”
Us (who they considered pointy-hatted-wizards): “No, it’s not.”
Them: “Look, I checked the systems and they’re fine. I think it’s the network.”
Us: “Come one, it’s rarely ever the network.”
Them: “Well, I still think…”
Us: “Then you’ll need to prove it.”

It was so difficult for them to pierce this veil that—if urban legends on the subject are to be believed—the reason the image of a cloud is used to signify a network is because it was originally labeled by those outside the network with the acronym TAMO. This stood for, “then a miracle occurs,” and the cloud graphic reinforced the divine and un-knowable nature of bits flowing through the wire.

But we in the network knew it wasn’t a miracle, though it was still somewhat of a black box even to us—a closed system that took a certain type of input, implemented only somewhat monitorable processes inside and then produced a certain type of output.

With time, though, the network became much less of a black box to everyone. Devices, software and our general knowledge grew in sophistication so that we now have come to expect bandwidth metrics, packet error data, NetFlow conversations, deep packet inspection results, IPSLA and more to be available on demand and in near real-time.

But recently, two new black boxes have arrived on the scene. And this time, we net admins are on the outside with almost everyone else.

The first of these, virtualization—as well as its commoditized cousin, cloud computing, has grown to the point where the count of physical servers in medium-to-large companies is sometimes just a tenth of the overall server count.

Ask an application owner if he knows how many other VM’s are running on the same host and you’ll be met with a blank stare. Probe further by asking if he thinks a “noisy neighbor”—a VM on the same host that is consuming more resources than it should—is impacting his system and he’ll look at you conspiratorially and say, “Well, I sure think there’s one of those, but heck if I could prove it.”

Still, we love virtual environments. We love the cost savings, the convenience and the flexibility they afford our companies. But don’t fool yourself—unless we’re actually on the virtualization team, we don’t really understand them one bit.

Storage is the other “new” black box. It presents the same challenge as virtualization, but only worse. Disks are the building blocks of arrays, which are collected via a software layer into LUNs, which connect through a separate network “fabric” to be presented as data stores to the virtual layer or as contiguous disk resources to physical servers.

Ask that already paranoid application owner which actual physical disks his application is installed on and he’ll say you may as well ask him to point out a specific grain of sand on a beach.

Making the storage environment even more challenging is its blended nature. Virtualization, for all the complexity, is a binary choice. Your server is either virtualized or it’s not. Storage isn’t that clear cut—a physical server may have a single traditional platter-based disk for its system drive, connect to a SAN for a separate drive where software is installed and then use a local array of SSD drives to support high-performance database I/O.

OK, so what does all this have to do with the network? Well, what’s most interesting about these new black boxes—especially to us network folk—is how they are turning networking back into a black box as well.

Think about it—software-based “virtual” switches distribute bandwidth from VMs to multiple network connections.

Also, consider that SAN “fabric” is often more software than hardware.

And then there is the rise of SDN, a promising new technology to be sure, but one that still needs to have some of the rough edges smoothed away.

The good news is that, like our original, inscrutable networking from the good old days, the ongoing drive towards maturity and sophistication will crack the lid on these two new black boxes and reverse the slide of the network back into one as well.

Even now it’s possible to use the convergence of networking, virtualization and storage to connect more dots than ever before. Because of the seamless flow from the disk through the array, LUN, datastore, hypervisor and on up to the actual application, we’re able to show—with a tip of the old fedora to detective Dirk Gently—the interconnectedness of all things. With the right tools in hand, we can now show how an array that is experiencing latency affects just about anything.

That paranoid application owner might even stop using his “they’re out to get me” coffee mug.

The Four Questions – Introduction

This article originally appeared in a scaled-down version here on I’m posting it in it’s full form here as an introduction to the full series.

For people who are interested in monitoring, there is a leap that you make when you go from watching systems that YOU care about, to monitoring systems that other people care about.

When you are doing it for yourself, it’s all about ease of maintenance, getting good (meaning useful, interesting) data, and having the information at your fingertips to deflect accusations that YOUR system is down/slow/ugly/whatever.

But if you do that job well, and show up at enough meetings showing off your shiny happy data, inevitably you will get nominated/conscripted into the monitoring group where it is expected you will take as much interest in other people’s sh…tuff as your beloved systems from your former job.

And this is where things get especially tricky.

Assuming you LIKE monitoring as a discipline, and find it exciting to learn about different types of systems (and ways they can fail), you are going to want to provide the same levels of insight for your coworkers as you had for yourself.

Inevitably, you will find yourself answering The Four Questions. These are questions which—for reasons that will become apparent—you never really had to ask yourself when you were doing it on your own. The four questions—with brief explanations—are:

  1. Why did I get an alert?
    The person is not asking, “Why did this alert trigger at this time?” They are asking why they got the alert at all.
  2. Why didn’t I get an alert?
    Something happened that the owner of the system felt should have triggered an alert, but they didn’t receive one.
  3. What is being monitored on my system?
    What reports and data can be pulled for their system (and in what form) so they can look at trending, performance, and forensic information after a failure.
  4. What will alert on my system?
    I’d like to be able to predict under which conditions I will get an alert for this system.

…and the Fifth Beatle… I mean question.

5. What do you monitor “standard”?
What metrics and data are typically collected for systems like this? This is the inevitable (and logical) response when you say, “We put standard monitoring in place.”

In the coming (weeks/days/months/series) I’m going to explore each of these questions in-depth, and offer techniques you can use to respond to each one.

Using NetFlow monitoring tactics to solve bandwidth mysteries

NetFlow eliminated the hassle of network troubleshooting after a school complained about its Internet access.

Life in IT can be pretty entertaining. And apparently we admins aren’t the only ones who think so — Hollywood’s taken notice, too. The problem is, the television shows and feature films about IT are all comedies. I’m not saying we don’t experience some pretty humorous stuff, but I want a real show; you know, something with substance. I can see it now — The Xmit (and RCV) Files: The Truth is Out There.

 In fact, I’ve already got the pilot episode all planned. It’s based on an experience I had not long ago with the NetFlow monitoring protocol.

The company I was with at the time offered monitoring as a service to a variety of clients. One day, I was holding the receiver away from my head as a school principal shouted, “The Internet keeps going down! You’ve got to do something.”

Now, there are few phrases that get my goat like “the Internet is down,” or its more common cousin, “the network is down.” So, my first thought was, “Buddy, we have redundant circuits, switches configured for failover and comprehensive monitoring. The network is not down, so please shut up.”

Of course, that’s not what I said. Instead, I asked a few questions that would help me narrow down the root cause of the problem.

First up: “How often are you experiencing this issue?”

“A bunch,” I was told.

“Ooookay … at any particular time?” I asked.

He replied, “Well, it seems kind of random.”

Gee, thanks. I’m sure I can figure it out with such insightful detail.

It was obvious I was going to have to do some real investigation. My first check was the primary circuit to our provider. Nothing there. So, I’m sorry, Virginia, but “the Internet” is not down, as if I had any doubt.

Next, I looked at the school’s WAN interface, which revealed that yes indeed, the WAN link to the school was becoming saturated at various intervals during the day. Usage would spike for 20 to 30 minutes, then drop down until the next incident. I checked the firewall logs — not my favorite job, which showed a high volume of http connections at the same times.

Now, for many years, checking was the pinnacle of network troubleshooting — check the devices, check the logs, wait for the event to happen again, dig a little further. And in my case, that might have been all I could do. Our contract had us monitoring the entire core data center for the school system, but that only extended to the edge router for the school. We had exactly zero visibility beyond each individual school building’s WAN connection.

But as chance would have it, I had one more trick up my sleeve: NetFlow.

NetFlow has been around a while, but it’s been only in the last few years that it’s entered the common network admin lexicon, largely due to the maturation of tools that can reliably and meaningfully collect and display NetFlow data. NetFlow collects and correlates “conversations” between two devices as the data passes through a router. You don’t have to monitor the specific endpoints in the conversation, you just have to get data from one router that sits between the two points.


Hmm, that sounds a lot like a WAN router connected to the Internet provider, which is exactly what I had. Correlating the spike times from our bandwidth stats, we saw that during the same period, 10 MAC addresses were starting conversations with YouTube. Every time there was a spike, it was the same set of MAC addresses.

Now, if we had been monitoring inside the school, we could have gleaned much more information — IP address, location, maybe even username if we had some of the more sophisticated user device tracking tools in place — but we weren’t. However, a visit to WireShark’s OUI Lookup Tool revealed that all 10 of those MAC addresses were from — and please forgive the irony — Apple Inc.

At that point, I had exhausted all of the tools at my disposal. So, I called the principal back and gave him the start and stop times of the spikes, along with the information about 10 Apple products being to blame.

“Wait, what time was that?” he asked.

I repeated the times.

“Oh, for the love of … I know what the problem is.” Click.

It turns out the art teacher had been awarded a grant for 10 shiny new iPads. He would go from room to room during art period handing them out and teaching kids how to do video mashups.

This was one of those rare times when a bandwidth increase really was warranted, and after the school’s WAN circuit was reprovisioned, the Internet stopped mysteriously “going down.”

The episode would close with the handsome and sophisticated admin — played by yours truly, of course — looking into the camera and while channeling the great Fox Mulder saying, “Remember, my fellow admins, the truth is out there.” (And, I would add, for those of you reading this blog post, don’t forget how valuable NetFlow can be in finding network truth.)

Now, if that’s not compelling TV, I don’t know what is.

(This article originally appeared on SearchNetworking)

IT Monitoring Scalability Planning: 3 Roadblocks

Planning for growth is key to effective IT monitoring, but it can be stymied by certain mindsets. Here’s how to overcome them.

As IT professionals, planning for growth is something we do all day almost unconsciously. Whether it’s a snippet of code, provisioning the next server, or building out a network design, we’re usually thinking: Will it handle the load? How long until I’ll need a newer, faster, or bigger one? How far will this scale?

Despite this almost compulsive concern with scalability, there are still areas of IT where growth tends to be an afterthought. One of these happens to be my area of specialization: IT monitoring. So, I’d like to address growth planning (or non-planning) as it pertains to monitoring by highlighting several mindsets that typically hinder this important, but often surprisingly overlooked element, and showing how to deal with each.

The fire drill mindset
The occurs when something bad has already happened either because there was either no monitoring solution in place or because the existing toolset didn’t scale far enough to detect a critical failure, and so it was missed. Regardless, the result is usually a focus on finding a tool that would have caught the problem that already occurred, and finding it fast.

However, short of a TARDIS, there’s no way to implement an IT monitoring tool that will help avoid a problem after it occurs. Furthermore, moving too quickly as a result of a crisis can mean you don’t take the time to plan for future growth, focusing instead solely on solving the current problem.

My advice is to stop, take a deep breath, and collect yourself. Start by quickly, but intelligently developing a short list of possible tools that will both solve the current problem and scale with your environment as it grows. Next, ask the vendors if they have free (or cheap) licenses for in-house demoing and proofs of concept.

Then, and this is where you should let the emotion surrounding the failure creep back in, get a proof-of-concept environment set up quickly and start testing. Finally, make a smart decision based on all the factors important to you and your environment. (Hint: one of which should always be scalability.) Then implement the tool right away.

The bargain hunter
The next common pitfall that often prevents better growth planning when implementing a monitoring tool is the bargain-hunter mindset. This usually occurs not because of a crisis, but when there is pressure to find the cheapest solution for the current environment.

How do you overcome this mindset? Consider the following scenario: If your child currently wears a size 3 shoe, you absolutely don’t want to buy a size 5 today, right? But you should also recognize that your child is going to grow. So, buying enough size 3 shoes for the next five years is not a good strategy, either.

Also, if financials really are one of the top priorities preventing you from better preparing for future growth, remember that the cheapest time to buy the right-sized solution for your current and future environment is now. Buying a solution for your current environment alone because “that’s all we need” is going to result in your spending more money later for the right-sized solution you will need in the future. (I’m not talking about incrementally more, but start-all-over-again more.)

My suggestion is to use your company’s existing business growth projections to calculate how big of a monitoring solution you need. If your company foresees 10% revenue growth each year over the next three years and then 5% each year after that, and you are willing to consider completely replacing your monitoring solution after five years, then buy a product that can scale to 40% of the size you currently need.

The dollar auction
The dollar auction mindset happens when there is already a tool in place — a tool that wasn’t cheap and that a lot of time was spent perfecting. The problem is, it’s no longer perfect. It needs to be replaced because company growth has expanded beyond its scalability, but the idea of walking away from everything invested in it is a hard pill to swallow.

Really, this isn’t so much of a mindset that prevents preparing for future growth as it is something that’s all too often overlooked as an important lesson: If only you had better planned for future growth the first time around. The reality is that if you’re experiencing this mindset, you need a new solution. However, don’t make the same mistake. This time, take scalability into account.

Whether you’re suffering from one of these mindsets or another that is preventing you from better preparing your IT monitoring for future growth, remember, scalability is key to long term success.

(This article originally appeared on NetworkComputing)

Time for a network monitoring application? What to look for

You might think that implementing a network monitoring tool is like every other rollout. You would be wrong.

Oh, so you’re installing a new network monitoring tool, huh? No surprise there, right? What, was it time for a rip-and-replace? Is your team finally moving away from monitoring in silos? Perhaps there were a few too many ‘Let me Google that for you’ moments with the old vendor’s support line?

Let’s face it. There are any number of reasons that could have led you to this point. What’s important is that you’re here. Now, you may think a new monitoring implementation is no different than any other rollout. There are some similarities, but there are also some critical elements that are very different. How you handle these can mean the difference between success and failure.

I’ve found there are three primary areas that are often overlooked when it comes to deploying a network monitoring application. This isn’t an exhaustive list, but taking your time with these three things will pay off in the end.

Scope–First, consider how far and how deep you need the monitoring to go. This will affect every other aspect of your rollout, so take your time thinking this through. When deciding how far, ask yourself the following questions:

  • Do I need to monitor all sites, or just the primary data center?
  • How about the development, test or quality assurance systems?
  • Do I need to monitor servers or just network devices?
  • If I do need to include servers,  should I cover every OS or just the main one(s)?
  • What about devices in DMZs?
  • What about small remote sites across low-speed connections?

And when considering how deep to go, ask these questions:

  • Do I need to also monitor up/down for non-routable interfaces (e.g., EtherChannel connections, multiprotocol label switching links, etc.)?
  • Do I need to monitor items that are normally down and alert when they’re up (e.g., cold standby servers, cellular wide area network links, etc.)?
  • Do I need to be concerned about virtual elements like host resource consumption by virtual machine, storage, security, log file aggregation and custom, home-grown applications?

Protocols and permissions–After you’ve decided which systems to monitor and what data to collect, you need to consider the methods to use. Protocols such as Simple Network Management Protocol (SNMP), Windows Management Instrumentation (WMI), syslog and NetFlow each have its own permissions and connection points in the environment.

For example, many organizations plan to use SNMP for hardware monitoring, only to discover it’s not enabled on dozens –or hundreds — of systems. Alternatively, they find out it is enabled, but the connection strings are inconsistent, undocumented or unset. Then they go to monitor in the DMZ and realize that the security policy won’t allow SNMP across the firewall.

Additionally, remember that different collection methods have different access schemes. For example, WMI uses a Windows account on the target machine. If it’s not there, has the wrong permissions or is locked, monitoring won’t work. Meanwhile, SNMP uses a simple string that can be different on each machine.

Architecture–Finally, consider the architecture of the tools you’re considering. This breaks down to connectivity and scalability.

First, let’s consider connectivity. Agent-based platforms have on-device agents that collect and store data locally, then forward large data sets at regular intervals. Each collector bundles and sends this data to a manger-of-managers, which passes it to the repository. Meanwhile, agentless solutions use a collector that directly polls source devices and forwards the information to the data store.

You need to understand the connectivity architecture of these various tools so you can effectively handle DMZs, remote sites, secondary data centers and the like. You also need to look at the connectivity limitations of various tools, such as how many devices each collector can support and how much data will be traversing the wire, so you can design a monitoring implementation that doesn’t cripple your network or collapse under its own weight.

Next comes scalability. Understand what kind of load the monitoring application will tolerate, and what your choices are to expand when — yes, when, not if — you hit that limit. To be honest, this is a tough one and many vendors hope you’ll accept some form of a, “it-really-depends” response.

In all fairness, it does matter, and some things are simply impossible to predict. For example, I once had a client who wanted to implement syslog monitoring on 4,000 devices. It ended up generating upwards of 20 million messages per hour. That was not a foreseeable outcome.

By taking these key elements of a monitoring tool implementation into consideration, you should be able to avoid most of the major missteps many monitoring rollouts suffer from. And the good news is that from there, the same techniques that serve you well during other implementations will help here.  You want to ask lots of questions; meet with customers in similar situations, such as environment size, business sector, etc.; set up a proof of concept first; engage experienced professionals to assist as necessary; and be prepared — both financially and psychologically — to adapt as wrinkles crop up. Because they will.

(this originally appeared on SearchNetworking)