Blueprint: The Evolution of the Network, Part 1

NOTE: This article originally appeared here.

Learn from the past, live in the present and prepare for the future.

While this may sound like it belongs hanging on a high school guidance counselor’s wall, they are words to live by, especially in IT. They apply perhaps to no other infrastructure element better than the network. After all, the network has long been a foundational building block of IT, it’s even more important today than it was in the days of SAGE and ARPANET, and its importance will only continue to grow in the future while simultaneously becoming more complex.

For those of us charged with maintaining the network, it’s valuable to take a step back and examine the evolution of the network. Doing so helps us take an inventory of lessons learned—or the lessons we should have learned; determine what today’s essentials of monitoring and managing networks are; and finally, turn an eye to the future to begin preparing now for what’s on the horizon.

Learn from the Past

Think back to the time before the luxuries of Wi-Fi and the proliferation of virtualization, and before today’s wireless and cloud computing.

The network used to be defined by a mostly wired, physical entity controlled by routers and switches. Business connections were based on T1 and ISDN, and Internet connectivity was always backhauled through the data center. Each network device was a piece of company-owned hardware, and applications operated on well-defined ports and protocols. VoIP was used infrequently, and anywhere connectivity—if even a thing—was provided by the low-quality bandwidth of cell-based Internet access.

With this yesteryear in mind, consider the following lessons we all (should) have learned that still apply today:

It Has to Work
Where better to start than with a throw back to IEEE RFC1925, “The Twelve Networking Truths”? It’s just as true today as it was in 1996—if your network doesn’t actually work, then all the fancy hardware is for naught. Anything that impacts the ability of your network to work should be suspect.

The Shortest Distance Between Two Points is Still a Straight Line
Wired or wireless and MPLS, EIGRP or OSPF, your job as a network engineer is still fundamentally to create the conditions where the distance between the provider of information, usually a server, and the consumer of that information, usually a PC, is as near to a straight line as possible. When you forget that but still get caught up in quality of service maps, automated functions and fault-tolerance, you’ve lost your way.

An Unconfigured Switch is Better than the Wizard
It was a long-standing truth that running the configuration wizard on a switch was the fastest way to break it, whereas just unboxing and plugging it in would work fine. Wizards are a fantastic convenience and come in all forms, but if you don’t know what the wizard is making convenient, you are heading for trouble.

What is Not Explicitly Permitted is Forbidden
No, this policy it’s not fun and it won’t make you popular. And it will actually create work for you on an ongoing basis. But there is honestly no other way to run your network. If espousing this policy will get you fired, then the truth is you’re going to get fired one way or the other. You might as well be able to pack your self-respect and professional ethics into the box along with your potted fern and stapler when the shoe drops. Because otherwise that huge security breach is on you.

Live in the Present 

Now let’s fast forward and consider the network of present day.

Wireless is becoming ubiquitous—it’s even overtaking wired networks in many instances—and the number of devices wirelessly connecting to the network is exploding (think Internet of Things). It doesn’t end there, though—networks are growing in all directions. Some network devices are even virtualized, resulting in a complex amalgam of the physical, the virtual and the Internet. Business connections are DSL/cable and Ethernet services, and increased use of cloud services is stretching Internet capacity at remote sites, not to mention opening security and policy issues since it’s not all backhauled through the data center. BYOD, BYOA, tablets and smartphones are prevalent are creating bandwidth capacity and security issues. Application visibility based on port and protocol is largely impossible due to applications tunneling via HTTP/HTTPS. VOIP is common, also imposing higher demands on network bandwidth, and LTE provides high-quality anywhere connectivity.

Are you nostalgic for the days of networking yore yet? The complexity of today’s networking environment underscores that while lessons of the past are still important, a new set of network monitoring and management essentials is necessary to meet the challenges of today’s network administration head on. These new essentials include:

Network Mapping
While perhaps a bit back-to-basics and also suitable as a lesson we all should have learned by now, when you consider the complexity of today’s networks and network traffic, network mapping and the subsequent understanding of management and monitoring needs has never been more essential than it is today. Moving ahead without a plan—without knowing the reality on the ground—is a sure way to make the wrong choices in terms of network monitoring based on assumptions and guesswork.

Wireless Management
The growth of wireless networks presents new problems, such as ensuring adequate signal strength and that the proliferation of devices and their physical mobility—potentially hundreds of thousands of network-connected devices, few of which are stationary and many of which may not be owned by the company (BYOD)—doesn’t get out of hand. What’s needed are tools such as wireless heat maps, user device tracking, over-subscribed access points and tracking and managing device IP addresses.

Application Firewalls
When it comes to surviving the Internet of Things, you first must understand that all of the “things” connect to the cloud. Because they’re not coordinating with a controller on the LAN, each device incurs a full conversation load, burdening the WAN and every element in a network. And worse, many of these devices prefer IPv6, meaning you’ll have more pressure to dual-stack all of those components. Application firewalls can untangle device conversations, get IP address management under control and help prepare for IPv6. They can also classify and segment device traffic; implement effective quality of service to ensure that critical business traffic has headroom; and of course, monitor flow.

Capacity Planning
Nobody plans for not growing; it’s just that sometimes infrastructure doesn’t read the plan we’ve so carefully laid out. You need to integrate capacity for forecasting tools, configuration management and web-based reporting to be able to predict scale and growth. There’s the oft-quoted statistic that 70 percent of network outages come from unexpected network configuration changes. Admins have to avoid the Jurassic Park effect—unexpected, but what in hindsight were clearly predictable outages is the bane of any IT manager’s existence. “How did we not know and respond to this?” is a question nobody wants to have to answer.

Application Performance Insight
Many network engineers have complained that the network would be stable if it weren’t for the end users. While it’s an amusing thought, it ignores the universal truth of IT—everything we do is because of and for end-users. The whole point of having a network is to run the business applications end-users need to do their jobs on. Face it, applications are king. Technologies such as deep packet inspection, or packet-level analysis, can help you ensure the network is not the source of application performance problems.

Prepare for the Future

Now that we’ve covered the evolution of the network from past to present—and identified lessons we can learn from the network of yesterday and what the new essentials of monitoring and managing today’s network are—we can prepare for the future. So, stay tuned for part two in this series to explore what the future holds for the evolution of the network.

It’s Not About What Happened

“I call bullshit,” he said with authority. “it it couldn’t have happened like that. Let’s move on to something real.”

And just like that, he missed the most important part.

Because sometimes – maybe often – what you need to know is not whether something really, actually, 100% accurately happened “that way”.

It’s about what happened next. More to the point, it’s about what people did about the thing that may-or-may-not-have-happened-that-way.

How we respond to events (real or perceived) tells us and others more about who we are than the situations we find ourselves in.

“The entire data center was crashing”
(it was only 10 servers)
“the CEO was calling my cell every 2 minutes”
(he called twice in the first 30 minutes and then left you alone)
“it was a massive hack, probably out of China or Russia”
(it was a mis-configured router)

Whatever. I’m not as interested in that as what you did next. Did you:

  • Call a vendor and scream that they needed to “fix this yesterday”?
  • Pull the team together and solicit ideas and put together a plan
  • tell everyone to stay out of your way and worked 24 hours to sort it out?
  • Wait 30 minutes before doing anything to see if anyone noticed, or if it sorted itself out?
  • Start documenting everything you saw happening, to review afterward?
  • Simply shut everything down and start it up again and see if that fixes it?
  • Look at your historical data to see if you can spot the beginning of the failure?
  • Immediately recover from backups, and let people know work will be lost?

Notice that most of those aren’t inherently wrong, although several are wrong depending on the specific circumstances.

And that is the ONLY point where “what happened” comes into play. The events around us shape our environment.

But how we decide to respond shapes who we are.

When It Comes to System Outages, Don’t Prepare For the Worst

NOTE: This article originally appeared here.

During the 2015 World Cup soccer competition, Nate Silver and the psychic witches he keeps in his basement — because how else could he make the predictions he does with such accuracy? — got it wrong. Really, really wrong. They were completely blindsided by Germany’s win over Brazil. As Silver described it, it was a completely unforeseeable event.


In sports and, to a lesser extent, politics, the tendency in the face of these things is to eat the loss, chalk it up to a fluke — a black swan in statistics parlance — and get on with life.

But as network administrators, we know that’s not how it works in IT.

In my experience, when a black swan event affects IT systems, management usually acquires a dark obsession with the event. Meetings are called under the guise of “lessons learned exercises,” with the express intent of ensuring said system outages never happen again.

Don’t spend too much time studying what might occur

Now, I’m not saying that after a failure we should just blithely ignore any lessons that could be learned. Far from it, actually. In the ashes of a failure, you often find the seeds of future avoidance. One of the first things an IT organization should do after such an event is determine whether the failure was predictable, or if it was one of those cases where there wasn’t enough historical data to determine a decent probability.


If the latter is the case, I’m here to tell you your efforts are much better spent elsewhere. What’s a better approach? Instead of spending time trying to figure out if a probability may or may not exist, catch and circumvent those common, everyday IT annoyances. This is a tactic that’s overlooked far too often.

Don’t believe me? Well, let’s take the example of a not-so-imaginary company I know that had a single, spectacular IT failure that cost somewhere in the neighborhood of $100,000. Management was understandably upset. It immediately set up a task force to identify the root cause of the failure and recommend steps to avoid it in the future. Sounds reasonable, right?

The task force — five experts pulled from the server, network, storage, database and applications teams — took three months and more than 100 staff-hours to investigate the root cause. Being conservative, let’s say the hourly cost to the company was $50. Now, multiply that by five people, then by 100 hours, then by three months. It comes to a nice round $125,000.

Not so reasonable after all

Yes, at the end of it all the root problem was not only identified — at least, as much as possible — but code was put in place to (probably) predict the next time the exact same event might occur. Doesn’t sound so bad. But keep this in mind: The company spent $25,000 more than the cost of the original failure to create a system outages solution that may or may not predict the occurrence of a black swan exactly like the one that hit before.

Maybe it wasn’t so reasonable after all.

You may be thinking, “But where else are you saying we should be focusing on? After all, we’re held accountable to the bottom line as much as anyone else in the company.”

I get that, and it’s actually my point. Let’s compare the previous example of chasing a black swan to another, far more common problem: network interface card (NIC) failures.

In this example, another not-so-fictitious company saw bandwidth usage spike and stay high. NICs threw errors until the transmission rates bottomed out, and eventually the card just up and died. The problem was that while bandwidth usage was monitored, there was no alerting in place for interfaces that stopped responding or disappeared (the company monitored the IP at the end of the connection, which meant WAN links were absent alerts until the far end went down).

Let’s assume that a NIC failure takes an average of one hour to notice and correctly diagnose, and then it takes two hours to fix by network administrators who cost the company $53 per hour. While the circuit is out, the company loses about $1,000 per hour in revenue, lost opportunity, etc. That means system outages like this one could cost the company $3,106.

Setting a framework anchored by alerting and monitoring

Now, consider that, in my experience, proper monitoring and alerting reduces the time it takes to notice and diagnose problems such as NIC failures to 15 minutes. That’s it. Nothing else fancy, at least not in this scenario. But that simple thing could reduce the cost of the outage by $750.

I know those numbers don’t sound too impressive. That is, until you realize a moderately sized company can easily experience 100 NIC failures per year. That translates to more than $300,000 in lost revenue if the problem is unmonitored, and an annual savings of $75,000 if alerting is in place.

And that doesn’t take into account the ability to predict NIC failures and replace the card pre-emptively. If we estimate that 50% of the failures could be avoided using predictive monitoring, the savings could rise to more than $190,000.

Again, I’m not saying preparing for black swan events isn’t a worthy endeavor, but when tough budget decisions need to be made, some simple alerting on common problems can save more than trying to predict and prevent “the big one” that may or may not ever happen.

After all, NIC failures are no black swan. I think even Nate Silver would agree they’re a sure thing.

“Nice to Have” is Relative

What could you absolutely positively not live without? … Draw a line. Come up with reasons. Know where you stand.

Cute. Except without requirements you can’t draw a line effectively.

So go from there. Start with what you need to accomplish, and then set your must-haves to achieve that goal. Don’t embellish, don’t waffle. Just the essentials.

Draw the line, set the next goal, list the deal-breakers.

Believe it or not, this is an exercise most organizations don’t do (at least for montioring). They start with a (often vague) goal – “Select a NetFlow tool” or “Monitor customer experience for applications”. And then they look for tools that do that.

Before you know it, you have 20, or 50, or 150 tools (I am NOT exaggerating). . You have staff – even whole teams – who invest hundreds of hours getting good using those tools.

And then you get allegiences. And then you get politics.

Respect Your Elders

NOTE: This article originally appeared here.

“Oh Geez,” exclaimed the guy who sits 2 desks from me, “that thing is ancient! Why would they give him that?”

Taking the bait, I popped my head over the wall and asked “what is?”

He showed me a text message, sent to him from a buddy—an engineer (EE, actually) who worked for an oil company. My co-worker’s iPhone 6 displayed an image of a laptop we could only describe as “vintage”:

(A Toshiba Tecra 510CDT, which was cutting edge…back in 1997.)

“Wow.” I said. “Those were amazing. I worked on a ton of those. They were serious workhorses—you could practically drop one from a 4 story building and it would still work. I wanted one like nobody’s business, but I could never afford it.”

“OK, back in the day I’m sure they were great,” said my 20-something coworker dismissively. “But what the hell is he going to do with it NOW? Can it even run an OS anymore?”

I realized he was coming from a particular frame of reference that is common to all of us in I.T. Newer is better. Period. With few exceptions (COUGH-Windows M.E.-COUGH), the latest version of something—be it hardware or software—is always a step up from what came before.

While true, it leads to a frame of mind that is patently un-true: a belief that what is old is also irrelevant. Especially for I.T. Professionals, it’s a dangerous line of thought that almost always leads to un-necessary mistakes and avoidable failures.

In fact, ask any I.T. pro who’s been at it for a decade, and you’ll hear story after story:

  • When programmers used COBOL, back when dinosaurs roamed the earth, one of the fundamental techniques that were drilled into their heads was, “check your inputs.” Thinking about the latest version of exploits, be they an SSLv3 thing like ‘Poodle’, or a SQL injection, or any of a plethora of web based security problems, the fundamental flaw is the server NOT checking its inputs, for sanity.
  • How about the OSI model? Yes, we all know its required knowledge for many certification exams (and at least one IT joke). But more importantly, it was (and still is) directly relevant to basic network troubleshooting.
  • Nobody needs to know CORBA database structure anymore, right? Except that a major monitoring tool was originally developed on CORBA and that foundation has stuck. Which is why, if you try to create a folder-inside-a-folder more than 3 times, the entire system corrupts. CORBA (one of the original object-oriented databases) could only handle 3 levels of object containership.
  • Powershell can be learned without understanding the Unix/Linux command line concepts. But, it’s sure EASIER to learn if you already know how to pipe ls into grep into awk into awk so that you get a list of just the files you want, sorted by date. That technique (among other Unix/Linux concepts) was one of the original goals of Powershell.
  • Older rev’s of industrial motion-control systems used specific pin-outs on the serial port. The new USB-to-Serial cables don’t mimic those pin-outs correctly, and trying to upload a program with the new cables will render the entire system useless.

And in fact, that’s why my co-worker’s buddy was handed one of those venerable Tecra laptops. It had a standard serial port and it was preloaded with the vendor’s DOS-based ladder-logic programming utility. Nobody expected it to run Windows 10, but it fulfilled a role that modern hardware simply couldn’t have done.

It’s an interesting story, but you have to ask: aside from some interesting anecdotes and a few bizarre use-cases, does this have any relevance to our work day-today?

You bet.

We live in a world where servers, storage, and now the network is rushing toward a quantum singularity of virtualization.

And the “old-timers” in the mainframe team are laughing their butts off as they watch us run in circles, inventing new words to describe techniques they learned at the beginning of their career; making mistakes they solved decades ago; and (worst of all) dismissing everything they know as utterly irrelevant.

Think I’m exaggerating? SAN and NAS look suspiciously like DASD, just on faster hardware. Services like Azure and AWS, for all their glitz and automation, aren’t as far from rented time on a mainframe as we’d like to imagine. And when my company replaces my laptop with a fancy “appliance” that connects to Citrix VDI session, it reminds me of nothing as much as the VAX terminals I supported back in the day.

My point isn’t that I’m a techno-Ecclesiastes shouting “there is nothing new under the sun!” Or some I.T. hipster who was into the cloud before it was cool. My point is that it behooves us to remember that everything we do, and every technology we use, had its origins in something much older than 20 minutes ago.

If we take the time to understand that foundational technology we have the chance to avoid past mis-steps, leverage undocumented efficiencies built into the core of the tools, and build on ideas elegant enough to have withstood the test of time.

Got your own “vintage” story, or long-ago-learned tidbit that is still true today? Share it in the comments!

Are We There Yet? Are We There Yet? Are We…

How often you request information from a target device (commonly known as the polling cycle) is one of the fine arts of monitoring design, and one of the things that requires the most up-front discussion with consumers of your information.

Why? Because it’s often unclear to the consumer (the person who is getting the ticket, report, etc) the delays that exist between the occurrence of an actual event  and the reporting/recording of it. Worse, most of us in the monitoring segment of IT do a poor job of communicating these things.

Let’s take good ol’ ping as an example.

You want to check the up-down status of your systems on a regular basis. But you DON’T want to turn your monitoring system into an inadvertent DDOS Smurf attack generator.

So you pick some nice innocuous interval – say a ping check every 1 minute.

Then you realize that you need to monitor 3,000 systems, and so you back that off to 2 minutes

 Once you get past that frequency, there’s the additional level of verification. Just because a system failed ONE ping check doesn’t mean it’s actually down. So maybe your toolset does an additional level of just-in-time validation (let’s say it pings an additional 10 times, just to be sure the target is really down). That verification introduces at least another full minute of delay.

Is your alerting system doing nothing else? It’s just standing idle, waiting for this one alert to come in and be processed? Probably not. Let’s throw another 60 seconds into the mix for inherent delays in the alert engine.

Does your server team really want to know when a device is unavailable (ie: ping failed) for 1 minute? Probably not. They know servers get busy and don’t respond to ping even though they are processing. Heck, they may even want to avoid cutting tickets during patching cycle reboots. My experience is that most teams don’t want to hear that a server is down unless its really been down for at least 5 minutes.

Now you get the call: “I thought we were pinging every minute. But I got a ticket this morning and when I went to check the server, it had been down for over 10 minutes. What gives?”

There’s no single “right” answer to this. Nobody wants false alarms. And nobody wants to find out a system is down after the customer already knows.

But it is between those two requirements that one minute stretches to ten.

When, Not What, Defines Today’s Networking Career

Back in December, Cisco filed a lawsuit against Arista Networks because Arista’s network device operating system, EOS, was too similar to Cisco’s beloved IOS.

 This caused Tom Hollingsworth (a.k.a. “The Networking Nerd”) to speculate that this action presaged the ultimate death of the network device command-line interface (CLI).

Time will tell whether Hollingsworth is right or wrong and to what degree, but the idea intrigued me. Why would it matter if the command-line interface went away? What would be the loss?

Now, before going further, here’s a little background on me: I tend to be a “toaster” guy when it comes to technology. I don’t love my toaster or hate my toaster. I don’t proselytize the glorious features of my toaster to non-users. I just use my toaster. If it burns the toast, it’s time for a new toaster. Sure, over the years I’ve built up a body of experience that tells me my bagels tend to get stuck in the upright models, so I prefer the toaster/oven combos. But at the end of the day, it’s about making good toast.

Today’s networking career means learning new techniques

Jeez! Now I have a craving for a panini. Where was I? Oh right, technology.

I use a lot of technologies. My phone is Android. My work laptop runs Windows 8.1. My home desktop runs Linux. My wife lives on her iPad. So on and so forth. I’ve come to believe that learning technology is like learning to play cards.

The first game you learn is new, interesting and a little confusing, but ultimately thrilling because of the world it opens up. But the second card game, that’s the hard one. You know all the rules of the first game, but now there’s this other game that completely shatters what you knew. Then you learn your third card game, and you start to see the differences and similarities. By the fifth game, you understand that the cards themselves are just a vehicle for different ways of structuring sets.

I believe that’s why people are concerned about Hollingsworth’s prediction of the death of CLI. If you only know one game — and let’s face it, CLI is an extremely comprehensive and well-known “game” — and you’ve invested a lot of time and energy learning not only the rules of that game but also its nuances and tricks, finding out that game is about to be discontinued can be distressing. But when it comes to CLI, I believe that distress is actually due to a misplaced sense of self. Because you aren’t really a CLI professional, are you?

You’re a networking professional, not a CLI pro

Sure, you know a lot about CLI. But really, you’re a networking professional. Being able to configure open shortest path first (OSPF) from memory makes your job easier. But your job is knowing what OSPF is, when to use it versus enhanced interior gateway routing protocol, how to evaluate its performance and so on.

No, the concern about the death of CLI is really rooted in the fear of personal obsolescence. I’ve heard that notion repeated in conversations about the mainframe, Novell networking, WordPerfect 5.1 and dozens of other technologies that were brilliant in their time, but which, for various reasons, were superseded by something else — sometimes something else that is better, and sometimes not.

And a fear of personal obsolescence in your networking career is ultimately false, unless you are digging in your heels and choosing never to learn anything new again. (OK, that was sarcasm, folks. As IT pros, we should be committed to life-long learning. Even if you are two years away from retirement, learning new stuff is still A Good Thing™.) As long as you are open to new ideas, new techniques and yes, new systems, then you won’t become obsolete.

Employers exploit networkers’ insecurity

I’ll be honest. I think there are a lot of employers that exploit this insecurity. “Where’s a Perl script-kiddie like you going to find this kind of role?” they whisper — usually implicitly, although sometimes much more explicitly than any of us prefer. Or if we’re interviewing for a new job, they ask, “I see you have a lot of experience with AIX, but we’re a Windows shop. Do you really think your skills translate?”

I’m not here to talk about interviewing skills, salary negotiations or career improvements, so I’m not going to get into the potential responses, but I will say that the ultimate answer in each of these cases — and many others — is “Yes.” Why? Because it’s not about whether I know the fifth parameter of the gerfrinkel command in CodeMe version, which was deprecated in in favor of the unglepacht function. It’s not about any of that. It’s about my experience on when to use certain commands, when to look for a workaround, how to manage an environment of this scale and scope and so on.

To play off the old joke about the copier repairman, a small part of your paycheck goes toward turning the screw; more of it is based on knowing which screw to turn.

As IT pros, we are paid — and are valuable — because we know how to find out which screw to turn and when to turn it. So to speak, of course.

A Cornucopia

Before you know it, you have 20, or 50, or 150 monitoring tools in your company.

How does that happen?

Like most other things happen. Organically, slowly, in response to events on the ground, to pressures of time and budget and need (both real and perceived), to people’s biases.

Can you quantify which of your monitoring tools can ping? Can perform an SNMP walk? Can receive NetFlow information? Can perform a WMI query?

Of all the tools in your environment, do you know how many have features which overlap? And how much overlap?

If you had one solution which covered 90% (or even 80%) of the functionality of another tool (or more than one!), would you know? And would you know if the remaining 10-20% was essential?

And now that I’ve asked, how would you go about capturing that information?

“Logfile Monitoring” – I Do Not Think It Means What You Think It Means

This is a conversation I have A LOT with clients. They say we want “logfile monitoring” and I am not sure what they mean. So I end up having to unwind all the different things it COULD be, so we can get to what it is they actually need.

It’s also an important clarification for me to make as SolarWinds Head Geek because depending on what the requested means, I might need to point them toward Kiwi Syslog Server, Software & Application Monitor, or Log & Event Manager (LEM).

Here’s a handy guide to identify what people are talking about. “Logfile monitoring” is usually applied to 4 different and mutually exclusive areas. Before you allow the speaker to continue, please ask them to clarify which one they are talking about:

  1. Windows Logfile
  2. Syslog
  3. Logfile aggregation
  4. Monitoring individual text files on specific servers

More clarification on each of these areas below:

Windows Logfile

Monitoring in this area refers specifically to the Windows event log, which isn’t actually a log “file” at all, but a database unique to Windows machines.

In the SolarWinds world, the tool that does this is Server & Application Monitor. Or if you are looking for a small, quick, and dirty utility, the Eventlog Forwarder for Windows will take Eventlog messages that match a search pattern and pass them via Syslog to another machine.


Syslog is a protocol, which describes how to send a message from one machine to another on UDP port 514. The messages must fit a pre-defined structure. Syslog is different from SNMP Traps. This protocol is most often found when monitoring network and *nix (Unix, Linux) devices, although network and security devices send out their fair share as well.

In terms of products, this is covered natively by Network Performance Monitor (NPM), but as I’ve said often you shouldn’t send syslog or trap directly to your NPM primary poller. You should send it into a syslog/trap “filtration” first. And that would be the Kiwi Syslog server (or its freeware cousin).

Logfile aggregation

This technique involves sending (or pulling) log files from multiple machines and collecting them on a central server. This collection is done at regular intervals. A second process then searches across all the collected logs, looking for trends or patterns in the enterprise. When the audit and security groups talk about “logfile monitoring,” this is usually what they mean.

As you may have already guessed, the SolarWinds tool for this job is Log & Event Manager. I should point out that LEM will ALSO receive syslog and traps, so you kind of get a twofer if you have this tool. Although, I personally STILL think you should send all of your syslog and trap to a filtration layer, and then send the non-garbage messages to the next step in the chain (NPM or LEM).

Monitoring individual text files on specific servers

This activity focuses on watching a specific (usually plain text) file in a specific directory on a specific machine, looking for a string or pattern to appear. When that pattern is found, an alert is triggered. Now it can get more involved than that—maybe not a specific file, but a file matching a specific pattern (like a date); maybe not a specific directory, but the newest sub-directory in a directory; maybe not a specific string, but a string pattern; maybe not ONE string, but 3 occurrences of the string within a 5 minute period; and so on. But the goal is the same—to find a string or pattern within a file.

Within the context of SolarWinds, SAM has been the go-to solution for this type of thing. But, at this moment it’s only through a series of Perl, Powershell, and VBScript templates.

We know that’s not the best way to get the job done, but that’s a subject for another post.

The More You Know…

For now, it’s important that you are able to clearly define—for both you and your colleagues, customers, and consumers—the difference between “logfile monitoring” and which tool or technique you need to employ to get the job done.


What is “essential” to your monitoring environment? (hey, this *is* a monitoring blog. I could certainly ask that about your kitchen, but that would be a completely different discussion)

Seriously – if your budget was infinite, you could have all the tools running on giant systems with all the disk in the world. But it (the budget) isn’t (infinite) and you don’t (have all the toys).

So start from the other end. What could you absolutely positively not live without? Is ping and a DOS batch file enough? A rock-solid SNMP trap listener? Is it a deal-breaker if you don’t have  agents running local commands on each server?

Draw a line. Come up with reasons. Know where you stand.