Category Archives: ICYMI

Blueprint: The Evolution of the Network, Part 2

NOTE: This article originally appeared here.

If you’re not prepared for the future of networking, you’re already behind.

That may sound harsh, but it’s true. Given the speed at which technology evolves compared to the rate most of us typically evolve in terms of our skillsets, there’s no time to waste in preparing ourselves to manage and monitor the networks of tomorrow. Yes, this is a bit of a daunting proposition considering the fact that some of us are still trying to catch up with today’s essentials of network monitoring and management, but the reality is that they’re not really mutually exclusive, are they?

In part of one this series, I outlined how the networks of today have evolved from those of yesteryear, and what today’s new essentials of network monitoring and management are as a consequence. By paying careful attention, you will likely have picked up on ways the lessons from the past that I described helped shape those new essentials.

Similarly, today’s essentials will help shape those of tomorrow. Thus, as I said, getting better at leveraging today’s essentials of network monitoring and managing is not mutually exclusive from preparing for the networks of tomorrow.

Before delving into what the next generation of network monitoring and management will look like, it’s important to first explore what the next generation of networking will look like.

On the Horizon

Above all else, one thing is for certain: We networking professionals should expect tomorrow’s technology to create more complex networks resulting in even more complex problems to solve. With that in mind, here are the top networking trends that are likely to shape the networks of the future:

Networks growing in all directions
Fitbits, tablets, phablets and applications galore. The explosion of IoT, BYOD, BYOA and BYO-everything else is upon us. With this trend still in its infancy, the future of connected devices and applications will be not only about the quantity of connected devices, but also the quality of their connections tunneling network bandwidth.

But it goes beyond the gadgets end users bring into the environment. More and more, commodity devices such as HVAC infrastructure, environmental systems such as lighting, security devices and more all use bandwidth—cellular or WiFi—to communicate outbound and receive updates and instructions inbound. Companies are using, or planning to use, IoT devices to track product, employees and equipment. This explosion of devices that consume or produce data will, not might, create a potentially disruptive explosion in bandwidth consumption, security concerns and monitoring and management requirements.

IPv6 eventually takes the stage…or sooner (as in now!)
Recently, ARIN was unable to fulfill a request for IPv4 addresses because the request was greater than the contiguous blocks available. Meanwhile, IPv6 is now almost always enabled by default and is therefore creating challenges for IT professionals even if they, and their organizations, have committed to putting off their own IPv6 decisions. The upshot of all this is that IPv6 is a reality today. There is an inevitable and quickly approaching moment when switching over will no longer be an option, but a requirement.

SDN and NFV will become the mainstream
Software defined networking (SDN) and network function virtualization (NFV) are just in their infancy and should be expected to become mainstream in the next five to seven years. With SDN and virtualization creating new opportunities for hybrid infrastructure, a serious look at adoption of these technologies is becoming more and more important.

So long WAN Optimization, Hello ISPs
There are a number of reasons WAN technology is and will be kicked to the curb in greater fervency. With bandwidth increases outpacing CPU and custom hardware’s ability to perform deep inspection and optimization, and with ISPs helping to circumvent the cost and complexities associated with WAN accelerators, WAN optimization will only see the light of tomorrow in unique use cases where the rewards outweigh the risks. As most of us will admit, WAN accelerators are expensive and complicated, making ISPs more and more attractive. Their future living inside our networks is certainly bright.

Farewell L4 Firewalling 
With the mass of applications and services moving towards web-based deployment, using Layer 4 (L4) firewalls to block these services entirely will not be tolerated. A firewall incapable of performing deep packet analysis and understanding the nature of the traffic at the Layer 7 (L7), or the application layer, will not satisfy the level of granularity and flexibility that most network administrators should offer their users. On this front, change is clearly inevitable for us network professional, whether it means added network complexity and adapting to new infrastructures or simply letting withering technologies go.

Preparing to Manage the Networks of Tomorrow  

So, what can we do to prepare to monitor and manage the networks of tomorrow? Consider the following:

Understand the “who, what, why and where” of IoT, BYOD and BYOA
Connected devices cannot be ignored. According to 451 Research, mobile Internet of Things (IoT) and Machine-to-Machine (M2M) connections will increase to 908 million in just five years, this compared to 252 million just last year. This staggering statistic should prompt you to start creating a plan of action on how you will manage nearly four times the number of devices infiltrating your networks today.

Your strategy can either aim to manage these devices within the network or set an organizational policy to regulate traffic altogether. Nonprofit IT trade association CompTIA noted in a recent survey, many companies are trying to implement partial and even zero BYOD policies to regulate security and bandwidth issues. Even though policies may seem like an easy fix, curbing all of tomorrow’s BYOD/BYOA is nearly impossible. As such, you will have to understand your network device traffic in incremental metrics in order to optimize and secure them. Even more so, you will need to understand network segments that aren’t even in your direct control, like the tablets, phablets and Fitbits, to properly isolate issues.

Know the ins and outs of the new mainstream 
As stated earlier, SDN, NFV and IPv6 will become the new mainstream. We can start preparing for these technologies’ future takeovers by taking a hybrid approach to our infrastructures today. This will put us ahead of the game with an understanding of how these technologies work, the new complexities they create and how they will ultimately affect configuration management and troubleshooting ahead of mainstream deployment.

Start comparison shopping now
Going through the exercise of evaluating ISPs, virtualized network options and other on-the-horizon technologies—even if you don’t intend to switch right now—will help you nail down your particular requirements. Sometimes, knowing a vendor has or works with technology you don’t need right now, such as IPv6, but might later can and should influence on your decision.

Brick in, brick out
Taking on new technologies can feel overwhelming to those of us with “boots on the ground” because the new technology can often simply seem like one more mouth to feed, so to speak. As much as possible, look for ways that potential new additions will not just enhance, but replace the old guard. Maybe your new real-time deep packet inspection won’t completely replace L4 firewalls, but if it can reduce them significantly—while at the same time increasing insight and the ability to respond intelligently to issues—then the net result should be a better day for you. If you don’t do this, then more times than not, new technology will indeed simply seem to increase workload and do little else. This is also a great measuring stick to identify new technologies whose time may not yet have truly come just yet, at least not for your organization.

At a more basic layer, if you have to replace three broken devices and you realize that the newer equipment is far more manageable or has more useful features, consider replacing the entire fleet of old technology even if it hasn’t fallen apart yet. The benefits of consistency often far outweigh the initial pain of sticker shock.

To conclude this series, my opening statement from part one merits repeating: learn from the past, live in the present and prepare for the future. The evolution of networking waits for no one. Don’t be left behind.

Blueprint: The Evolution of the Network, Part 1

NOTE: This article originally appeared here.

Learn from the past, live in the present and prepare for the future.

While this may sound like it belongs hanging on a high school guidance counselor’s wall, they are words to live by, especially in IT. They apply perhaps to no other infrastructure element better than the network. After all, the network has long been a foundational building block of IT, it’s even more important today than it was in the days of SAGE and ARPANET, and its importance will only continue to grow in the future while simultaneously becoming more complex.

For those of us charged with maintaining the network, it’s valuable to take a step back and examine the evolution of the network. Doing so helps us take an inventory of lessons learned—or the lessons we should have learned; determine what today’s essentials of monitoring and managing networks are; and finally, turn an eye to the future to begin preparing now for what’s on the horizon.

Learn from the Past

Think back to the time before the luxuries of Wi-Fi and the proliferation of virtualization, and before today’s wireless and cloud computing.

The network used to be defined by a mostly wired, physical entity controlled by routers and switches. Business connections were based on T1 and ISDN, and Internet connectivity was always backhauled through the data center. Each network device was a piece of company-owned hardware, and applications operated on well-defined ports and protocols. VoIP was used infrequently, and anywhere connectivity—if even a thing—was provided by the low-quality bandwidth of cell-based Internet access.

With this yesteryear in mind, consider the following lessons we all (should) have learned that still apply today:

It Has to Work
Where better to start than with a throw back to IEEE RFC1925, “The Twelve Networking Truths”? It’s just as true today as it was in 1996—if your network doesn’t actually work, then all the fancy hardware is for naught. Anything that impacts the ability of your network to work should be suspect.

The Shortest Distance Between Two Points is Still a Straight Line
Wired or wireless and MPLS, EIGRP or OSPF, your job as a network engineer is still fundamentally to create the conditions where the distance between the provider of information, usually a server, and the consumer of that information, usually a PC, is as near to a straight line as possible. When you forget that but still get caught up in quality of service maps, automated functions and fault-tolerance, you’ve lost your way.

An Unconfigured Switch is Better than the Wizard
It was a long-standing truth that running the configuration wizard on a switch was the fastest way to break it, whereas just unboxing and plugging it in would work fine. Wizards are a fantastic convenience and come in all forms, but if you don’t know what the wizard is making convenient, you are heading for trouble.

What is Not Explicitly Permitted is Forbidden
No, this policy it’s not fun and it won’t make you popular. And it will actually create work for you on an ongoing basis. But there is honestly no other way to run your network. If espousing this policy will get you fired, then the truth is you’re going to get fired one way or the other. You might as well be able to pack your self-respect and professional ethics into the box along with your potted fern and stapler when the shoe drops. Because otherwise that huge security breach is on you.

Live in the Present 

Now let’s fast forward and consider the network of present day.

Wireless is becoming ubiquitous—it’s even overtaking wired networks in many instances—and the number of devices wirelessly connecting to the network is exploding (think Internet of Things). It doesn’t end there, though—networks are growing in all directions. Some network devices are even virtualized, resulting in a complex amalgam of the physical, the virtual and the Internet. Business connections are DSL/cable and Ethernet services, and increased use of cloud services is stretching Internet capacity at remote sites, not to mention opening security and policy issues since it’s not all backhauled through the data center. BYOD, BYOA, tablets and smartphones are prevalent are creating bandwidth capacity and security issues. Application visibility based on port and protocol is largely impossible due to applications tunneling via HTTP/HTTPS. VOIP is common, also imposing higher demands on network bandwidth, and LTE provides high-quality anywhere connectivity.

Are you nostalgic for the days of networking yore yet? The complexity of today’s networking environment underscores that while lessons of the past are still important, a new set of network monitoring and management essentials is necessary to meet the challenges of today’s network administration head on. These new essentials include:

Network Mapping
While perhaps a bit back-to-basics and also suitable as a lesson we all should have learned by now, when you consider the complexity of today’s networks and network traffic, network mapping and the subsequent understanding of management and monitoring needs has never been more essential than it is today. Moving ahead without a plan—without knowing the reality on the ground—is a sure way to make the wrong choices in terms of network monitoring based on assumptions and guesswork.

Wireless Management
The growth of wireless networks presents new problems, such as ensuring adequate signal strength and that the proliferation of devices and their physical mobility—potentially hundreds of thousands of network-connected devices, few of which are stationary and many of which may not be owned by the company (BYOD)—doesn’t get out of hand. What’s needed are tools such as wireless heat maps, user device tracking, over-subscribed access points and tracking and managing device IP addresses.

Application Firewalls
When it comes to surviving the Internet of Things, you first must understand that all of the “things” connect to the cloud. Because they’re not coordinating with a controller on the LAN, each device incurs a full conversation load, burdening the WAN and every element in a network. And worse, many of these devices prefer IPv6, meaning you’ll have more pressure to dual-stack all of those components. Application firewalls can untangle device conversations, get IP address management under control and help prepare for IPv6. They can also classify and segment device traffic; implement effective quality of service to ensure that critical business traffic has headroom; and of course, monitor flow.

Capacity Planning
Nobody plans for not growing; it’s just that sometimes infrastructure doesn’t read the plan we’ve so carefully laid out. You need to integrate capacity for forecasting tools, configuration management and web-based reporting to be able to predict scale and growth. There’s the oft-quoted statistic that 70 percent of network outages come from unexpected network configuration changes. Admins have to avoid the Jurassic Park effect—unexpected, but what in hindsight were clearly predictable outages is the bane of any IT manager’s existence. “How did we not know and respond to this?” is a question nobody wants to have to answer.

Application Performance Insight
Many network engineers have complained that the network would be stable if it weren’t for the end users. While it’s an amusing thought, it ignores the universal truth of IT—everything we do is because of and for end-users. The whole point of having a network is to run the business applications end-users need to do their jobs on. Face it, applications are king. Technologies such as deep packet inspection, or packet-level analysis, can help you ensure the network is not the source of application performance problems.

Prepare for the Future

Now that we’ve covered the evolution of the network from past to present—and identified lessons we can learn from the network of yesterday and what the new essentials of monitoring and managing today’s network are—we can prepare for the future. So, stay tuned for part two in this series to explore what the future holds for the evolution of the network.

When It Comes to System Outages, Don’t Prepare For the Worst

NOTE: This article originally appeared here.

During the 2015 World Cup soccer competition, Nate Silver and the psychic witches he keeps in his basement — because how else could he make the predictions he does with such accuracy? — got it wrong. Really, really wrong. They were completely blindsided by Germany’s win over Brazil. As Silver described it, it was a completely unforeseeable event.

 

In sports and, to a lesser extent, politics, the tendency in the face of these things is to eat the loss, chalk it up to a fluke — a black swan in statistics parlance — and get on with life.

But as network administrators, we know that’s not how it works in IT.

In my experience, when a black swan event affects IT systems, management usually acquires a dark obsession with the event. Meetings are called under the guise of “lessons learned exercises,” with the express intent of ensuring said system outages never happen again.

Don’t spend too much time studying what might occur

Now, I’m not saying that after a failure we should just blithely ignore any lessons that could be learned. Far from it, actually. In the ashes of a failure, you often find the seeds of future avoidance. One of the first things an IT organization should do after such an event is determine whether the failure was predictable, or if it was one of those cases where there wasn’t enough historical data to determine a decent probability.

 

If the latter is the case, I’m here to tell you your efforts are much better spent elsewhere. What’s a better approach? Instead of spending time trying to figure out if a probability may or may not exist, catch and circumvent those common, everyday IT annoyances. This is a tactic that’s overlooked far too often.

Don’t believe me? Well, let’s take the example of a not-so-imaginary company I know that had a single, spectacular IT failure that cost somewhere in the neighborhood of $100,000. Management was understandably upset. It immediately set up a task force to identify the root cause of the failure and recommend steps to avoid it in the future. Sounds reasonable, right?

The task force — five experts pulled from the server, network, storage, database and applications teams — took three months and more than 100 staff-hours to investigate the root cause. Being conservative, let’s say the hourly cost to the company was $50. Now, multiply that by five people, then by 100 hours, then by three months. It comes to a nice round $125,000.

Not so reasonable after all

Yes, at the end of it all the root problem was not only identified — at least, as much as possible — but code was put in place to (probably) predict the next time the exact same event might occur. Doesn’t sound so bad. But keep this in mind: The company spent $25,000 more than the cost of the original failure to create a system outages solution that may or may not predict the occurrence of a black swan exactly like the one that hit before.

Maybe it wasn’t so reasonable after all.

You may be thinking, “But where else are you saying we should be focusing on? After all, we’re held accountable to the bottom line as much as anyone else in the company.”

I get that, and it’s actually my point. Let’s compare the previous example of chasing a black swan to another, far more common problem: network interface card (NIC) failures.

In this example, another not-so-fictitious company saw bandwidth usage spike and stay high. NICs threw errors until the transmission rates bottomed out, and eventually the card just up and died. The problem was that while bandwidth usage was monitored, there was no alerting in place for interfaces that stopped responding or disappeared (the company monitored the IP at the end of the connection, which meant WAN links were absent alerts until the far end went down).

Let’s assume that a NIC failure takes an average of one hour to notice and correctly diagnose, and then it takes two hours to fix by network administrators who cost the company $53 per hour. While the circuit is out, the company loses about $1,000 per hour in revenue, lost opportunity, etc. That means system outages like this one could cost the company $3,106.

Setting a framework anchored by alerting and monitoring

Now, consider that, in my experience, proper monitoring and alerting reduces the time it takes to notice and diagnose problems such as NIC failures to 15 minutes. That’s it. Nothing else fancy, at least not in this scenario. But that simple thing could reduce the cost of the outage by $750.

I know those numbers don’t sound too impressive. That is, until you realize a moderately sized company can easily experience 100 NIC failures per year. That translates to more than $300,000 in lost revenue if the problem is unmonitored, and an annual savings of $75,000 if alerting is in place.

And that doesn’t take into account the ability to predict NIC failures and replace the card pre-emptively. If we estimate that 50% of the failures could be avoided using predictive monitoring, the savings could rise to more than $190,000.

Again, I’m not saying preparing for black swan events isn’t a worthy endeavor, but when tough budget decisions need to be made, some simple alerting on common problems can save more than trying to predict and prevent “the big one” that may or may not ever happen.

After all, NIC failures are no black swan. I think even Nate Silver would agree they’re a sure thing.

Respect Your Elders

NOTE: This article originally appeared here.

“Oh Geez,” exclaimed the guy who sits 2 desks from me, “that thing is ancient! Why would they give him that?”

Taking the bait, I popped my head over the wall and asked “what is?”

He showed me a text message, sent to him from a buddy—an engineer (EE, actually) who worked for an oil company. My co-worker’s iPhone 6 displayed an image of a laptop we could only describe as “vintage”:

(A Toshiba Tecra 510CDT, which was cutting edge…back in 1997.)

“Wow.” I said. “Those were amazing. I worked on a ton of those. They were serious workhorses—you could practically drop one from a 4 story building and it would still work. I wanted one like nobody’s business, but I could never afford it.”

“OK, back in the day I’m sure they were great,” said my 20-something coworker dismissively. “But what the hell is he going to do with it NOW? Can it even run an OS anymore?”

I realized he was coming from a particular frame of reference that is common to all of us in I.T. Newer is better. Period. With few exceptions (COUGH-Windows M.E.-COUGH), the latest version of something—be it hardware or software—is always a step up from what came before.

While true, it leads to a frame of mind that is patently un-true: a belief that what is old is also irrelevant. Especially for I.T. Professionals, it’s a dangerous line of thought that almost always leads to un-necessary mistakes and avoidable failures.

In fact, ask any I.T. pro who’s been at it for a decade, and you’ll hear story after story:

  • When programmers used COBOL, back when dinosaurs roamed the earth, one of the fundamental techniques that were drilled into their heads was, “check your inputs.” Thinking about the latest version of exploits, be they an SSLv3 thing like ‘Poodle’, or a SQL injection, or any of a plethora of web based security problems, the fundamental flaw is the server NOT checking its inputs, for sanity.
  • How about the OSI model? Yes, we all know its required knowledge for many certification exams (and at least one IT joke). But more importantly, it was (and still is) directly relevant to basic network troubleshooting.
  • Nobody needs to know CORBA database structure anymore, right? Except that a major monitoring tool was originally developed on CORBA and that foundation has stuck. Which is why, if you try to create a folder-inside-a-folder more than 3 times, the entire system corrupts. CORBA (one of the original object-oriented databases) could only handle 3 levels of object containership.
  • Powershell can be learned without understanding the Unix/Linux command line concepts. But, it’s sure EASIER to learn if you already know how to pipe ls into grep into awk into awk so that you get a list of just the files you want, sorted by date. That technique (among other Unix/Linux concepts) was one of the original goals of Powershell.
  • Older rev’s of industrial motion-control systems used specific pin-outs on the serial port. The new USB-to-Serial cables don’t mimic those pin-outs correctly, and trying to upload a program with the new cables will render the entire system useless.

And in fact, that’s why my co-worker’s buddy was handed one of those venerable Tecra laptops. It had a standard serial port and it was preloaded with the vendor’s DOS-based ladder-logic programming utility. Nobody expected it to run Windows 10, but it fulfilled a role that modern hardware simply couldn’t have done.

It’s an interesting story, but you have to ask: aside from some interesting anecdotes and a few bizarre use-cases, does this have any relevance to our work day-today?

You bet.

We live in a world where servers, storage, and now the network is rushing toward a quantum singularity of virtualization.

And the “old-timers” in the mainframe team are laughing their butts off as they watch us run in circles, inventing new words to describe techniques they learned at the beginning of their career; making mistakes they solved decades ago; and (worst of all) dismissing everything they know as utterly irrelevant.

Think I’m exaggerating? SAN and NAS look suspiciously like DASD, just on faster hardware. Services like Azure and AWS, for all their glitz and automation, aren’t as far from rented time on a mainframe as we’d like to imagine. And when my company replaces my laptop with a fancy “appliance” that connects to Citrix VDI session, it reminds me of nothing as much as the VAX terminals I supported back in the day.

My point isn’t that I’m a techno-Ecclesiastes shouting “there is nothing new under the sun!” Or some I.T. hipster who was into the cloud before it was cool. My point is that it behooves us to remember that everything we do, and every technology we use, had its origins in something much older than 20 minutes ago.

If we take the time to understand that foundational technology we have the chance to avoid past mis-steps, leverage undocumented efficiencies built into the core of the tools, and build on ideas elegant enough to have withstood the test of time.


Got your own “vintage” story, or long-ago-learned tidbit that is still true today? Share it in the comments!

When, Not What, Defines Today’s Networking Career

Back in December, Cisco filed a lawsuit against Arista Networks because Arista’s network device operating system, EOS, was too similar to Cisco’s beloved IOS.

 This caused Tom Hollingsworth (a.k.a. “The Networking Nerd”) to speculate that this action presaged the ultimate death of the network device command-line interface (CLI).

Time will tell whether Hollingsworth is right or wrong and to what degree, but the idea intrigued me. Why would it matter if the command-line interface went away? What would be the loss?

Now, before going further, here’s a little background on me: I tend to be a “toaster” guy when it comes to technology. I don’t love my toaster or hate my toaster. I don’t proselytize the glorious features of my toaster to non-users. I just use my toaster. If it burns the toast, it’s time for a new toaster. Sure, over the years I’ve built up a body of experience that tells me my bagels tend to get stuck in the upright models, so I prefer the toaster/oven combos. But at the end of the day, it’s about making good toast.

Today’s networking career means learning new techniques

Jeez! Now I have a craving for a panini. Where was I? Oh right, technology.

I use a lot of technologies. My phone is Android. My work laptop runs Windows 8.1. My home desktop runs Linux. My wife lives on her iPad. So on and so forth. I’ve come to believe that learning technology is like learning to play cards.

The first game you learn is new, interesting and a little confusing, but ultimately thrilling because of the world it opens up. But the second card game, that’s the hard one. You know all the rules of the first game, but now there’s this other game that completely shatters what you knew. Then you learn your third card game, and you start to see the differences and similarities. By the fifth game, you understand that the cards themselves are just a vehicle for different ways of structuring sets.

I believe that’s why people are concerned about Hollingsworth’s prediction of the death of CLI. If you only know one game — and let’s face it, CLI is an extremely comprehensive and well-known “game” — and you’ve invested a lot of time and energy learning not only the rules of that game but also its nuances and tricks, finding out that game is about to be discontinued can be distressing. But when it comes to CLI, I believe that distress is actually due to a misplaced sense of self. Because you aren’t really a CLI professional, are you?

You’re a networking professional, not a CLI pro

Sure, you know a lot about CLI. But really, you’re a networking professional. Being able to configure open shortest path first (OSPF) from memory makes your job easier. But your job is knowing what OSPF is, when to use it versus enhanced interior gateway routing protocol, how to evaluate its performance and so on.

No, the concern about the death of CLI is really rooted in the fear of personal obsolescence. I’ve heard that notion repeated in conversations about the mainframe, Novell networking, WordPerfect 5.1 and dozens of other technologies that were brilliant in their time, but which, for various reasons, were superseded by something else — sometimes something else that is better, and sometimes not.

And a fear of personal obsolescence in your networking career is ultimately false, unless you are digging in your heels and choosing never to learn anything new again. (OK, that was sarcasm, folks. As IT pros, we should be committed to life-long learning. Even if you are two years away from retirement, learning new stuff is still A Good Thing™.) As long as you are open to new ideas, new techniques and yes, new systems, then you won’t become obsolete.

Employers exploit networkers’ insecurity

I’ll be honest. I think there are a lot of employers that exploit this insecurity. “Where’s a Perl script-kiddie like you going to find this kind of role?” they whisper — usually implicitly, although sometimes much more explicitly than any of us prefer. Or if we’re interviewing for a new job, they ask, “I see you have a lot of experience with AIX, but we’re a Windows shop. Do you really think your skills translate?”

I’m not here to talk about interviewing skills, salary negotiations or career improvements, so I’m not going to get into the potential responses, but I will say that the ultimate answer in each of these cases — and many others — is “Yes.” Why? Because it’s not about whether I know the fifth parameter of the gerfrinkel command in CodeMe version 12.3.9.7, which was deprecated in 12.3.9.8 in favor of the unglepacht function. It’s not about any of that. It’s about my experience on when to use certain commands, when to look for a workaround, how to manage an environment of this scale and scope and so on.

To play off the old joke about the copier repairman, a small part of your paycheck goes toward turning the screw; more of it is based on knowing which screw to turn.

As IT pros, we are paid — and are valuable — because we know how to find out which screw to turn and when to turn it. So to speak, of course.

“Logfile Monitoring” – I Do Not Think It Means What You Think It Means

This is a conversation I have A LOT with clients. They say we want “logfile monitoring” and I am not sure what they mean. So I end up having to unwind all the different things it COULD be, so we can get to what it is they actually need.

It’s also an important clarification for me to make as SolarWinds Head Geek because depending on what the requested means, I might need to point them toward Kiwi Syslog Server, Software & Application Monitor, or Log & Event Manager (LEM).

Here’s a handy guide to identify what people are talking about. “Logfile monitoring” is usually applied to 4 different and mutually exclusive areas. Before you allow the speaker to continue, please ask them to clarify which one they are talking about:

  1. Windows Logfile
  2. Syslog
  3. Logfile aggregation
  4. Monitoring individual text files on specific servers

More clarification on each of these areas below:

Windows Logfile

Monitoring in this area refers specifically to the Windows event log, which isn’t actually a log “file” at all, but a database unique to Windows machines.

In the SolarWinds world, the tool that does this is Server & Application Monitor. Or if you are looking for a small, quick, and dirty utility, the Eventlog Forwarder for Windows will take Eventlog messages that match a search pattern and pass them via Syslog to another machine.

Syslog

Syslog is a protocol, which describes how to send a message from one machine to another on UDP port 514. The messages must fit a pre-defined structure. Syslog is different from SNMP Traps. This protocol is most often found when monitoring network and *nix (Unix, Linux) devices, although network and security devices send out their fair share as well.

In terms of products, this is covered natively by Network Performance Monitor (NPM), but as I’ve said often you shouldn’t send syslog or trap directly to your NPM primary poller. You should send it into a syslog/trap “filtration” first. And that would be the Kiwi Syslog server (or its freeware cousin).

Logfile aggregation

This technique involves sending (or pulling) log files from multiple machines and collecting them on a central server. This collection is done at regular intervals. A second process then searches across all the collected logs, looking for trends or patterns in the enterprise. When the audit and security groups talk about “logfile monitoring,” this is usually what they mean.

As you may have already guessed, the SolarWinds tool for this job is Log & Event Manager. I should point out that LEM will ALSO receive syslog and traps, so you kind of get a twofer if you have this tool. Although, I personally STILL think you should send all of your syslog and trap to a filtration layer, and then send the non-garbage messages to the next step in the chain (NPM or LEM).

Monitoring individual text files on specific servers

This activity focuses on watching a specific (usually plain text) file in a specific directory on a specific machine, looking for a string or pattern to appear. When that pattern is found, an alert is triggered. Now it can get more involved than that—maybe not a specific file, but a file matching a specific pattern (like a date); maybe not a specific directory, but the newest sub-directory in a directory; maybe not a specific string, but a string pattern; maybe not ONE string, but 3 occurrences of the string within a 5 minute period; and so on. But the goal is the same—to find a string or pattern within a file.

Within the context of SolarWinds, SAM has been the go-to solution for this type of thing. But, at this moment it’s only through a series of Perl, Powershell, and VBScript templates.

We know that’s not the best way to get the job done, but that’s a subject for another post.

The More You Know…

For now, it’s important that you are able to clearly define—for both you and your colleagues, customers, and consumers—the difference between “logfile monitoring” and which tool or technique you need to employ to get the job done.

The Three Black Boxes (hint: the network was only the beginning)

NOTE: This post originally appeared here

Once upon a time, back in the dark ages of IT, people would sit around the table and talk about “the black box” and everyone knew what was meant—the network. The network, whether it was ARCNET, DECnet, Ethernet or even LANTastic, seemed inscrutable to them. They just connected their stuff to the wires and hoped for the best.

Back in those early days, conversations with us early network engineers often went something like this:

Them: “I think the network is slow.”
Us (who they considered pointy-hatted-wizards): “No, it’s not.”
Them: “Look, I checked the systems and they’re fine. I think it’s the network.”
Us: “Come one, it’s rarely ever the network.”
Them: “Well, I still think…”
Us: “Then you’ll need to prove it.”

It was so difficult for them to pierce this veil that—if urban legends on the subject are to be believed—the reason the image of a cloud is used to signify a network is because it was originally labeled by those outside the network with the acronym TAMO. This stood for, “then a miracle occurs,” and the cloud graphic reinforced the divine and un-knowable nature of bits flowing through the wire.

But we in the network knew it wasn’t a miracle, though it was still somewhat of a black box even to us—a closed system that took a certain type of input, implemented only somewhat monitorable processes inside and then produced a certain type of output.

With time, though, the network became much less of a black box to everyone. Devices, software and our general knowledge grew in sophistication so that we now have come to expect bandwidth metrics, packet error data, NetFlow conversations, deep packet inspection results, IPSLA and more to be available on demand and in near real-time.

But recently, two new black boxes have arrived on the scene. And this time, we net admins are on the outside with almost everyone else.

The first of these, virtualization—as well as its commoditized cousin, cloud computing, has grown to the point where the count of physical servers in medium-to-large companies is sometimes just a tenth of the overall server count.

Ask an application owner if he knows how many other VM’s are running on the same host and you’ll be met with a blank stare. Probe further by asking if he thinks a “noisy neighbor”—a VM on the same host that is consuming more resources than it should—is impacting his system and he’ll look at you conspiratorially and say, “Well, I sure think there’s one of those, but heck if I could prove it.”

Still, we love virtual environments. We love the cost savings, the convenience and the flexibility they afford our companies. But don’t fool yourself—unless we’re actually on the virtualization team, we don’t really understand them one bit.

Storage is the other “new” black box. It presents the same challenge as virtualization, but only worse. Disks are the building blocks of arrays, which are collected via a software layer into LUNs, which connect through a separate network “fabric” to be presented as data stores to the virtual layer or as contiguous disk resources to physical servers.

Ask that already paranoid application owner which actual physical disks his application is installed on and he’ll say you may as well ask him to point out a specific grain of sand on a beach.

Making the storage environment even more challenging is its blended nature. Virtualization, for all the complexity, is a binary choice. Your server is either virtualized or it’s not. Storage isn’t that clear cut—a physical server may have a single traditional platter-based disk for its system drive, connect to a SAN for a separate drive where software is installed and then use a local array of SSD drives to support high-performance database I/O.

OK, so what does all this have to do with the network? Well, what’s most interesting about these new black boxes—especially to us network folk—is how they are turning networking back into a black box as well.

Think about it—software-based “virtual” switches distribute bandwidth from VMs to multiple network connections.

Also, consider that SAN “fabric” is often more software than hardware.

And then there is the rise of SDN, a promising new technology to be sure, but one that still needs to have some of the rough edges smoothed away.

The good news is that, like our original, inscrutable networking from the good old days, the ongoing drive towards maturity and sophistication will crack the lid on these two new black boxes and reverse the slide of the network back into one as well.

Even now it’s possible to use the convergence of networking, virtualization and storage to connect more dots than ever before. Because of the seamless flow from the disk through the array, LUN, datastore, hypervisor and on up to the actual application, we’re able to show—with a tip of the old fedora to detective Dirk Gently—the interconnectedness of all things. With the right tools in hand, we can now show how an array that is experiencing latency affects just about anything.

That paranoid application owner might even stop using his “they’re out to get me” coffee mug.

The Four Questions – Introduction

This article originally appeared in a scaled-down version here on PacketPushers.net. I’m posting it in it’s full form here as an introduction to the full series.


For people who are interested in monitoring, there is a leap that you make when you go from watching systems that YOU care about, to monitoring systems that other people care about.

When you are doing it for yourself, it’s all about ease of maintenance, getting good (meaning useful, interesting) data, and having the information at your fingertips to deflect accusations that YOUR system is down/slow/ugly/whatever.

But if you do that job well, and show up at enough meetings showing off your shiny happy data, inevitably you will get nominated/conscripted into the monitoring group where it is expected you will take as much interest in other people’s sh…tuff as your beloved systems from your former job.

And this is where things get especially tricky.

Assuming you LIKE monitoring as a discipline, and find it exciting to learn about different types of systems (and ways they can fail), you are going to want to provide the same levels of insight for your coworkers as you had for yourself.

Inevitably, you will find yourself answering The Four Questions. These are questions which—for reasons that will become apparent—you never really had to ask yourself when you were doing it on your own. The four questions—with brief explanations—are:

  1. Why did I get an alert?
    The person is not asking, “Why did this alert trigger at this time?” They are asking why they got the alert at all.
  2. Why didn’t I get an alert?
    Something happened that the owner of the system felt should have triggered an alert, but they didn’t receive one.
  3. What is being monitored on my system?
    What reports and data can be pulled for their system (and in what form) so they can look at trending, performance, and forensic information after a failure.
  4. What will alert on my system?
    I’d like to be able to predict under which conditions I will get an alert for this system.

…and the Fifth Beatle… I mean question.

5. What do you monitor “standard”?
What metrics and data are typically collected for systems like this? This is the inevitable (and logical) response when you say, “We put standard monitoring in place.”

In the coming (weeks/days/months/series) I’m going to explore each of these questions in-depth, and offer techniques you can use to respond to each one.

Using NetFlow monitoring tactics to solve bandwidth mysteries

NetFlow eliminated the hassle of network troubleshooting after a school complained about its Internet access.

Life in IT can be pretty entertaining. And apparently we admins aren’t the only ones who think so — Hollywood’s taken notice, too. The problem is, the television shows and feature films about IT are all comedies. I’m not saying we don’t experience some pretty humorous stuff, but I want a real show; you know, something with substance. I can see it now — The Xmit (and RCV) Files: The Truth is Out There.

 In fact, I’ve already got the pilot episode all planned. It’s based on an experience I had not long ago with the NetFlow monitoring protocol.

The company I was with at the time offered monitoring as a service to a variety of clients. One day, I was holding the receiver away from my head as a school principal shouted, “The Internet keeps going down! You’ve got to do something.”

Now, there are few phrases that get my goat like “the Internet is down,” or its more common cousin, “the network is down.” So, my first thought was, “Buddy, we have redundant circuits, switches configured for failover and comprehensive monitoring. The network is not down, so please shut up.”

Of course, that’s not what I said. Instead, I asked a few questions that would help me narrow down the root cause of the problem.

First up: “How often are you experiencing this issue?”

“A bunch,” I was told.

“Ooookay … at any particular time?” I asked.

He replied, “Well, it seems kind of random.”

Gee, thanks. I’m sure I can figure it out with such insightful detail.

It was obvious I was going to have to do some real investigation. My first check was the primary circuit to our provider. Nothing there. So, I’m sorry, Virginia, but “the Internet” is not down, as if I had any doubt.

Next, I looked at the school’s WAN interface, which revealed that yes indeed, the WAN link to the school was becoming saturated at various intervals during the day. Usage would spike for 20 to 30 minutes, then drop down until the next incident. I checked the firewall logs — not my favorite job, which showed a high volume of http connections at the same times.

Now, for many years, checking was the pinnacle of network troubleshooting — check the devices, check the logs, wait for the event to happen again, dig a little further. And in my case, that might have been all I could do. Our contract had us monitoring the entire core data center for the school system, but that only extended to the edge router for the school. We had exactly zero visibility beyond each individual school building’s WAN connection.

But as chance would have it, I had one more trick up my sleeve: NetFlow.

NetFlow has been around a while, but it’s been only in the last few years that it’s entered the common network admin lexicon, largely due to the maturation of tools that can reliably and meaningfully collect and display NetFlow data. NetFlow collects and correlates “conversations” between two devices as the data passes through a router. You don’t have to monitor the specific endpoints in the conversation, you just have to get data from one router that sits between the two points.

 

Hmm, that sounds a lot like a WAN router connected to the Internet provider, which is exactly what I had. Correlating the spike times from our bandwidth stats, we saw that during the same period, 10 MAC addresses were starting conversations with YouTube. Every time there was a spike, it was the same set of MAC addresses.

Now, if we had been monitoring inside the school, we could have gleaned much more information — IP address, location, maybe even username if we had some of the more sophisticated user device tracking tools in place — but we weren’t. However, a visit to WireShark’s OUI Lookup Tool revealed that all 10 of those MAC addresses were from — and please forgive the irony — Apple Inc.

At that point, I had exhausted all of the tools at my disposal. So, I called the principal back and gave him the start and stop times of the spikes, along with the information about 10 Apple products being to blame.

“Wait, what time was that?” he asked.

I repeated the times.

“Oh, for the love of … I know what the problem is.” Click.

It turns out the art teacher had been awarded a grant for 10 shiny new iPads. He would go from room to room during art period handing them out and teaching kids how to do video mashups.

This was one of those rare times when a bandwidth increase really was warranted, and after the school’s WAN circuit was reprovisioned, the Internet stopped mysteriously “going down.”

The episode would close with the handsome and sophisticated admin — played by yours truly, of course — looking into the camera and while channeling the great Fox Mulder saying, “Remember, my fellow admins, the truth is out there.” (And, I would add, for those of you reading this blog post, don’t forget how valuable NetFlow can be in finding network truth.)

Now, if that’s not compelling TV, I don’t know what is.

(This article originally appeared on SearchNetworking)

IT Monitoring Scalability Planning: 3 Roadblocks

Planning for growth is key to effective IT monitoring, but it can be stymied by certain mindsets. Here’s how to overcome them.

As IT professionals, planning for growth is something we do all day almost unconsciously. Whether it’s a snippet of code, provisioning the next server, or building out a network design, we’re usually thinking: Will it handle the load? How long until I’ll need a newer, faster, or bigger one? How far will this scale?

Despite this almost compulsive concern with scalability, there are still areas of IT where growth tends to be an afterthought. One of these happens to be my area of specialization: IT monitoring. So, I’d like to address growth planning (or non-planning) as it pertains to monitoring by highlighting several mindsets that typically hinder this important, but often surprisingly overlooked element, and showing how to deal with each.

The fire drill mindset
The occurs when something bad has already happened either because there was either no monitoring solution in place or because the existing toolset didn’t scale far enough to detect a critical failure, and so it was missed. Regardless, the result is usually a focus on finding a tool that would have caught the problem that already occurred, and finding it fast.

However, short of a TARDIS, there’s no way to implement an IT monitoring tool that will help avoid a problem after it occurs. Furthermore, moving too quickly as a result of a crisis can mean you don’t take the time to plan for future growth, focusing instead solely on solving the current problem.

My advice is to stop, take a deep breath, and collect yourself. Start by quickly, but intelligently developing a short list of possible tools that will both solve the current problem and scale with your environment as it grows. Next, ask the vendors if they have free (or cheap) licenses for in-house demoing and proofs of concept.

Then, and this is where you should let the emotion surrounding the failure creep back in, get a proof-of-concept environment set up quickly and start testing. Finally, make a smart decision based on all the factors important to you and your environment. (Hint: one of which should always be scalability.) Then implement the tool right away.

The bargain hunter
The next common pitfall that often prevents better growth planning when implementing a monitoring tool is the bargain-hunter mindset. This usually occurs not because of a crisis, but when there is pressure to find the cheapest solution for the current environment.

How do you overcome this mindset? Consider the following scenario: If your child currently wears a size 3 shoe, you absolutely don’t want to buy a size 5 today, right? But you should also recognize that your child is going to grow. So, buying enough size 3 shoes for the next five years is not a good strategy, either.

Also, if financials really are one of the top priorities preventing you from better preparing for future growth, remember that the cheapest time to buy the right-sized solution for your current and future environment is now. Buying a solution for your current environment alone because “that’s all we need” is going to result in your spending more money later for the right-sized solution you will need in the future. (I’m not talking about incrementally more, but start-all-over-again more.)

My suggestion is to use your company’s existing business growth projections to calculate how big of a monitoring solution you need. If your company foresees 10% revenue growth each year over the next three years and then 5% each year after that, and you are willing to consider completely replacing your monitoring solution after five years, then buy a product that can scale to 40% of the size you currently need.

The dollar auction
The dollar auction mindset happens when there is already a tool in place — a tool that wasn’t cheap and that a lot of time was spent perfecting. The problem is, it’s no longer perfect. It needs to be replaced because company growth has expanded beyond its scalability, but the idea of walking away from everything invested in it is a hard pill to swallow.

Really, this isn’t so much of a mindset that prevents preparing for future growth as it is something that’s all too often overlooked as an important lesson: If only you had better planned for future growth the first time around. The reality is that if you’re experiencing this mindset, you need a new solution. However, don’t make the same mistake. This time, take scalability into account.

Whether you’re suffering from one of these mindsets or another that is preventing you from better preparing your IT monitoring for future growth, remember, scalability is key to long term success.

(This article originally appeared on NetworkComputing)