Category Archives: Monitoring

ICYMI: Preparing For the Big One (or not)

(This originally appeared on Data Center Journal)

During last year’s  [ed: 2014] World Cup soccer competition, Nate Silver and the psychic witches he keeps in his basement — because how else could he make the predictions he does with such accuracy? — got it wrong. Really, really wrong. They were completely blindsided by Germany’s win over Brazil. As Silver described it, it was a completely unforeseeable event.

In sports and, to a lesser extent, politics, the tendency in the face of these things is to eat the loss, chalk it up to a fluke — a black swan in statistics parlance — and get on with life.

But as network administrators, we know that’s not how it works in IT.

In my experience, when a black swan event affects IT systems, management usually acquires a dark obsession with the event. Meetings are called under the guise of “lessons learned exercises,” with the express intent of ensuring said system outages never happen again.

Don’t spend too much time studying what might occur

Now, I’m not saying that after a failure we should just blithely ignore any lessons that could be learned. Far from it, actually. In the ashes of a failure, you often find the seeds of future avoidance. One of the first things an IT organization should do after such an event is determine whether the failure was predictable, or if it was one of those cases where there wasn’t enough historical data to determine a decent probability.

If the latter is the case, I’m here to tell you your efforts are much better spent elsewhere. What’s a better approach? Instead of spending time trying to figure out if a probability may or may not exist, catch and circumvent those common, everyday IT annoyances. This is a tactic that’s overlooked far too often.

Don’t believe me? Well, let’s take the example of a not-so-imaginary company I know that had a single, spectacular IT failure that cost somewhere in the neighborhood of $100,000. Management was understandably upset. It immediately set up a task force to identify the root cause of the failure and recommend steps to avoid it in the future. Sounds reasonable, right?

The task force — five experts pulled from the server, network, storage, database and applications teams — took three months and more than 100 staff-hours to investigate the root cause. Being conservative, let’s say the hourly cost to the company was $50. Now, multiply that by five people, then by 100 hours, then by three months. It comes to a nice round $125,000.

Not so reasonable after all

Yes, at the end of it all the root problem was not only identified — at least, as much as possible — but code was put in place to (probably) predict the next time the exact same event might occur. Doesn’t sound so bad. But keep this in mind: The company spent $25,000 more than the cost of the original failure to create a system outages solution that may or may not predict the occurrence of a black swan exactly like the one that hit before.

Maybe it wasn’t so reasonable after all.

You may be thinking, “But where else are you saying we should be focusing on? After all, we’re held accountable to the bottom line as much as anyone else in the company.”

I get that, and it’s actually my point. Let’s compare the previous example of chasing a black swan to another, far more common problem: network interface card (NIC) failures.

In this example, another not-so-fictitious company saw bandwidth usage spike and stay high. NICs threw errors until the transmission rates bottomed out, and eventually the card just up and died. The problem was that while bandwidth usage was monitored, there was no alerting in place for interfaces that stopped responding or disappeared (the company monitored the IP at the end of the connection, which meant WAN links were absent alerts until the far end went down).

Let’s assume that a NIC failure takes an average of one hour to notice and correctly diagnose, and then it takes two hours to fix by network administrators who cost the company $53 per hour. While the circuit is out, the company loses about $1,000 per hour in revenue, lost opportunity, etc. That means system outages like this one could cost the company $3,106.

Setting a framework anchored by alerting and monitoring

Now, consider that, in my experience, proper monitoring and alerting reduces the time it takes to notice and diagnose problems such as NIC failures to 15 minutes. That’s it. Nothing else fancy, at least not in this scenario. But that simple thing could reduce the cost of the outage by $750.

I know those numbers don’t sound too impressive. That is, until you realize a moderately sized company can easily experience 100 NIC failures per year. That translates to more than $300,000 in lost revenue if the problem is unmonitored, and an annual savings of $75,000 if alerting is in place.

And that doesn’t take into account the ability to predict NIC failures and replace the card pre-emptively. If we estimate that 50% of the failures could be avoided using predictive monitoring, the savings could rise to more than $190,000.

Again, I’m not saying preparing for black swan events isn’t a worthy endeavor, but when tough budget decisions need to be made, some simple alerting on common problems can save more than trying to predict and prevent “the big one” that may or may not ever happen.

After all, NIC failures are no black swan. I think even Nate Silver would agree they’re a sure thing.

SCP Exam Overview – My Perspective

(NOTE: This is an OLD post from THWACK, which I’m re-posting here for posterity. Much about the SCP program has changed since writing this, and you can expect an updated post soon.)

I like to take tests. I’m just weird that way. At one of my jobs, they actually put “exam hamster” on my business cards. For me, It’s like doing a crossword puzzle and most of the time I don’t have a lot riding on the exam. Plus, with IT certification tests, I know I can usually retake them if I bomb horribly. So, at the very worst, taking a test and doing badly is just a way of finding out EXACTLY what kinds of questions that test is going to ask.

I recently took the SCP exam “cold”. Meaning I looked over the sample questions, watched a couple of the online videos, and then said “what the hell” and dove in.

Now “cold” for me means: I have used Solarwinds (on and off) since 2004, I passed my CCNA (also in 2004, it has since lapsed) and I’ve been working with monitoring tools (BMC Patrol, Tivoli, OpenView, Nagios, Zenoss, etc) for the last 11 years. But the point is, I didn’t intensively study SCP prep material so I had the “right” answer.

The rest of the guys on my team want to take the test so I wrote up an overview of the exam for them, which appears below. I thought I would share it here in case:

  1. you don’t share my love of tests
  2. you aren’t sure if you are ready
  3. you don’t want to waste your money/time/hair/stomach lining by feeling unprepared.

(NOTE: I checked with the SolarWinds Exam Overlords, to make sure I’m not giving too much away here. Just in case you are worried about that kind of thing. I was.)

Test Mechanics OverView

  • The test is online. You don’t go into a testing center. You can take it from work, home, the coffee shop, your secret lair in your parent’s basement, etc.
  • The test is made up of 77 randomly selected multiple choice questions.
  • The test is not adaptive. You will answer all 77 questions.
  • The exam is “one way” – no “back”, no “skip”, no “pause” and no “review my answers”
  • Most questions have 1 answer.
  • A few have multi-answer (but it will tell you how many – ie: “Pick the best 2 from the choices below”).
  • Partial answers are marked as wrong.
  • Blank answers are marked as wrong
  • If you accidentally leave a blank or partial answer, you’ll get a warning prompt. But if you confirm, it’s done.
  • You have 90 minutes to complete the test
    • DON’T PANIC! That’s a little over 1 minute per question. PLENTY of time. Seriously.
    • No seriously. Make a fist and squeeze it as hard as you can for 60 seconds. That’s how long you have to think about and answer EACH question.
  • 70% is passing.
  • Every question is worth the same (ie: questions are not weighted)
    • That means you need 54 correct answers to pass.
    • Or to put it another way, you can get 23 questions wrong and still pass.
  • You have 3 attempts to pass the exam
  • You must wait 14 days between attempts

Am I Ready?

There are a couple of ways I think you can confirm you are ready:

  1. You go through the sample tests and you not only get the answers right, you understand:
    1. The broader topic they are discussing (netflow, router configuration, firewalls, troubleshooting)
    2. WHY the right answer is the right one (versus the others)
    3. HOW the other answers are wrong for this situation
    4. WHERE those other answers would be the correct answer
  2. When you read one of the sample test questions, you know what screen/utility they are talking about and you can get to that same screen and use it (maybe not for what THEY are asking, but you know how to get around in it).

General testing ideas (they work for any test)

  • TAKE YOUR TIME
  • Don’t give up.
    • I watched a guy bail on his CCNA exam with 10 questions left. When the proctor ran the score, he missed passing by 5 points.
  • If you are really stumped, start over by looking at the answers first, and then seeing which one(s) seem to fit the question.
  • Remember, this is a SolarWinds exam. The right answer is always from the SW perspective (ie: if you have a choice between “do it with a DOS batch file” and “Do it with a SolarWinds SAM script” and you aren’t sure, SolarWinds is your better bet.
  • If you don’t know the answer and one of the answers is significantly longer than the rest, that’s a good bet as well.
  • If you really don’t know, eliminate the stupid answers (there’s usually at least one) and then guess.

Good Ideas for the SCP

  • This is “open book” – have a separate browser window with tabs open to google, thwack, and the NPM admin guide as well as browser AND RDP open to the polling engine (assuming you have one handy) so you can check things out before hitting “submit”
  • Also have a calculator open
  • Also have a subnet calculator open

Specific thoughts on each of the sections

** Indicates thoughts I  added after I took the test

Network Management Fundamentals

  • Know the OSI model (come on dude: All People Seem To Need Data Processing) and how SolarWinds stuff (SNMP, Netflow, SSH, etc) maps to it.
  • Ping is ICMP (no port)
  • SNMP poll is UDP port 161, trap is 162
  • Syslog is port 514
  • NetFlow is UDP 2055
  • Netflow is push-based. When the router sees a conversation is done, it pushes the information to the configured Netflow receiver(s)
  • WMI requires 1) the service to be enabled on the target server, 2) all ports over 1024, 3) a windows user account (domain or local)
  • Know the terms MIB, OID, Perfmon Counter
  • ** Know the very basic basics of subnetting (ie: that 10.10.12.1 and 10.10.15.1 are both contained in the subnet mask 255.255.240.0)
  • ** Know the IOS string to configure a router for SNMP (traps and poll)
  • ** Know the basic concept of how to configure an ACL on a router
  • ** Know what NetFlow is, how it works, etc.

Network Management Planning

  • Protocol “load” on the network from least to most: ICMP, SNMP, WMI, NetFlow
  • Document goals, take a baseline
  • Know how to build a report
  • When do you need a distributed solution?
    • Shakey connections back to Poller
    • Redundancy
    • ACL issues
  • Understand SolarWinds licensing model

Network Management Operations

  • Know the SNMP versions (1, 2vc, 3) and what each added to the standard
  • Know Netflow versions (5, 9 and 9 with ipfix) and the basic features of each
  • Know the levels of Syslog logging in order
  • Know what SNMP provides versus Netflow

Network Fault & Performance Troubleshooting

  • Know the OSI model, and where (what layer) you would find: telnet, ping, ssh, ACL’s, snmp, syslog and netflow
  • ** Know the format of an OSPF routing syslog alert, when routing is failing
  • ** Think through the Saas CRM example and all the different ways it could fail, and how you would determine it. (ping stops on their network…etc)
  • ** Understand the various levels of counters (node details pages) in the virtualization “stack” (VCenter, Cluster, Datacenter, Host, guests)
  • ** What can SolarWinds tell you about Virtual environments that tell you to change that virtual environment (ie: How do you know when it’s time to add more hosts)
  • Understand general VMWare concepts: virtualcenter, cluster, datacenter, host, guest; what happens when you go p2v, etc.

SolarWinds NMP Administration

  • Obviously, this is the biggest section
  • Understand escalation triggers
  • Know the basics of the SolarWinds engineers toolset and how to integrate it with NPM
  • ** Understand HOW reportwriter works (how to create, clone, import, export – including to/from thwack)
  • ** Understand report design – timeframes, groupings, summarization
  • ** Understand WHAT report scheduler does (but not necessarily HOW to configure it)
  • ** Understand account settings – especially limitations and how they work
  • ** Understand network discovery, including the one we never use (seed file)
  • ** Know how to create a trap alert versus a regular alert (ditto for syslog)
  • ** Know the indicators that tell you the SolarWinds installation (database, etc) is over capacity

#FeatureFriday: Improving Alerts with Query Execution Plans

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

Alerts are, for many monitoring engineers, the bread-and-butter of their job. What many fail to recognize is that, regardless of how graphical and “Natural English Language” the alert builder appears, what you are really creating is a query. Often it is a query which runs frequently (every minute or even more) against the entire database.

Because of that, a single, poorly-constructed query can have a huge (and hugely negative) impact on overall performance. Get a few bad eggs, and the rest of the monitoring system – polling, display, reports, etc – can grind to a crawl, or even come to a halt.

Luckily there’s a tool which can help you discover a query’s execution performance, and identify where the major bottlenecks are.

In the video below my fellow Solarwinds Head Geeks and I channel our inner SQLRockstar and dive into query execution plans and how to apply that technique to SolarWinds Orion alerts. 


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

Blueprint: The Evolution of the Network, Part 2

NOTE: This article originally appeared here.

If you’re not prepared for the future of networking, you’re already behind.

That may sound harsh, but it’s true. Given the speed at which technology evolves compared to the rate most of us typically evolve in terms of our skillsets, there’s no time to waste in preparing ourselves to manage and monitor the networks of tomorrow. Yes, this is a bit of a daunting proposition considering the fact that some of us are still trying to catch up with today’s essentials of network monitoring and management, but the reality is that they’re not really mutually exclusive, are they?

In part of one this series, I outlined how the networks of today have evolved from those of yesteryear, and what today’s new essentials of network monitoring and management are as a consequence. By paying careful attention, you will likely have picked up on ways the lessons from the past that I described helped shape those new essentials.

Similarly, today’s essentials will help shape those of tomorrow. Thus, as I said, getting better at leveraging today’s essentials of network monitoring and managing is not mutually exclusive from preparing for the networks of tomorrow.

Before delving into what the next generation of network monitoring and management will look like, it’s important to first explore what the next generation of networking will look like.

On the Horizon

Above all else, one thing is for certain: We networking professionals should expect tomorrow’s technology to create more complex networks resulting in even more complex problems to solve. With that in mind, here are the top networking trends that are likely to shape the networks of the future:

Networks growing in all directions
Fitbits, tablets, phablets and applications galore. The explosion of IoT, BYOD, BYOA and BYO-everything else is upon us. With this trend still in its infancy, the future of connected devices and applications will be not only about the quantity of connected devices, but also the quality of their connections tunneling network bandwidth.

But it goes beyond the gadgets end users bring into the environment. More and more, commodity devices such as HVAC infrastructure, environmental systems such as lighting, security devices and more all use bandwidth—cellular or WiFi—to communicate outbound and receive updates and instructions inbound. Companies are using, or planning to use, IoT devices to track product, employees and equipment. This explosion of devices that consume or produce data will, not might, create a potentially disruptive explosion in bandwidth consumption, security concerns and monitoring and management requirements.

IPv6 eventually takes the stage…or sooner (as in now!)
Recently, ARIN was unable to fulfill a request for IPv4 addresses because the request was greater than the contiguous blocks available. Meanwhile, IPv6 is now almost always enabled by default and is therefore creating challenges for IT professionals even if they, and their organizations, have committed to putting off their own IPv6 decisions. The upshot of all this is that IPv6 is a reality today. There is an inevitable and quickly approaching moment when switching over will no longer be an option, but a requirement.

SDN and NFV will become the mainstream
Software defined networking (SDN) and network function virtualization (NFV) are just in their infancy and should be expected to become mainstream in the next five to seven years. With SDN and virtualization creating new opportunities for hybrid infrastructure, a serious look at adoption of these technologies is becoming more and more important.

So long WAN Optimization, Hello ISPs
There are a number of reasons WAN technology is and will be kicked to the curb in greater fervency. With bandwidth increases outpacing CPU and custom hardware’s ability to perform deep inspection and optimization, and with ISPs helping to circumvent the cost and complexities associated with WAN accelerators, WAN optimization will only see the light of tomorrow in unique use cases where the rewards outweigh the risks. As most of us will admit, WAN accelerators are expensive and complicated, making ISPs more and more attractive. Their future living inside our networks is certainly bright.

Farewell L4 Firewalling 
With the mass of applications and services moving towards web-based deployment, using Layer 4 (L4) firewalls to block these services entirely will not be tolerated. A firewall incapable of performing deep packet analysis and understanding the nature of the traffic at the Layer 7 (L7), or the application layer, will not satisfy the level of granularity and flexibility that most network administrators should offer their users. On this front, change is clearly inevitable for us network professional, whether it means added network complexity and adapting to new infrastructures or simply letting withering technologies go.

Preparing to Manage the Networks of Tomorrow  

So, what can we do to prepare to monitor and manage the networks of tomorrow? Consider the following:

Understand the “who, what, why and where” of IoT, BYOD and BYOA
Connected devices cannot be ignored. According to 451 Research, mobile Internet of Things (IoT) and Machine-to-Machine (M2M) connections will increase to 908 million in just five years, this compared to 252 million just last year. This staggering statistic should prompt you to start creating a plan of action on how you will manage nearly four times the number of devices infiltrating your networks today.

Your strategy can either aim to manage these devices within the network or set an organizational policy to regulate traffic altogether. Nonprofit IT trade association CompTIA noted in a recent survey, many companies are trying to implement partial and even zero BYOD policies to regulate security and bandwidth issues. Even though policies may seem like an easy fix, curbing all of tomorrow’s BYOD/BYOA is nearly impossible. As such, you will have to understand your network device traffic in incremental metrics in order to optimize and secure them. Even more so, you will need to understand network segments that aren’t even in your direct control, like the tablets, phablets and Fitbits, to properly isolate issues.

Know the ins and outs of the new mainstream 
As stated earlier, SDN, NFV and IPv6 will become the new mainstream. We can start preparing for these technologies’ future takeovers by taking a hybrid approach to our infrastructures today. This will put us ahead of the game with an understanding of how these technologies work, the new complexities they create and how they will ultimately affect configuration management and troubleshooting ahead of mainstream deployment.

Start comparison shopping now
Going through the exercise of evaluating ISPs, virtualized network options and other on-the-horizon technologies—even if you don’t intend to switch right now—will help you nail down your particular requirements. Sometimes, knowing a vendor has or works with technology you don’t need right now, such as IPv6, but might later can and should influence on your decision.

Brick in, brick out
Taking on new technologies can feel overwhelming to those of us with “boots on the ground” because the new technology can often simply seem like one more mouth to feed, so to speak. As much as possible, look for ways that potential new additions will not just enhance, but replace the old guard. Maybe your new real-time deep packet inspection won’t completely replace L4 firewalls, but if it can reduce them significantly—while at the same time increasing insight and the ability to respond intelligently to issues—then the net result should be a better day for you. If you don’t do this, then more times than not, new technology will indeed simply seem to increase workload and do little else. This is also a great measuring stick to identify new technologies whose time may not yet have truly come just yet, at least not for your organization.

At a more basic layer, if you have to replace three broken devices and you realize that the newer equipment is far more manageable or has more useful features, consider replacing the entire fleet of old technology even if it hasn’t fallen apart yet. The benefits of consistency often far outweigh the initial pain of sticker shock.

To conclude this series, my opening statement from part one merits repeating: learn from the past, live in the present and prepare for the future. The evolution of networking waits for no one. Don’t be left behind.

Change is Good… For Other People

Things really do stay the same more than they change.

And I’ll argue that they do so because we want them to stay the same. When you are responsible for monitoring thousands of devices, and you’ve built a career on your guru-like expertise in a particular toolset, the last thing you want is for everything (or even part of everything) to change radically.

If you are a MIB wizard, your worst fear may be that everything goes to REST API calls. If you’ve spent years learning the in’s and out’s of a vendors database the last thing you want to hear is that they’re moving to noSQL.

So how will we respond to the pressures of IoT, SDN, hybrid cloud? Heck, how are we responding to the pressure of BYOD?

Are you going to try to tackle it with more of the same old?

Or is it finally time to re-think the way YOU do things, and let the vendors catch up to you for once.

Blueprint: The Evolution of the Network, Part 1

NOTE: This article originally appeared here.

Learn from the past, live in the present and prepare for the future.

While this may sound like it belongs hanging on a high school guidance counselor’s wall, they are words to live by, especially in IT. They apply perhaps to no other infrastructure element better than the network. After all, the network has long been a foundational building block of IT, it’s even more important today than it was in the days of SAGE and ARPANET, and its importance will only continue to grow in the future while simultaneously becoming more complex.

For those of us charged with maintaining the network, it’s valuable to take a step back and examine the evolution of the network. Doing so helps us take an inventory of lessons learned—or the lessons we should have learned; determine what today’s essentials of monitoring and managing networks are; and finally, turn an eye to the future to begin preparing now for what’s on the horizon.

Learn from the Past

Think back to the time before the luxuries of Wi-Fi and the proliferation of virtualization, and before today’s wireless and cloud computing.

The network used to be defined by a mostly wired, physical entity controlled by routers and switches. Business connections were based on T1 and ISDN, and Internet connectivity was always backhauled through the data center. Each network device was a piece of company-owned hardware, and applications operated on well-defined ports and protocols. VoIP was used infrequently, and anywhere connectivity—if even a thing—was provided by the low-quality bandwidth of cell-based Internet access.

With this yesteryear in mind, consider the following lessons we all (should) have learned that still apply today:

It Has to Work
Where better to start than with a throw back to IEEE RFC1925, “The Twelve Networking Truths”? It’s just as true today as it was in 1996—if your network doesn’t actually work, then all the fancy hardware is for naught. Anything that impacts the ability of your network to work should be suspect.

The Shortest Distance Between Two Points is Still a Straight Line
Wired or wireless and MPLS, EIGRP or OSPF, your job as a network engineer is still fundamentally to create the conditions where the distance between the provider of information, usually a server, and the consumer of that information, usually a PC, is as near to a straight line as possible. When you forget that but still get caught up in quality of service maps, automated functions and fault-tolerance, you’ve lost your way.

An Unconfigured Switch is Better than the Wizard
It was a long-standing truth that running the configuration wizard on a switch was the fastest way to break it, whereas just unboxing and plugging it in would work fine. Wizards are a fantastic convenience and come in all forms, but if you don’t know what the wizard is making convenient, you are heading for trouble.

What is Not Explicitly Permitted is Forbidden
No, this policy it’s not fun and it won’t make you popular. And it will actually create work for you on an ongoing basis. But there is honestly no other way to run your network. If espousing this policy will get you fired, then the truth is you’re going to get fired one way or the other. You might as well be able to pack your self-respect and professional ethics into the box along with your potted fern and stapler when the shoe drops. Because otherwise that huge security breach is on you.

Live in the Present 

Now let’s fast forward and consider the network of present day.

Wireless is becoming ubiquitous—it’s even overtaking wired networks in many instances—and the number of devices wirelessly connecting to the network is exploding (think Internet of Things). It doesn’t end there, though—networks are growing in all directions. Some network devices are even virtualized, resulting in a complex amalgam of the physical, the virtual and the Internet. Business connections are DSL/cable and Ethernet services, and increased use of cloud services is stretching Internet capacity at remote sites, not to mention opening security and policy issues since it’s not all backhauled through the data center. BYOD, BYOA, tablets and smartphones are prevalent are creating bandwidth capacity and security issues. Application visibility based on port and protocol is largely impossible due to applications tunneling via HTTP/HTTPS. VOIP is common, also imposing higher demands on network bandwidth, and LTE provides high-quality anywhere connectivity.

Are you nostalgic for the days of networking yore yet? The complexity of today’s networking environment underscores that while lessons of the past are still important, a new set of network monitoring and management essentials is necessary to meet the challenges of today’s network administration head on. These new essentials include:

Network Mapping
While perhaps a bit back-to-basics and also suitable as a lesson we all should have learned by now, when you consider the complexity of today’s networks and network traffic, network mapping and the subsequent understanding of management and monitoring needs has never been more essential than it is today. Moving ahead without a plan—without knowing the reality on the ground—is a sure way to make the wrong choices in terms of network monitoring based on assumptions and guesswork.

Wireless Management
The growth of wireless networks presents new problems, such as ensuring adequate signal strength and that the proliferation of devices and their physical mobility—potentially hundreds of thousands of network-connected devices, few of which are stationary and many of which may not be owned by the company (BYOD)—doesn’t get out of hand. What’s needed are tools such as wireless heat maps, user device tracking, over-subscribed access points and tracking and managing device IP addresses.

Application Firewalls
When it comes to surviving the Internet of Things, you first must understand that all of the “things” connect to the cloud. Because they’re not coordinating with a controller on the LAN, each device incurs a full conversation load, burdening the WAN and every element in a network. And worse, many of these devices prefer IPv6, meaning you’ll have more pressure to dual-stack all of those components. Application firewalls can untangle device conversations, get IP address management under control and help prepare for IPv6. They can also classify and segment device traffic; implement effective quality of service to ensure that critical business traffic has headroom; and of course, monitor flow.

Capacity Planning
Nobody plans for not growing; it’s just that sometimes infrastructure doesn’t read the plan we’ve so carefully laid out. You need to integrate capacity for forecasting tools, configuration management and web-based reporting to be able to predict scale and growth. There’s the oft-quoted statistic that 70 percent of network outages come from unexpected network configuration changes. Admins have to avoid the Jurassic Park effect—unexpected, but what in hindsight were clearly predictable outages is the bane of any IT manager’s existence. “How did we not know and respond to this?” is a question nobody wants to have to answer.

Application Performance Insight
Many network engineers have complained that the network would be stable if it weren’t for the end users. While it’s an amusing thought, it ignores the universal truth of IT—everything we do is because of and for end-users. The whole point of having a network is to run the business applications end-users need to do their jobs on. Face it, applications are king. Technologies such as deep packet inspection, or packet-level analysis, can help you ensure the network is not the source of application performance problems.

Prepare for the Future

Now that we’ve covered the evolution of the network from past to present—and identified lessons we can learn from the network of yesterday and what the new essentials of monitoring and managing today’s network are—we can prepare for the future. So, stay tuned for part two in this series to explore what the future holds for the evolution of the network.

It’s Not About What Happened

“I call bullshit,” he said with authority. “it it couldn’t have happened like that. Let’s move on to something real.”

And just like that, he missed the most important part.

Because sometimes – maybe often – what you need to know is not whether something really, actually, 100% accurately happened “that way”.

It’s about what happened next. More to the point, it’s about what people did about the thing that may-or-may-not-have-happened-that-way.

How we respond to events (real or perceived) tells us and others more about who we are than the situations we find ourselves in.

“The entire data center was crashing”
(it was only 10 servers)
“the CEO was calling my cell every 2 minutes”
(he called twice in the first 30 minutes and then left you alone)
“it was a massive hack, probably out of China or Russia”
(it was a mis-configured router)

Whatever. I’m not as interested in that as what you did next. Did you:

  • Call a vendor and scream that they needed to “fix this yesterday”?
  • Pull the team together and solicit ideas and put together a plan
  • tell everyone to stay out of your way and worked 24 hours to sort it out?
  • Wait 30 minutes before doing anything to see if anyone noticed, or if it sorted itself out?
  • Start documenting everything you saw happening, to review afterward?
  • Simply shut everything down and start it up again and see if that fixes it?
  • Look at your historical data to see if you can spot the beginning of the failure?
  • Immediately recover from backups, and let people know work will be lost?

Notice that most of those aren’t inherently wrong, although several are wrong depending on the specific circumstances.

And that is the ONLY point where “what happened” comes into play. The events around us shape our environment.

But how we decide to respond shapes who we are.

When It Comes to System Outages, Don’t Prepare For the Worst

NOTE: This article originally appeared here.

During the 2015 World Cup soccer competition, Nate Silver and the psychic witches he keeps in his basement — because how else could he make the predictions he does with such accuracy? — got it wrong. Really, really wrong. They were completely blindsided by Germany’s win over Brazil. As Silver described it, it was a completely unforeseeable event.

 

In sports and, to a lesser extent, politics, the tendency in the face of these things is to eat the loss, chalk it up to a fluke — a black swan in statistics parlance — and get on with life.

But as network administrators, we know that’s not how it works in IT.

In my experience, when a black swan event affects IT systems, management usually acquires a dark obsession with the event. Meetings are called under the guise of “lessons learned exercises,” with the express intent of ensuring said system outages never happen again.

Don’t spend too much time studying what might occur

Now, I’m not saying that after a failure we should just blithely ignore any lessons that could be learned. Far from it, actually. In the ashes of a failure, you often find the seeds of future avoidance. One of the first things an IT organization should do after such an event is determine whether the failure was predictable, or if it was one of those cases where there wasn’t enough historical data to determine a decent probability.

 

If the latter is the case, I’m here to tell you your efforts are much better spent elsewhere. What’s a better approach? Instead of spending time trying to figure out if a probability may or may not exist, catch and circumvent those common, everyday IT annoyances. This is a tactic that’s overlooked far too often.

Don’t believe me? Well, let’s take the example of a not-so-imaginary company I know that had a single, spectacular IT failure that cost somewhere in the neighborhood of $100,000. Management was understandably upset. It immediately set up a task force to identify the root cause of the failure and recommend steps to avoid it in the future. Sounds reasonable, right?

The task force — five experts pulled from the server, network, storage, database and applications teams — took three months and more than 100 staff-hours to investigate the root cause. Being conservative, let’s say the hourly cost to the company was $50. Now, multiply that by five people, then by 100 hours, then by three months. It comes to a nice round $125,000.

Not so reasonable after all

Yes, at the end of it all the root problem was not only identified — at least, as much as possible — but code was put in place to (probably) predict the next time the exact same event might occur. Doesn’t sound so bad. But keep this in mind: The company spent $25,000 more than the cost of the original failure to create a system outages solution that may or may not predict the occurrence of a black swan exactly like the one that hit before.

Maybe it wasn’t so reasonable after all.

You may be thinking, “But where else are you saying we should be focusing on? After all, we’re held accountable to the bottom line as much as anyone else in the company.”

I get that, and it’s actually my point. Let’s compare the previous example of chasing a black swan to another, far more common problem: network interface card (NIC) failures.

In this example, another not-so-fictitious company saw bandwidth usage spike and stay high. NICs threw errors until the transmission rates bottomed out, and eventually the card just up and died. The problem was that while bandwidth usage was monitored, there was no alerting in place for interfaces that stopped responding or disappeared (the company monitored the IP at the end of the connection, which meant WAN links were absent alerts until the far end went down).

Let’s assume that a NIC failure takes an average of one hour to notice and correctly diagnose, and then it takes two hours to fix by network administrators who cost the company $53 per hour. While the circuit is out, the company loses about $1,000 per hour in revenue, lost opportunity, etc. That means system outages like this one could cost the company $3,106.

Setting a framework anchored by alerting and monitoring

Now, consider that, in my experience, proper monitoring and alerting reduces the time it takes to notice and diagnose problems such as NIC failures to 15 minutes. That’s it. Nothing else fancy, at least not in this scenario. But that simple thing could reduce the cost of the outage by $750.

I know those numbers don’t sound too impressive. That is, until you realize a moderately sized company can easily experience 100 NIC failures per year. That translates to more than $300,000 in lost revenue if the problem is unmonitored, and an annual savings of $75,000 if alerting is in place.

And that doesn’t take into account the ability to predict NIC failures and replace the card pre-emptively. If we estimate that 50% of the failures could be avoided using predictive monitoring, the savings could rise to more than $190,000.

Again, I’m not saying preparing for black swan events isn’t a worthy endeavor, but when tough budget decisions need to be made, some simple alerting on common problems can save more than trying to predict and prevent “the big one” that may or may not ever happen.

After all, NIC failures are no black swan. I think even Nate Silver would agree they’re a sure thing.

“Nice to Have” is Relative

What could you absolutely positively not live without? … Draw a line. Come up with reasons. Know where you stand.

Cute. Except without requirements you can’t draw a line effectively.

So go from there. Start with what you need to accomplish, and then set your must-haves to achieve that goal. Don’t embellish, don’t waffle. Just the essentials.

Draw the line, set the next goal, list the deal-breakers.

Believe it or not, this is an exercise most organizations don’t do (at least for montioring). They start with a (often vague) goal – “Select a NetFlow tool” or “Monitor customer experience for applications”. And then they look for tools that do that.

Before you know it, you have 20, or 50, or 150 tools (I am NOT exaggerating). . You have staff – even whole teams – who invest hundreds of hours getting good using those tools.

And then you get allegiences. And then you get politics.

Respect Your Elders

NOTE: This article originally appeared here.

“Oh Geez,” exclaimed the guy who sits 2 desks from me, “that thing is ancient! Why would they give him that?”

Taking the bait, I popped my head over the wall and asked “what is?”

He showed me a text message, sent to him from a buddy—an engineer (EE, actually) who worked for an oil company. My co-worker’s iPhone 6 displayed an image of a laptop we could only describe as “vintage”:

(A Toshiba Tecra 510CDT, which was cutting edge…back in 1997.)

“Wow.” I said. “Those were amazing. I worked on a ton of those. They were serious workhorses—you could practically drop one from a 4 story building and it would still work. I wanted one like nobody’s business, but I could never afford it.”

“OK, back in the day I’m sure they were great,” said my 20-something coworker dismissively. “But what the hell is he going to do with it NOW? Can it even run an OS anymore?”

I realized he was coming from a particular frame of reference that is common to all of us in I.T. Newer is better. Period. With few exceptions (COUGH-Windows M.E.-COUGH), the latest version of something—be it hardware or software—is always a step up from what came before.

While true, it leads to a frame of mind that is patently un-true: a belief that what is old is also irrelevant. Especially for I.T. Professionals, it’s a dangerous line of thought that almost always leads to un-necessary mistakes and avoidable failures.

In fact, ask any I.T. pro who’s been at it for a decade, and you’ll hear story after story:

  • When programmers used COBOL, back when dinosaurs roamed the earth, one of the fundamental techniques that were drilled into their heads was, “check your inputs.” Thinking about the latest version of exploits, be they an SSLv3 thing like ‘Poodle’, or a SQL injection, or any of a plethora of web based security problems, the fundamental flaw is the server NOT checking its inputs, for sanity.
  • How about the OSI model? Yes, we all know its required knowledge for many certification exams (and at least one IT joke). But more importantly, it was (and still is) directly relevant to basic network troubleshooting.
  • Nobody needs to know CORBA database structure anymore, right? Except that a major monitoring tool was originally developed on CORBA and that foundation has stuck. Which is why, if you try to create a folder-inside-a-folder more than 3 times, the entire system corrupts. CORBA (one of the original object-oriented databases) could only handle 3 levels of object containership.
  • Powershell can be learned without understanding the Unix/Linux command line concepts. But, it’s sure EASIER to learn if you already know how to pipe ls into grep into awk into awk so that you get a list of just the files you want, sorted by date. That technique (among other Unix/Linux concepts) was one of the original goals of Powershell.
  • Older rev’s of industrial motion-control systems used specific pin-outs on the serial port. The new USB-to-Serial cables don’t mimic those pin-outs correctly, and trying to upload a program with the new cables will render the entire system useless.

And in fact, that’s why my co-worker’s buddy was handed one of those venerable Tecra laptops. It had a standard serial port and it was preloaded with the vendor’s DOS-based ladder-logic programming utility. Nobody expected it to run Windows 10, but it fulfilled a role that modern hardware simply couldn’t have done.

It’s an interesting story, but you have to ask: aside from some interesting anecdotes and a few bizarre use-cases, does this have any relevance to our work day-today?

You bet.

We live in a world where servers, storage, and now the network is rushing toward a quantum singularity of virtualization.

And the “old-timers” in the mainframe team are laughing their butts off as they watch us run in circles, inventing new words to describe techniques they learned at the beginning of their career; making mistakes they solved decades ago; and (worst of all) dismissing everything they know as utterly irrelevant.

Think I’m exaggerating? SAN and NAS look suspiciously like DASD, just on faster hardware. Services like Azure and AWS, for all their glitz and automation, aren’t as far from rented time on a mainframe as we’d like to imagine. And when my company replaces my laptop with a fancy “appliance” that connects to Citrix VDI session, it reminds me of nothing as much as the VAX terminals I supported back in the day.

My point isn’t that I’m a techno-Ecclesiastes shouting “there is nothing new under the sun!” Or some I.T. hipster who was into the cloud before it was cool. My point is that it behooves us to remember that everything we do, and every technology we use, had its origins in something much older than 20 minutes ago.

If we take the time to understand that foundational technology we have the chance to avoid past mis-steps, leverage undocumented efficiencies built into the core of the tools, and build on ideas elegant enough to have withstood the test of time.


Got your own “vintage” story, or long-ago-learned tidbit that is still true today? Share it in the comments!