When It Comes to System Outages, Don’t Prepare For the Worst

NOTE: This article originally appeared here.

During the 2015 World Cup soccer competition, Nate Silver and the psychic witches he keeps in his basement — because how else could he make the predictions he does with such accuracy? — got it wrong. Really, really wrong. They were completely blindsided by Germany’s win over Brazil. As Silver described it, it was a completely unforeseeable event.

 

In sports and, to a lesser extent, politics, the tendency in the face of these things is to eat the loss, chalk it up to a fluke — a black swan in statistics parlance — and get on with life.

But as network administrators, we know that’s not how it works in IT.

In my experience, when a black swan event affects IT systems, management usually acquires a dark obsession with the event. Meetings are called under the guise of “lessons learned exercises,” with the express intent of ensuring said system outages never happen again.

Don’t spend too much time studying what might occur

Now, I’m not saying that after a failure we should just blithely ignore any lessons that could be learned. Far from it, actually. In the ashes of a failure, you often find the seeds of future avoidance. One of the first things an IT organization should do after such an event is determine whether the failure was predictable, or if it was one of those cases where there wasn’t enough historical data to determine a decent probability.

 

If the latter is the case, I’m here to tell you your efforts are much better spent elsewhere. What’s a better approach? Instead of spending time trying to figure out if a probability may or may not exist, catch and circumvent those common, everyday IT annoyances. This is a tactic that’s overlooked far too often.

Don’t believe me? Well, let’s take the example of a not-so-imaginary company I know that had a single, spectacular IT failure that cost somewhere in the neighborhood of $100,000. Management was understandably upset. It immediately set up a task force to identify the root cause of the failure and recommend steps to avoid it in the future. Sounds reasonable, right?

The task force — five experts pulled from the server, network, storage, database and applications teams — took three months and more than 100 staff-hours to investigate the root cause. Being conservative, let’s say the hourly cost to the company was $50. Now, multiply that by five people, then by 100 hours, then by three months. It comes to a nice round $125,000.

Not so reasonable after all

Yes, at the end of it all the root problem was not only identified — at least, as much as possible — but code was put in place to (probably) predict the next time the exact same event might occur. Doesn’t sound so bad. But keep this in mind: The company spent $25,000 more than the cost of the original failure to create a system outages solution that may or may not predict the occurrence of a black swan exactly like the one that hit before.

Maybe it wasn’t so reasonable after all.

You may be thinking, “But where else are you saying we should be focusing on? After all, we’re held accountable to the bottom line as much as anyone else in the company.”

I get that, and it’s actually my point. Let’s compare the previous example of chasing a black swan to another, far more common problem: network interface card (NIC) failures.

In this example, another not-so-fictitious company saw bandwidth usage spike and stay high. NICs threw errors until the transmission rates bottomed out, and eventually the card just up and died. The problem was that while bandwidth usage was monitored, there was no alerting in place for interfaces that stopped responding or disappeared (the company monitored the IP at the end of the connection, which meant WAN links were absent alerts until the far end went down).

Let’s assume that a NIC failure takes an average of one hour to notice and correctly diagnose, and then it takes two hours to fix by network administrators who cost the company $53 per hour. While the circuit is out, the company loses about $1,000 per hour in revenue, lost opportunity, etc. That means system outages like this one could cost the company $3,106.

Setting a framework anchored by alerting and monitoring

Now, consider that, in my experience, proper monitoring and alerting reduces the time it takes to notice and diagnose problems such as NIC failures to 15 minutes. That’s it. Nothing else fancy, at least not in this scenario. But that simple thing could reduce the cost of the outage by $750.

I know those numbers don’t sound too impressive. That is, until you realize a moderately sized company can easily experience 100 NIC failures per year. That translates to more than $300,000 in lost revenue if the problem is unmonitored, and an annual savings of $75,000 if alerting is in place.

And that doesn’t take into account the ability to predict NIC failures and replace the card pre-emptively. If we estimate that 50% of the failures could be avoided using predictive monitoring, the savings could rise to more than $190,000.

Again, I’m not saying preparing for black swan events isn’t a worthy endeavor, but when tough budget decisions need to be made, some simple alerting on common problems can save more than trying to predict and prevent “the big one” that may or may not ever happen.

After all, NIC failures are no black swan. I think even Nate Silver would agree they’re a sure thing.

Leave a Reply