The Four Questions, Part 1

 

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked: Why did I get an alert? Why didn’t I get an alert? What is being monitored on my system? What will alert on my system? and What standard monitoring do you do? My goal in this next post is to give you the tools you need to answer the first of those:

Why did I get an alert?

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, the fact is that most of the techniques can be translated to any toolset.

**************
It’s 8:45am, and you are just settling in at your desk. You notice that one email came in overnight from your company’s 24-7 operations desk:

“We got an alert for high CPU on the server WinSrvABC123 at 2:37am. We didn’t notice anything when we jumped on the box. Can you explain what happened?”

Out of all of The Four Questions of monitoring, this is the easiest one to answer, as long as you have done your homework and set up your environment.

Before I dig in, I want to clarify that this is not the same question as “What WILL alert on my server?” or “What are the monitoring and alerting standards for this type of device?” (I’ll cover both of those in later parts of this series.) Here, we’re dealing strictly with a user’s reaction when they receive an alert.

I also have to stress that it’s imperative that you always take the time to answer this question. It can be annoying, tedious, and time-consuming. But if you don’t, before long all of your alerts will be dismissed as “useless.” That is the first step on a long road that leads to a CIO-mandated RFP for monitoring tools, you defending your choice of tools, and other conversations that are significantly more annoying, tedious, and time-consuming.

However, my tips below should cut down on your workload significantly. So let’s get started.

First, let’s be clear: monitoring is not alerting. Some people confuse getting a ticket, page, email, or other alert with actual monitoring. In my opinionbook, “Monitoring” is the ongoing collection of data about a particular element or set of elements. Alerting is a happy by-product of having monitoring, because once you have those metrics you can notify people when a specific metric is above or below a threshold. I say this because customers sometimes ask (or demand) that you fix (or even turn off) “monitoring.”. What they really want is for you to change the alert they receive. Rarely do they really mean you should stop collecting metrics.

The bulk of your work is going to be in the way you create alert messages, because in reality, it’s the vagueness of those messages that has the recipient confused. Basically, you should ensure that every alert message contains a few key elements. Some are obvious:

  • The machine having the problem
  • The time of alert
  • Current statistic

Some are slightly less obvious but no less important:

  • Any other identifying information about the device
    • Any custom properties indicating location, owner group, etc.
    • OS type and version (the MachineType variable)
    • The IP address
    • The DNS Name and/or Sysname variables if your device names are… less than standard
  • The threshold value which breached
  • The duration – how long the alert has been in effect
  • A link or other reference to a place where the alert recipient can see this metric. Speaking in SolarWinds-specific terms, this could be:
    • The node Details page – using either the ${NodeDetailsURL} (or the equivalent for your kind of alert) or a “forged” URL (i.e.: “http://myserver/Orion/NetPerfMon/NodeDetails.aspx?NetObject=N:${NodeID}
    • A link to the metric details page. For example, the CPU average would be http://myserver/Orion/NetPerfMon/CustomChart.aspx?chartname=HostAvgCPULoad&NetObject=N:${NodeID}
    • Or even a report that shows this device (or a collection of devices where this is one member) and the metric involved

Finally, one element that should always be included in each alert:

  • The name of alert

For your straightforward alerts, this should not be a difficult task and can be something you (almost) copy and paste from one alert to another. Here’s an example for CPU:

CPU Utilization on the ${MachineType} device owned by ${OwnerGroup} named ${NodeName} (IP: ${IP_Address}, DNS: ${DNS}) has been over ${CPU_Thresh} for more than 15 minutes. Current load at ${AlertTriggerTime} is ${CPULoad}.

View full device details here: ${NodeDetailsURL}.
Click here to acknowledge the alert: ${AcknowledgeTime}

This message was brought to you by the alert: ${AlertName}

While it means more work during alert setup, having an alert with this kind of messaging means that the recipient has several answers to the “Why did I get this alert?” at their fingertips:

  • They have everything they need to identify the machine – which team owns it, what version of OS it’s running, and the office or data center where it’s located.
  • They have what they need to connect to the device – whether by name, DNS name, IP address, etc.
  • They know what metric (CPU) triggered the alert.
  • They know when the problem was detected (because let’s face it, sometimes emails DO get delayed).
  • They have a way to quickly get to a status screen (i.e.: the Node details page) to see the history of that metric and hopefully see where the spike occurred.

Finally, by including the ${AlertName}, you’re enabling the recipient to help you help them. You now know precisely which alert to research. And that’s critical, because there’re more things you should be prepared to do.

There is one more value you might want to include if you have a larger environment, and that’s the name of the SolarWinds polling engine. There are times when a device is moved to the wrong poller—wrong because of networking rules, AD membership, etc. Having the polling engine in the message is a good sanity check in this situation.

Let’s say that the owner of the device is still unclear why they received the alert. (Hey, it happens!) With the information the recipient can give you from the alert message, you can now use the following tools and techniques:

The Message Center

Some people live and die on this screen. Some never touch it. But in this case, it can be your best friend. Note two specific areas:

  • The Network Object drop-down – this lets you zero in on just the alerts from the offending device. Step one is to look at EVERYTHING coming off this box for the time period. Events, alerts, etc. See if this builds a story about what may have led up to the event.
  • The Alert Name drop-down under Triggered Alerts – this allows you to look at ALL of the instances when this alert triggered, or further zero in on the one event you are trying to find.

Side Note: The Time Period drop-down is critical here. Make sure you set it to show the correct period of time for the alert or else you’re going to constantly miss the mark.

Using these two simple controls in Message Center, you (and your users) should be able to drill into the event stream around the ticket time. Hopefully that will answer their question.

If you do it right (meaning take your time explaining what you are doing in a meeting, or using a screen share; maybe even come up with some light “how to” documentation with screen shots), users—especially those in heavy support roles—will learn over the course of time to analyze alerts on their own.

But what about the holdouts? The ones where Message Center hasn’t shown them (or you) what you hoped to see. What then?

Be prepared to test your alert. It’s something you should do every time you’re ready to release a new alert into your environment. Also remember that sometimes you get busy, and sometimes you test everything, but then the situation on the ground changes without your participation.

So, however you got here, you need to go back to the testing phase.

  • Make a copy of the alert. Never test a live normal production alert. There’s a COPY button in the alert manager for that very reason.
  • Change the alert copy by adding an alert trigger for the machine in question. JUST that machine. (i.e.: “where node caption is equal to WinSRVABC123”).
  • Set your triggering criteria (“CPULoad > 90%” or whatever) to a value so low that it’s guaranteed to trigger.

At that point, test the heck out of that bugger until both you and the recipient are satisfied that it works as expected. Copy whatever modifications you need over to the existing alert, and beware that updating the alert trigger will cause any existing alerts to re-fire. So you may need to hold off on those changes until a quieter moment.

Stay tuned for our next installment: “Why didn’t I get an alert?”

Moments

We’re told to visualize success. And when we do – when we imagine ourselves winning, it’s in a moment.

But there’s another moment we should be envisioning – the moment before the moment. The one that comes before the winning move, before the piece of code that pulls it all together, before the report is handed in to smiles and nods and thumbs up.

And there’s a moment before that moment, too.

If you plan on winning today, you might want to spare some time to imagine for a moment what it looks like in those moments before the moment where you win.

Because that’s probably where the winning actually happens.

The Four Questions – Introduction

This article originally appeared in a scaled-down version here on PacketPushers.net. I’m posting it in it’s full form here as an introduction to the full series.


For people who are interested in monitoring, there is a leap that you make when you go from watching systems that YOU care about, to monitoring systems that other people care about.

When you are doing it for yourself, it’s all about ease of maintenance, getting good (meaning useful, interesting) data, and having the information at your fingertips to deflect accusations that YOUR system is down/slow/ugly/whatever.

But if you do that job well, and show up at enough meetings showing off your shiny happy data, inevitably you will get nominated/conscripted into the monitoring group where it is expected you will take as much interest in other people’s sh…tuff as your beloved systems from your former job.

And this is where things get especially tricky.

Assuming you LIKE monitoring as a discipline, and find it exciting to learn about different types of systems (and ways they can fail), you are going to want to provide the same levels of insight for your coworkers as you had for yourself.

Inevitably, you will find yourself answering The Four Questions. These are questions which—for reasons that will become apparent—you never really had to ask yourself when you were doing it on your own. The four questions—with brief explanations—are:

  1. Why did I get an alert?
    The person is not asking, “Why did this alert trigger at this time?” They are asking why they got the alert at all.
  2. Why didn’t I get an alert?
    Something happened that the owner of the system felt should have triggered an alert, but they didn’t receive one.
  3. What is being monitored on my system?
    What reports and data can be pulled for their system (and in what form) so they can look at trending, performance, and forensic information after a failure.
  4. What will alert on my system?
    I’d like to be able to predict under which conditions I will get an alert for this system.

…and the Fifth Beatle… I mean question.

5. What do you monitor “standard”?
What metrics and data are typically collected for systems like this? This is the inevitable (and logical) response when you say, “We put standard monitoring in place.”

In the coming (weeks/days/months/series) I’m going to explore each of these questions in-depth, and offer techniques you can use to respond to each one.

Lose, Survive, or Win?

My teenage son has had a rough patch lately – like many teens. Getting up for school requires Herculean commitment. Being civil, let alone kind, is almost impossible.

No surprises there, it’s all part of the journey.

This morning as I drove him to school, I asked him if he was going to let the morning’s bad start ruin the rest of his day.

“Nope” came hims typically verbose reply.
“So,” I asked. “Do you plan to lose, survive, or win?”
“Uh…. survive?” he answered, clearly thrown off his recalcitrant game.
“Interesting choice.” I said, and left it at that.

But I think it’s worth asking ourselves each day. As we prepare to meet the challenges inherent in the day of a typical IT pro, do we envision ourselves losing, surviving, or finding a way to win?

Sometimes it’s not important whether you actually win. Sometimes what matters – on day 463 of your job – what your plan is.

The rest is part of the journey.

MovieBob and Magneto

(Inspired by the article: “Magneto was Right” which has subsequently been taken down, but the video is here)

I really liked the character of Magneto in X-Men 1 and 2. The character had a point and a purpose and an inner consistency. He wasn’t “evil” any more than most of us are, he simply framed things differently than Xavier and acted based on his own values.

It’s like this: The power goes out in my neighborhood and some people think “candle light block party” while others lock the doors in case there’s looting and riots. Neither option is totally far out, it just depends on how you see the world.

Coming back to Bob. I got picked on in school. Most of the people I associated with got picked on too. Depending on the day and context, it was because I was a band geek, or a theater dweeb, or a fashion train wreck, or socially inept, or somehow being “an easy target”.

At least, that’s what I’ve always assumed. And since it was me getting picked on and not me picking on them, I assumed there was a flaw in me that invited the abuse.

But I think MovieBob is truly onto something, and not just because he’s using comic book characters as his foil.

My favorite point:
Bullies pick on us NOT because “we’re different” (MovieBob says “I can attest that they came in all shapes and sizes. A veritable rainbow coalition of torment.”).

NO, the thing we all suspect deep down is that it’s not that we’re different, we’re BETTER.

Bob uses images from Revenge of the Nerds in his discussion, and that might be the most accurate. The narrative of the geeks realizing their own self worth and playing to their strengths may be a fantasy, but it’s definitely a satisfying one as well as one that is actually playing out in reality with more and more frequency.

Breaking the Loop

There are times when doing it all over again can be part of a brand new discovery. And there are times (maybe MOST of the time) when it’s not.

There are times when doing it over is just busywork, repetition, your own little slice of Groundhog Day (but without getting Andie McDowell at the end, or becoming a surgeon, or anything).

Welcome to 80% of the work of IT. Figuring out a solution once and then doing it again and again and…

Hopefully, right about now, you are asking yourself “who would want to do that?”

If a response is known, repeatable, and predictable, shouldn’t it be automated? If a service stops, automatically restart it. If a disk is filling up, clear the temp directory. If the server has too many connections, clear the ones that are old, or stale, or showing no activity.

“But it’s not that simple” you say? Each solution is a snowflake, unique in it’s particulars? That’s fine. NOT repeatable or predictable happens too.

But if you find yourself locked into a circular routine, each ticket blurring into the previous one, it might be time to look for a pattern so you can break out of it.

LINK: Do what you love

And thanks to Doug at http://www.asknice.com for pointing it out.

http://www.paulgraham.com/love.html
How to Do What You Love
January 2006

To do something well you have to like it. That idea is not exactly novel. We’ve got it down to four words: “Do what you love.” But it’s not enough just to tell people that. Doing what you love is complicated.

The very idea is foreign to what most of us learn as kids. When I was a kid, it seemed as if work and fun were opposites by definition. Life had two states: some of the time adults were making you do things, and that was called work; the rest of the time you could do what you wanted, and that was called playing. Occasionally the things adults made you do were fun, just as, occasionally, playing wasn’t– for example, if you fell and hurt yourself. But except for these few anomalous cases, work was pretty much defined as not-fun.

And it did not seem to be an accident. School, it was implied, was tedious because it was preparation for grownup work.

The world then was divided into two groups, grownups and kids. Grownups, like some kind of cursed race, had to work. Kids didn’t, but they did have to go to school, which was a dilute version of work meant to prepare us for the real thing. Much as we disliked school, the grownups all agreed that grownup work was worse, and that we had it easy.

Teachers in particular all seemed to believe implicitly that work was not fun. Which is not surprising: work wasn’t fun for most of them. Why did we have to memorize state capitals instead of playing dodge-ball? For the same reason they had to watch over a bunch of kids instead of lying on a beach. You couldn’t just do what you wanted.

I’m not saying we should let little kids do whatever they want. They may have to be made to work on certain things. But if we make kids work on dull stuff, it might be wise to tell them that tediousness is not the defining quality of work, and indeed that the reason they have to work on dull stuff now is so they can work on more interesting stuff later. [1]

Once, when I was about 9 or 10, my father told me I could be whatever I wanted when I grew up, so long as I enjoyed it. I remember that precisely because it seemed so anomalous. It was like being told to use dry water. Whatever I thought he meant, I didn’t think he meant work could literally be fun– fun like playing. It took me years to grasp that.

Jobs
By high school, the prospect of an actual job was on the horizon. Adults would sometimes come to speak to us about their work, or we would go to see them at work. It was always understood that they enjoyed what they did. In retrospect I think one may have: the private jet pilot. But I don’t think the bank manager really did.

The main reason they all acted as if they enjoyed their work was presumably the upper-middle class convention that you’re supposed to. It would not merely be bad for your career to say that you despised your job, but a social faux-pas.

Why is it conventional to pretend to like what you do? The first sentence of this essay explains that. If you have to like something to do it well, then the most successful people will all like what they do. That’s where the upper-middle class tradition comes from. Just as houses all over America are full of chairs that are, without the owners even knowing it, nth-degree imitations of chairs designed 250 years ago for French kings, conventional attitudes about work are, without the owners even knowing it, nth-degree imitations of the attitudes of people who’ve done great things.

What a recipe for alienation. By the time they reach an age to think about what they’d like to do, most kids have been thoroughly misled about the idea of loving one’s work. School has trained them to regard work as an unpleasant duty. Having a job is said to be even more onerous than schoolwork. And yet all the adults claim to like what they do. You can’t blame kids for thinking “I am not like these people; I am not suited to this world.”

Actually they’ve been told three lies: the stuff they’ve been taught to regard as work in school is not real work; grownup work is not (necessarily) worse than schoolwork; and many of the adults around them are lying when they say they like what they do.

The most dangerous liars can be the kids’ own parents. If you take a boring job to give your family a high standard of living, as so many people do, you risk infecting your kids with the idea that work is boring. [2] Maybe it would be better for kids in this one case if parents were not so unselfish. A parent who set an example of loving their work might help their kids more than an expensive house. [3]

It was not till I was in college that the idea of work finally broke free from the idea of making a living. Then the important question became not how to make money, but what to work on. Ideally these coincided, but some spectacular boundary cases (like Einstein in the patent office) proved they weren’t identical.
… read more here: http://www.paulgraham.com/love.html

I’d Like the Record to Reflect…

What do you do when creativity refuses to cooperate? When you have the intention to create something, but that spark, that “thing” just isn’t happening?

Author Elizabeth Gilbert (“Eat, Pray, Love”) gave a speech about nurturing creativity at the 2009 TED conference. I strongly recommend it to anyone who has 19 minutes to spare:  http://www.ted.com/talks/view/id/453

Near the end of Ms. Gilbert’s talk she speaks to the invisible externalized part of her that provides inspiration:

You and I both know that if this book is not brilliant, that is not entirely my fault, because you can see that I am putting everything I have into this […] so if you want it to be better, then you’re going to have to show up and do your part of the deal […] but if you don’t do that, then […] I’m going to keep writing because *that’s my job*. And I would please like the record to reflect today that *I* showed up for my part of the job.

As I.T. professionals, we don’t always think of ourselves as particularly “creative”. But if we are honest, we know there are moments. We actually seek them out.

When I.T. pros connect with their creative selves, “just fixing it” turns into a solution which is elegant, inspired, and repeatable.

Don’t Tell Me “It’s Complicated”

HT to my hero and writing inspiration Seth Godin. His post here got me started, and his style is something I have wanted to emulate for years now.


Please

don’t tell me that it – monitoring – is complicated.

Don’t tell me you’re a snowflake – unique in your need for 1200 alert rules.

Don’t tell me “but our company is different. WE create value for our shareholders. Not like your other clients.”

Don’t tell me you can’t do it because…

Because

I’ve been creating monitoring solutions for over a decade.

I’ve designed solutions that scaled to 250,000 systems, in 5,000 locations

I work at a company that has written millions of lines of code to do this one thing, and do it well.

So please Don’t tell me it’s complicated.

Tell me what you need. What you want. What you wish you could have.

And then LISTEN to what I have to say. Because I’ve seen this before. I’ve done this before. And it’s NOT complicated. It’s also not easy.

But it is simple.