Category Archives: philosophy

Essential

What is “essential” to your monitoring environment? (hey, this *is* a monitoring blog. I could certainly ask that about your kitchen, but that would be a completely different discussion)

Seriously – if your budget was infinite, you could have all the tools running on giant systems with all the disk in the world. But it (the budget) isn’t (infinite) and you don’t (have all the toys).

So start from the other end. What could you absolutely positively not live without? Is ping and a DOS batch file enough? A rock-solid SNMP trap listener? Is it a deal-breaker if you don’t have  agents running local commands on each server?

Draw a line. Come up with reasons. Know where you stand.

Nothing is New Under the Sun

Ping. SNMP. WMI. Script. API call.

As I’ve elaborated on before, monitoring hasn’t changed much in the last decade or so.

To be sure, we’ve gotten better at presenting the information. We’ve found more effecient ways to collect that same information. And we’ve done a better job of streamlining the process of enabling those collections.

At the same time, the targets of monitoring have gotten smarter. Onboard agents expose more information, or are more responsive to polling requests.

But the raw underlying techniques haven’t shifted much. And cars still burn fossil fuels (well, unless you have one of these). And children still go to school to learn (more or less) the 3r’s out of actual paper books.

Things really do stay the same more than they change.

And I’ll argue that they do so because we want them to stay the same.

More on that to come…

Why I care about Women in Monitoring (and IT)

(originally posted Feb 10, 2015)
(image credit Josh Rossi )

I started a thread on Twitter asking Who are some awesome women in monitoring? One of the common reactions (privately and respectfully, I’m happy to say) has been asking me why I started the discussion in the first place. I thought that question deserved a response.

Because, I’m a feminist. Yes, Virginia, Orthodox Jewish middle-aged white guys can be feminists, too. Because I think that anything that can be done to promote and encourage women getting into STEM professions should be done. Full stop. Because people from different backgrounds, cultures, and environments see the world differently, and if there’s one thing you need in a “the order entry system is down again” crisis, it’s as many experienced perspectives as possible to get that sucker running again.

“But why ‘women in monitoring’?” I’m then asked. “Why not ‘awesome women in I.T.’ or just ‘awesome women in STEM’ ?”

Because on top of all the “Because”-es above, I’m also a MONITORING enthusiast. I think monitoring (especially monitoring done right) is awesome, a lot of fun, and provides a huge value to organizations of all sizes.

I also think it’s an under-appreciated discipline within I.T. Monitoring today. The current state of monitoring-as-a-discipline within IT reminds me of InfoSec, Storage, or Virtualization about a decade ago. Back then, it (infosec, virtualization, etc) was a set of skills, but few people claimed that it was their sole role within a company. Fast forward to today, and IT departments would dream of not having specialists in those areas. I think (and hope) that in a few years we’ll look back at monitoring and see the same type of progression.

I want to see monitoring recognized as a career path, the same as being a Voice engineer, or cloud admin, or a data analytics specialist.

Of course, this all ties back to my role as Head Geek. Part of the job of a Head Geek is to promote the amazing—amazing solutions, amazing trends, amazing companies, and amazing groups—as it relates to monitoring.

One reason this is explicitly part of my job is to build an environment where those people who are quietly doing the work, but not identifying as part of “the group” feel more comfortable doing so. The more “the group” gains visibility, the more that people who WANT to be part of the group will gravitate towards it rather than falling into it by happenstance.

Which brings me back to the point about “amazing women in monitoring”. This isn’t a zero-sum competition. Looking for amazing women doesn’t somehow imply women are MORE amazing than x (men, minorities, nuns, hamsters, etc).

This is about doing my part to start a conversation where achievements can be recognized for their own merit.

I know that’s a pretty big soapbox to balance on a series of twitter posts, but I figure it’s gotta start somewhere.

So, if you know of any exceptional women in monitoring: Forward this to them. Encourage to connect – on Twitter (@LeonAdato), THWACK (@adatole) or in the comments below.

If you Knew Me, You Wouldn’t Believe Me

And all of a sudden, people are referring to you as an expert in the field. When this first happens, you may even feel like an imposter, a fraud. But don’t worry.

As long as you stay focused on helping make things better, on helping others, on elevating good work (whether it’s yours or someone else’s) that you find “out there”… as long as you do that, you won’t be a fraud.

Because that’s genuine work, even if it doesn’t feel like work to you. And as hard as it is to believe, it’s not easy enough that anyone can do it, because very few other people are doing it.

Otherwise, who would you be helping in the first place?

The Four Questions, Part 4

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked. You can read that introduction here. Information on the first question (Why did I get this alert) is here. You can get the low-down on the second question (Why DIDN’T I get an alert) here. And the third question (What is monitored on my system) is here.

My goal in this post is to give you the tools you need to answer the fourth question: Which of the existing alerts will potentially trigger for my system?

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

Riddle Me This, Batman…

It’s 3:00pm. You can’t quite see the end of the day over the horizon, but you know it’s there. You throw a handful of trail mix into your face to try to avoid the onset of mid-afternoon nap-attack syndrome and hope to slide through the next two hours unmolested.

Which, of course, is why you are pulled into a team meeting. Not your team meeting, mind you. It’s the Linux server team. On the one hand, you’re flattered. They typically don’t invite anyone who can’t speak fluent Perl or quote every XKCD comic in chronological order (using EPOC time as their reference, of course). On the other…well, team meeting.

The manager wrote:

            kill `ps -ef | grep -i talking | awk ‘{print $1}’`

on the board, eliciting a chorus of laughter (from everyone but me). Of course, this gave the manager the perfect opportunity to focus the conversation on yours truly.

“We have this non-trivial issue, and are hoping you can grep out the solution for us.” He begins, “we’re responsible for roughly 4,000 sytems…”

Unable to contain herself, a staff member followed by stating, “4,732 systems. Of which 200 are physical and the remainder are virtualized…”

Unimpressed, her manager said, “Ms. Deal, unless I’m off by an order of magnitude, there’s no need to correct.”

She replied, “Sorry boss.”

“As I was saying,” he continued. “We have a…significant number of systems. Now how many alerts currently exist in the monitoring system which could generate a ticket?”

“436, with 6 currently in active development.” I respond, eager to show that I’m just on top of my systems as they are of theirs.

“So how many of those affect our systems?” the manager asked.

Now I’m in my element. I answer, “Well, if you aren’t getting tickets, then none. I mean, if nothing has a spiked CPU or RAM or whatever, then it’s safe to say all of your systems are stable. You can look at each node’s detail page for specifics, although with 4,000—I can see where you would want a summary. We can put something together to show the current statistics, or the average over time, or…”

“You misunderstand,” he cuts me off. “I’m fully cognizant of the fact that our systems are stable. That’s not my question. My question is…should one of my systems become unstable, how many of your 436-soon-to-be-442 alerts WOULD trigger for my systems?”

He continued, “As I understand it, your alert logic does two things: it identifies the devices which could trigger the alert—All Windows systems in the 10.199.1 subnet, for example—and at the same time specifies the conditions under which an alert is triggered—say, when the CPU goes over 80% for more than 15 minutes.”

“So what I mean,” he concluded, “Is this: can you create a report that shows me the devices which are included in the scope of an alert logic irrespective of the trigger condition?”

Your Mission, Should You Choose to Accept it…

As with the other questions we’ve discussed in this series, the specifics of HOW to answer this question is less critical than knowing you will be asked it.

In this case, it’s also important to understand that this question is actually two questions masquerading as one:

  1. For each alert, tell me which machines could potentially be triggers
  2. For each machine, tell me which alerts may potentially triggered

Why is this such an important question—perhaps the most important  – in this series? Because it determines the scale of the potential notifications monitoring may generate. It’s one thing if 5 alerts apply to 30 machines. It’s entirely another when 30 alerts apply to 4,000 machines.

The answer to this question has implications to staffing, shift allocation, pager rotation, and even the number of alerts a particular may approve for production.

The way you go about building this information is going to depend heavily on the monitoring solution you are using.

In general, agent-based solutions are better at this because trigger logic – in the form of an alert name –  is usually pushed down to the agent on each device, and thus can be queried (both “Hey, node, what alerts are “on” you?” and “hey, alert, which nodes have you been pushed to?”)

That’s not to say that agentless monitoring solutions are intrinsically unable to get the job done. The more full-featured monitoring tools have options built-in.

Reports that look like this:

Note to mention little “reminders” like this on the alert screen:

Or even resources on the device details page that look like this:

Houston, We Have a Problem

What if it doesn’t though? What if you have poured through the documentation, opened a ticket with the vendor, visited the online forums and asked the greatest gurus up on the mountain, and came back with a big fat goose egg? What then?

Your choices at this point still depend largely on the specific software, but generally speaking there are 3 options:

  • Reverse-engineer the alert trigger and remove the actual trigger part

Many monitoring solutions use a database back-end for the bulk of their metrics, and alerts are simply a query against this data. The alert trigger queries may exist in the database itself, or in a configuration file. Once you have found them, you will need to create a copy of each alert and then go through each, removing the parts which comprise the actual trigger (i.e.: CPU_Utilization > 80%) and leaving the parts that simply indicate scope (where Vendor = “Microsoft”; where “OperatingSystem” = “Windows 2003”; where “IP_address” contains “10.199.1”; etc). This will likely necessitate your learning the back-end query language for your tool. Difficult? Probably. Will it increase your street cred with the other users of the tool? Undoubtedly. Will it save your butt within the first month after you create it? Guarenteed.

And once you’ve done it, running a report for each alert becomes extremely simple.

  • Create duplicate alerts with no trigger

If you can’t export the alert triggers, another option is to create a duplicate of each alert that has the “scope” portion, but not the trigger elements (so the “Windows machines in the 10.199.1.x subnet” part but not the “CPU_Utilization > 80%” part). The only recipient of that alert will be you and the alert action should be something like writing to a logfile with a very simple string (“Alert x has triggered for Device y”). If you are VERY clever, you can export the key information in CSV format so you can import it into a spreadsheet or database for easy consumption. Every so often—every month or quarter—fire off those alerts and then tally up the results that recipient groups can slice and dice.

  • Do it by hand

If all else fails (and the inability to answer this very essential question doesn’t cause you to re-evaluate your choice of monitoring tool), you can start documenting by hand. If you know up-front that you are in this situation, then it’s simply part of the ongoing documentation process. But most times it’s going to be a slog through of existing alerts and writing down the trigger information. Hopefully you can take that trigger info and turn it into an automated query against your existing devices. If not, then I would seriously recommend looking at another tool. Because in any decent-sized environment, this is NOT the kind of thing you want to spend your life documenting, and it’s also not something you want to live without.  

What Time Is It? Beer:o’clock

After that last meeting—not to mention the whole day—you are ready pack it in. You successfully navigated the four impossible questions that every monitoring expert is asked (on more or less a daily basis)—Why did I get that alert, Why didn’t I get that alert, What is being monitored on my systems, and What alerts might trigger on my systems? Honestly, if you can do that, there’s not much more that life can throw at you.

Of course, the CIO walks up to you on your way to the elevator. “I’m glad I caught up to you,” he says, “I just have a quick question…”

Stay tuned for the bonus question!

Don’t Fear Starting Over

Sometimes revisions are inevitable. But have you ever considered starting over when you didn’t have to?

What if you reinstalled the system from ground-up because you understand how all the applications fit together? What if you started that code project from a blank screen, now that the requirements are completely fleshed out? What if you finished the essay, put it in a drawer, and started writing it all over again from the top?

What would you lose? Time, to be sure.

But what might you gain?

Some of our greatest discoveries were found after losing the initial effort, the first draft, the prototype.

So why don’t we do it – start over – more often, on purpose?

Do we really think the time we save is more valuable than the wonders we could uncover?

Luxury

In monitoring solutions, there are features which are essential.

There are features which are convenient.

There are features which are interesting, but not particularly useful.

Then there are features which are luxury items.

I think of a luxury feature in a monitoring solution (or any software, really) as something which may cost more, but is so amazingly convenient, or well-designed, or comfortable to use that it makes me excited to use the tool. It makes me better at what I do – able to accomplish more in less time, or devote more time doing it because it takes longer for me to get tired of it.

Knowing the difference between essential, convenient, and luxury important. You need to know what you are paying for, and why.

You need to build the skill of identifying luxury features, and making a case for them when needed, so that you don’t find yourself “making due” when it’s not necessary.

I Wish Someone Had Told Me

Ira Glass (Host of “This American Life“) has an oft-quoted piece about creative work titled “The Gap”. You can read it here, but it’s been made into vines and videos and lots of other forms that are more fun to watch – you can Google the first sentence and find enough to waste an hour or two.

What Ira describes has a parallel in the world of I.T., and just like Ira’s experience, nobody told me when I started. So with all necessary apologies and legal disclaimers, here is my adaptation of Ira’s famous advice:

Nobody tells this to I.T. noobies. I wish someone told me. All of us who do technical work, we get into it because we have a desire to make things work better. Folks drawn to IT are quick figure out how things work, but then we have this vision of how it could work – how cool it could be. And we know that if we could just get in there and tinker with it, we’d get it all sorted out and it would be incredible.

But there’s this gap.

For the first couple of years you are just plowing along. There’s so much to learn, and so much to do, and you have to earn respect before you get to do some of the cool stuff. So you do it. You just do the hard work of doing the work and learning and growing.

But at the same time you can see so much more that you want to fix, to improve, to be part of.

So you keep plugging away, soaking it all in and just trying to be part of whatever you can get into.

After a few years, you realize you’ve fallen into this trap – you have done all these different things (and done them well!) but now everyone expects you to be a jack of all trades all the time. To be the “he can figure it out” guy.

So this is the part nobody told me. The part that I had to figure out for myself. The part I’m telling you now:

After a few years, when you’ve seen the whole landscape of I.T. and you know what it’s all about, you need to pick. You need to decide where your personal desires and skills overlap. It might be storage, or voice, or server, or appdev, or whatever. It might have NOTHING to do with what you are doing here and now (actually, that’s a pretty safe bet). The point isn’t that you start doing “it” (at least not at first). The point is that you have chosen, and you commit to that goal.

To get there, you might need to work with a whole other team “on the side”, or after hours, or volunteer, or just hang out with “those guys” when they eat lunch. You might have to start reading a whole other set of blogs, or sneak to user groups or conventions on your lunch hour and days off.

And the job you are doing now, at the company you work for? You should get used to the idea that they’re not going to help you get there. Right now you are this amazing do-it-all resource. If you start only doing one thing they’re going to have to hire 2 or 3 more people to cover what you used to be doing. So don’t expect a lot of love in that direction.

But please, PLEASE keep doing it. Tap into the passion that got you into this in the first place – the desire to figure it out and the vision of how it can be better – and you push ahead. You’ll start commenting in forums, or writing blog posts, or jumping on tweet-ups.

You transform interest and enthusiasm into experience.

And all of a sudden, people are referring to you as an expert in the field. And all of a sudden, you are doing what you love, not just what you can.

Like Ira says: “It’s gonna take awhile. It’s normal to take awhile. You’ve just gotta fight your way through.”

Moments

We’re told to visualize success. And when we do – when we imagine ourselves winning, it’s in a moment.

But there’s another moment we should be envisioning – the moment before the moment. The one that comes before the winning move, before the piece of code that pulls it all together, before the report is handed in to smiles and nods and thumbs up.

And there’s a moment before that moment, too.

If you plan on winning today, you might want to spare some time to imagine for a moment what it looks like in those moments before the moment where you win.

Because that’s probably where the winning actually happens.

Lose, Survive, or Win?

My teenage son has had a rough patch lately – like many teens. Getting up for school requires Herculean commitment. Being civil, let alone kind, is almost impossible.

No surprises there, it’s all part of the journey.

This morning as I drove him to school, I asked him if he was going to let the morning’s bad start ruin the rest of his day.

“Nope” came hims typically verbose reply.
“So,” I asked. “Do you plan to lose, survive, or win?”
“Uh…. survive?” he answered, clearly thrown off his recalcitrant game.
“Interesting choice.” I said, and left it at that.

But I think it’s worth asking ourselves each day. As we prepare to meet the challenges inherent in the day of a typical IT pro, do we envision ourselves losing, surviving, or finding a way to win?

Sometimes it’s not important whether you actually win. Sometimes what matters – on day 463 of your job – what your plan is.

The rest is part of the journey.