And I’ll argue that they do so because we want them to stay the same. When you are responsible for monitoring thousands of devices, and you’ve built a career on your guru-like expertise in a particular toolset, the last thing you want is for everything (or even part of everything) to change radically.
If you are a MIB wizard, your worst fear may be that everything goes to REST API calls. If you’ve spent years learning the in’s and out’s of a vendors database the last thing you want to hear is that they’re moving to noSQL.
So how will we respond to the pressures of IoT, SDN, hybrid cloud? Heck, how are we responding to the pressure of BYOD?
Are you going to try to tackle it with more of the same old?
Or is it finally time to re-think the way YOU do things, and let the vendors catch up to you for once.
“I call bullshit,” he said with authority. “it it couldn’t have happened like that. Let’s move on to something real.”
And just like that, he missed the most important part.
Because sometimes – maybe often – what you need to know is not whether something really, actually, 100% accurately happened “that way”.
It’s about what happened next. More to the point, it’s about what people did about the thing that may-or-may-not-have-happened-that-way.
How we respond to events (real or perceived) tells us and others more about who we are than the situations we find ourselves in.
“The entire data center was crashing”
(it was only 10 servers)
“the CEO was calling my cell every 2 minutes”
(he called twice in the first 30 minutes and then left you alone)
“it was a massive hack, probably out of China or Russia”
(it was a mis-configured router)
Whatever. I’m not as interested in that as what you did next. Did you:
Call a vendor and scream that they needed to “fix this yesterday”?
Pull the team together and solicit ideas and put together a plan
tell everyone to stay out of your way and worked 24 hours to sort it out?
Wait 30 minutes before doing anything to see if anyone noticed, or if it sorted itself out?
Start documenting everything you saw happening, to review afterward?
Simply shut everything down and start it up again and see if that fixes it?
Look at your historical data to see if you can spot the beginning of the failure?
Immediately recover from backups, and let people know work will be lost?
Notice that most of those aren’t inherently wrong, although several are wrong depending on the specific circumstances.
And that is the ONLY point where “what happened” comes into play. The events around us shape our environment.
Cute. Except without requirements you can’t draw a line effectively.
So go from there. Start with what you need to accomplish, and then set your must-haves to achieve that goal. Don’t embellish, don’t waffle. Just the essentials.
Draw the line, set the next goal, list the deal-breakers.
Believe it or not, this is an exercise most organizations don’t do (at least for montioring). They start with a (often vague) goal – “Select a NetFlow tool” or “Monitor customer experience for applications”. And then they look for tools that do that.
Before you know it, you have 20, or 50, or 150 tools (I am NOT exaggerating). . You have staff – even whole teams – who invest hundreds of hours getting good using those tools.
And then you get allegiences. And then you get politics.
How often you request information from a target device (commonly known as the polling cycle) is one of the fine arts of monitoring design, and one of the things that requires the most up-front discussion with consumers of your information.
Why? Because it’s often unclear to the consumer (the person who is getting the ticket, report, etc) the delays that exist between the occurrence of an actual event and the reporting/recording of it. Worse, most of us in the monitoring segment of IT do a poor job of communicating these things.
Let’s take good ol’ ping as an example.
You want to check the up-down status of your systems on a regular basis. But you DON’T want to turn your monitoring system into an inadvertent DDOS Smurf attack generator.
So you pick some nice innocuous interval – say a ping check every 1 minute.
Then you realize that you need to monitor 3,000 systems, and so you back that off to 2 minutes
Once you get past that frequency, there’s the additional level of verification. Just because a system failed ONE ping check doesn’t mean it’s actually down. So maybe your toolset does an additional level of just-in-time validation (let’s say it pings an additional 10 times, just to be sure the target is really down). That verification introduces at least another full minute of delay.
Is your alerting system doing nothing else? It’s just standing idle, waiting for this one alert to come in and be processed? Probably not. Let’s throw another 60 seconds into the mix for inherent delays in the alert engine.
Does your server team really want to know when a device is unavailable (ie: ping failed) for 1 minute? Probably not. They know servers get busy and don’t respond to ping even though they are processing. Heck, they may even want to avoid cutting tickets during patching cycle reboots. My experience is that most teams don’t want to hear that a server is down unless its really been down for at least 5 minutes.
Now you get the call: “I thought we were pinging every minute. But I got a ticket this morning and when I went to check the server, it had been down for over 10 minutes. What gives?”
There’s no single “right” answer to this. Nobody wants false alarms. And nobody wants to find out a system is down after the customer already knows.
But it is between those two requirements that one minute stretches to ten.
What is “essential” to your monitoring environment? (hey, this *is* a monitoring blog. I could certainly ask that about your kitchen, but that would be a completely different discussion)
Seriously – if your budget was infinite, you could have all the tools running on giant systems with all the disk in the world. But it (the budget) isn’t (infinite) and you don’t (have all the toys).
So start from the other end. What could you absolutely positively not live without? Is ping and a DOS batch file enough? A rock-solid SNMP trap listener? Is it a deal-breaker if you don’t have agents running local commands on each server?
Draw a line. Come up with reasons. Know where you stand.
To be sure, we’ve gotten better at presenting the information. We’ve found more effecient ways to collect that same information. And we’ve done a better job of streamlining the process of enabling those collections.
At the same time, the targets of monitoring have gotten smarter. Onboard agents expose more information, or are more responsive to polling requests.
But the raw underlying techniques haven’t shifted much. And cars still burn fossil fuels (well, unless you have one of these). And children still go to school to learn (more or less) the 3r’s out of actual paper books.
Things really do stay the same more than they change.
And I’ll argue that they do so because we want them to stay the same.
(originally posted Feb 10, 2015)
(image credit Josh Rossi )
I started a thread on Twitter asking Who are some awesome women in monitoring? One of the common reactions (privately and respectfully, I’m happy to say) has been asking me why I started the discussion in the first place. I thought that question deserved a response.
Because, I’m a feminist. Yes, Virginia, Orthodox Jewish middle-aged white guys can be feminists, too. Because I think that anything that can be done to promote and encourage women getting into STEM professions should be done. Full stop. Because people from different backgrounds, cultures, and environments see the world differently, and if there’s one thing you need in a “the order entry system is down again” crisis, it’s as many experienced perspectives as possible to get that sucker running again.
“But why ‘women in monitoring’?” I’m then asked. “Why not ‘awesome women in I.T.’ or just ‘awesome women in STEM’ ?”
Because on top of all the “Because”-es above, I’m also a MONITORING enthusiast. I think monitoring (especially monitoring done right) is awesome, a lot of fun, and provides a huge value to organizations of all sizes.
I also think it’s an under-appreciated discipline within I.T. Monitoring today. The current state of monitoring-as-a-discipline within IT reminds me of InfoSec, Storage, or Virtualization about a decade ago. Back then, it (infosec, virtualization, etc) was a set of skills, but few people claimed that it was their sole role within a company. Fast forward to today, and IT departments would dream of not having specialists in those areas. I think (and hope) that in a few years we’ll look back at monitoring and see the same type of progression.
I want to see monitoring recognized as a career path, the same as being a Voice engineer, or cloud admin, or a data analytics specialist.
Of course, this all ties back to my role as Head Geek. Part of the job of a Head Geek is to promote the amazing—amazing solutions, amazing trends, amazing companies, and amazing groups—as it relates to monitoring.
One reason this is explicitly part of my job is to build an environment where those people who are quietly doing the work, but not identifying as part of “the group” feel more comfortable doing so. The more “the group” gains visibility, the more that people who WANT to be part of the group will gravitate towards it rather than falling into it by happenstance.
Which brings me back to the point about “amazing women in monitoring”. This isn’t a zero-sum competition. Looking for amazing women doesn’t somehow imply women are MORE amazing than x (men, minorities, nuns, hamsters, etc).
This is about doing my part to start a conversation where achievements can be recognized for their own merit.
I know that’s a pretty big soapbox to balance on a series of twitter posts, but I figure it’s gotta start somewhere.
So, if you know of any exceptional women in monitoring: Forward this to them. Encourage to connect – on Twitter (@LeonAdato), THWACK (@adatole) or in the comments below.
As long as you stay focused on helping make things better, on helping others, on elevating good work (whether it’s yours or someone else’s) that you find “out there”… as long as you do that, you won’t be a fraud.
Because that’s genuine work, even if it doesn’t feel like work to you. And as hard as it is to believe, it’s not easy enough that anyone can do it, because very few other people are doing it.
Otherwise, who would you be helping in the first place?
In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked. You can read that introduction here. Information on the first question (Why did I get this alert) is here. You can get the low-down on the second question (Why DIDN’T I get an alert) here. And the third question (What is monitored on my system) is here.
My goal in this post is to give you the tools you need to answer the fourth question: Which of the existing alerts will potentially trigger for my system?
Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.
Riddle Me This, Batman…
It’s 3:00pm. You can’t quite see the end of the day over the horizon, but you know it’s there. You throw a handful of trail mix into your face to try to avoid the onset of mid-afternoon nap-attack syndrome and hope to slide through the next two hours unmolested.
Which, of course, is why you are pulled into a team meeting. Not your team meeting, mind you. It’s the Linux server team. On the one hand, you’re flattered. They typically don’t invite anyone who can’t speak fluent Perl or quote every XKCD comic in chronological order (using EPOC time as their reference, of course). On the other…well, team meeting.
on the board, eliciting a chorus of laughter (from everyone but me). Of course, this gave the manager the perfect opportunity to focus the conversation on yours truly.
“We have this non-trivial issue, and are hoping you can grep out the solution for us.” He begins, “we’re responsible for roughly 4,000 sytems…”
Unable to contain herself, a staff member followed by stating, “4,732 systems. Of which 200 are physical and the remainder are virtualized…”
Unimpressed, her manager said, “Ms. Deal, unless I’m off by an order of magnitude, there’s no need to correct.”
She replied, “Sorry boss.”
“As I was saying,” he continued. “We have a…significant number of systems. Now how many alerts currently exist in the monitoring system which could generate a ticket?”
“436, with 6 currently in active development.” I respond, eager to show that I’m just on top of my systems as they are of theirs.
“So how many of those affect our systems?” the manager asked.
Now I’m in my element. I answer, “Well, if you aren’t getting tickets, then none. I mean, if nothing has a spiked CPU or RAM or whatever, then it’s safe to say all of your systems are stable. You can look at each node’s detail page for specifics, although with 4,000—I can see where you would want a summary. We can put something together to show the current statistics, or the average over time, or…”
“You misunderstand,” he cuts me off. “I’m fully cognizant of the fact that our systems are stable. That’s not my question. My question is…should one of my systems become unstable, how many of your 436-soon-to-be-442 alerts WOULD trigger for my systems?”
He continued, “As I understand it, your alert logic does two things: it identifies the devices which could trigger the alert—All Windows systems in the 10.199.1 subnet, for example—and at the same time specifies the conditions under which an alert is triggered—say, when the CPU goes over 80% for more than 15 minutes.”
“So what I mean,” he concluded, “Is this: can you create a report that shows me the devices which are included in the scope of an alert logic irrespective of the trigger condition?”
Your Mission, Should You Choose to Accept it…
As with the other questions we’ve discussed in this series, the specifics of HOW to answer this question is less critical than knowing you will be asked it.
In this case, it’s also important to understand that this question is actually two questions masquerading as one:
For each alert, tell me which machines could potentially be triggers
For each machine, tell me which alerts may potentially triggered
Why is this such an important question—perhaps the most important – in this series? Because it determines the scale of the potential notifications monitoring may generate. It’s one thing if 5 alerts apply to 30 machines. It’s entirely another when 30 alerts apply to 4,000 machines.
The answer to this question has implications to staffing, shift allocation, pager rotation, and even the number of alerts a particular may approve for production.
The way you go about building this information is going to depend heavily on the monitoring solution you are using.
In general, agent-based solutions are better at this because trigger logic – in the form of an alert name – is usually pushed down to the agent on each device, and thus can be queried (both “Hey, node, what alerts are “on” you?” and “hey, alert, which nodes have you been pushed to?”)
That’s not to say that agentless monitoring solutions are intrinsically unable to get the job done. The more full-featured monitoring tools have options built-in.
Reports that look like this:
Note to mention little “reminders” like this on the alert screen:
Or even resources on the device details page that look like this:
Houston, We Have a Problem…
What if it doesn’t though? What if you have poured through the documentation, opened a ticket with the vendor, visited the online forums and asked the greatest gurus up on the mountain, and came back with a big fat goose egg? What then?
Your choices at this point still depend largely on the specific software, but generally speaking there are 3 options:
Reverse-engineer the alert trigger and remove the actual trigger part
Many monitoring solutions use a database back-end for the bulk of their metrics, and alerts are simply a query against this data. The alert trigger queries may exist in the database itself, or in a configuration file. Once you have found them, you will need to create a copy of each alert and then go through each, removing the parts which comprise the actual trigger (i.e.: CPU_Utilization > 80%) and leaving the parts that simply indicate scope (where Vendor = “Microsoft”; where “OperatingSystem” = “Windows 2003”; where “IP_address” contains “10.199.1”; etc). This will likely necessitate your learning the back-end query language for your tool. Difficult? Probably. Will it increase your street cred with the other users of the tool? Undoubtedly. Will it save your butt within the first month after you create it? Guarenteed.
And once you’ve done it, running a report for each alert becomes extremely simple.
Create duplicate alerts with no trigger
If you can’t export the alert triggers, another option is to create a duplicate of each alert that has the “scope” portion, but not the trigger elements (so the “Windows machines in the 10.199.1.x subnet” part but not the “CPU_Utilization > 80%” part). The only recipient of that alert will be you and the alert action should be something like writing to a logfile with a very simple string (“Alert x has triggered for Device y”). If you are VERY clever, you can export the key information in CSV format so you can import it into a spreadsheet or database for easy consumption. Every so often—every month or quarter—fire off those alerts and then tally up the results that recipient groups can slice and dice.
Do it by hand
If all else fails (and the inability to answer this very essential question doesn’t cause you to re-evaluate your choice of monitoring tool), you can start documenting by hand. If you know up-front that you are in this situation, then it’s simply part of the ongoing documentation process. But most times it’s going to be a slog through of existing alerts and writing down the trigger information. Hopefully you can take that trigger info and turn it into an automated query against your existing devices. If not, then I would seriously recommend looking at another tool. Because in any decent-sized environment, this is NOT the kind of thing you want to spend your life documenting, and it’s also not something you want to live without.
What Time Is It? Beer:o’clock
After that last meeting—not to mention the whole day—you are ready pack it in. You successfully navigated the four impossible questions that every monitoring expert is asked (on more or less a daily basis)—Why did I get that alert, Why didn’t I get that alert, What is being monitored on my systems, and What alerts might trigger on my systems? Honestly, if you can do that, there’s not much more that life can throw at you.
Of course, the CIO walks up to you on your way to the elevator. “I’m glad I caught up to you,” he says, “I just have a quick question…”