(and how monitoring can solve them)
I spend a lot of time talking about the value that monitoring can bring an organization, and helping IT professionals make a compelling case for expanding or creating a monitoring environment. One of the traps I fall into is talking about the functions and features that monitoring tools provide while believing that the problems they solve are self-evident.
While this is often not true when speaking to non-technical decision makers, it can come as a surprise that it’s sometimes not obvious even to a technical audience!
So I have found it helpful to describe the problem first, so that the listener understands and buys into the fact that a challenge exists. Once that’s done, talking about solutions becomes much easier.
With that in mind, here are the top 5 issues I see in companies today, along with ways that sophisticated monitoring addresses them.
Wireless Networks
Issue #1:
Ubiquitous wireless has directly influenced the decision to embrace BYOD programs, which has in turn created an explosion of devices on the network. It’s not uncommon for a single employee to have 3, 4, or even 5 devices.
This spike in device density has put an unanticipated strain on wireless networks. In addition to the sheer load, there are issues with the type of connections, mobility, and device proximity.
The need to know how many users are on each wireless AP, how much data they are pulling, and how devices move around the office has far outstripped the built-in options that come with the equipment.
Monitoring Can Help!
Wireless monitoring solutions tell you more than when an AP is down. They can alert you when an AP is over-subscribed, or when an individual device is consuming larger-than-expected amounts of data.
In addition, sophisticated monitoring tools now include wireless heat maps – which take the feedback from client devices and generate displays showing where signal strength is best (and worst) and the movement of devices in the environment.
Capacity Planning
Issue #2
We work hard to provision systems appropriately, and to keep tabs on how that system is performing under load. But this remains a largely manual process. Even with monitoring tools in place, capacity planning—knowing how far into the future a resource (CPU, RAM, disk, bandwidth) will last given current usage patterns—is something that humans do (often with a lot of guesswork). And all too often, resources still reach capacity without anyone noticing until it is far too late.
Monitoring Can Help!
This is a math problem, pure and simple. Sophisticated monitoring tools now have the logic built-in to consider both trending and usage patterns day-by-day and week-by-week in order to come up with a more accurate estimate of when a resource will run out. With this feature in place, alerts can be triggered so that staff can act proactively to do the higher-level analysis and act accordingly.
Packet Monitoring
Issue #3
We’ve gotten very good at monitoring the bits on a network – how many bits per second in and out; the number of errored bits; the number of discarded bits. But knowing how much is only half the story. Where those bits are going and how fast they are traveling is now just as crucial. User experience is now as important as network provisioning. As the saying goes: “Slow is the new down.” In addition, knowing where those packets are going is the first step to catching data breaches before they hit the front page of your favourite Internet news site.
Monitoring Can Help!
A new breed of monitoring tools includes the ability to read data as it crosses the wire and track source, destination, and timing. Thus you can get a listing of internal systems and who they are connecting to (and how much data is being transferred) as well as whether slowness is caused by network congestion or an anaemic application server.
Intelligent Alerts
Issue #4
“Slow is the new down”, but down is still down, too! The problem is that knowing something is down gets more complicated as systems evolve. Also, it would be nice to alert when a system is on its way down, so that the problem could be addressed before it impacts users.
Monitoring Can Help!
Monitoring tools have come a long way since the days of “ping failure” notifications. Alert logic can now take into account multiple elements simultaneously such as CPU, interface, and application metrics so that alerts are incredibly specific. Alert logic also now allows for de-duplication, delay based on time or number of occurrences, and more. Finally, the increased automation built into target systems allows monitoring tools to take action and then re-test at the next cycle to see if that automatic action fixed the situation.
Automatic Dependency Mapping
Issue #5
One device going down should not create 30 tickets. But it often does. This is because testing upstream/downstream devices requires knowing which devices those are, and how each depends on the other. This is either costly in terms of processing power, difficult given complex environments, time-consuming for staff to configure and maintain, or all three.
Monitoring Can Help!
Sophisticated monitoring tools now collect topology information using devices’ built-in commands, and then use that to build automatic dependency maps. These parent-child lists can be reviewed by staff and adjusted as needed, but they represent a huge leap ahead in terms of reducing “noise” alerts. And by reducing the noise, you increase the credibility of every remaining alert so that staff responds faster and with more trust in the system.
So, what are you waiting for?
At this point, the discussion doesn’t have to spiral around whether a particular feature is meaningful or not. As long as the audience agrees that they don’t want to find out what happens when everyone piles into conference room 4, phones, pads, and laptops in tow; or when the “free” movie streaming site starts pulling data out of your drive; or when the CEO finds out that the customer site crashed because a disk filled, but had been steadily filling up for weeks.
As long as everyone agrees that those are really problems, the discussion on features more or less runs itself.
(originally published on GeekSpeak)