(“In Case You Missed It Monday” is my chance to showcase something that I wrote and published in another venue, but is still relevant. This week’s post originally appeared on DataCenterJournal.com)
In part one of this series, I described a situation I often run up against when I work with customers to develop or improve their monitoring environment. It was so shocking the first time that I compared it to the discovery that one of my children had been persistently bullied at school. Even though I’ve run into this monitoring issue many times, familiarity has rendered it no less shocking.
What I found is that IT professionals—solid, technically astute, experienced folks—are willing to put up with ineffective monitoring simply because they’ve come to believe that “that’s the way it is.” A combination of persistent corporate bad habits and a general de-emphasis of monitoring combine to create environments where monitoring doesn’t just lack user-friendly aspects, it’s sometimes actively user hostile.
Aside from the generally good business policy of removing practices that reduce morale or make people’s lives harder, addressing this issue is important because in the 20 years I’ve been building, improving and maintaining monitoring systems, I’ve not seen this behavior change, even though it’s incredibly simple to do so.
In this post, my focus is on two specific areas: first, how to build monitoring solutions that don’t make the users feel like they’re being beaten up when they use them, and second, how you and the folks who receive the benefits of monitoring can move to a healthy place where your expectations match what a good solution can provide.
But first, a brief aside.
My Message to Management: Why Bother?
It should be clear that I wrote this post (and the one that preceded it) to help monitoring engineers and those who receive all that monitoring goodness. But as their manager, you’re the one who both enables and encourages improvement of the technology under your purview. Since the suggestions I’m offering below are nontrivial to implement, I wanted to first get your buy-in (like any good IT practitioner should). You may be wondering what the real business value of improved monitoring is.
First and foremost, monitoring is about reducing cost to the business. Ask yourself: what’s the cost of a typical outage? I’m not talking about “the big one”—an outage so massive that it’s become part of corporate lore. I’m talking about the regular, ongoing outages and failures that many folks chalk up to the cost of doing business. Disk and NIC failures, drives filling up, application crashes, flapping routes, and so on. Each one of those failures requires some amount of staff and even hardware to resolve. More to the point, they happen repeatedly. So even if the individual cost of an incident is low, it can become a case of death (or at least annoyance) by a thousand cuts—a thousand cuts that cost you and your organization a lot over the long haul.
Great monitoring can help you avoid outages entirely by detecting the patterns that precede (and even predict) the failure and by taking preemptive action (restarting a service, clearing temp files, moving a VM, etc.). But even monitoring that’s simply “good” can alert you to the condition ahead of time and allow your staff to get ahead of it, reducing the event’s impact (and cost).
Let’s say you do the math and realize a specific event costs $100 (in staff time, lost opportunity, etc.) each time it happens. And “merely good” monitoring reduces that cost to $25. Small potatoes, right? But then you look at your ticket system and realize this event occurs about 15 times a month. That’s $1,125 per month or $13,500 a year. With monitoring, you’ve cut $9,000 off that impact.
And that’s just a single event. Great (or even “merely good”) monitoring can enable your team to address hundreds of these types of incidents, which can add up to millions of dollars in avoided costs.
The second reason you, as the manager, should care is because you understand what monitoring is and what it provides your organization when it works:
- It allows you to know exactly what you have and where it is. Whether “it” is a series of ephemeral containers in the cloud, a set of switches in closets or access points embedded in ceilings around the campus.
- It lets you collect and retain network and server-system configurations so you can tell what changed, when and by whom. And it lets you put things back the way they were without asking your chief network engineer to rebuild a 300-line core switch configuration from memory. (Or retrieve it from a flash drive he keeps around his neck. Even in the shower. True story.)
- It enables automated responses to events, which occur with the unerring reflexes of a computer.
Third, you should care because you recognize that if your solution is monitoring business-critical systems (and honestly, why wouldn’t it?), then monitoring is now mission critical and should be treated as such.
And finally, you know that “operational excellence” you keep talking about? You can’t know if you have it, or how to get it, without monitoring it.
How to Build Solutions That Don’t Hurt
Let’s start by focusing on how to build (or maybe rebuild) monitoring that does what it should and provides the value it’s capable of.
First, keep in mind that it should always be about the data, not the tool. I’ll go out on a limb and say there’s no one-size-fits-all solution. In fact, there’s not even a one-size-fits-most solution. You will have multiple tools performing various monitoring operations, so your primary focus should be on ensuring you consume all your data inputs in the widest and most convenient ways possible. Your secondary focus should be on keeping the number of disparate tools to a minimum.
Second, if you have multiple tools, you’ll immediately find at least a little overlap among them. Decide which tools are your primary source of truth and which should be a backup and/or a sanity check to ensure that the primary tools are gathering accurate information.
Along those lines, one trap many companies fall into—especially if the monitoring solution is corporate selected (versus home grown, as I described in the previous post on this topic)—is to take the most expensive tool and use it in the largest possible number of tasks, often going so far as to shoehorn it into situations where it’s suboptimal. The thinking is that “we paid so much for this, we better get our money’s worth.” But the opposite is usually true. By attempting to use it everywhere, the tool’s cost becomes obscenely high, and conversely, its usefulness decreases.
Instead, imagine deploying your monitoring tools like a luscious cake. The base of the cake is the actual cake (called “sponge” in the baking biz). When made correctly, the sponge is certainly delicious, but it’s also the cheapest of the cake elements, while at the same time being the most solid. It can cover a wide base for a relatively low cost. In between layers of sponge goes the filling. The flavor of the filling (compared with the sponge) is much more intense, but you can’t build a whole cake out of it. It’s also more expensive than the sponge layer and therefore should be used more thoughtfully. Finally, there’s the Belgian-chocolate shavings. They’re incredibly intense (and expensive) and therefore used sparingly as well as for one specific task: to adorn the top of the cake.
With this mouth-watering example in mind, you should use your most economical (and versatile) tool should in the widest possible set of tasks around your enterprise, with other tools adding “flavor” by serving in more-limited situations. You should use your most expensive tools sparingly, keeping them in their sweet spot and also controlling costs.
My last piece of advice is to invest in the idea of monitoring engineers and even a monitoring team. Regardless of the size of your organization, you should have one or more people whose primary responsibility centers on monitoring—not just maintaining the tools that do monitoring, but the techniques that get monitoring done.
Although this approach sounds radical, I liken it to the advent of the InfoSec professional. A decade ago, InfoSec was a set of tasks that network or systems admins performed as a slice of their work but not their main job. Today, however, no company (regardless of size) would dream of lacking a dedicated security professional either employed or at least under contract.
Systems- and network-monitoring software has reached the same point. The convergence, overlap and interdependence among technologies has hit critical mass, where monitoring in silos is no longer viable. Companies need people who are comfortable exploring failure modes regardless of where in “the stack” they occur, who love data in all its forms, and who are conversant with collecting that data through multiple protocols, techniques and technologies. These people should also be able to create automation that employs the near-real-time status that monitoring provides and reacts with lightning speed to either correct the situation or gather more data, enabling human responders to have the richest insights into the issue.
Monitoring engineers help bridge the gap between the solution and the teams that ultimately rely on it. They can employ a broad knowledge of various technologies and work with the teams that have depth expertise to address repeated or pernicious issues.
How to Help the Healing Process
If you’re the monitoring engineer and you’re coming into an environment where poor tools and poorer implementations have left a bad taste in people’s mouths, here’s how you can work your way back to a healthy state. First, ensure that you have or are in the process of building (or rebuilding) a monitoring solution as described in the previous section. Second, focus on satisfying existing demand first. Going back to my initial essay on this topic, when the network engineer I spoke to asked me for monitoring of nonroutable interfaces, I showed him this capability right out of the gate. I proved I could deliver. My immediate reaction was to ask whether I could monitor nonroutable interfaces that weren’t up. When I asked for clarification, I found out that backup cellular circuits were coming up unexpectedly, but the network team couldn’t prove that the primary circuit was up.
The punchline? A simple notification to the network personnel alerting them that a cell line came up allowed them to quickly spot-check. The result was a refund of over $1,500 of unnecessary cell charges in the first month and a 75% decrease in overall backup-circuit usage over the following year. But I wouldn’t have had the opportunity to do that if I hadn’t addressed the first (admittedly simple) request. So make sure you listen to what the team wants and that you deliver it exactly as requested and without excuses. If you do so, you’ll have plenty of opportunities to show off your real monitoring chops.
Next, look to the ticket system. Not just for inspiration, but for data. You want to use your resources—both your technology and your time—to address recurring, unresolved and high-impact events. When you find an event that seems to be occurring repeatedly, use it as an opportunity to talk to the team stuck dealing with it. Often you’ll find another case where the monitoring solution has “beaten them up” so often, they’ve stopped asking for it to be better. That’s your chance to come in, have an honest talk and start the healing process.
Finally, if you skipped it initially, go back and read my message to management now. You can help your C-suite heal from the pain of false monitoring promises. I asked them to calculate the cost of an outage. You can help yourself by helping them in that task. Commit to making the value assessment part of your process when building a new alert. Find out what the cost (in hours, dollars, etc.) is per event pre-alert and how the alert helped improve it. Then you can assign that dollar value in savings each time the alert fires.
The entire company may be amazed at how much actual measurable value it brings to the table and realize that it’s never necessary to put up with monitoring solutions that hurt.