An excerpt from an essay I wrote for an upcoming issue of “Data Center Journal” was especially relevant to today’s post. The essay is titled “Data, Information, Action”:
The saying, “you can have data without information, but you cannot have information without data,” may never have been so blindingly obvious or true as it is today. We are awash in seas of data, fed by thundering, swollen tributaries like the Internet of Things, mobile computing and social media. The goal of the so-called “big data” movement is to channel those raging rivers into meaningful insight.
For almost 20 years, my specialty within the field of IT has been systems monitoring and management. Those who share my passion for finding ever newer and more creative ways to determine when, how, and if a server went bump in the night understand that data versus information is not really a dichotomy. It’s a triad.
Of course good monitoring starts with data. Lots of it, collected regularly from a variety of devices, applications and sources across the data center. And of course transforming that data into meaningful information—charts, graphs, tables and even speedometers—that represent the current status and health of critical services is the work of the work.
But unless that information leads to action, it’s all for naught. And that, patient reader, is what this article is about—the importance of taking that extra step to turn data-driven insight into actionable behavior. What is surprising to me is how often this point is overlooked. Let me explain:
Let’s say you diligently set up your monitoring to collect hard drive data for all of your critical servers. You’re not only collecting disk size and space used, but you also pull statistics on IOPS, read errors and write errors.
Now, let’s say your sophisticated and robust monitoring technology goes the extra mile, not only converting those metrics to pretty charts and graphs, but also analyzing historical data to establish baselines so that your alerts don’t just trigger when, for example, disk usage is over 90 percent, but rather, for example, when disk usage jumps 50 percent over normal for a certain time period.
Now, let’s say you roll that monitoring out to all 5,000 of your critical servers and begin to “enjoy” about 375 “disk full” tickets per month.
That, sadly, is the normal state of affairs at most companies. It’s the point where, as a monitoring engineer (or, at the very least, the person in charge of the server monitoring), you begin to notice the dark looks and poorly hidden sneers from colleagues who have had “your” monitoring wake them one too many times at 2 a.m.
So, what’s missing? The answer is found in a simple question: Now what? As in, once you and the server team have hashed out the details of the disk full alert, the next thing you should do is ask, “What should we do now? What’s out next step?” In this case, it would likely involve clearing the temp directory to see if that resolves the issue.
And the next logical step from there is automation. Often, the same monitoring platform that kicks up a fuss about a server being down at 2 .m. can clear that nasty old temp directory for you. Right then and there, all while you’re still sound asleep. Then, if and only if, the problem persists, will a ticket be cut so a human can get involved. And said human will know that before their precious beauty sleep was so rudely interrupted, the temp directory had already been cleared, so it’s something just a bit more sophisticated than that.
This type of automated action is neither difficult to understand nor super complicated to establish. But in the environments where I’ve personally implemented it, the result was a whopping 70 percent reduction in disk full tickets.