(“In Case You Missed It Monday” is my chance to showcase something that I wrote and published in another venue, but is still relevant. This week’s post was co-written with my fellow Head-Geek Thomas LaRock, and originally appeared on OrangeMatter)
In this age of instant gratification, the digital experience can make or break a business. Downtime or subpar performance can cause customer attrition and decreased productivity, leading to lower revenues. Slow truly is the new down, so IT professionals must become even more attuned to the overall health of data centers to quickly and proactively solve issues and identify problems affecting end users’ experiences and businesses’ bottom lines.
In a fitting example of parallelism, the data center echoes how a business handles adjusting and adapting to external market forces. As trends like cloud, virtualization, hybrid IT, converged infrastructure, and more continue to transform traditional IT management and troubleshooting, it’s no longer enough to say, “I know how and where my gear is located and connected, and I have platforms from which I can pull metrics when needed.”
Instead, you need to know how your infrastructure responds to the external stimuli of real users. You need tools to help you understand the multi-variant inputs that create performance effects on discrete elements in your environment. Can you say that on the second Tuesday of every month when a backup runs, for example, a misconfigured, concurrent map-reduce job for Big Data is colliding with backup traffic on the storage network? This kind of knowledge is required to untangle performance metrics in today’s environments, and you must have a fundamental understanding of how these disparate pieces of infrastructure and technology work together to best deliver services—and quality of services—that satisfy end users. With the rate of change in today’s data center, that’s a tall order.
Data Center Monitoring Is Important
Why? Because even beyond the ultimate end goal of monitoring for and deciphering key performance metrics, modern troubleshooting presents even the most seasoned IT professionals with several hurdles, starting with what kind of environment you’re managing. If you’re tasked with overseeing a traditional data center, you’ve more than likely selected the technology and infrastructure systems (or you’re probably only a phone call away from that person, even if they’ve since moved on). It’s much easier to generate data for these systems because your IT department has agreed on a common set of standards for gathering those metrics, and monitoring tools are typically very mature.
On the other hand, if you’re at an organization working either partially or entirely in the cloud, IT departments typically take a back seat—business leaders often choose the service providers, so administrators are playing catch-up and getting to know someone else’s technology. Cloud service providers also deliver a relatively new service and are much more focused on rapidly developing features and functionality than on developing monitoring capabilities. DevOps helps to bridge this gap by enabling higher-level integration of tools, but, unfortunately, it’s quite rare to find an integrated tool with the breadth to monitor everything from traditional enterprise technology all the way to something as abstracted as containers.
What Data Center Metrics to Monitor
Another major hurdle is simply culling through the seemingly endless amount of data points even one monitoring tool can generate; let alone the multiple tools many organizations deploy simultaneously. For IT professionals, more data isn’t always better. Certainly, the more metrics you have, the more visibility you have, but it’s also a much larger data set to manage. The single largest problem IT troubleshooters face is identifying that single point of truth amidst all the noise. Instead, identifying and leveraging the right data, and getting only what you need when you need it, is key.
To help streamline troubleshooting, there are a handful of performance metrics you should always monitor, including:
- Percent of capacity used (requires an understanding of your base capacity and metrics that report how much you’re using)
- Quality of service at the endpoint
- Network performance across the internet
- Component-performance metric in a composite application (i.e., can you discretely monitor the relevant performance of each of those components?)
However, when it comes to troubleshooting and understanding performance metrics in today’s data center, the ability to easily and quickly collaborate with your peers across siloes is arguably the fastest resolution. Investigating a potential issue to the extent of your ability or responsibility, and then distributing your findings widely via a team communication platform, ensures that no one else responds to the alert until you’ve had a look, and the same goes for the other teams involved.
Look for data center monitoring tools that allow you to selectively send metrics and problem requests to the next person or team in the chain and, ideally, include details about the metrics and the troubleshooting steps you’ve already attempted. Just to be clear, I’m not talking about a ticket system that logs notes and changes owners. I mean a system that lets you build a set of metrics that tell a specific story, and then share that story. Using the shared metrics and reporting capabilities of this tool, the second responder can immediately see what you’ve already addressed and the most likely root causes of the problem. Now, this second administrator can investigate additional metrics related to their specific domain, such as virtualization. They might discover a noisy neighbor problem and move some virtual machines around to create additional capacity.
Having the Right Data Center Monitoring Tools Matters
Unfortunately, most organizations deploy disparate monitoring tools amongst each operating silo, and cross-platform sharing and collaboration is difficult, let alone correlation of metrics (another must-have). However, as the cloud continually drives data center convergence, look for comprehensive monitoring and management tools that allow your IT department to easily find that single point of truth to more effectively manage and troubleshoot problems.
Beyond filing away the handful of metrics most likely to indicate performance problems and cultivating a more collaborative environment, consider the following best practices as additional guidance for navigating the tangled web of modern data center troubleshooting.
- Don’t panic. If you’re broadly measuring your infrastructure—meaning you have invested time and resources into deploying a monitoring tool or system that ensures you have a significant collection of metrics and data at your disposal—you will be able to solve the problem.
- Play with your data. Pull it into a view where you can juxtapose metrics you haven’t considered side-by-side before. Take wild guesses. You’ll start to more quickly make associations that uncover root cause.
- Befriend the machines. Rely on automated-context discovery wherever possible. Physical topology is one thing, but many monitoring platforms can now go a step further and determine a logical topology and interconnections between applications. Which application sits on which server, accessing which database, stored on which disk of which LUN? Then, when you start mixing and matching data and exploring juxtaposed metrics, you don’t have to do it off the top of your head. Use the automated context discovery provided by your monitoring solution as a starting point.
With the rate of data center changes showing no sign of slowing down, streamline and simplify your performance monitoring and troubleshooting processes by investing in next-generation, comprehensive tools that enable cross-team collaboration and leveraging of the above best practices. The bottom line is, a fundamental understanding of how your infrastructure elements—on-premises and in the cloud—work together to deliver services is critical to solve issues quickly, and to proactively identify problems that affect end users’ experiences before they happen.