“Logfile Monitoring” – I Do Not Think It Means What You Think It Means

This is a conversation I have A LOT with clients. They say we want “logfile monitoring” and I am not sure what they mean. So I end up having to unwind all the different things it COULD be, so we can get to what it is they actually need.

It’s also an important clarification for me to make as SolarWinds Head Geek because depending on what the requested means, I might need to point them toward Kiwi Syslog Server, Software & Application Monitor, or Log & Event Manager (LEM).

Here’s a handy guide to identify what people are talking about. “Logfile monitoring” is usually applied to 4 different and mutually exclusive areas. Before you allow the speaker to continue, please ask them to clarify which one they are talking about:

  1. Windows Logfile
  2. Syslog
  3. Logfile aggregation
  4. Monitoring individual text files on specific servers

More clarification on each of these areas below:

Windows Logfile

Monitoring in this area refers specifically to the Windows event log, which isn’t actually a log “file” at all, but a database unique to Windows machines.

In the SolarWinds world, the tool that does this is Server & Application Monitor. Or if you are looking for a small, quick, and dirty utility, the Eventlog Forwarder for Windows will take Eventlog messages that match a search pattern and pass them via Syslog to another machine.

Syslog

Syslog is a protocol, which describes how to send a message from one machine to another on UDP port 514. The messages must fit a pre-defined structure. Syslog is different from SNMP Traps. This protocol is most often found when monitoring network and *nix (Unix, Linux) devices, although network and security devices send out their fair share as well.

In terms of products, this is covered natively by Network Performance Monitor (NPM), but as I’ve said often you shouldn’t send syslog or trap directly to your NPM primary poller. You should send it into a syslog/trap “filtration” first. And that would be the Kiwi Syslog server (or its freeware cousin).

Logfile aggregation

This technique involves sending (or pulling) log files from multiple machines and collecting them on a central server. This collection is done at regular intervals. A second process then searches across all the collected logs, looking for trends or patterns in the enterprise. When the audit and security groups talk about “logfile monitoring,” this is usually what they mean.

As you may have already guessed, the SolarWinds tool for this job is Log & Event Manager. I should point out that LEM will ALSO receive syslog and traps, so you kind of get a twofer if you have this tool. Although, I personally STILL think you should send all of your syslog and trap to a filtration layer, and then send the non-garbage messages to the next step in the chain (NPM or LEM).

Monitoring individual text files on specific servers

This activity focuses on watching a specific (usually plain text) file in a specific directory on a specific machine, looking for a string or pattern to appear. When that pattern is found, an alert is triggered. Now it can get more involved than that—maybe not a specific file, but a file matching a specific pattern (like a date); maybe not a specific directory, but the newest sub-directory in a directory; maybe not a specific string, but a string pattern; maybe not ONE string, but 3 occurrences of the string within a 5 minute period; and so on. But the goal is the same—to find a string or pattern within a file.

Within the context of SolarWinds, SAM has been the go-to solution for this type of thing. But, at this moment it’s only through a series of Perl, Powershell, and VBScript templates.

We know that’s not the best way to get the job done, but that’s a subject for another post.

The More You Know…

For now, it’s important that you are able to clearly define—for both you and your colleagues, customers, and consumers—the difference between “logfile monitoring” and which tool or technique you need to employ to get the job done.

#FeatureFriday: Validating SolarWinds Database Maintenance

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

One of the key aspects of SolarWinds tools are their ease of use. Other enterprise-class monitoring solutions require you to have subject matter experts on hand to help with implementation and maintenance. SolarWinds can be installed by a single technician and doesn’t require you to have a DBA or Linux expert or CCIE on hand. 

But that doesn’t mean there’s no maintenance happening. And while a lot of it is automated, it’s important for folks who are responsible for the SolarWinds toolset to understand whether that maintenance is running correctly.

In this video, Head Geeks Kong Yang, Patrick Hubbard, and I go over the SolarWinds maintenance subroutines and how to see whether things are happy or not under the hood.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

Essential

What is “essential” to your monitoring environment? (hey, this *is* a monitoring blog. I could certainly ask that about your kitchen, but that would be a completely different discussion)

Seriously – if your budget was infinite, you could have all the tools running on giant systems with all the disk in the world. But it (the budget) isn’t (infinite) and you don’t (have all the toys).

So start from the other end. What could you absolutely positively not live without? Is ping and a DOS batch file enough? A rock-solid SNMP trap listener? Is it a deal-breaker if you don’t have  agents running local commands on each server?

Draw a line. Come up with reasons. Know where you stand.

The Four Questions, Part 5

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked. You can read that introduction here. Information on the first question (Why did I get this alert) is here. You can get the low-down on the second question (Why DIDN’T I get an alert) here. The third question (What is monitored on my system) is here. And the fourth question (What alerts WILL trigger for my system) is posted here.

But as I’ve hinted all along, and despite the title of the series, there are actually five questions that you will be asked when you start down the path to becoming a monitoring engineer. The fifth, and final, question that you will most likely be asked is:

What do you monitor “standard”?

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

No Rest for the Weary…

You thought you were going to make it out of the building without another task hanging over your head, but the CIO caught you in the elevator and decided to have a “quick chat” on the way down.

 “I’m glad I caught up to you,” he says, “I just have a quick question…”

“I love what you’re doing with the monitoring system,” he begins. “but one thing that I keep hearing from support teams is that they feel like every new device has to be monitored from the ground up. Is there a way we can just have some monitors in place as soon as it enters the system?”

Choosing your words carefully (it *is* the CIO, after all) you respond “Well, there’s a whole raft of metrics we collect as soon as any new system is added into monitoring. Then we augment that with metrics based on what the new device provides and the owner and support teams need.”

“That’s a relief,” smiles your CIO, as you arrive at your car. He is now literally standing in between you and a well-earned beer. “but what our teams – your customers, you know,” he adds with a chuckle. “need to know is what those standard options are. Like when we’re buying a new car.” He says, eyeing your 2007 rustbucket. “Monitoring should list which features are standard, and which are optional upgrades. That shouldn’t be too hard, right?”

“Not at all!” you chirp, your enthusiasm having more to do with the fact that he’s moved out of the way than the specter of new work.

“Excellent.” He says. “Can you pull it together for me to look over tomorrow?” Without waiting for an answer he calls “Have a great night!” over his shoulder.

Standard Time

The reason this (“what do you monitor standard?”) is a common question has a lot to do with how monitoring solutions usually come into being in organizations; how monitoring teams are established; and how conversations between you (the monitoring engineer) and the recipient of monitoring tend to go.

First, while in some cases monitoring solutions are conceived and implemented as a company-wide initiative in a greenfield scenario, those cases are rare. More often what ends up being the company standard started off as a departmental (or even individual) initiative, which somebody noticed and liked and promoted until it spread.

“Standard procedures? We do it like this because Harold did it that way when he installed this thing 3 years ago.”

Second, monitoring teams often form around the solution (product XYZ) OR around the cause (“what do we want?” MONITORING! “When do we want it?” PING!). Unlike more mature IT disciplines like networking, storage, or even the relative newcomer virtualization, people don’t usually set out to become monitoring engineers. And thus, there are precious few courses, books, or even tribal knowledge to fall back on to understand how it is “usually” done. Teams form and, for better or worse, begin to write their own set of rules.

Third (and last), because of the preceeding two points, conversations between the ersatz monitoring engineer and the monitoring services consumer (the person or team who will read the reports, look at the screens, and/or receive the alerts) tends to have a number of disconnects. More on that in a moment.

What We Have Here…

Put yourself in the shoes of that monitoring consumer I mentioned a minute ago. You’ve got a new device, application, or service which needs to be monitored. At best, you need it monitored because you realize the value monitoring brings to the table. At worst, you need it monitored because you were told your device, application, or service is a class III critical element to the organization, and therefore monitoring is mandatory. You just need to check the little box that says “monitoring” so that you can get this sucker rolled into production.

So you make the call and set up a meeting with someone from the vaguely shadowy sounding ‘monitoring team’ (what, do they manage all the keylogger and spyware software you’re sure this place is crawling with?) and a little while later, you’re sitting down in a conference room with them.

You explain your new project, device, application, or service. The person is taking notes, nodding at all the right places. This is looking positive. Then they look up from their notes and ask:

“So, what do you need to have monitored here.”

In your head, you’re thinking “I thought YOU were supposed to tell ME that. Why do I have to do all the heavy lifting here?”

But you’re a pro so of course you don’t say that. Instead you reply, “Well, I’m sure whatever you monitor standard should be fine.”

The monitoring person looks mildly annoyed, and asks “Anything ELSE? I mean, this is a pretty important <device/application/service>, right?”

So now you DO open your mouth. “Well, it’s hard to tell since I’m not sure what standard monitoring is!”

“Oh come on,” comes the retort. You’ve been using XYZ monitoring for 2 years now. You know what we monitor on the other stuff for you. It’s…”

And then then do the most infuriating thing: They rattle off a list of words. Some of which are components, some are counters, some sound vaguely familiar, and others you’ve never heard of. And there they sit with their arms crossed, looking at you across the table.

*************

While the above scene is probably (hopefully) not as dramatic as the ones you’ve encountered in real life, Never the less this probably captures the essence of the issue. If the monitoring team and those who benefit from monitoring are working from different playbooks, the overall effectiveness of monitoring (not to mention YOU) is going to be impacted.

So how do you avoid this? Or if you’ve already fallen into friction, how do you get past it? As with the rest of this series, I have are a few suggestions which largely boil down to “knowing you will be asked this question is half the battle” because now you can prepare.

But the daemon is, as they say, in the details. So let’s dig in!

好記性,不如爛筆頭

(“A good memory is no match for a bad pen nub”)

Yes, my first suggestion is going to be to make sure you have good documentation.

Before you quail at the thought of mountains of research, keep in mind that that you aren’t documenting the specific thresholds, values, etc for each node. You are compiling a list of the default values which are collected when a new system is added into your monitoring solution. Depending on your software, that could be a consistent list regardless of device type, or it could be a core set of common values along with a few additional variations by device type.

Your first stop should always be to RTFM which stands, of course, for “Read The Friendly Manual”. Many vendors have already done this work for you and all you need to do is copy and paste that into your documentation.

But in the case that your vendor considers such information to be legally protected intellectual property, your next step is to just LOOK AT THE SCREENS THEMSELVES. I mean, it’s not that hard. After scanning a couple of the detail pages for devices, you’re going to see a pattern emerge. It will probably be something like:

  • Availability
    • Response time
    • Packet loss
  • CPU
    • Overall percent utilization, 5 minute average
    • Per processor % util 5 min avg
  • Physical RAM
    • % Util 5 min avg
    • IO Operations Per Second (IOPS)
    • Errors
  • Virtual RAM
    • % Util 5 min avg
    • IO Operations Per Second (IOPS)
    • Errors
  • Disk
    • Availability
    • Capacity
    • Used
    • IOPS
  • Interface
    • Availability
    • Bandwidth
      • Bits per second (BPS)
      • % Utilization
      • Packets Per Second (PPS)
      • Errors
      • Discards

There, in that relatively short list, is probably the core of the metrics every machine provides. Then you have specifics for routers (Flapping routes, VPN metrics, etc), switches (stacked switch metrics, VLANs), virtual machines (host statistics,  “noisy neighbor” events), and so on.

When all else fails, there’s wireshark. By monitoring the connection between the monitoring system and a single target device (preferably one you have newly added) you can capture all communication and suss out the metrics being requested/supplied. It’s not pretty, it might not even be easy. But it’s a lot easier than having to repeatedly tell your colleagues that you have no idea what gets monitored.

PS: If you are down to the wireshark method, it might be time to look at another monitoring solution.

There’s An App For That

Everything I’ve listed above is great for hardware-based metrics, including specialized storage and virtualization tools. But once you cross into the valley of the shadow of application monitoring, you need a better strategy.

By their very nature, application monitoring is highly complex, highly customized, and highly variable. Take something as straightforward as Microsoft Exchange. When building a monitoring solution, you might need specific monitoring for the hub transport server, the mailbox server, the client access server, or even the external OWA server (just to name a few). While there are common element from one to the other, they serve very different purposes and require very different components and thresholds.

Take that concept and multiply it by all the applications in your environment.

So what is the diligent yet slightly-overwhelmed monitoring engineer to do? While movies like “Glengarry Glen Ross” and “The Boiler Room” have popularized the phrase “ABC – Always Be Closing” within sales circles, in monitoring (and indeed, within much of IT) I prefer “ABS – Always Be Standardizing”.

Whether the topic is alerts, reports, or sets of application monitoring components, the best thing you can do for yourself is to continuously try to create a single “gold standard” for a particular need and then keep using it, expanding that single standard when necessary to account for additional variations, but avoiding as much as possible the “copy and paste” process that leaves you with 42 different Exchange monitoring templates and (18 months down the road) no idea which one is the default, let alone which one is applied to which server.

If you are able to adhere to this one practice, then answering the question “what do you monitor, standard” for applications becomes immeasurably more achievable.

Good Monitoring Is It’s Own Reward

Armed with the information I’ve described above, that conversation I described at the beginning of this essay goes in a very different direction:

Them: “We need to monitor about 20 of our devices and the applications that go with them.”

You: “Great! It’s a good thing you’re talking to me and not accounts receivable then. Do you have a list of what you are putting together?”

Them: (laughing politely at your very not funny joke) “Right here”

You : “OK, great. I’ve got a list here of the things that get monitored automatically when we load your devices into our software. Can you take a look and tell me if anything you think is important is missing?” (hands printed list of default hardware monitors across the table)

Them: (Scanning the list) “Wow. This is… a lot. It’s more than we expected. I don’t want tickets on all this tho…”

You: “Oh, no. These are the metrics we collect. Alerts are something else. But if you need to see the general health of your systems, this is what we can provide you automatically.”

Them: “Oh, I get it! That’s great. You know what? We’ll ride with this list for the time being.”

You: “Perfect. Now, on the application side, here’s what we have in place today.” (hands over printed list of application monitoring).

Them: “OK, I can already see we’ll need a few extra items that the dev’s have said they want to track. They will also want to alert on some custom messages they’re generating.”

You: “No worries. You can get me that list by email later if you want. I’m sure we’ll be able to get it all set up for you.”

Them: “Wow, this was a lot less painful than I expected.”

You: (Winks at camera) “I know!”

The Long And Winding Road

You pull into the driveway and head into your home, reflecting on how 8:45am seemed like 100 years ago in terms of the miles you’ve put on that old brain of yours. But then you sit down and take a minute to clear your head, and you realize a few things:

First, that monitoring, GOOD monitoring, is something that is achievable by anyone that is willing to put in a little time and effort, and can be accomplished with most of the monitoring solutions currently available on the market today.

Second, that putting in the time and effort to create good monitoring makes the overall job significantly more enjoyable.

Third, that monitoring is, and should be treated as, a separate discipline within IT, as valid a sub-specialty as networking, programming, storage, virtualization or information security. And only when it IS treated as a discipline will people be willing to put in the time and effort I’ve been describing, and to invest in the tools that help make that possible.

Fourth, and finally, that everything that happened today – from the 8:45am beginning I described in part 1 all the way to the cold beer in your hand now – is all part of a day’s work in IT.

 

#FeatureFriday: What is WMI?

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

WMI (Windows Management Instrumentation) is, as the name implies, a foundational protocol for Microsoft Windows – based environments. Despite this, it is not well understood in terms of what it can do and how you make it do it.

In this video Chris O’Brien and I take a look at the basics of WMI and what it looks like in its most basic form.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

Nothing is New Under the Sun

Ping. SNMP. WMI. Script. API call.

As I’ve elaborated on before, monitoring hasn’t changed much in the last decade or so.

To be sure, we’ve gotten better at presenting the information. We’ve found more effecient ways to collect that same information. And we’ve done a better job of streamlining the process of enabling those collections.

At the same time, the targets of monitoring have gotten smarter. Onboard agents expose more information, or are more responsive to polling requests.

But the raw underlying techniques haven’t shifted much. And cars still burn fossil fuels (well, unless you have one of these). And children still go to school to learn (more or less) the 3r’s out of actual paper books.

Things really do stay the same more than they change.

And I’ll argue that they do so because we want them to stay the same.

More on that to come…

The Three Black Boxes (hint: the network was only the beginning)

NOTE: This post originally appeared here

Once upon a time, back in the dark ages of IT, people would sit around the table and talk about “the black box” and everyone knew what was meant—the network. The network, whether it was ARCNET, DECnet, Ethernet or even LANTastic, seemed inscrutable to them. They just connected their stuff to the wires and hoped for the best.

Back in those early days, conversations with us early network engineers often went something like this:

Them: “I think the network is slow.”
Us (who they considered pointy-hatted-wizards): “No, it’s not.”
Them: “Look, I checked the systems and they’re fine. I think it’s the network.”
Us: “Come one, it’s rarely ever the network.”
Them: “Well, I still think…”
Us: “Then you’ll need to prove it.”

It was so difficult for them to pierce this veil that—if urban legends on the subject are to be believed—the reason the image of a cloud is used to signify a network is because it was originally labeled by those outside the network with the acronym TAMO. This stood for, “then a miracle occurs,” and the cloud graphic reinforced the divine and un-knowable nature of bits flowing through the wire.

But we in the network knew it wasn’t a miracle, though it was still somewhat of a black box even to us—a closed system that took a certain type of input, implemented only somewhat monitorable processes inside and then produced a certain type of output.

With time, though, the network became much less of a black box to everyone. Devices, software and our general knowledge grew in sophistication so that we now have come to expect bandwidth metrics, packet error data, NetFlow conversations, deep packet inspection results, IPSLA and more to be available on demand and in near real-time.

But recently, two new black boxes have arrived on the scene. And this time, we net admins are on the outside with almost everyone else.

The first of these, virtualization—as well as its commoditized cousin, cloud computing, has grown to the point where the count of physical servers in medium-to-large companies is sometimes just a tenth of the overall server count.

Ask an application owner if he knows how many other VM’s are running on the same host and you’ll be met with a blank stare. Probe further by asking if he thinks a “noisy neighbor”—a VM on the same host that is consuming more resources than it should—is impacting his system and he’ll look at you conspiratorially and say, “Well, I sure think there’s one of those, but heck if I could prove it.”

Still, we love virtual environments. We love the cost savings, the convenience and the flexibility they afford our companies. But don’t fool yourself—unless we’re actually on the virtualization team, we don’t really understand them one bit.

Storage is the other “new” black box. It presents the same challenge as virtualization, but only worse. Disks are the building blocks of arrays, which are collected via a software layer into LUNs, which connect through a separate network “fabric” to be presented as data stores to the virtual layer or as contiguous disk resources to physical servers.

Ask that already paranoid application owner which actual physical disks his application is installed on and he’ll say you may as well ask him to point out a specific grain of sand on a beach.

Making the storage environment even more challenging is its blended nature. Virtualization, for all the complexity, is a binary choice. Your server is either virtualized or it’s not. Storage isn’t that clear cut—a physical server may have a single traditional platter-based disk for its system drive, connect to a SAN for a separate drive where software is installed and then use a local array of SSD drives to support high-performance database I/O.

OK, so what does all this have to do with the network? Well, what’s most interesting about these new black boxes—especially to us network folk—is how they are turning networking back into a black box as well.

Think about it—software-based “virtual” switches distribute bandwidth from VMs to multiple network connections.

Also, consider that SAN “fabric” is often more software than hardware.

And then there is the rise of SDN, a promising new technology to be sure, but one that still needs to have some of the rough edges smoothed away.

The good news is that, like our original, inscrutable networking from the good old days, the ongoing drive towards maturity and sophistication will crack the lid on these two new black boxes and reverse the slide of the network back into one as well.

Even now it’s possible to use the convergence of networking, virtualization and storage to connect more dots than ever before. Because of the seamless flow from the disk through the array, LUN, datastore, hypervisor and on up to the actual application, we’re able to show—with a tip of the old fedora to detective Dirk Gently—the interconnectedness of all things. With the right tools in hand, we can now show how an array that is experiencing latency affects just about anything.

That paranoid application owner might even stop using his “they’re out to get me” coffee mug.

#FeatureFriday: Understanding Network Configuration Backups

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

Network Configuration Backups is usually not considered to be a monitoring technique, and yet the opportunities this technique opens up are pretty significant.

In this video, Chris O’Brien and I take a look at Config Backups and their potential for you and your environment.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)

The Myth of the “Five Nines”

“Five-Nines” refers to something that is available 99.999% of the time. It’s become a catch phrase in various parts of the IT industry.

It’s also complete bullshit.

Sean Hull did a great job explaining why five-nine’s is over-rated in this post. But my point is NOT that this level of reliability is expensive. It’s that it’s nearly impossible to functionally achieve.

I’m also saying that the demands for (or the claims of) “Reliability in the five-nines” are highly over-blown.

Let’s do the math.

  • In a single minute, 5-9s means you would be unavailable just .0006 of a second.
  • In an hour, you could have .036 seconds of downtime
  • In a day, your system would get .86 seconds of breathing room
  • In a week, you could take a 6.04 second break before being less than 5-9’s
  • In an entire month, you’d only get 24.192 seconds of downtime
  • In any given fiscal quarter, you could expect just over a minute – 72.576 seconds, to be precise – of an outage
  • Half a year? You get over two minutes – 145.152 seconds – where that system was not available
  • And in a whole year, your 5-9’s system would experience just under 5 minutes (290.304 seconds) of outage. Total. Over the entire year.

You seriously expect any device, server or service to be available to all users for all but 5 minutes in an entire year?

This has implications for us as monitoring professionals. After all, we’re the ones tasked with watching the systems, raising the flag when things become unavailable. When someone is less than 99.999% available, we’rethe ones the system owners come to, asking us to “paint the roses green”. We’re the ones that will have to re-check our numbers, re-calculate our thresholds, and re-explain for the thousandth time that “availability” always carries with it observational bias.

Yes, Mr. CEO, the server was up. It was up and running in a datacenter where a gardener with backhoe had severed the WAN circuit; it was up and running and everyone in the country could see it except for you, because wifi was turned off on your laptop; it was up and running but it showed “down” in monitoring  because someone changed the firewall rules so that my monitoring server could no longer reach it…

There’s an even more pressing fact-on-the-ground: Polling cycles are neither constant nor instant. Realistic polling intervals sit at around 1-2 minutes for “ping” type checks, and 5 minutes for data collection. If I’m only checking the status of a server every minute, and my monitoring server is dealing with more than that one machine, the reality is that I won’t be cutting a ticket for a “down” device for 3-5 minutes. That blows your 5-9’s out of the water right there.

But all of that is beside the point. First you need to let me know if your corporate server team is down with 5-9’s availability guarentees. Are they promising that the monthly patch and reboot process will take less than half a minute, end to end?

I’m thinking ‘no’.

#FeatureFriday: What is IPSLA?

Welcome to “Feature Friday”, a series of (typically short) videos which explain a feature, function, or technique.

Like NetFlow, it’s easy to fall into the trap of thinking IPSLA is some “new” feature that’s just come on the scene. In reality, it was part of Cisco IOS release 11.2 (as a featured called “Response Time Reporter”, or RTR) all the way back in October 1996. Later it was renamed “Service Assurance Agent” (SAA), and finally changed to it’s current moniker which stands for “Internet Protocol Service Level Agreement”.

In this video, Chris O’Brien and I talk over the things IPSLA can do for you and how it tends to be pigeon-holed into being “just good for voice monitoring” where it can actually do much, much more.


For more insights into monitoring, as well as random silliness, you can follow me on Twitter (@LeonAdato) or find me on the SolarWinds THWACK.com forums (@adatole)