This article originally appeared in a scaled-down version here on PacketPushers.net. I’m posting it in it’s full form here as an introduction to the full series.
For people who are interested in monitoring, there is a leap that you make when you go from watching systems that YOU care about, to monitoring systems that other people care about.
When you are doing it for yourself, it’s all about ease of maintenance, getting good (meaning useful, interesting) data, and having the information at your fingertips to deflect accusations that YOUR system is down/slow/ugly/whatever.
But if you do that job well, and show up at enough meetings showing off your shiny happy data, inevitably you will get nominated/conscripted into the monitoring group where it is expected you will take as much interest in other people’s sh…tuff as your beloved systems from your former job.
And this is where things get especially tricky.
Assuming you LIKE monitoring as a discipline, and find it exciting to learn about different types of systems (and ways they can fail), you are going to want to provide the same levels of insight for your coworkers as you had for yourself.
Inevitably, you will find yourself answering The Four Questions. These are questions which—for reasons that will become apparent—you never really had to ask yourself when you were doing it on your own. The four questions—with brief explanations—are:
- Why did I get an alert?
The person is not asking, “Why did this alert trigger at this time?” They are asking why they got the alert at all. - Why didn’t I get an alert?
Something happened that the owner of the system felt should have triggered an alert, but they didn’t receive one. - What is being monitored on my system?
What reports and data can be pulled for their system (and in what form) so they can look at trending, performance, and forensic information after a failure. - What will alert on my system?
I’d like to be able to predict under which conditions I will get an alert for this system.
…and the Fifth Beatle… I mean question.
5. What do you monitor “standard”?
What metrics and data are typically collected for systems like this? This is the inevitable (and logical) response when you say, “We put standard monitoring in place.”
In the coming (weeks/days/months/series) I’m going to explore each of these questions in-depth, and offer techniques you can use to respond to each one.