(This post originally appeared on the Kentik blog)
“A few years back, I was tuning an alert rule and accidentally triggered the alert, which created 772 tickets. Twice.”
This (all too true) story serves as the introduction of the main thesis of my talk in the video below (and the reason for its title): That alerts — contrary to the popular opinion held by IT practitioners across the spectrum of tech — don’t inherently suck. The problem lies in how alerts are typically created, which causes them to be… well, let’s just say “sub-optimal” and leave it at that.
I’ve given this talk frequently at conferences such as DevOpsDays BRUM, DevOpsDays TLV, Monitorama, and others. I believe its popularity is largely due to its fun approach to a frustrating issue.
I’d like to take a few moments of your time here to emphasize points I make in the talk, but then extend those ideas in ways that don’t fit the limitations of time or format common in conference presentations.
The slippery slope to “Monitoring Engineer”
If you’ve read this far, there’s a good chance you care about alerts for more than just your own personal reasons. You probably have people — whether on your immediate team or in the larger organization — who look to and rely on you for help designing, implementing, maintaining, and fixing alerts.
While most of us first encounter monitoring solutions because we want to know more about our own sh… tuff, it quickly follows that we’re helping others set up monitoring for themselves. Before long, we found ourselves in the “resident expert” role. Once that reputation gets around, the job (whether official or not) is irrevocably added to our responsibilities.
The good news is that this is a huge opportunity for those who enjoy the work. Monitoring is an undeniable game-change in organizations willing to embrace and use it.
Alerts ≠ Monitoring
One of my first encounters with alerting that was completely off the rails was at a company that defined uptime as “100% minus the # of alerts” in a given period. It was utterly unhinged.
While it was an extreme example, the underlying issue — confusing alerting with monitoring — isn’t rare at all. For many individuals (and teams, departments, and entire companies), the raison d’être for monitoring is to have alerts, which is simply not helpful or effective.
Monitoring is nothing more (and nothing less) than the consistent, persistent collection of data from a set of systems. Everything else that a monitoring and observability solution provides — dashboards, widgets, reports, automation, and alerts — is merely a happy by-product of having monitoring in the first place.
As a monitoring engineer, I know something is amiss when I see people hyper-focusing on alerts to the exclusion (if not the detriment) of monitoring.
Alerts need proof of their value
An alert should only exist if it has a proven, measurable, meaningful impact. The best way to validate that is to see if an alert is intended to cause
- someone
- to do something
- RIGHT. NOW.
- about a problem
If all of those conditions aren’t met, you’re looking at an alert that is trying to replace some other monitoring structure — a dashboard, a report, etc.
But that merely proves that an alert is actionable, not valuable. And I must be clear: “important” isn’t the same as “valuable.” Importance implies that it is technically, intellectually, or (believe it or not) emotionally meaningful to some person or group.
“Valuable” is much more particular: The existence of the alert can be directly tied to a financial outcome.
How does one establish this? Start with what the world would look like without the alert:
- How would the people who can fix the issue find out about the problem? And more to the point, how LONG would it take for the people who can resolve the issue to find out?
- Are there any inherent losses while the problem is happening? An online sales system that generates $1,000 an hour loses that amount every hour it’s unavailable.
- How long would it take to fix the problem? In some cases, it’s the same amount of time, alert or not. But in far more circumstances, if the problem were left unaddressed for the length of time identified in the first bullet, it would take longer (possibly significantly longer) to resolve.
- What is the regular (“total loaded”) rate for the staff who can fix the issue?
- What is the “interruption cost” for that staff? This means the staff is (ostensibly) not sitting around waiting for this particular error. So what is the value of their normal work? Because they will NOT be doing it during the time they are addressing this issue.
You are welcome to take the formula above and, as the saying goes, “salt to taste.”
Once you have this, recalculate all of the above WITH the alert. The difference between the first calculation and the second is the dollar value of the alert.
Now, you can set up a simple report showing the number of occurrences the alert triggered, multiplied by the value. That is the amount this one alert has saved the company during that time.
Observability enables us to change our focus
Back when I started working with monitoring solutions (yeah, yeah, Grampa. When dinosaurs ruled the earth and you had to chisel each bit into the hard drive by hand with a lodestone), we had to guess at the user’s experience from an array of much lower-level data. We’d look at network traffic, disk I/O, server connections, and other metrics and use those metrics to guess what was happening at the top of the OSI model.
We didn’t do it because we thought it was the best option. We did it because it was the ONLY option. Tracing didn’t really come onto the scene — in terms of true application monitoring – until 2010. And it only took hold because of the fundamental change in application architecture.
The widespread adoption of cloud computing (AWS EC2 went GA in 2006) and mobile phones (the first iPhone came on the scene in 2007) radically changed how we interacted with applications. Facebook had an unbelievable (for the time) 600 million users in 2010. That number grew to 800 million in 2011 and over 1 billion in 2012.
Against THAT backdrop, application tracing and real user monitoring went from something we could only do in carefully controlled QA environments to a technique that was not only possible but game-changing.
Because the entire reason we have monitoring — the whole damn point — is to understand what users are experiencing with an application. That’s it. That’s the whole enchilada.
So, I will go on record as saying that alerting should focus on that aspect first and foremost. If the user experience is impacted, sound the alarm and get people out of bed if necessary.
At that point, all the other telemetry – metrics, events, and logs – can be used to understand the details of the impact. But those lower-level data points no longer have to be the trigger point for alerts. Not in most cases.
Where do we go from here?
Hopefully, you have enough time between this blog and my talk to reflect on your existing alerts with an eye toward real improvement. You may find yourself deleting alerts you once thought essential. You will also undoubtedly spend time tweaking your alerts to make them more actionable, meaningful, and valuable.
Just ensure you don’t trigger an alert storm in the process, or you’ll end up in the helpdesk, manually closing 1,544 tickets.
Don’t ask me how I know.