I’ve found myself engaged in a very particular type of debate lately, about the definition and nature of “observability”. It’s a discussion that’s neither unique to me, my role, or my experience; nor is it particularly new. Better voices – more sage, more passionate, more experienced – have weighed in on the topic.
Nevertheless I feel compelled to put these thoughts down into blog form, if for no other reason that it will allow me NOT to have a long drawn-out conversation in the futre, but rather give me an easy link to reference.
The Word Observability Has a Particular Pedigree
“Observability” is a very specific word. It has a history going back to times before IT existed and has roots in “pure” engineering. More recently, IT powerhouses have extended and used the term to apply to software purpose-built to include hooks / outputs that enable observability.
Observability in IT was first applied to software development / devops / continuous development-continuous integration (CI/CD) pipelines because it’s where things (software) could be readily changed to include MORE outbound information when needed. It’s much harder to get an operating system vendor (let alone hardware vendor) to build (and continue to expand/enhance) those kinds of data streams. Not impossible, but slower and less responsively.
For this reason, observability has remained largely in the dev (and devops) domain for the better part of a decade, since that’s where the most responsive opportunities exist.
The Hallmarks of Observability
First, I’ll begin by defining how monitoring differs. At it’s core, monitoring solutions ask a system “how are you?” and collect the response.
Compare this to observability solutions, which are designed to listen for everything the system says on its own without being asked and infer how it’s doing.
Observability requires a partnership between the system designer and the collector-of-data. The designer has to be willing to output incredibly detailed streams of information about the application—the good, the bad, the weird, the seemingly trivial, and everything in-between—and the collector has to be nimble enough to accept multiple, disparate, varied types of data and normalize them into a system designed to allow someone (preferably software, not a human) to identify suspicious events.
Therefore, a system can only be observable if it can be adjusted or updated to provide new information as new conditions and situations are identified. If a system fails in a way the outbound stream of data doesn’t include, the system must be updated. If there’s no way to update the system, there’s no real way to make it observable either.
Second, monitoring is concerned about the “known unknowns” – I know a specific piece is likely to break, but I don’t know when. So, I just watch (monitor) it and track its status, waiting for the known threshold (whether that’s “down” or “over 90%” or some other combination of known conditions) to occur.
Observability, on the other hand, is designed to detect and showcase “unknown unknowns” – defined (by me) as a cirumstance where I don’t know what will break; and I also don’t know when.
You can see this is a completely different type of problem from the one that monitoring addresses. How can you monitor for a thing you haven’t considered as having a possibility of occuring?
Third, observability is primarily focused on events with “high cardinality.” The term “cardinality” refers to how unique it is. Think about a phone book. The data element with the lowest cardinality is “state.” It’s likely to be the same in the entire phone book. Slightly higher in cardinality would be city (but there are still a boatload of repeats). As you think through the phone book analogy, something becomes clear: nothing has particularly high cardinality by itself. Not last name, not street number, nothing. What IS highly unique is a combination of elements—last name, first name, street, and house number. The likelihood of two people named “Adato, Leon” living at 123 Oak Avenue is zero. This is a data point with high cardinality.
Additionally, observability solutions aren’t looking for a single data point with high cardinality, they look for a unique combination of events. Given the vast number of data inputs, the kind of processing I’m describing requires sophisticated analysis. Hence, the emphasis of most solutions on AI/ML (which, to be quite honest, is often vender hyperbole for “really good algorithms” but honestly that’s nitpicking. I’m less concerned with what vendors call what they’re doing, and more with the fact that they’re doing it at all.) to process the incoming data, re-assemble those data points in multiple ways, and make visible the events both unique and interesting.
Why Is Observability So Hard?
Joe Reves, my friend and colleague, has a great metaphor illustrating why observability is a difficult nut to crack: Let’s imagine we want to create an app to help us navigate in our car from point A to point B.
In its simplest form, I’d create a static map showing my relative position via GPS. Updates would be few and far between because new roads don’t just pop out of thin air. Including alternate routes is a trivial matter because the roads are a fixed set of elements. My app could even present alternate routes based on road closures, assuming there’s reliable sources for the information. Monitoring those data streams is, after all, a “known unknown.” I may not know when a road will be closed, but I’m certain roads DO have problems, and so I can watch for this data.
But it’s never so simple. Getting from point A to point B quickly and efficiently means taking current traffic patterns into account. For this, you need to include GPS information from every other driver on the road; then filter for the cars who share my current route (whether or not their destination is the same or different); it also requires keeping track of cars on every alternate route, in case traffic patterns change and an alternate route becomes faster; and not just that, but watching the speed of all those individual not-my-car objects to determine if their relative speed slows down to a point where it might indicate a traffic jam or congestion; and much, much more.
What makes the whole thing orders of magnitude more challenging is this isn’t just happening for a single “user” who wants directions from point A to point B. All those individual cars want the same information, from their point of view, all at the same time.
Collecting such a large volume of information, parsing it, customizing it, and flagging potential “issues” (again, in this example it would be traffic jams, road closures, better routes, and more) is a mind-bogglingly complex set of computations.
I share all of this by way of showing how a “simple” navigation app is anything but; how it requires computing capability far exceeding the power on the humble phone in our pocket; how it involves the collection of wildly disparate sets of data, then processing the data, and finally presenting the graphically visualized data from an extremely particular point of view.
Just like observability.
Can a Router Be Observable?
The simple answer is “no.” Traditional infrastructure—from routers and switches to servers and IoT devices and extending even to operating systems—aren’t inherently observable. Why? Because we’re not Cisco (or Juniper, or Microsoft, or whoever the vendor is who created the system). When we experience an unanticipated failure mode not included in the existing streams of data, we can’t go modify the code to include more information.
This isn’t to say the industry isn’t trying to get there. OpenTelemetry, is an effort to wrap observability data in infrastructure devices. It should be noted there’s also the usual chaos of vendors developing their own competing standards and engaging in both marketing and development knife fights. Only time will show which standard or vision will win out.
In its current state, it’s certainly possible to apply observability concepts to traditional infrastructure devices and systems—collecting outbound data streams provided by the target systems; combining them into a single coherent view; applying software-based-analysis to bring unexpected events to the fore; and presenting the data we’re able to collect in a way that feels familiar to folks who use “real” observability solutions for software. But this is probably as close as we can get, and will get, until a common standard is adopted by all (or at least most) of the industry.