It would not be wrong to say that Observability is the new buzz word for the last couple of years at least and often we find organizations burden themselves with questions like –

  • How do you create an Observability set up?
  • Should we invest in more monitoring tools?
  • Is Observability an extension of monitoring?
  • We have created a culture of Observability but are we doing it the right way?

The answer to these questions lies in understanding the concept of Observability and how it ties in with the digital experience monitoring strategy of your organization and only then can you determine where you stand in terms of Observability.

What is Observability?

Let us try to understand the definition of Observability first. According to Wikipedia/Google, “In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs”.  While we read through the definition, take a step back and evaluate the tools we already use. Do these tools support this definition? Yes, it does!

We already have various monitoring tools, like APM, that measure the internal state of a system and correlates with the external output using Synthetic and Real user monitoring. Then the question is why this new concept? To answer this, we need to understand the three broad categories of Observability – Structured logs, Metrics and Traces.

The idea of Observability is to tie these components and make the system more observable rather than look at each metric, logs, or traces individually. For example, when alerted on the high wait time from synthetic monitoring tool, correlate it with the right APM data. Use logs and trace to identify the root cause and then focus on ensuring the issue does not recur.

Take the example of another scenario, an engineer is paged in the middle of the night about a certain server that is slow or running on low disk space. But only a small percentage of queries were impacted and by the time the engineer identified the root cause, the load balancer had already rerouted to another server. The scenario may sound all too familiar? Consider the cost of such an incident and more importantly, the impact on end users.

These examples highlight the importance of understanding the three components of Observability and the challenges associated with setting up a true observable system.

Logs help comprehend and correlate data for more detailed analysis. With more complex infrastructure and the use of multiple vendors, the code written by different vendors in a distributed system creates the challenge of navigating through logs and making sense of it in one go. For example, you would have Audit logs, Service logs, App logs, System logs, Platform logs and so on. Data coming from different sources, and every system has a different logic written to define logs. Thus, the data correlation becomes a challenge. Also, it is not about logs from your application alone, but at times you also need to collect logs over network protocols like UDP if the request has passed the firewall. Root cause analysis with logs does indeed come with its own set of challenges.

Tracing is another component that has an added layer of complexity introduced by multiple infrastructure vendors and the use of microservices. It is difficult to trace individual components to identify bottlenecks. For example, while using trace Ids in a system we see system A-B takes 100ms and B-C takes 3000ms (Fig 1).

A close up of a logo

Description automatically generated
Fig 1

We may conclude that the issue is between B-C, but guess what, after C the request goes to D and then E (Fig 2). So does latency exist only between B-C or beyond D as well?

Fig 2

Also, tracing is not limited to understanding latency from the network point of view, what if there is a function code written at D creating a loop that has caused things to break?

Metrics are the most important component. Metrics are the first indication of an issue. We do not go into each and every log and trace it down but we look at metrics and filter the relevant logs. APM related metrics like CPU utilization, Memory, Disk, etc. are some of the many metrics available. The real essence of a metric is to provide the pulse of your system and this is exactly what key synthetic metrics such as Response time, Wait time, TCP handshake time, SSL time, and Time to Interactive does. There are a good number of reasons to back this argument. For this, we have to go back to the definition of Observability which talks about “knowledge of external output”. Having said that, Let us take a look at a couple of examples:

Example 1: Are we doing it the right way?

In this example, we look at one of the largest CRM platforms. While monitoring their CRM URL from global locations, we started to see a gradual spike in Wait time and the page eventually became too slow. Over a period of 20 days, the Response time jumped from 0.7 seconds to over 9 seconds, a spike of 900%.

A screenshot of a cell phone

Description automatically generated

In the charts above, we can see that these spikes did not happen all of a sudden. The Wait time spikes consistently for almost 20 days. While this was happening, many alerts and metrics from Dynatrace APM were ignored which would otherwise have indicated an issue with CPU, Disk or Memory.

The more distributed the systems are, the more complex it is to detect issues pertaining to a specific component and more often than not, some of these get ignored. When we reached out to the customer highlighting the issue, we were told that the NOC team gets over 50,000 alerts every month and many of these alerts are simply ignored. In such a scenario, synthetic monitoring from the end user’s perspective provides valuable “knowledge of external output indicating the internal state of a system”

Example 2: Need to be observable from outside

While organizations ramp up Observability for a better correlation of metrics, trace and logs, one aspect which is ignored is Observability from outside or from the end user accessing the application from real ISPs. Here is another example.

A screenshot of a social media post

Description automatically generated

The internal APM did not indicate any issue but end users from a certain location (North America in this case) couldn’t connect to the application because a connection with the CDN’s Edge server could not be established. A very important question here is, are we observable enough to understand issues with components that are outside our control like CDN or ISPs? And if we do not measure this, then are we fully observable? Or are we content being observable from within our application alone creating a culture of saying “not my fault”?

In conclusion, the term Observability and Monitoring can be interpreted differently, and it completely depends on the organization. For some, it is a fresh approach while for others it is an extension of their existing monitoring strategy. But even with the different interpretations, we should consider this important question – are we doing it the right way? We should also understand how to improve the culture around performance to make Observability effective. The examples discussed in this blog show how crucial it is to monitor end-user performance and not rely on internal APM tools alone.

We discuss another important aspect to consider when introducing Observability in the Catchpoint SRE Report 2020. Download the full report and read about Observability from an SRE’s perspective.