External monitoring tools draw a very powerful picture, that of your website performance, availability, and reliability – from the end user perspective. That said the picture can fall short at times in answering your main question – what happened and where? Things can get blurry!
The main reason for the shortcoming is that from the outside these tools can only see one server and one request, however internally a site or application is performing many tasks and relying on other systems and services hidden from the end user. The picture becomes more powerful when you can add additional context to it, internal application context!
Let me illustrate such a case. We instructed Catchpoint to measure how long it takes to get from Google the search results for the keyword “Google” utilizing Internet Explorer.
The chart below displays the performance of the test over the last 30 days.
When using external performance monitoring services, like Catchpoint, you can overlay additional metrics to help understand where the slowness occurred. In this case we added the following:
In the new chart you can now see that there was corresponding increase in both Wait and Load, a possible sign that the Response time was driven by some kind of “internal” bottleneck. However, these two metrics can also be affected by slow connection, the slower the connection between the client and the server the longer it takes to get the data from the server- so we are still not 100% sure what the cause is. It could be Google, or it could be the Internet.
In the case of Google search results page, Google gives us one extra piece of information: how long it takes for their “backend” system to process the search request before sending it to the user. The metric is clearly tied to how their internal performs – it is not impacted by external factors like connectivity in the Internet.
Thanks to our Catchpoint Insight product we can capture and overlay this “internal” context with the external performance data we collect during each test.
The new chart clearly displays that the cause for the spikes was in fact tied to the performance of the Google backend and not the connectivity or monitoring nodes!
One interesting observation is that when the internal Google performance jumped from 100ms to 290ms, the response time jumped from 180ms to 489ms.
Monitoring is not just about detecting failures, it is about watching an application 24×7, collecting all the data, and understanding the data in order to optimize your performance and avoid failures in the future.