The move to a more hybrid and distributed application architecture has pushed cloud providers to offer higher availability. Availability has become the key differentiator among competitors. The focus on offering higher and higher availability comes at the cost of other vital performance factors such as service reliability.

This blog discusses some of the important takeaways from our recent on-demand webcast on improving service reliability. We look at Digital Experience Monitoring’s simplified approach to service reliability and some real-world scenarios where DEM can be leveraged to improve service reliability.

From High Availability to High Reliability​

Higher availability is not the sole indicator of high performance. Monitoring methodologies are based on four major performance pillars – Reachability, Availability, Performance and Reliability. Reliability, in simple words, determines performance consistency. It is a critical metric in the application delivery chain and there are multiple challenges when it comes to guaranteeing reliability.

  • Reliability is an emergent property​
  • Reliability is relative to end user’s perception​
  • Reliability requires drilling into granular historical data​
  • Reliability requires advanced data analytic capabilities ​

The challenges do not mean that it is impossible to monitor reliability. DEM tools measure all metrics that are relevant to the four performance pillars but there are ways to make reliability monitoring more effective. The monitoring methodology should account for each of the challenges listed above and then fine-tune it by working through some, specifically three, overlooked aspects of reliability.

Three Overlooked Aspects of Reliability

1. When measuring reliability consider distributed systems appropriately​

The digital world is highly distributed, everything from networks, microservices, DNS, CDN and even cloud services. It is designed to improve speed, performance and efficiency while eliminating single points of failures. But all these distributed components add to the complexity of monitoring reliability.

To understand the impact of such a distributed system, let us take the example of CDN mapping. End users are mapped to CDN PoP passed on their location and this mapping determines how quickly the application loads for the end user. And how quickly the critical components of the application loads determine the end user’s “perceived” reliability of your service. In figure 1, we compare the impact of CDN performance. There is a significant difference in the way the page renders and this sums up the end-user experience which also translates to poor reliability as perceived by the end user.   

Figure 1

The CDN geo-mapping plays a crucial role in determining end-user experience. For instance, a user in Boston, MA on AT&T is served content from Ashburn, VA. This raises several questions – was the mapping efficient? Does same mapping pattern apply for traffic from different providers?

Understanding how the CDN mapping works will help you evaluate different CDN providers and improve the current service. It will also help you identify and resolve incident faster. Working with your CDN partner and optimizing the content distribution paths will greatly improve service reliability for all end users.

2. Measure reliability from real users perspective​

The end user perspective is crucial to reliability. The end user is unaware of the complexities involved in delivering an application, reliability to them is a perception of how the system worked for them and not based on any actual quantifier. So, to ensure reliability you need the real user’s insight which is possible with the use of Real User NEL (Network Error Logging). NEL captures and reports errors (DNS errors, TCP timeouts, HTTP errors) encountered by the real user. The data captured is very helpful when trying to evaluate the actual end-user experience.

For example, as you push out a new release into production intending for it to be a ‘no-downtime’ release – you can compare the performance baselines of the build and verify if users experienced any errors. ​ The data from real users will give answers to reliability questions such as:

  • Is the issue specific to a region? (higher DNS errors in Asia?)
  • Is the issue specific to a connection type? (ISP related?)
  • Are there any network peering issues?

Combining performance data from synthetic monitoring with real user monitoring provides complementary viewpoints which can then be leveraged to improve the synthetic monitors you use and in turn improve service reliability.

3. Understanding reliability requires high cardinality raw historical data​

​The third most important aspect that is usually overlooked is historical reliability patterns. Analyzing day to day data may not give you the true reliability picture. Consider this example, we are measuring the page render time for google.com. Figure 2 illustrates performance data for over a year, if we were to analyze reliability for a day or week of any month, we would conclude that it is mostly above average. But when you look at the historical trend, there is a clear pattern of performance degradation. How we determine the reliability of the service is directly related to the time window we observe.

Figure 2

Reliability Checklist

We discussed three important aspects you need to reconsider when trying to improve reliability. If you want to monitor reliability effectively then start with this reliability checklist:

  • Global monitoring infrastructure for distributed systems​
    • If you want to capture data from actual user locations, you need a global monitoring infrastructure. Such a monitoring strategy will provide full visibility even in a highly distributed system.
  • Historical data retention​ and multi-dimensional analysis ​
    • It is not enough to capture performance metrics data without a long-term data retention plan in place. The ability to pull historical data to identify trends and patterns is vital when trying to improve service reliability.
  • User-focused metrics​
    • Another important aspect is the metrics used to measure reliability. The metrics should be user-focused to understand the end user perspective.
  • Synthetic and Real User Insights​​
    • Detailed performance analysis is possible only when you have data sets from multiple viewpoints. Combining proactive synthetic monitoring with real user monitoring will provide insight into the actual end-user experience. These two types of monitoring are complementary and the data it provides can have a great impact on service reliability.

The webcast offers many more insightful use cases that help you improve service reliability. Watch the webcast here and learn about digital experience monitoring’s simplified approach to service reliability.