In my experience, prior to 2005, companies worried about uptime, as everyone was being measured on uptime and how many 9s they could achieve. Since then, a lot has changed, better and cheaper hardware, better hosting, virtualization, and now cloud computing with “infinite” scale … so there is really no excuse for downtime anymore (although it still happens).
The new threat facing online services (web site, travel sites, SaaS, blogs, etc.) is really the performance, how fast the service does perform.
Downtime is binary: a site is either up or down. Performance on the other hand is subjective, sometimes undetectable and worst yet very easy to dismiss because often it is sometimes caused by elements outside of the site’s control (ad tags, widgets). I would even argue that some of the new technology that is giving us infinite scale (cloud, virtualization) might cause some of the performance issues we see today. We fixed availability but are hurting performance.
In 2000 research showed us that if a page took longer than 10 seconds, users would be frustrated and leave. In 2009 research from Akamai & Forrester showed that end users will not wait more than 2 seconds for retail sites to load.
The problem with these research numbers and about web performance in general is that it’s very subjective. We all expect Google homepage to load in a blink, that would be 300 – 400 ms. However, we understand that searching for a flight on a travel site might take a few seconds. So it turns out the performance numbers are relative to the content and user expectation and thresholds need to be adapted to your business.
Slow web performance is like clutter, it sneaks up to you slowly, 100 ms because of some feature, 300ms because of survey tag, and 200ms because of your network peering. Eventually they add up and you page can go from 2 seconds to 4 seconds – in a blink of an eye (or more of a release).
You need to keep tab on performance on a real-time basis or at least on a daily basis, depending on your business. You need to analyze every aspect of what impacts web performance: Geography, ISPs, peering, DNS providers, APIs, third party content… You need to become a HAWK! (The internal code name for Catchpoint was Hawk in its first year). You need to slice and dice your performance data to understand what is being impact, where, who, what… and turns this data into insight.
Recently, a site we were monitoring displayed an interesting pattern: The performance slowly started to degrade out of nowhere. Looking at the average, the web performance jumped by only 150 ms, it did not seem a big deal!
Well those 150 ms on the backbone nodes will mean something else to end users that are accessing via DSL or slower pipes. Secondly the average hid the real culprit and what the impact was! By slicing the performance data by geography, ISP, and the different metrics Catchpoint collects – we identified that the problem existed across the United States, but impacted only one Backbone ISP. When we looked at it just from that ISP’s perspective, the increase was not 150 ms but rather 1,000 ms. (when averages lie!)
By using traditional networking tools like traceroute, built into Catchpoint, we were able to prove and identify a peering issue with a major Backbone ISP. The site fixed the issue after a few days and their performance returned to “normal”.
Performance is the new downtime -and unlike downtime, it is harder to establish if there is a problem or not. It is crucial for any site/service to monitor and collect as much data 24/7, setup alerts based on deltas & variations and most importantly review this data very seriously on at least a daily basis. Silent killers can be prevented by early diagnostics and constant monitoring.
Update: On 10/3/2010 SFGate had this very interesting article :“Google’s speed need”. When Google, part of an experiment, slowed down their site by 100-400 ms they noticed a drop of users between 0.2% to 0.6%. These are miniscule numbers, but the business impact is huge: a potential loss of “$900 million in revenue last year”. The lesson: Speed matters. A lot.
Mehdi – Catchpoint