Navigation

My 2 cents on the AWS failure and lessons learned from the past.

Single Point of Failure

A lot has been published already about AWS EC2 failure, I wanted to add my 2 cents on the issue as it reminded me of a notorious event that happened to DoubleClick in August 2000.

What AWS and their customers experienced is unfortunate, but it will and it can happen to anyone! In IT we are dealing with various complex systems – hardware, software, and people – things are bound to break at some point. Failure is not limited to IT, human history is full of such failures with automobile recalls, bank failures, nuclear disasters, collapsing bridges. What people should understand is that failure is bound to happen, be ready for it, and learn from it to avoid it in the future.

Let’s be real very few companies out there have the money and resources to have a redundant transactional systems running in parallel which can act as back up. For most companies you just have to fail “nicely”. You should have plans and processes to deal with everything from troubleshooting the failure, recovering from failure, to notifying customers of the failure, and most importantly architect your application and systems so they fail nicely and can recover from such failures.

Companies that have websites or web application must be able to redirect all requests to “Service is down” webpage. Mobile or desktop applications relying on APIs might need to have special logic built-in for such failures. However, if you are a company delivering services to other website via tags, like adserving or widgets, things get a little more complicated. You cannot remove the tags from the webpages, unless your clients build it in their pages. You need to ensure you can deliver from another location enough to ensure your tags do not impact the web performance and usability of your clients’ websites!

Back at DoubleClick we ran a fairly large infrastructure delivering billions of impressions, the DART tags are present on almost every major website. One day in 2000 we had a really bad outage and our tags “stopped” working because the adserving system experienced a catastrophic meltdown. Customers were not happy, but they understood that technology fails sometimes, and they had SLAs to protect them. What they were most unhappy about was that the DoubleClick ad tag had such an incredible impact on the performance of their sites. Webpages came to a crawl or stopped loading, the user experience was horrible! Our client couldn’t recover from our failure – some were able to remove the tags via their Content Management Systems – but others just had to suffer from our failure.

So we went back to the drawing board and built a complete secondary system capable of handling the billions of ad calls but that will only deliver 1×1 pixels or empty JavaScript. So in case of a major outage the ads would not work but at least would not take down the entire customer’s site with us and their user experience. That “Dot” system was never used in real life, but was always there in case we needed it.

The first lesson for companies that provide services to other websites is to not rely on a single vendor for hosting and spare a few hundred dollars and get a backup plan. So next time AWS or anyone else goes down, you will not have impacted the user experience of the folks visiting your customer’s site. And once you have that backup system in place, test it every frequently! Make sure the right folks know when to pull the trigger and the system can handle it (capacity).

The second lesson is about diversification; do not put all your eggs in one basket. If you go with vendor A for hosting, choose vendor B for DNS, choose vendor C for CDN…

Lastly, if you are website relying on 3rd party vendors, make sure you monitor them. Also learn about their technology and their vendors, who they are relying on for hosting their technology, who is their DNS provider, and most importantly what are their back up plans in case that tag comes to a crawl!

The cloud is great, it is the future of IT -but do not drink too much of the kool-aid or “cloud-aid”, be ready for outages and failures!

Mehdi – one of the guys who handled those angry customer phone calls in 2000.

For more about the AWS issue : The Big List of Articles on the Amazon Outage

Written by

Posted on: April 25th, 2011

Category: Outage

Tags: