This past weekend, the media was ablaze with another online outage – and this time it was not a major ecommerce site, social network, or cloud provider. Dropbox, a very popular file sharing system, had a 48 hour outage after a scheduled maintenance to update the OS on their servers. During the update a scripting error slipped through, causing Dropbox to go down from 5:30 PM PT Friday to 4:40 PM PT Sunday.
I would like to commend the Dropbox team. They worked around the clock to restore service and kept in touch with their users. I’d like to give an A+ in Mastering Disaster for sharing their post mortem analysis, what they learned and how they are updating their service to avoid this kind of failure in the future. At the same time, there were some issues with how it handled the outage, from which the rest of us can learn a thing or two.
There is no clear way to eliminate outages. Failure is bound to happen no matter how big or small you are, how much money you have invested in the infrastructure, or how many smart people are on staff. The only option is to manage failure in order to minimize its occurrences and the impact on end users/customers.
If there is one thing you could implement today to help your company on managing failure to instill into your company culture passion about building fast and reliable systems, and most importantly a culture that understand failure and how to handle it.
This does not mean give a slideshow of the latest industry performance statistics to all of your employees – this means setting goals, attending events, building effective cross-department communication, and lots of training. Definitely pay attention to the latest speakers at the Web Perf Meetup near you and attend as many as possible to see what they are doing and what is working.
Make performance part of development and testing – identify all of the places that can break your service or slow things down and make sure all of these conditions are tested for before going live.
And things do fail or break, it’s part of nature. We are humans and no human is perfect. Ensure that your Development, Marketing/PR, and Support teams have a failure plan ready that involves everything from defining triage and war room roles, to building a backup site/service, to what is communicated by whom during and after the outage.
Most importantly ensure no one falls prey to “jumping the gun on the problem is fixed” – there is nothing worse than broadcasting to your customers “problem solved”, and the problem still exists for them. It is the quickest way to lose their trust in you. Ensure proper testing is done after problem resolved and ensure all your customers can access the site/service for some period of time (60 minutes at least) – before claiming a victory.