Amazon’s AWS (Amazon Web Services) S3 web-based storage service in North America experienced widespread issues beginning at 12:37 PM EST on February 28. As reported on Amazon’s status dashboard, “high error rates with S3 in US-EAST-1.” This was the only explanation provided at the time.
Consequently, many popular online services that utilize S3 such as Quora, Imgur, and Trello suffered from outages throughout the day. This also included Amazon’s very own status dashboard—their status icons are hosted on that service and could not be updated until 14:35 AM EST.
S3 was completely unavailable beginning around 12:37 PM EST, and began improving around 15:45 PM EST, as seen in the chart below.
Ironically, Isitdownrightnow.com, a website that reports whether another site is currently unavailable, was also down during this time.
AWS continued to provide updates for affected services throughout the day on their status page.
While Quora was unavailable, their website www.quora.com was returning a “504 Gateway Timed Out” error. Using our synthetic monitoring tool, we could see the failures occurring in real time.
You can also see the 504 being returned for Quora’s homepage in the Waterfall chart below.
Mashable.com was among the many others who also faced significant issues, such as images failing to load, as those items were hosted on S3 buckets. Below is an instant test of Mashable.com, where we saw multiple images not getting served because they were hosted on S3.
The traceroute below ran from Catchpoint to one of the S3 buckets. As you can see, timeouts occurred closer to the destination.
We can group the downtime into two buckets:
We could not establish a TCP connection to the S3 end points from anywhere in the world (it was not a geo or network transit issue).
At the end of this hectic day, we’re left with a cold, hard truth – 100% uptime is unrealistic. Precautions must be taken for when situations like this occur, no matter how robust the system is. Monitoring your own services, along with third-party services, enables you to catch performance issues and resolve them in a timely manner to ensure your user base’s confidence in your service. Communication with your users is also crucial when catastrophe strikes. Amazon took the proper steps in communication by being upfront and transparent about the issue across multiple platforms, allowing some reprieve for their users during a time of utter chaos.
We should also remember that the fact that these major websites, services… were completely out of service during this time wasn’t Amazon’s fault. The cloud is still just a bunch of servers, switches, and someone’s code. This means it’s still vulnerable to failures, outages and performance issues, and this isn’t the first time AWS has failed. It’s not Amazon’s responsibility to create a redundancy plan for its customers—it’s the customer’s job to make sure that their business is covered when the services they use fail. Many of the companies that went down yesterday offer products and services that other companies rely on every day to do their jobs.
Having a failsafe contingency plan, like distributing to multiple cloud services and zones, is what determines the amount of damage an outage like this has on a business.
Many people are quick to assume that those affected by this outage were single-entity websites, however the magnitude is much larger than that. The scope of impact ranges anywhere from websites to IoT (Internet of Things)—many companies rely on such cloud services for every aspect of their digital experience. In fact, we found ourselves deeply affected by this outage in several different ways: video conferencing systems Zoom and Bluejeans were down, our office door management system Kisi was inaccessible, Duo Security, and several other tools and systems we use on a daily basis were completely unavailable.
This is the second time in the last several months that our daily operations were severely affected by a major cloud service’s outage. We are now going to be grilling our vendors and asking them:
We do not use any public cloud service (AWS, Google, Azure); not because we do not want to, but because many of our customers forbid us from using them and now we understand why!
The most important takeaway from this incident is that we all have a duty to our customers to provide the best service possible, under any circumstance. The tools we use will only take us so far—it’s up to us to make sure our critical components are covered by redundancy.
By: Mehdi Daoudi, Nilabh Mishra, Mitchell Zelmanovich, David Lui