Wednesday, March 13th was an interesting day as the world got to see what happens when a major social media giant that also is a technology leader goes down. The Facebook outage impacted users around the world for an extended period of time.

At Catchpoint we monitor and benchmark the availability and performance of major websites across multiple segments, key elements required for delivering an amazing digital experience to end users.

First warning sign: a micro-outage

The first alert we received indicating that Facebook was having issues came in at 4:02:40 UTC. This micro-outage lasted for around 30 mins and didn’t have a widespread impact.

During this period, we saw Facebook servers return an HTTP 503 error.

Details showing a test failure and 503 error message from Facebook

When we receive notifications like this we follow a few troubleshooting steps.

  • We look at other social networks to see if users are complaining.
  • We manually try loading the site to see if we experience issues.

At this time we didn’t find any correlation on other social networks of users complaining, there was no news of a Facebook outage, and we were able to open Facebook and Instagram. We decided to keep an eye on this.

Critical alerts received

We started receiving critical alerts again at 3/13/2019 16:06:58 UTC. This time around, the server was returning a 500 HTTP error response code.

Waterfall showing 500 error code response from Facebook.

 

Scatter plot showing the micro-outage and the start of the major outage.

The issue was global. It was not tied to any particular ISP or region. We quickly ruled out this being a network issue with DNS or BGP based on our network monitoring data.

Chart showing no issues with packet loss, ping, or connection times.

image of Facebook traceroute results

Impact on end-user experience

While we waited for Facebook to post an update on the outage, we started looking at the impact on end user experience.

World map showing end user impact of Facebook outage

Users across the globe were impacted. While users in some cities coming from certain ISP’s could load at least the homepage the majority of users could not.

Bar charts showing ISP performance

We further tried to understand why the outage was spotty or why the site wasn’t hard down.

Bar charts showing Facebook available in some cities.

The chart above shows users coming from some cities, and ISPs like Milan (Telia), Atlanta (Zayo), and Mumbai (Telia) were able to view the homepage. We took a look at the IPs serving these nodes to understand if it could be specific to certain VIPs / PoPs or servers.

We noted that some servers never resulted in failures.

Listing of availability by server

In the locations with no failures, we analyzed the server IPs to see if these resulted in 100% successful page loads. However, this wasn’t the case.

Listing of working servers showing no pattern

Thus, this also wasn’t a case of a DDoS attack as we didn’t see servers go down one after the other or the service degrade gradually on a specific server.

Communication during the Facebook outage

Facebook tweeted for the first time at 10:49 AM PST and acknowledged the issue.

Tweet from Facebook saying "We're aware that some people are currently having trouble accessing the Facebook family of apps. We're working to resolve the issue as soon as possible."

It was over 1.5 hours after Catchpoint initially caught the issue at 9:06 AM and after their user base had already taken to Twitter complaining

We also took a look at how Facebook was handling user experience during the outage. Most organizations put up a maintenance page or a “Sorry, we will be back soon” page.

During the outage, the error page seen by the users was varied.

Some saw:

Example of Facebook error message saying "Something went wrong"

And a few others saw:

Message stating Facebook is down for maintenance and will be back soon.

Those who tried too often to check if Facebook was back up, the below message was displayed:

Facebook message saying "Please try again later. You are trying too often."

This and the lack of communication on what was going on definitely resulted in outrage among the users.

 

Tweets on Facebook Outage

What would have helped from an end-user experience perspective is frequent updates on what was going on. Facebook did assure users that it wasn’t a DDOS attack.

Incident resolution

Facebook fixed the issue around 15:00 PST. After incident resolution, it is important to take a look at how the remediation takes effect.

1. Are all users able to load the page successfully at the same time?
2. Are there any corner cases that still need to be addressed?

Below are the last failure timestamps in the respective countries –

United Kingdom – 3/13/2019 14:58:11 PST
Switzerland – 3/13/2019 15:35:50 PST
Canada – 3/13/2019 18:55:31 PST
United Stated – 3/14/2019 01:10:16
Hong Kong – 03/14/2019 05:40:53 PT
Australia – 3/13/2019 22:42:19
India – 3/14/2019 06:44:26
Singapore – 3/14/2019 04:41:55

Scatter plot showing service restored in Australia, Hong Kong, India, and Singapore

Scatter plots showing when service returned to normal in Canada, Switzerland, United Kingdom, and United States.

 

In the United States, failures were seen as of this morning as well.

The last failure we noted was in Philadelphia at 03/14/2019 01:10:16 PT.

Across the globe, the last failure we noted was from Delhi- Airtel node in India at 3/14/2019 06:44:26 PT.

Line chart showing failures Thursday morning

Facebook tweeted at 9:24 AM on Thursday that a server configuration issue caused the site to have issues. The total downtime noted in Catchpoint was 09:06:58 03/13/2019 to 3/14/2019 06:44:26 PT.

Facebook tweet on resolution of incident.

2018 saw some major outages, this is the biggest outage of 2019 so far.

Lesson learned:
Outages happen, to even the best and biggest companies out there. Customer notification and communication is key in moments like this.