On August 30th CenturyLink/Level3 faced major IP outages across its Global Network. This incident adversely impacted all their customers, ISPs, and other digital services around the world. The outage lasted for more than 5 hours (1000 UTC till 1530 UTC on a Sunday morning) and spread outward from CenturyLink’s network and impacted other internet service providers, ending up causing connectivity problems for many other companies including tech giants like Amazon, Twitter, Microsoft (Xbox Live), Cloudflare, EA, Blizzard, Steam, Discord, Reddit, Hulu, Duo Security, Imperva, NameCheap, OpenDNS and many more.

Fig 1: Outage heatmap

This Global Internet outage had not only impacted CenturyLink’s network (initiated at CenturyLink’s Mississauga data center) but other entities as well, was shown by the Geo Chart Dashboard, highlighting all the impacted areas not only restricted to North America but also covering APAC and EMEA regions as well.

Outage Impact

Considering Centurylink is a transit provider any networks (enterprise or others) peering with it were impacted. The outage impact can be categorized under four sections:

  1. End users originating from the impacted ISPs – Last Mile
  2. Cloud providers, CDNs, Managed DNS, and other 3rd party vendors impacted.
  3. Networks whose immediate peer was CenturyLink/Level3. – Middle Mile
  4. Enterprises who were using CenturyLink/Level3 as their First Mile provider.

End users originating from the impacted ISPs – Last Mile

Fig 2: Data breakdown by ISP

In Fig 2, we highlight the impacted ISPs. There was a significant downtime on Centurylink/Level3 compared to the other ISPs while accessing a popular global eCommerce website. The end users experienced:

  • DNS resolution errors
  • TCP connection failures
  • In some cases, some key elements of the page like images, CSS, JS, fonts, etc. failed to load

Cloud providers, CDN, managed DNS and other third-party vendors impacted

Most organizations rely on multiple Managed DNS providers and multiple CDN providers. In Fig 3, we highlight the multiple DNS Name Servers impacted during the outage. This explains the DNS resolution failures seen by the end users.

Fig 3: DNS impacted during the outage
Fig 4: DNS Direct test for NS

Enterprises adopt a multi CDN strategy to ensure they can deliver a reliable and optimal experience to their end users. In scenarios like this where an incident like this had an impact on multiple networks and services, having real-time monitoring data helps organizations quickly detect issues and determine the best network or service to reroute their end users.

In Fig 5, we see that while all CDNs were impacted, Level 3 CDN was the worst hit because of the Centurylink/Level3 IP outage.

Fig 5: Impacted CDNs

Drilling down further (Fig 6) we identified a couple of patterns:

  •  The reachability of Level3 CDN was impacted across all ISPs.
  •  We saw adverse performance impact on all CDNs on Centurylink & Level 3, however certain CDNS had better performance on the consumer ISPs such as AT&T, Comcast, and Verizon.
Fig 6: Impact on CDN across ISPs

Networks whose immediate peer was CenturyLink/Level3 – Middle Mile

As discussed above CDNs like Fastly were able to minimize the impact by detecting and adapting to the issue.

For example, in the scenario below (Fig 7), before the outage – GTT AS is peering with Fastly AS via Level3 AS but as soon as the issue started taking place GTT AS switched to Telia AS instead of Level3 and finally when the network recovered, GTT switched back to Level3 AS.

Fig 7: Peering before, during and after

The above example was about a vendor. The next example (Fig 8) focuses on the BGP route and highlights how an enterprise was impacted because the upstream ISP was peering with Level3.

Fig 8: BGP routing changes

NTT begins peering with Cogent instead of Level 3 at 13:00 UTC (9AM ET) — approximately three hours after the start of the outage. Level3 announcements do not reflect NTT’s revocation of routes until the incident is resolved and CenturyLink regains control of its network, at which point, the BGP path reflects NTT’s peering change from Level3 to Cogent.

Another major ISP, Telia, indicated that at approximately 14:00 UTC CenturyLink requested the provider, to remove the peering with its network in order to reduce the number of route announcements it received (a move possibly meant to stabilize its network). Once the incident was resolved, CenturyLink requested Telia to re-establish peering. Route flapping can have a significant toll on routing infrastructure, and de-peering from CenturyLink would have reduced the number of BGP updates it received from its peers, allowing CenturyLink’s control plane to stabilize and break the looping pattern described earlier.

Root Cause Analysis

The outage started on August 30th at 10:04 UTC and was caused by an offending Flowspec announcement (RFC 5575) – typically used by providers to distribute along with their network a set of firewall rules – that prevented BGP from correctly establishing, as claimed by CenturyLink itself in their official outage explanation.

By monitoring BGP routes, it is possible to recognize any suspicious amount of data being generated on BGP sessions, way larger than usual. For example, it is very easy to see that RouteViews route collector located on London Internet Exchange Point (LINX) was collecting up to 5 times the amount of regular BGP packets being collected during regular times.  

Taking a deeper look at the data collected on August 30th it’s very easy to understand that AS 3356 (Level3) was the cause of a large number of announcements, both directly appearing in the AS paths of collected data and indirectly causing rerouting due to their flapping sessions and outage. 

Fig 9: IPv6 announcements

From the very same dataset, we can also understand how many networks were involved in the outage. Before the outage, AS 3356 was appearing in the AS path of about 14% of the IPv4 routes and 10% of the IPv6 routes collected by the RouteViews collector. Once the outage started, some of these networks were able to mitigate the problem in a matter of minutes via multihoming or direct peers, most likely using automated mitigation methods and network monitoring tools. Interestingly, about 6% of the routes were not as responsive, probably reacting only once a few tickets were starting to flood their NOCs. Even worse, about 8% of the routes were still using CenturyLink, thus being 100% impacted by the outage, most likely because of bad planning of their network and lack of a valid alternative to bypass the outage. 

Fig 10: Routes crossing AS3356

Conclusion

While there is a common notion that network outages are beyond the control of organizations and we have seen such incidents happen again and again. The truth is that there is always something that can be improved. Most of the organizations that built their networks with multiple transit providers were able to mitigate the problem by turning down their BGP direct connection with CenturyLink – as suggested by CenturyLink itself – and redirecting their traffic to other providers, avoiding the source of the problem.

In today’s distributed set up, the application delivery chain consists of multiple disparate and yet interdepend parts, and incidents like this shows the impact a network outage can have on your infrastructure – DNS, Loadbalancers, CDN, Cloud infrastructure, Datacenters, etc. but most importantly the impact on your end-user experience and the overall business.

The incident is also a reminder that monitoring is essential. A multihomed BGP protocol was unable to auto-correct the situation. Mitigating the impact starts with detection through effective digital experience monitoring. The issue can then be quickly analyzed and resolved with either manual or automated intervention. 

Some people were upset with the time it took to bring the services back online. An error or a hardware failure can cause havoc in a single second, bringing an entire datacenter and network down. Resolving the issue and bringing the systems back up with the right configurations and without compromising data integrity can be time-consuming. The problem is compounded when a majority of the workforce is remote.

This outage can serve as a great learning experience for a lot of organizations to:

  • Review their monitoring strategy for everything in between their end users and their content.
  • Identify fall back mechanisms.
  • Monitor and hold their vendor accountable.
  • Review their Incident Management practices and those of their vendors.
  • Remember to test mitigation plans.