For those who have ever visited California in June, you are familiar with June Gloom weather. Well, apparently it now applies to our digital forecast as well. This is no longer about just watching cat videos. Serious business is at stake, along with real-world impact on people, their finances, and their health.
This month has seen multiple incidents that left the internet crippled for hours at a time. Just a week after the Google Cloud outage, the internet suffered another major outage that impacted various companies; the most visible one was Cloudflare because of the impact it had on hundreds of websites. Catchpoint detected the issue within seconds and kept track of performance, availability, and reachability across multiple sites as the issue unfolded.
At around 10:30 AM UTC (6:30 AM EST) on June 24, we noticed several websites starting to have performance issues, including Barnes and Noble, Discord, AWS services, and even privacy management services such as OneTrust. Analyzing the data coming in, it was interesting to note that most of these websites were either using Cloudflare or third-party services that depended on Cloudflare. The heat map below illustrates the global impact of the outage.
Performance data for Cloudflare showed a similar pattern. Increased wait and connect times impacted website availability, leaving end users to deal with websites that were either too slow or didn’t load at all.
The graph below compares data from two days ago; the blue line denotes a clear spike in connect time during the incident, which impacted the document complete time and corresponds to the dip in website page views with the data collected by our Real User Monitoring (RUM).
Further analysis of the data indicated a network routing issue. The Sankey chart shows us where the routing went wrong. Routes announced by DQE Communications (AS33154) were picked up by Allegheny Technologies Inc (AS396531), which was then forwarded to Verizon (AS701). The new routing path created a bottleneck in the network that steadily stalled incoming traffic, causing the drop in performance.
The outage had a domino effect on performance; websites that used third-party services dependent on Cloudflare also exhibited performance degradation, while websites using Cloudflare DNS services were rendered inaccessible due to unresponsive DNS servers.
The issue lasted for over an hour before Cloudflare was able to correct the flawed routing. According to Cloudflare, they recorded a 15% drop in global traffic during the incident. The image below highlights the traffic flow during the incident and once it was resolved.
Where did it go wrong?
At 10:30 am (UTC), Allegheny Technologies Inc. started to propagate prefixes received from one of its providers (DQE Communications) to another provider (AS701 – Verizon), starting a BGP route leak of type 1, namely a “Hairpin Turn with Full Prefix.” According to RFC 7908, a route leak is defined as “[…] the propagation of routing announcement(s) beyond their intended scope. That is an announcement from an Autonomous System (AS) of a learned BGP route to another AS is in violation of the intended policies of the receiver, the sender, and/or one of the ASes along the preceding AS path.”
According to the community, AS33154 sent AS396531 more specific routes for several popular Internet destinations, including Cloudflare, Amazon, Facebook. In normal operations, AS396531 should have stored these BGP routes and use them for its own routing. Probably due to a bad filtering configuration mechanism, AS396531 instead advertised those routes to its other provider AS701, which is operated by Verizon. Probably due to a bad protection mechanism setup, AS701 accepted these routes and advertised them to its own neighbors, effectively causing the propagation of the leak on a global scale.
Summarizing, the leak was caused by a mixture of loose protection mechanisms, bad habits, and lack of counter-measures (RPKI, IRR-based filtering or max-prefix allowed on the BGP session).
Did the BGP leak affect only Cloudfare, Amazon, and Facebook?
Definitely not. It is possible to understand the impact of the BGP leak thanks to public route collector such as the University of Oregon Route Views project, the RIPE NCC Routing Information Service (RIS), and the IIT-CNR Isolario project, and results are quite scary. As shown in Figures 1 and 2, 11937 subnets previously announced by 1425 different ASes were re-directed via AS396531 between 10:34 am and 12:40 pm. According to route-views2 collector, which collects BGP data directly from AS701, the situation was probably even worse for direct customers of AS701; the collector recorded 65180 different networks crossing AS396531 belonging to 4553 ASes.
Besides Cloudfare, Amazon, and Facebook, networks owned by Comcast, T-Mobile, Bloomberg, and Fastly have also been affected, as well as networks belonging to ASes that can be attributed to at least nine American banks/credit associations (AS33730 Bank of America, AS393346 City National Bank, AS32996 AgriBank, AS14310 Safra National Bank of New York, AS54418 American National Bank of Texas, AS33504 First National Bank of PA, AS16790 TransNational Payments, AS393714 Wheelhouse Credit Union, AS394145 SCE Federal Credit Union, AS14590 Desert Financial Credit Union, and AS16910 Farm Credit Financial Partners).
As shown in Figure 3, the leak affected ASes belonging to all five Regional Internet Registries, with incidence on the North American registry (ARIN).
Not every tier 1 AS was propagating the leak as they received it from AS 701, showing that not every tier 1 AS is tightly compliant to RFC8212 (Default External BGP Route Propagation Behavior without Policies), which states that “[…] routes are neither imported nor exported unless specifically enabled by configuration.”
Out of 21 ASes marked as Tier 1 in Wikipedia which do not belong to Verizon, only 10 of these ASes never appear in any AS path: AT&T (AS7018), Telia Carrier (AS1299), some ASes of CenturyLink (AS209, AS4323), NTT Comunications (AS2914), GTT Communications (AS3257, AS4436, AS5580) and Orange (AS5511). Some of these ISPs are known to take route leak prevention very seriously (e.g. NTT Communications and AT&T), while others may appear in the list simply because of lack of data in BGP collectors.
End user monitoring is a must!
As of today, the Internet is so complex and dynamic that prevention and good MANRS (Mutually Agreed Norms for Routing Security) alone are not enough to guarantee that routing anomalies will not happen. Over the years, we have seen multiple incidents caused by routing issues. A fundamental and complementary action to prevention is the monitoring of inter-domain routing system.
A BGP monitoring infrastructure able to receive and analyze in real-time BGP data from multiple collectors could generate alarms as soon as routing anomalies are perceived, allowing the involved networks to take counter-measures quickly, saving time and money.
Outages like these bring the spotlight back to monitoring strategies and how the right monitoring strategy can mitigate and even prevent such incidents. Visibility at different layers of the application is necessary. It is the only way you can take control during a major incident; this monitoring strategy is what answers crucial questions such as:
- What was the impact on end-user experience?
- How did the issue impact business?
- What was the root cause of the issue?
Catchpoint does exactly this. With Catchpoint, you monitor from where it matters while effectively eliminating blind spots in your monitoring strategy and regain visibility to quickly detect, identify, and take actions.
The internet is very fragile; we see it with DNS and BGP every day. We hope to see a good Root Cause Analysis by all the parties involved (Verizon, DQE and others). It’s also time to rethink how some of these must change to add layers of control, security, etc. so we can have a more stable and reliable Internet.
Alessandro Improta and Luca Sani contributed to the data collection and analysis contained in this article.