There are multiple protocols and components that keep the complex Internet engine running. And just like any other well-oiled machine, it is important to regularly check whether it is functioning efficiently and delivering optimum performance.

The internet is basically a circuit relaying data signals/packets across different paths. One of the most important processes that make up the internet is IP routing. Several protocols manage the flow of data, Border Gateway Protocol or BGP governs how data is transmitted between autonomous entities in the network.

There is never enough stress on the need for ramping up security protocols as well as implementing proactive measures to identify performance degradation across a network quickly. This was highlighted by the BGP routing issue faced by Google yesterday. Although the issue was quickly sorted out, it still had a significant impact on user experience across multiple platforms.

Issue Analysis

At 16:30 EST on November 12th, Google noticed connectivity issues across multiple services including APIs, load balancers, and even their cloud services.

Catchpoint triggered performance alerts as soon as the issue surfaced. The charts below show some of the different Google services that were impacted.

Response time spikes for Google properties.

Impact of BGP routing issues on Google services.

Looking at the performance data from multiple customers, we realized this was a routing problem. For example, in the instance illustrated below, traffic was routed from Germany to Russia.

Traffic routed from Germany to Russia

The RIPEstat data shows the routing path. AS37282 was advertised as the route to Google prefixes. This route information was then accepted by AS4809 (China Telecom) and then picked up by AS20485 (Transtelecom Russia).

RipeStat visual showing BGP routing change

Initial reports of the incident coupled with the suspicious routing paths pointed to a potential BGP hijack. But a Google representative clarified to ArsTechnica that this was an accident and not malicious.

The Nigerian ISP, MainOne Cable Company–identified as the origin of the issue, also tweeted that this was an error that occurred during a planned maintenance.

Tweet from MainOne

 

Within 30 minutes the issue was resolved. Google issued this statement on their Cloud Status Dashboard.

“Throughout the duration of this issue Google services were operating as expected and we believe the root cause of the issue was external to Google. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.”

Not just another third-party

We are constantly discussing the performance tax that comes with integrating third-party tags. Such incidents are a testament to the fact that third-party monitoring should never be overlooked.

The routing issue brought down Google services which had an immediate impact on performance; multiple websites had unusually high page load time. This was mainly due to the Google AJAX libraries that are referenced by many websites. The outage brings the focus right back to third-party tag management and how performance issues introduced by these tags lead to downtime.

Customers using the Ajax libraries provided by Google (ajax.googleapis.com)  had a noticeable drop in performance throughout the duration of the routing attack. Websites that relied on the Google Ajax library did not load properly leaving the page blank. For example, this website was blank for over 31 seconds.

Filmstrip showing a blank screen for 30 seconds.

The waterfall graph shows the unusually high wait time for the Google APIs resource which pushed the page load time to 54 secs.

Waterfall chart highlighting 54 second page load time due to Google API.

 

Multiple features make up an online application so dependencies on such third-party services are inevitable. Proactive and constant monitoring of these services is key to mitigating the impact on performance. It is even more important to be prepared to handle such incidents, we shared tips on how you can do this in our blog “5 Lessons for Managing a Third Party Outage”.

Performance monitoring is no longer about the uptime or downtime of an application. Advanced monitoring provides you with all the data and tools necessary to identify issues quickly as well as predict and prevent potential performance issues.

Gartner Report

 

Leave a Reply

Your email address will not be published. Required fields are marked *