I started this Monday morning, like any other Monday catching up with the office in NY via Skype. Out of the sudden I lost my internet connectivity. Not the first time it happened, so I figured out it might have been the router or the ISP, so I simply rebooted it – but no luck.
Fifteen minute later the internet is back on its own and I get back on Skype to continue my conversation. To my surprise I find out the person on the other line – all the way in New York City – had just experienced the same connectivity problem. Quickly we found out we were not alone – Twitter had several people complaining of issues all around US and the globe, Gizmodo put a quick article blaming it on Timewarner.
We quickly took a look at the performance monitoring data collected by Catchpoint monitoring stations around the world and discovered a jump in failures across all cities – both on backbone Carriers and Last Mile locations on Timewarner, ATT & Verizon FIOS. The problem started at around 9:15 am EDT and lasted until 9:30 am EDT.
Our probes captured several failures for major web sites, ad serving companies, content distribution networks, Public DNS resolvers… as if someone used some kind of kill switch. Several of the captured traceroutes during this period, showed routes that went nowhere – the routers did not know the paths to take to reach the destination.
- From Singapore AWS to a Major CDN (Why not served out of Singapore… not the right topic)
Tracing route to i.cdn.turner.com [220.127.116.11] over a maximum of 30 hops:
1 * * * Timed Out
2 <1 ms <1 ms <1 ms ec2-175-41-128-192.ap-southeast-1.compute.amazonaws.com[18.104.22.168]
3 <1 ms <1 ms <1 ms ec2-175-41-128-233.ap-southeast-1.compute.amazonaws.com[22.214.171.124]
4 * * * Timed Out
5 * * * Timed Out
6 1 ms 1 ms 1 ms 126.96.36.199
7 2 ms 2 ms 2 ms ae-2.r20.sngpsi02.sg.bb.gin.ntt.net[188.8.131.52]
8 185 ms 187 ms 187 ms as-3.r20.snjsca04.us.bb.gin.ntt.net[184.108.40.206]
9 186 ms 173 ms 175 ms ae-1.r07.snjsca04.us.bb.gin.ntt.net[220.127.116.11]
10 ms ms ms Unknown
11 ms ms ms Unknown
12 ms ms ms Unknown
13 ms ms ms Unknown
- From Hong Kong to Amazon S3:
Tracing route to xyz.s3.amazonaws.com [18.104.22.168] over a maximum of 30 hops:
1 2 ms <1 ms <1 ms 22.214.171.124
2 170 ms 132 ms 63 ms ge4-6.br02.hkg04.pccwbtn.net[126.96.36.199]
3 147 ms 147 ms 147 ms sjp-brdr-03.inet.qwest.net[188.8.131.52]
4 ms ms ms Unknown
5 ms ms ms Unknown
6 ms ms ms Unknown
7 ms ms ms Unknown
There has been no official report on what happened, however we were able to collect various reports on Nanog, Twitter, emails… of multiple concurrent issues at hand:
Some other Graphs from TEAM CYMRU :
Internet Routing Table Delta:
BGP Announcements / Withdrawal:
– problems with Juniper routers, which could be tied to the BGP announcements – Source Nanog.
– DNS Cache Poisoning in Brazil that is creating havoc around the world (http://net-security.org/secworld.php?id=11903)
Let’s hope this is was a onetime glitch or human mistake – that can be resolved and it is not something worse.
======= UPDATE =======
Juniper Networks confirmed a bug on their routers causing the BGP issues.
Mehdi – Catchpoint