DNS is the most important component of the engine that keeps the internet running. If DNS is not managed effectively or the servers go down, applications are inaccessible and users are frustrated.
Public DNS servers are becoming the preferred DNS server among the internet population. They improve the online customer experience and optimize performance by reducing latency, as we recently wrote about.
But, what happens when these public DNS resolvers experience an outage or unexpected latency? End users and businesses both rely on these services for smooth operation of the internet.
On May 30th, Google Public DNS experienced a minor outage. End users were unable to access websites. Frustrated users took to Twitter complaining about the service disruption.
Businesses were also affected by the outage. The APIs served by a cloud provider experienced errors as it was querying the Google DNS server. Once the issue was uncovered they immediately deployed a mitigation to divert the DNS traffic improving availability.
Catchpoint regularly monitors key internet infrastructure, including public DNS providers, to help customers diagnose and troubleshoot issues. The public Google DNS resolver is one such service we regularly monitor. Traceroute tests running from multiple locations within the United States identified an outage with the Google servers starting at 12:54 ET. The servers were unresponsive and the ICMP packets were unable to reach their destination.
The failed datapoints denote packet loss while attempting to reach the destination IP address 188.8.131.52. The failure of the destination IP to respond resulted in timeouts.
In addition to traceroute tests, we were also running DNS Direct tests. DNS Direct tests help troubleshoot DNS issues by querying a domain from a specific resolver.
To test the Google DNS resolver we have a DNS Direct test configured to resolve the domain www.google.com using 184.108.40.206 as the resolver. After seeing the failures on the traceroute test in the United States we analyzed this test. Failures on this test occurred starting at 12:30 ET across Brazil, Singapore, India and United States region.
The waterfall graphs identified connection timed out failures for the 220.127.116.11 server.
Higher than normal DNS response times were present from 05/30/2018 12:30 ET – 13:50 ET.
Using the Comparative time breakdown feature, we were able to compare and analyze performance over different time periods. We compared the data from two hours prior and 12 hours prior to the incident and noticed the response time had more than doubled. Increasing by approximately 24 ms across our backbone nodes.
From Twitter, we knew that end users were impacted. Using data from our last mile and wireless tests, we are able to examine the impact on users. The last mile tests run on dedicated nodes deployed in homes of real end users and wireless tests run on major wireless providers.
During the timeframe 05/30/2018 12:30 ET – 13:50 ET, the response time increased more than two folds across both last mile and wireless nodes. DNS resolution on last mile nodes averaged 60ms and 100ms on wireless nodes.
Eliminating DNS Bottlenecks
Even though this was a minor outage, the impact on users relying solely on the public DNS server was major. Companies had to deploy mitigations, and end users either had to wait the outage out or change to an alternate DNS resolver.
According to Labs.Apnic.net, 10% of Russian Users, 14% of Brazilian users, 11% of Indian users and 5% of users in both China and Great Britain, direct DNS queries to Google DNS.
If you are using Google Public DNS resolver as your sole preferred DNS server, it is recommended to add a fallback or secondary server so you and your users are not impacted during an outage. If your business relies on third-party DNS resolvers in addition to a failover strategy, it is equally important to proactively monitor DNS performance. Monitoring the performance of DNS resolvers will make you more prepared to handle such a crisis. Speed is a major governing factor when it comes to end-user experience, a website slowed down by latency at the public DNS resolver can result in negative user experience and lost business.