Published on Tue, Apr 14, 2015 by Aaron
On April 11 2015, during US daytime, Zerigo suffered a global DNS outage due to a DDoS attack. The Zerigo status page informs us that the attack hit their origin nameservers , and gives the impression the problems started before 15:24 UTC and ended at 18:30 UTC. But Zerigo customers kept complaining on Twitter even hours later. What happened, really? Our data tells the story.
Here at TurboBytes we closely monitor the real-world response time and availability of many authoritative DNS providers, including Zerigo, with our RUM for DNS. Every day we run millions of measurements from across the globe, and our data for April 11 clearly shows: Zerigo DNS performance was very poor for about 8 hours globally.
The DDoS attack hit the Zerigo origin nameservers hard worldwide, resulting in very slow responses and poor availability:
Browsers beacon a Fail when the authoritative was too slow, down or sent a bad response.
Fail Ratio = % of measurements that failed. More info.
Zerigo first mentions the DDoS attack at 15:24 UTC, but it started well before that, at ~ 14:40 UTC. Fail Ratio jumped and response times went sky high. Between 14:45 and 16:40 the median response time was well over 4 seconds (!) and Fail Ratio was between 40% and 60%.
In reality the Fail Ratio was actually even higher, because the number of beacons we received from browsers running the performance tests for Zerigo dropped by ~ 12%. This is easily explained by the fact that our tests don’t have a set timelimit and if a test takes a long time, some users will navigate to the next page before the test completes or fails.
Two hours after Zerigo DNS peformance went bad, suddenly performance is back to normal. The Zerigo status page: “Valued customers, please be advised that the DDOS attack has been mitigated and all Zerigo services are now restored”. However, within 10 minutes performance goes bad again, and it gets even worse than before.
About 30 minutes after performance was back to normal levels, Zerigo DNS is completely down. Fail Ratio hits the 100% mark and response times are very, very high.
Luckily for Zerigo customers, performance starts to improve soon and at ~ 18:00 UTC response times are much lower and the Fail Ratio is down to ~ 25%.
Global median response time is ~ 260 ms and this is normal for Zerigo. But Fail Ratio is still a whopping 25% and it stays like that for over 2 hours.
Fail Ratio is back to normal … but only for about 10 minutes. At 21:30, it’s above 20%.
Finally, all problems are gone and Zerigo DNS performance is back to normal levels.
So far we’ve taken a global view on Zerigo’s DNS performance. The charts for individual countries look more or less the same as the ones we showed above. When browsing through our data some more, we did spot some interesting things though.
Perhaps one would expect performance to be the same on all networks, because the DDoS attack hit Zerigo origin nameservers directly. If those nameservers are down due to an overload in traffic, this impacts performance on all networks the same, right? Well, this turns out not to be the case.
In the global chart we saw that after ~ 18:00 UTC the Fail Ratio was at ~ 25%.
On AS7922 however, Zerigo DNS remains almost completely unavailable until 22:30.
Fail Ratio does not get below 60%, except for that short 10 minute dip at ~ 21:05.
On some networks in other countries we see the same.
From our data it is clear that resolvers on AS7922 had a harder time getting a response from Zerigo’s nameservers than for example Google Public DNS resolvers (see below). Why? We don’t know.
AS15169 is Google’s network. The Google Public DNS resolvers live on this network. After 18:00 UTC, the Fail Ratio stays relatively low and well below that 20% - 25% we see on AS7922. This is what our data shows for most networks.
Zerigo could have done a better job at informing their customers during the DNS performance problems. Many tweets of customers were not answered and the info on the status page was not awesome. We give a shoutout here to DNSimple. They suffered from a global DDoS attack in December and communicated about this in a proactive, professional way during the outage and wrote a lengthy post-mortem.
DDoS attacks happen and will continue to happen. It’s likely Zerigo will be hit again, and it’s likely other DNS providers will be under attack too. And then there are all sorts of other causes of poor DNS performance (BGP route leaks, broken peering links, …). What can you as a customer of managed DNS do? One thing: use two DNS providers.
Here at TurboBytes we use two DNS providers (NSONE and AWS Route53) for extra reliability and speed. Read our blog post about Why You Should Use Two DNS Providers.
Since February 2015, Zerigo uses CloudFlare Virtual DNS, which means CloudFlare proxies the DNS traffic and caches the results.
If CloudFlare receives a query for an expired or uncached record, it will query the Zerigo origin.
Many DNS records have a TTL lower than one 5 minutes (source), so if the origin nameservers are down, that will cause real-world problems for Zerigo customers.
TurboBytes measures authoritative DNS performance using a wildcard A record, so with Zerigo, we always hit their origin nameservers.
April 14, 13:50 UTC: our data shows Zerigo DNS performance has been steadily getting worse in the past 48 hours, with Fail Ratio peaking at 10% at 23:00 UTC on April 13. We’ll keep a close eye out and update this blog post when it makes sense.
We always welcome your thoughts, ideas and feedback. Please share below in the comments section and don't forget to check out our Authoritative DNS Performance Reports.