Published on Tue, Dec 16, 2014 by Aaron
No CDN is excellent all the time everywhere. They all have their bad days.
Last week we showed how several CDNs struggle to deliver great performance in France and on Nov 27 CloudFront had a major global outage lasting ~ 90 minutes, allegedly caused by DNS issues. Today we write about another CDN's performance that recently went bad as a result of DNS issues (Highwinds) and we show how our Multi-CDN platform provided great value by reducing downtime by 60 minutes.
Here at TurboBytes, we monitor performance of CDNs with RUM (Real User Monitoring) all the time from across the globe, and we use the data to power our Multi-CDN service. Our non-blocking JS executes after page load and then silently in the background fetches a 15 KB object from a few CDNs, and beacons the load time details to our servers. If the 15 KB object failed to load within 5 seconds, we beacon a Fail.
On Sunday Dec 14 2014 around 14:20 UTC, the Failratio of Highwinds jumped globally and it was not until well over an hour later that the Failratio returned to a normal level. Let’s take a look at what happened with Highwinds and see how TurboBytes’ Multi-CDN platform performed. We zoom in on France because in that country the TurboBytes Multi-CDN platform was routing traffic to Highwinds when the problem started.
In the charts below, a vertical blue line was drawn for every test that passed (browser fetched 15 KB object from CDN within 5000 ms) and a vertical red line was drawn for every test that failed to finish within 5000 ms. Before the problem started, it’s clear there is a lot more blue than red, and during the outage, red has the upperhand.
You can see the Failratio increase starting around 14:22 and after 10 minutes it’s almost all red, but some ‘ok’ beacons do keep coming in (our guess is some resolvers will serve a stale response if they can’t get to the authoritative DNS). Around 15:25 - about an hour after the issue started - we started receving a lot more ‘ok’ beacons and the Failratio declined. All in all the downtime lasted about 70 minutes.
How did TurboBytes’ Multi-CDN service perform? Did we switch away from Highwinds quickly? As it turned out, we switched to Highwinds right before the issue started at 14:22. In the next few minutes, more and more traffic started flowing to Highwinds but most traffic was still going to the CDN we previously mapped to. Apparently most DNS resolvers in France were still handing out the CNAME to that previous CDN (we use a DNS TTL of 300 seconds). At 14:25, our platform decided to switch away from Highwinds and it automatically updated our authoritative DNS. It took about 5 minutes, due to the DNS TTL, for traffic to stop flowing to Highwinds alltogether. Conclusion: TurboBytes prevented ~ 60 minutes of downtime in France.
As mentioned, Highwinds suffered from DNS issues not just in France but globally.
Below you see charts for Highwinds and TurboBytes in The Netherlands and Global.
Note: the charts for TurboBytes don’t provide great value because we were not routing traffic to Highwinds in The Netherlands at the time of the issue and a global comparison makes little sense because our platform doesn’t make routing decisions at a global level, but I thought it’s best to show them here anyway.
We’ve closely analyzed our data and what happened in our platform before, during and after the issue with Highwinds, and from that analysis we have defined a few ways to optimize our service, with the object of switching away from a bad CDN more quickly. One thing we can do is lower the DNS TTL. That is easy and this was already on our To Do list. Another way to improve has to do with how our platform processes the incoming beacons and makes switching decisions: process the data more quickly and make an equally good decision with less data.
We can’t know from our RUM data why Highwinds was failing, and we did not run any other tests during the time of the incident. We know it was DNS because Highwinds told us. Highwinds informed customers about the incident not long after it started and proactively kept them informed until it was resolved. Highwinds deserves credit for this behavior. Unfortunately, it’s not common for CDN providers to inform customers about content delivery performance degradations.
We don’t know how many endpoints of Highwinds were unavailable due to the DNS issue. TurboBytes has two endpoints with Highwinds: one for HTTP-only traffic and one for SSL-enabled traffic. Our HTTP-only endpoint was impacted but the other was just fine.
Was your Highwinds endpoint broken on Dec 14? If so, how did you spot it and what action did you take to mitigate the problem? We always welcome your thoughts, ideas and feedback. Please share below in the comments section.