Published on Tue, Aug 25, 2015 by Aaron
CloudFront had a bad day in the UK yesterday, causing many websites to be unavailable to the millions of people on the BT network. The outage lasted 6.5 hours, during the day. Ouch.
We'll show you the problem was in DNS, and how you - a CloudFront user/customer - can easily and quickly do diagnostics on your CDN, from many consumer ISP networks across the globe.
It's not the first time CloudFront had DNS related availability problems. On Nov 27 2014 CloudFront had a major global outage lasting ~ 90 minutes.
Yesterday's CloudFront outage in Great-Brittain was different, because it occurred on a single ISP network only, but it lasted for a very long time. What happened? In short: the DNS resolvers of BT sent 'empty' responses for
<something>.cloudfront.net queries, resulting in browsers and apps not being able to connect to a CloudFront CDN server.
Here at TurboBytes, we constantly monitor performance of CDNs with RUM (Real User Monitoring) from within browsers of people at home and at work, everywhere in the world. We use this data to power our Multi-CDN service. Our non-blocking JS executes after page load and then silently in the background fetches a 15 KB object from a few CDNs, and beacons the load time details to our servers. If the 15 KB object failed to load within 5 seconds, we beacon a Fail.
In the chart below (UTC time zone), a vertical blue line was drawn for every test that passed (browser fetched 15 KB object from CDN within 5000 ms) and a vertical red line was drawn for every test that failed. Before and after the CloudFront outage it’s clear there is a lot more blue than red. During the outage it’s almost all red.
The Failratio jumps around 10:10 UTC and you can clearly see it’s a hard ‘break’. In the following 6.5 hours some ‘ok’ beacons do come in (likely because some users on BT have configured their machines to use Google Public DNS or OpenDNS), but by far most beacons show failure. Around 16:40 UTC the problem was fixed.
AWS first reported about the problem circa two and a half hours after it started. That’s not great.
Also, and surprisingly, it was classified as ‘Informational message’ and not as ‘Performance issues’ or ‘Service disruption’.
We first heard about the CloudFront outage from a tweet by @OpenRent, who responded to a site visitor and linked to Pulse test results. We then ran a HTTPS test on Pulse against their CloudFront endpoint and all seemed fine. The @OpenRent person gave a good hint to why that HTTPS test did not show the CloudFront problem: “… or the Agent is using a different public DNS”. Ah, yes, that must be it! We then quickly ran a DNS test (scroll down to agent 128-Lee-Armstrong, located in Portsmouth) and this gave insight into what was going on: BT resolvers sent a NOERROR response without an ANSWER section, meaning the client (browser/app) gets no IP address to connect to.
Searching Twitter for “Cloudfront BT” led us to some tweets by @Lovell, a nice guy from London who runs a image resizing web service. Lovell’s tweets one and two gave more insight into what was going on with CloudFront on BT:
We could have (and should have) run Pulse tests to verify Lovell’s findings about that single NS being unavailable, but we didn’t, so let’s safely assume Lovell is right.
TurboBytes Pulse has grown from 10 to 80+ agents (test locations) in a few months time, and we’re always looking for more agents. Agent hosts get access to the Pulse API so if you want that and you can install the Pulse software on a Linux machine or Raspberry Pi that is connected to a consumer ISP network, please reach out to us.