Published on Thu, Nov 27, 2014 by Aaron
The CDN of Amazon Web Services, CloudFront, experienced a major global outage yesterday due to DNS issues.
The CloudFront outage lasted circa 90 minutes, starting at ~ 00:15 UTC (04:15 PM PST).
On Twitter and other social media channels (incl. Hacker News) people started talking about it, and the news was picked up by The Next Web, Forbes and other media. The AWS status page did not mark Cloudfront as being in trouble until 45 minutes after the problems started, and surprisingly it was marked as 'informational' and not as 'big problem', or something like that.
The CloudFront outage had a big impact around the globe. Thousands of websites delivered a poor user experience and undoubtedly suffered from a drop in conversion, clicks, sales etc. But the impact does not stop there: banner ads, analytics trackers and widgets did not load, as quite a few of these 'third party content' providers use CloudFront CDN.
Here at TurboBytes, we monitor performance of CDNs with RUM (Real User Monitoring) all the time from all across the globe, and use the data to power our Multi-CDN service. Our customers add our non-blocking JS to their site, which executes after page load. It then silently in the background fetches a 15 KB object from a few CDNs and beacons the load time details to our servers. If the 15 KB object failed to load within 5 seconds, we beacon a Fail.
Our RUM clearly shows how big the CloudFront outage was. The Failratio went sky high, but CloudFront did not reach 1 (=fail all the time). Why not? Well, our RUM data does not tell us exactly what was going on, but from all the info we gathered online, it seems the authoritative DNS of cloudfront.net was not responding *most of the time*. Resolvers often do retries, and apparently, sometimes, one of the authoritative DNS servers would send a good response and the browser would then be able to connect to CloudFront. If indeed the DNS lookup was successful, it was on average much slower than normally:
While looking at the data, we created a visualization that perhaps makes even more clear how things went really bad: In this chart, a vertical blue line was drawn for every test that passed (browser fetched 15 KB object from CloudFront within 5000 ms) and a vertical red line was drawn for every test that failed to finish within 5000 ms. Before the problems started, it's clear there is a lot more blue than red, and during the outage, red has the upperhand.
Was your business impacted by this CloudFront outage? How will you prepare for a similar fail of your CDN in the future? We welcome your thoughts, ideas and feedback. Please share below in the comments section.