It started out like any other Monday morning. I woke up, got dressed, put my contacts in and started making my way to the kitchen for coffee. Along the way, I launched a browser and the mail client on my laptop (as I always do on “home office” days) and I checked to make sure my son was up. After making coffee, I had a few minutes before it was time to drive my 14-year-old to school, I scanned the headlines in my newsfeed. The top two headlines read:
I only got to read the hover-over teaser paragraphs before:
a) I realized it was no longer like any other Monday morning and
b) my son informed me it was time to go.
My son and I bantered our way to school the way that only fathers and teenage sons can, but I was distracted. Major outages at both Google and Microsoft on the same day? That’s just odd. I’m familiar with the performance teams and practices at both Google and Microsoft and they are both quite good – not perfect, but good enough that it just feels wrong for both of them to have outages at approximately the same time without a common, external, trigger.
By the time I got back home to read the rest of the articles, I’d decided that it had to be the result of some kind of deliberate cyber-attack. Not because I’m a conspiracy theorist, but rather because I couldn’t come up with a more likely explanation for this apparent coincidence.
As I read, I quickly realized this wasn’t the result of cyber-attacks, data-center fires, or natural disasters. It was just “run of the mill” performance issues.
Google has yet to release any specifics other than the following message from their Apps Status Board shortly after the outage:
“Between 15:51 and 15:52 PDT, 50% to 70% of requests to Google received errors; service was mostly restored one minute later, and entirely restored after four minutes.”
Based on what is known (as of my last chance to check the facts before this goes live), the most likely cause is either a physical infrastructure problem or human error.
I’m the first person to acknowledge that no amount of performance testing and monitoring will prevent physical infrastructure failure or human error. Things fail and mistakes happen. And given the fact that Google was able to restore some services within a minute and most services in under five minutes, it would seem that Google has (at least) a reasonable live site monitoring program and that they have in place (at least) reasonable risk mitigation measures in the form of outage detection and recovery procedures. I’d even go so far as to wager that teams practice those outage detection and recovery procedures as part of their performance testing and monitoring program. So, assuming the Google outage does turn out to be the result of a physical infrastructure failure or a garden variety mistake, my hat is off to them for the quick recovery. That said, I hope the outage leads to enhancements in their program to defend against future outages as opposed to resting on the relative success of quick recovery.
“This incident was a result of a failure in a caching service that interfaces with devices using Exchange ActiveSync, including most smart phones. The failure caused these devices to receive an error and continuously try to connect to our service. This resulted in a flood of traffic that our services did not handle properly, with the effect that some customers were unable to access their Outlook.com email and unable to share their SkyDrive files via email.”
Now, this is something that could have been simulated during performance testing. And if it had been simulated it could have been prevented. Again, I am sympathetic to the degree that there is never enough time to simulate every possible, oddball, scenario under every possible condition. What gets me is “…and continuously try to connect to our service. This resulted in a flood of traffic that our services did not handle properly…”
Experienced performance testers know that automatic retries are always a fun thing to simulate (because auto-retry has a nasty habit of causing geometric traffic growth until something breaks if no one has put a mechanism in place to counter-act that geometric growth in traffic). In this case, I’d have to wager that the resolution that took four hours was something akin to a reset, possibly a minor performance improvement and probably some enhanced monitoring for early warning if/when it happens again. What I hope is that the auto-retry lesson has been learned and that in the near future there are both performance improvements and mechanisms put in place to automatically mitigate runaway traffic resulting from auto-retries – no matter the root cause.
Nope, this is not a normal Monday at all. It seems that Amazon.com was jealous of all the headlines Google and Microsoft were getting, because as I returned to put the finishing touches on this, Amazon.com was taking its turn at being down. What are the odds of three companies with solid performance teams and sound performance practices making headline news with outages within 12 hours of one another? Did I miss something? Did I go to bed last night in mid-August and wake up in late November? Is today actually Cyber Monday? If not, I hope, for the sake of online consumers everywhere, that companies sit up and take notice.
Amazon’s outage is currently being reported to be specific to "an increased error rate for CreateTags and DeleteTags APIs in the US-EAST-1 region." That information alone is insufficient for me to speculate either the cause or what aspects of performance testing or monitoring would have helped to prevent or mitigate this situation, but I can show you a cool chart from SmartBear’s AlertSite product that makes the impact and duration the Amazon outage even more poignant.
AlertSite also showed that errors consisted of HTTP server errors which typically displayed this screen:
Well, in theory, my work day ended about 20 min ago and I still have to pack for and drive to a few hours to the hotel for a few days in the office before I can really call it a day. But before I do that, I’d like to leave you with the following thoughts.
If these three organizations can all make the headlines from unrelated outages on the same day, how confident are you that your Web and mobile sites or apps are really ready for the “Wild Wild Web”? Knowing that three of the most well-funded software companies in the world, all accustomed to huge volumes of unpredictable traffic hitting their sites, can all make the headlines over outages on the same, otherwise completely average day, how do you feel about your company’s performance testing and monitoring practices? And maybe more importantly, how much would your organization be willing to invest to keep their name out of headlines like these tomorrow?
Think about it.