Any Given Thursday - Digging into Nasdaq's 3-Hour Outage
This week began like Any Given Monday, except for the strange headlines reporting that select Google, Microsoft AND Amazon websites, applications and services experienced Web outages. We still don’t know all of the details surrounding what happened - and we probably never will. The only things we are pretty confident of is that there is no indication that these outages were related in any way, there is no indication they were caused by foul play, and there is no indication they resulted from unexpectedly high usage – leaving only a few likely culprits:
- physical failure of hardware, networks, power sources, etc.
- human error
- under-estimating or under-mitigating known performance and/or reliability risks
- really bad luck
Now it’s any given Friday, and we’ve got another high-profile outage. This time it’s NASDAQ:
That particular reporter is clearly focused on the security angle, although there is no current indication that this outage was the result of foul play either.
“The first indication of a problem at Nasdaq occurred at 11:45 a.m. when the exchange announced that it was experiencing “momentary interruptions” in the system.
“The UTP SIP experienced momentary interruptions in quote dissemination across all UQDF Channels from approximately 10:57 to 11:03,” the note read. “All channels are now operating normally.”
At 12:21, a second alert was sent out announcing that Nasdaq was halting trading on all Tape C securities due to the problems, followed by another note indicating that trading on all Nasdaq options markets was being halted due to the same problem.
The White House sent out a message to reporters saying that President Obama had been briefed about the issue by his chief of staff.
Trading resumed around 3:41 p.m.
The exact problem at Nasdaq is not known, nor have the causes of the other outages been explained, other than to indicate that they involved "network issues.”
I have no inside information about the Nasdaq development, testing, networks, or operations. What I do have is lots of experience with detecting, resolving, mitigating, and reacting to performance-related issues. And that experience has taught me that when an outage is reported as having “involved network issues” that means one of a few things:
- Something physically failed
- Someone made a mistake (like unplugging the wrong cable)
- Pretty much anything leading to insufficient available internal network or ISP bandwidth
- “We have no idea, but that sounds really bad. People accept occasional network issues, we’ll go with that. It’ll keep our clients from losing confidence in us.”
Here’s the thing. If an outage is related to a physical failure, why not say that? It happens. Things wear out. Maybe the thing that failed was a “lemon.” Physical failures don’t shake my confidence in a service or provider unless or until it becomes a regular thing.
If someone made a mistake, why not say it was human error? Again, it happens. All of us who have worked in IT for a while have, at some point, done something accidentally or with all good intent that went very badly. And, again, this doesn’t shake my confidence unless it becomes a regular occurrence.
While I don’t approve of misleading the public, I do have some sympathy for organizations who cave to relentless badgering for information and provide an entirely generic, but plausible, response in the early stages of troubleshooting an issue because the reality is that it’s exceptionally difficult to troubleshoot a problem when you have to stop every five minutes to tell someone else that you don’t know what it is yet. That sympathy disappears, however, if the generic response isn’t replaced with a real response once the root cause is found.
Insufficient available bandwidth causing an outage, however, bothers me. A lot. There is absolutely no good reason for insufficient bandwidth to cause an outage. Maybe a slowdown, but if a flood of network traffic (not a flood of traffic to your site, just a whole bunch of traffic on the same network as your site) leads to an outage, something is wrong, at least in my book.
When we’re discussing high volume, high reliability sites and apps associated with high profile, well established and well-funded organizations like Microsoft, Google, Amazon and Nasdaq, it’s safe to assume that delivering good performance and reliability are a real priority. Therefore, it is at least fair to assume that a deliberate effort has been made to deliver good performance and reliability, which includes things like conducting load testing and establishing and testing failover scenarios. Furthermore, it’s fair to assume that these kinds of sites have, and have tested, reasonably robust Denial of Service Attack defense mechanisms (which I *know* to be true for at least two of these organizations because, one time, long ago, I fell victim to those Denial of Service Attack defense mechanisms while I was testing a vendor’s claim about how many virtual users their product could generate. Oops). And if all of that is true, I would find it entirely unforgivable if testing wasn’t done related to “suddenly limited bandwidth” conditions and if mechanisms weren’t put in place to help prevent such a situation and to return to acceptable service levels in short order if it did happen.
I would find it unforgivable because, after you’ve done the testing and implemented defense and recovery mechanisms around unexpected peaks of legitimate traffic, hardware failure and denial of service attacks, it would literally take under an hour to test what would happen if part of the network “went down” or was suddenly flooded with unrelated traffic.
Now maybe I’m being too narrow. Maybe “network issues” means a construction crew across town accidentally cut a relevant fibre-optic cable. Or maybe it means a software update being done by the “network team” went bad. But we’re talking about four major sites during an otherwise unremarkable week (not to mention the New York Times outage last week that was also attributed to “network issues”). Personally, I think the odds of all this being related to SkyNet becoming self-aware are better than it being the result of construction crews or during peak usage hours software updates gone bad.
All I know for sure is that I’m starting to think that Cyber Monday could be *really* interesting this year if this trend continues.