On Monday, June 3, two major social networks experienced significant downtime on both mobile and desktop interfaces. Twitter was the first to go down, with features starting to fail at around 4:08 EST. Users couldn't refresh their timelines, click on trending topics, or view profiles. The outage only lasted about twenty-five minutes, but its reach was extensive.
Facebook Messages also experienced downtime that evening. For more than an hour, some users, both mobile and desktop, were unable to send or receive messages. The chat feature, which identifies which friends are available for messaging, was also affected. The platform was running smoothly again by 10:30 EST.
Neither company provided extensive information regarding the resolution of the problem. Twitter simply displayed the “fail whale,” its icon for platform errors, and later claimed that the outage was due to a routine change, which they apparently rolled back as soon as they realized the damage. Many Facebook Messages users received a message claiming that the platform was “down for maintenance.”
A Brief History of Twitter and Facebook Downtime.
While both companies have experienced outages before, Twitter is the more reliable of the two. The platform has consistently performed at over 97% uptime despite brief outages over the past six years. The issues that have been experienced have been resolved quickly and effectively. (For more information on this, check out our recent infographic, Twitter as an Emergency Broadcast Network)
Facebook’s history has not been quite so unfailing. 2009 was a bad year for the platform, and the worst outage happened in September 2010, lasting over three hours and affecting over 60% of users. While both the website and the mobile app have improved over the last few years, the company is facing tough competition from sites like Google and other apps, making reliability even more important.
Why It Happens.
A nearly infinite number of problems could cause an outage. The most common is a power outage or failed backup. This problem can be solved simply by spanning connectivity across multiple availability zones (or storing data at and broadcasting it from multiple locations instead of a single location). The second and third most frequent causes of downtime are natural disasters (Hurricane Sandy caused six major outages!) and traffic spikes/DNS routing. While researchers have yet to master controlling (or even predicting) the weather, there are things you can do to prevent or manage outages due to traffic spikes.
What You Can Do?
Obviously, we at SmartBear advocate testing as thoroughly as possible to make sure you have identified vulnerabilities both in your code and in your network. Being thorough in your testing will provide invaluable insight and ensure fewer surprises down the road. But if you do experience an outage (and you will – it’s one thing you can count on), it is important to deal with it effectively and efficiently. Here are some tips for handling the inevitable outages on your own site:
Set up Notifications
To increase your ability to deal with downtime in a timely manner, set up a production monitoring tool to notify you immediately of any problems with performance or availability. A notification from a tool like this ensures that you have the appropriate information to know if the site is actually down and the user isn’t experiencing connection problems of his or her own.
When you learn of an issue in your production environment, you can identify the exact problem and communicate with your users as quickly as possible. Take this time to notify your users via email or social media of the issue and what is being done to resolve it. Proactively communicating with the user base can save a lot of frustration on the customer side and keep your customers loyal to you.
Limit your advertising during the outage
Just remember, the longer your site is down, the more traffic you are losing. For this reason, if you see that the downtime is going to be more extensive, try to limit online advertising or marketing that drives users to the website. Most outages last between 1 and 4 hours, and, depending on your normal traffic and when the outage occurs, this could mean a significant loss in website traffic and the risk that your potential customers will go to the competition if they are pushed to your site and see problems there.
Keep calm and restore service
Remember, downtime is not the end of the world. It happened this week to Facebook and Twitter, and it happens 1-4 times a year on average to companies with data centers. However, as online reliability becomes even more important for companies, consider taking preventative measures and creating a plan of action in case it happens to you.