Breaking Down the CrowdStrike Outage Part 1: Preventing Critical Errors from Reaching Production
On July 19th, 2024, the world witnessed a large-scale computer outage caused by a faulty update from cybersecurity giant CrowdStrike. This incident, affecting millions of Windows devices globally, serves as a stark reminder of the domino effect that software errors can have.
Since then, CrowdStrike and other industry experts have shared their preliminary incident report in which they outline the incident and the steps they will take to prevent future issues like this. That said, this was no small bug – the errors resulted in over $5.4 billion in revenue loss for just Fortune 500 companies. Even with the best forward-looking intentions, this event caused irreparable reputational harm that will surely cost the company in the future.
Bugs are always a risk, but you can protect your business by implementing the right development and testing processes. That’s why we’ve created this brief blog series where we’ll explore what happened, the impact it had, and how you can prevent similar disasters for your app.
Here’s what you can expect from this two-part blog series:
- Part One: Preventing critical errors from reaching production
- Part Two: Resolving bugs swiftly with post-production app monitoring
Read part two of this series here.
What Caused the CrowdStrike Outage?
The culprit behind the outage was a seemingly innocuous update – a sensor configuration update for CrowdStrike’s Falcon antivirus software on Windows machines. This update contained a logic error that triggered system crashes and the dreaded Blue Screen of Death (BSOD) upon activation. Read the preliminary incident report for more information.
The technical details, as explained by CrowdStrike themselves, point to a configuration update triggering a logic error. This error resulted in a system crash, causing affected devices to reboot repeatedly and become unusable.
The Global Impact
The impact of the CrowdStrike outage was widespread and disruptive. Here’s a breakdown of how different areas were affected:
- Businesses: It’s estimated that the Fortune 500 alone have lost over $5.4 billion due to the outage. Countless companies across various industries, including finance, healthcare, retail, and transportation, were left paralyzed. Critical operations were halted as employees couldn’t access their workstations. There will also likely be continued fallout as businesses attempt to recover from the event.
- Individuals: Individual users who rely on Windows machines for work or personal use were also left inconvenienced and, in some cases, at-risk. The airline industry was hit hard and was unable to recover for days, with passengers left stranded across the globe . More concerning, there were reports of 911 dispatchers unable to answer calls and deploy emergency services for upwards of 7 hours in some cases.
- Overall Productivity: The outage caused significant productivity losses across the globe, highlighting the interconnectedness of today’s digital landscape. It also highlighted potential overreliance on a single software provider – if Microsoft used different or multiple security software, the outage may not have been so widespread.
- CrowdStrike Themselves: CrowdStrike will face both financial and reputational battles moving forward. Businesses like Delta have already begun to pursue legal action to recover their losses. Even if they can handle the financial demands of the lawsuits, their reputation is irreparably damaged. Additionally, CrowdStrike’s stock value has dropped by a whopping 41% over the course of two weeks. Recovering from this will require significant effort, which most companies lack the resources to achieve.
How Could This Have Been Avoided?
In short, this outage was caused because a seemingly small update that didn’t go through the normal testing channels before being released. Bugs are unavoidable, but there are steps you can take to prevent them from having a catastrophic effect on your business.
Here are some tips that can minimize your risk of leaking bugs to production:
- Stronger Application of QA Process Standards: CrowdStrike acknowledges that they have an extensive QA process that includes automated testing, manual testing, validation, and rollout steps – but also that the rapid response content through which this update was delivered followed a different process. While different delivery methods may require different testing steps, setting organizational standards promotes a mindset that prevents even edge-cases from impacting your release quality. No release is too small to forego your standards.
- Treat all releases with the same care: This billion-dollar bug was caused by a small, 40KB release, something no team would have expected to cause this much pain. This should serve as a reminder that code is only becoming more complex, and even the smallest update can create massive change. It’s important to maintain full testing coverage and quality no matter the size of your release.
- Using automation for insight: Automation is essential for teams to stay caught up on work, but it’s important to remember that automation, like AI, isn’t 100% perfect and doesn’t replace a human set of eyes. As you grow your test automation, be sure you are keeping manual checks in place and that you aren’t overly reliant or trusting of automation. In this case, more diligent attention to what automation is “passing” could have flagged this potential issue sooner.
- OS- & Device-Specific Testing: the outage wasn’t a result of a codebase issue that impacted every device and operating system, it was Windows-specific. It’s important to remember that generic cross-platform tests can often miss errors like this, and so it’s important to further test on the devices and browsers you operate in. Had CrowdStrike run their tests through real machines, they would have detected this error even before deployment.
- Remember the Purpose of Testing: All of this leads to the final point, which is to remember the point of testing: you are not just trying to release as fast as you can, you are expected to deliver a quality experience. As you grow, business leaders will put pressure on you to meet release demands quickly, but if it comes at the cost of quality you could end up causing more harm than good.
How to Run a More Comprehensive Test Suite
So how do you go about implementing some of these stopgaps? We’ve outlined some of the tactical steps you should take to shore up your QA processes.
- Automate More Types of Tests: Functional tests are helpful for making sure your UI works, but you need more than that to ship confidently. If you can incorporate additional test types like load testing, visual testing, and accessibility testing, you’ll have a much more well-rounded picture of where bugs may appear.
- Use Manual Testing for More Than Just Spot-Checking: Manual testing can offer important UI/UX insights to improve your customer experience, but it can also be used as a stopgap in your QA process. As you automate more of your tests, run occasional manual tests at the same time to make sure the automation is still reliable.
- Test on Real Devices and Browsers: The more complex devices and OSs become, the more nuanced your app delivery has to be. It’s not enough to run your app code through a basic round of tests – you need to then replicate those tests in the actual environments your customers can be found.
- Apply Organizational Standards to Every Department, Even if They Work in Different Focus Areas: Teams might work on different apps or functions within your company, but that doesn’t mean you can’t have consistency. Seek ways to organize test cases across your teams and to share knowledge and guidelines across different stakeholders.
Essentially our testing guidance boils down to this – get more done in a single place and become more thorough in your application of standards.
Use Tools Built for Full-Scale Test Orchestration
Thorough testing often requires complex workstreams and effective communication. At SmartBear, we’ve designed our Test Hub to give you a single destination to achieve the next level in QA, help you ship higher quality applications, while meeting fast-paced release demands. Whether you’re seeking help with automation, management, or both – we have solutions to fit your needs.
Seeking Automation? TestComplete can help users of any skill level automate more types of tests and run them in parallel on real devices and browsers. You can even perform manual tests and pull reports to stay on top of the testing processes.
Seeking Management? Our suite of Zephyr products gives you the means to organize and optimize your test cases. Choose from in-Jira or standalone options that give you the freedom to work where you need to while maintaining close communication with your developers.
With TestComplete and Zephyr, you will achieve full-scale test automation across your teams and can apply organizational standards that will ensure releases of any size are always of the highest quality. Trial one or both today for free and see how quickly you can build confidence for your next releases!
Try TestComplete || Explore Zephyr
Next Up: Maintaining Successful Live Performance with Detailed Monitoring
In this piece, we discussed beneficial pre-release steps you should consider implementing to avoid leaking major bugs to production. Of course, not every bug is preventable, so it’s important to also prepare for post-launch performance stability.
Check out the next part of this series, where we will cover tips for spotting and resolving bugs quickly to prevent live errors from impacting your business.