On July 19, 2024, 8.5 million Windows computers utilizing CrowdStrike security software crashed, in what many consider to be the biggest IT failure in history. The outage caused massive disruption to many organizations including airlines, hotels, banks, hospitals, stock markets, and governments. According to a recent Fortune magazine article, the incident resulted in over $5 billion of damage to Fortune 500 companies.
The irony of the incident is that CrowdStrike software is intended to protect systems from downtime due to unauthorized access by bad actors and cyber criminals. But by releasing a buggy update, CrowdStrike ended up causing major financial damage to the very customers it was committed to protecting. Along with many other CIO’s, I’ve been asking, “How could this happen? And how can we mitigate the risks of this happening again?”
Securing systems is necessary but not sufficient. Digital systems are the backbone of modern business, and IT providers must ensure these systems are not only protected but also available, stable, and resilient.
This article will cover the cause of the CrowdStrike incident, key lessons for executives, our reasons for selecting SentinelOne security software, and how it mitigates the risks that led to the recent outage.
What Went Wrong in the CrowdStrike Incident?
CrowdStrike is a security startup from Austin TX with over 24,000 clients, including Fortune 100 companies. CrowdStrike develops Managed Detection and Response (MDR) software, which is used to monitor computers for malicious threats and is typically overseen by a 24/7 Security Operations Center (SOC).
On July 19, 2024, a buggy security software update went out to all computers running CrowdStrike software. The update caused machines running Windows to crash and display the dreaded “blue screen of death.”
This raises a serious question: why was a buggy update released in the first place?
Ultimately, the CrowdStrike outage represents a QA failure. We now know that the CrowdStrike testing procedures had serious flaws:
- There were no regression tests.
- The Unit and Manual tests only tested for the “happy path” and ignored edge cases.
- There were no staged rollouts for the software and mechanisms to quickly withdraw the release, and no special precautions were taken in deploying to critical infrastructure.
Big Picture Lessons for Executives
The QA failures evident in the CrowdStrike incident hold important lessons for executives. First of all, proper testing and deployment procedures should be in place. For example, releases should be rolled out first in test environments, and it should be possible to immediately roll back bad releases, without any impact to system availability. CIO’s need to ensure that any software accessing the kernel has an internal QA process to test critical patches. In addition, more rigorous testing and validation procedures should be put in place when working with critical infrastructure.
For companies that create or modify software, the incident also demonstrates the importance of validation and testing. Automated tests are ineffective if they do not test points of potential system failure. Likewise, manual tests need to be thorough – checking that the “happy path” works is not enough.
This incident also highlights the importance of having disaster recovery plans in place. When referring to IT incidents, cybersecurity experts often use the phrase “Not if, but when“. Don’t just plan for best case scenarios–Plan under the assumption that eventually something is going to go wrong.
Finally, the CrowdStrike incident demonstrates that in a world that runs on software, mistakes made by vendors can have wide reaching consequences for other organizations. For CIO’s, this means that performing thorough vendor due diligence is essential. When vendors provide mission critical software, it is crucial that they adhere to industry best practices in regards to testing and deployment. The CrowdStrike outage would not have surprised anyone familiar with the flaws that existed in CrowdStrike’s testing and deployment procedures.
Why We Use SentinelOne
As a Managed Services Provider entrusted with securing systems for clients in a variety of industries, CPU RX takes vendor due diligence seriously. We carefully evaluate the third party solutions that we provide to our clients. In the case of MDR software, we chose to use SentinelOne precisely because it avoids some of CrowdStrike’s flaws:
- SentinelOne has strict QA and testing procedures. A major failure of CrowdStrike was that their testing procedures were not adequate to catch the bug in the update which caused widespread issues.
- SentinelOne updates are available on-demand in staging environments before deployment, and every critical patch is tested internally by our Security Operations Center (SOC) before being rolled out. CrowdStrike updates are pushed to all machines automatically, and there is no way for the SOC to test updates in a staging environment before deployment.
- SentinelOne has a newer codebase. Compared to CrowdStrike, SentinelOne has less technical debt.
- SentinelOne’s live security updates only operate in user-space, and do not affect the Windows OS kernel. CrowdStrike security updates do impact the kernel, and create the possibility that buggy updates may crash the Windows OS.
We understand the critical role IT systems play in your business, and cutting corners is not an option when it comes to protecting them. Our internal safeguards ensure that every solution we implement is safe, reliable, and effective. To learn more about our managed services and 24/7 cybersecurity protection, book a call today.
Learn About Our Managed IT Services
With over 20 years of experience, CPU RX has the expertise needed to secure your systems and protect your confidential data. Contact us today to learn more about our managed IT services.