The Falcon’s great Fall: The 2024 Update That Shook CrowdStrike’s Legacy
The Falcon’s Great Fall
The 2024 Update That Shook CrowdStrike’s Legacy
Background and Context
CrowdStrike is a top investigative firm in cybersecurity, founded in 2011 by George Kurtz (former CTO, McAfee) and Dmitri Alperovitch (former VP Threat Research, McAfee). Later Shawn Henry, a former FBI official, joined them to lead the CrowdStrike Services, Inc to deliver security response services. It was fast on its way to becoming one of the most pre-eminent global cybersecurity companies, reaching the coveted Unicorn status ($1 billion in valuation) in 2017 and breaching the $3 billion valuation in just the next year! The founders’ vision at the company was to counter new cyber threats and provide strong endpoint protection.
Key Achievements
It was the first company to add advanced threat intelligence and real-time response to product development, shaping modern cybersecurity solutions. This helped protect financial and critical infrastructure service providers with an aim to defend against complex threats in the real-time.
Crowdstrike reinvented endpoint detection and response with their flagship product: the CrowdStrike Falcon platform. Falcon is cloud-native and, thus, scalable at speeds unprecedented for threat detection. Later, CrowdStrike launched Falcon-X, an advanced threat intelligence tool. It also offered Falcon OverWatch, a detection and response service.
For their advanced and innovative ways to deal with cyberthreats, CrowdStrike has won many awards from giants like Gartner and Forrester. Thousands of enterprise clients, including the majority of the Fortune 500 companies and organizations like the U.S. Department of Health and Human Services, vouch for its solutions.
Since very early on, CrowdStrike has been concerned with breach prevention. Effective and innovative security solutions have been the mission executed at the company. But on July 19th, 2024, the company issued a faulty security software update that caused global computer outages, severely disrupting air travel, banking, broadcasting, and other services.
The Fateful Update That Brought Down The World
Context and Timeline
On July 19, 2024, at 04:09 UTC, CrowdStrike released a sensor configuration update to its Falcon platform to enhance the detection of new threat techniques.
Release of the Update: This update, tagged as Channel File 291, was released at 04:09 UTC on July 19, 2024, by CrowdStrike.
Immediate Aftermath: Within minutes of deployment, logic errors inherent in the update began to trigger BSOD (Blue Screen of Death) crashes and system failures on affected Windows machines.
Discovery of the Problem: CrowdStrike found the issue at 05:27 UTC and began to roll back the update, but by that time it was too late and far too many systems had been compromised. This caused a widespread boot loop in Windows 10 and later systems. Mac and Linux systems were not affected.
A series of events leading to system failures
The Impact
The security update backfired, disrupting the very enhancements it intended to improve. Critical services, including stock exchanges, power plants, hospitals, and airlines, halted due to system failures, highlighting the urgent need for more robust software upgrade testing.
Global Effect of the BSOD Crashes
The faulty CrowdStrike update triggered widespread system failures, with the Blue Screen Of Death (BSOD) crippling industries worldwide. Airlines, hospitals, and banks were among the hardest hit — flights were grounded, surgeries delayed, financial transactions halted, and government services disrupted, paralysing key infrastructures and economic activities globally.
Areas Affected
Some regions of the world were more affected than others. North America, Europe, and Asia were severely impacted. In North America, tech companies, banks, and emergency services faced operational issues, especially in the US and Canada. Europe saw significant disruptions in the UK, Germany, and France, while Asia experienced the worst effects in India and Japan. The crisis underscored the vulnerability of global business and infrastructure.
Affected Users
Microsoft reported that the update affected 8.5 million devices globally, severely disrupting vital services. Key impacts included:
- Service Delays: Banking, transport, and other daily services were delayed, exposing our dependence on technology.
- Workplace Stress: Office workers faced record stress levels and backlogs due to inaccessible systems.
- Healthcare Disruption: Hospitals struggled to deliver urgent care, resorting to pen-and-paper methods amid system failures.
- Retail Chaos: Both online and offline stores experienced transaction failures, leading to lost sales and frustrated customers.
- Travel Disruptions: Hundreds of flights were canceled, and thousands of rail, metro, bus, and cab services were impacted, stranding travellers.
- Education Interruption: Schools and universities couldn’t conduct exams, mark attendance, or access essential facilities, disrupting academic activities.
Economic Impact
The financial loss from the outage is estimated at $5.4 billion, primarily affecting Fortune 500 companies, excluding potential losses from Microsoft, so the impact could be higher.
Early estimates of top 4 sectors impacted by the Falcon sensor update incident, the estimated losses, and the major players in those sectors. It does not represent all affected sectors and the total loss caused due to this incident world wide.
- Health: The most affected sector, with an estimated $1.94 billion loss, as disruptions in appointment systems led to delays and cancellations, impacting patient care.
- Banking & Financial Services: Incurred losses estimated at $1.15 billion, excluding future insurance and reinsurance claims.
- Airlines & Logistics: Grounded flights resulted in $860 million in lost revenues and costs for overnight stays.
- Retail & Wholesale: Transaction failures led to a $470 million loss, along with significant damage to consumer trust.
Affected Businesses
The CrowdStrike incident had widespread effects across various industries, severely disrupting operations and services. In media and entertainment, major organizations like CBS, Disney, and Sky News experienced outages that paralysed live transmissions. The retail sector, including McDonald’s and Best Buy, faced downtime that blocked customer transactions, leading to lost revenues and eroded trust. Financial institutions such as JPMorgan Chase and Bank of America dealt with defective online banking systems, causing complications for customers. Healthcare providers like NHS and Blue Cross Blue Shield saw disruptions in electronic health records, delaying patient care and emergency services. IT companies like HP and Tesla experienced infrastructure disruptions, affecting service delivery and product management.
Airlines like Delta and Ryanair faced operational glitches that led to flight delays and cancellations, while airports such as Heathrow and Changi grappled with overcrowding and long delays. Hospitality giants like Hilton and Marriott encountered issues in reservations and customer service, inconveniencing guests. Logistics companies like FedEx and UPS struggled with late shipments, and educational institutions like Harvard University faced disruptions in administrative activities. Government agencies, including NASA and the Department of Homeland Security, experienced critical delays in public services, while emergency services, such as the Alaska 911, were hard hit, risking lives. The incident also affected companies like Deloitte, resulting in financial and reputational damage.
Response to the Incident
CrowdStrike Responses
CrowdStrike quickly addressed the Windows BSOD outage by publicly acknowledging the issue and collaborating with Microsoft to identify the cause — a conflict between their Falcon sensor update and a previous Windows patch. They advised users to boot into Safe Mode, uninstall the Falcon sensor, and reinstall a patched version. However, this solution was complex for non-technical users, leading to widespread frustration. CrowdStrike’s technical teams worked tirelessly to guide users through the process, but the intricate nature of the fix ultimately hurt the customer experience, especially for small businesses and those without IT support.
Microsoft’s Response
Microsoft collaborated with CrowdStrike to release compatible security updates and provided guidelines for safely uninstalling and reinstalling the software, including instructions for temporarily disabling BitLocker. However, the technical nature of these steps led to further frustration among non-technical users, many of whom struggled to follow the guidelines and felt unsupported by Microsoft’s response.
Business Continuity Strategies
Companies adopted various strategies to maintain operations during the crisis. For example, Air India issued handwritten tickets to avoid flight cancellations, and Mayo Clinic reassigned staff to critical departments to ensure continued patient care despite disruptions to their electronic health records system.
Legal Implications
CrowdStrike faces a class action lawsuit from the Plymouth County Retirement Association in Texas, alleging that the company misled shareholders about its software updates, inflating stock prices and exposing investors to losses. Additionally, Delta Airlines is suing CrowdStrike and Microsoft for damages estimated in the billions, with insurance covering only a portion. Other organizations are also assessing their legal options against CrowdStrike.
Customer Confidence
The failed update severely damaged customer confidence in CrowdStrike, raising concerns about the reliability of their security products. While CrowdStrike’s transparency and ongoing support helped mitigate some of the damage, the highly technical fix frustrated many users, particularly those without technical expertise. To restore confidence fully, CrowdStrike may need to improve testing, quality assurance, and possibly initiate independent third-party audits.
Shareholder Value Evaporated
CrowdStrike’s stock fell by 11.1% on Friday and another 13.5% on Monday due to fears of financial and reputational damage. While the company’s swift response helped stabilize the stock, there remain concerns about the long-term effects on its reputation and customer trust.
Insurance Coverage
Cyber insurance is expected to cover only 10% to 20% of the losses, translating to an estimated $540 million to $1.08 billion in insured losses. This limited coverage highlights a significant gap between potential risks and actual insurance coverage. The incident underscores how tech failures can disrupt today’s computer-dependent economy.
Crisis Management Gone Bonkers
CrowdStrike attempted to offer a $10 Uber Eats coupon as a goodwill gesture to those affected by the outage, but the plan backfired when Uber’s fraud detection system flagged and banned the coupons, rendering them unusable. This misstep attracted heavy criticism and further tarnished CrowdStrike’s reputation during an already challenging time.
In-Depth Look at the System Update
The system update opened up a critical vulnerability, causing system failure. It was a latent out-of-bounds read error in CrowdStrike Falcon’s Content Interpreter of the sensor. It was triggered by a newly introduced Rapid Response Content file, which started to request data outside the allocated boundaries of some arrays.
Fatal Flaw: Missed Input Validation
Their developer committed the cardinal mistake in software development, bypassing input validation for the relevant template type. This would have normally checked the number of input fields against expected parameters. In the absence of such a safeguard, the system went on to process data without checking its integrity.
The Domino Effect: Unchecked Data and System Crashes
As input data was not validated, it resulted in a catastrophic failure. When the Content Interpreter read outside the bounds of the array to retrieve data, a memory access violation occurred, which crashed the system. This unexpected behaviour was heightened because the test cases used during development did not model conditions that would reveal the vulnerability.
The Role of Signed Drivers: A False Sense of Security
The concerning part of the incident is the role of signed drivers. As important as driver signing is to source authentication and integrity, it does not guarantee freedom from vulnerabilities or overlooked bugs. In this case, the CrowdStrike Falcon sensor driver was end-signed, but the flawed Content Interpreter formed part of the digital signature, a case of full security measures beyond just digital signatures.
This again brings home the need for rigorous testing, robust input validation, and security in layers. Signing of drivers is an important system security measure, but a system cannot depend on it as an independent security mechanism.
Why were only Windows Users Impacted?
The CrowdStrike Windows outage in August 2024 highlighted some key digital security issues. It also sheds light on the meaning of kernel and user mode operations in an Operating System (OS). The event proved that, when digital security fails, the results are usually catastrophic, especially if the OS’s kernel is compromised.
Kernel Mode vs. User Mode
Kernel Mode: The kernel is the core of any operating system. In kernel mode, access to all hardware and system resources is unrestricted. This access is usually needed to manage hardware and for low-level system functions. If compromised, it could cause system-wide failures, like in the CrowdStrike Falcon update incident where updating the Falcon sensor caused BSOD errors in Windows machines.
User Mode: In contrast, user mode limits access to system resources. The apps run in a sandboxed environment, lowering the possibility of system-wide crashes in case an application crashes. More significantly, the CrowdStrike update did not affect macOS, where it runs mostly in the user mode. This shows how the OS user mode can offer users an added layer of protection.
The difference between these two modes played out dramatically in the CrowdStrike incident. Windows, a system that required kernel mode for any security process, was subjected to this flawed update, causing widespread crashes. More so, the impact of the update brings about the question of how exactly testing and validation should be done, especially when kernel-level operations are involved, which could thus affect whole systems.
Linux and macOS Users
The macOS’ and Linux’s focus on allowing third-party updates only in the user mode made them immune to the update’s bad effects. This shows that a more restricted mode can prevent such failures. This view serves to underscore the necessity of strong security provisions that are responsive to operating configurations, be it kernel or user modes.
Reflecting and Learning
The incident at CrowdStrike exposes serious flaws in their testing procedures and update release policies. It reminds us to constantly improve digital security. Businesses should invest in:
Digital Security
- Vulnerability Management: Ensure updates undergo rigorous testing to avoid new vulnerabilities.
- Mode-Specific Security: Consider security implications across different operational modes (e.g., kernel vs. user mode).
- Cross-Platform Strategies: Develop security solutions compatible with various operating systems.
Testing and Policies
- Inadequate Testing: The Falcon Sensor update wasn’t tested in diverse real-world scenarios, leading to global system crashes and questioning the reliability of CrowdStrike’s development process.
- Quality Assurance Gaps: Weak quality assurance processes failed to catch issues pre-deployment, indicating a need for more robust QA standards.
- Internal Policy Flaws: Rapid deployment overshadowed thorough testing, suggesting a need to reevaluate internal policies to prioritize quality over speed.
Responsibility and Accountability
- Systemic Issues: CrowdStrike’s development cycle lacks stringent practices to identify vulnerabilities, highlighting the need for accountability and improved processes.
- Cultural Shift: CrowdStrike must prioritize security and reliability over speed, starting from leadership, to regain client trust and ensure operational resilience.
Actions for Enterprise Clients
The incident offers lessons for improving organizational resilience:
- Zero-Defect Testing: Thoroughly test updates in diverse environments before deployment to ensure compatibility and stability.
- Incident Response Plans: Regularly update response plans, clearly defining roles and activities for effective incident management.
- Incident Management Best Practices:
— Clear Communication Channels: Establish and practice swift emergency communication networks.
— Crisis Management Teams: Create and regularly drill teams for emergency handling.
— Cybersecurity Insurance: Consider insurance to mitigate losses from breaches or reputational damage.
Invest in Training and Development
Investing in specialized training is essential for strengthening cybersecurity defences. Companies can significantly enhance their security posture by equipping teams with the right knowledge and skills. Courses like those offered by DataCouch on SRE (Site Reliability Engineering) and cybersecurity provide teams with the tools they need to improve testing processes, streamline deployments, and manage incidents more effectively. Continuous learning ensures that organizations stay ahead of emerging threats, making them more resilient to attacks and better prepared to safeguard their assets in an increasingly challenging landscape.