On 19 July 2024, a routine update by cybersecurity firm CrowdStrike to their Falcon Sensor agent left millions of Windows PC users facing the blue screen of death (BSOD).
Now the dust has settled on the incident, ORX has spoken to members of the Operational Resilience Working Group (ORWG) to understand their experiences and gain insights into actions taken both during the incident and in the days following.
In this blog, Emilie Odin, ORX Research Manager for operational resilience, reflects on what they said.
What happened?
The outage, caused by an unforced coding error in the software update, led to disruptions across various sectors, notably global travel, financial services, television broadcasting and healthcare. The event is now being described as one of, if not the largest, IT outage in history affecting millions of Windows users. ORX News subscribers can access a deep dive into the incident on the ORX News website here.
Why this matters?
Members of the ORWG widely agreed that while operational resilience was already at the top of the agenda at their organisations, the global outage served as a reminder of some of the important aspects of the discipline, including:
- No one/nothing is ‘too big to fail’
- The critical role of testing and being prepared for a wide range of severe scenarios
- The challenges associated with being accountable for third party services despite having limited control and oversight of them
- The role of crisis management and disaster recovery and the importance of effective communication
What has the incident brought to light?
Particularly with regards to the CrowdStrike incident, a number of firms praised the effectiveness of their existing controls (such as staggered implementation and pre-rollout testing) when rolling out major software updates or patches, which kept them from being directly impacted.
However, whether impacted by the incident or not, the outage has provided a real-life example of how a systemic risk may materialise. While cloud providers and their associated services are generally considered to be resilient and recoverable, disruption does happen. Therefore, in order to be resilient, members stressed the importance of ensuring they have a grasp of how service provider failures could impact on the resilience of their critical/important business services. In order to achieve this, firms must ensure they continue to build an understanding of the potential impact a vulnerability can have across the end-to-end delivery of their critical/important business services.
Scenario testing and simulation exercises may provide an effective way of achieving this, but members flagged scenario testing as an ongoing challenge, with the following questions currently being asked:
- Are we testing sufficiently severe and plausible scenarios to enable it to plan forward?
- Are scenario storylines specific and detailed enough?
- Are we moving towards end-to-end testing to deliver more comprehensive response plans?
- Is testing considering what is important/critical from a resilience point of view?
- Are third parties being involved in scenario testing? If not, how can they be?
How are firms (and regulators) responding?
Participants of the ORWG listed a range of actions/outcomes as a direct result of the CrowdStrike incident:
Conversations with regulators – typically prompted by regulators themselves (regardless of whether organisations were impacted)- Some firms have been invited to run specific stress tests in light of the outage as well as in preparation for upcoming regulatory deadlines (e.g. DORA)
- Broadening the scope away from focusing primarily on cyber incidents and looking at impacts more broadly
- Considering the difference in recoverability depending on whether an outage is caused by a malicious actor or if it is a benign event
- This may provide an opportunity to bring together relevant stakeholders (e.g. crisis management and technology teams) to discuss response plans (e.g. in this scenario, how rollback processes are implemented)
Renewed focus on addressing concentration risk and digital monocultures
Revisiting controls and remediation plans to ensure there is a clear understanding of;
- Who has the mandate to make important decisions when something needs to be done urgently
- How to interact with third parties that are responsible for managing impacted technology
Focus on internal communication and collaboration to drive individual and collective accountability
Looking ahead
ORX will continue to work with members to discuss and develop the points discussed in this article. To access our wide range of operational resilience discussion summaries, visit our Operational Resilience Working Group page, where you can also register to be invited to future sessions (exclusive to ORX members).
ORX members who are keen to learn more about the forthcoming DORA regulatory deadline are invited to participate in our DORA Focused Discussion 2024.