As many of you know, an issue with an update to CrowdStrike, a security platform installed on literally millions (if not billions) of computers worldwide, caused a global IT Outage on Friday and many organizations (and IT workers) are still feeling the sting of that outage today. The root issue appears to have been a bad configuration update from CrowdStrike that prevented affected systems from booting. We are thankful that none of our clients were directly impacted but the incident provides an excellent opportunity to “sharpen the saw” and update our internal processes. In this email, I’d like to quickly summarize what happened, some key things that we took away from the incident and some changes that we’re making in as a result.
What happened?
CrowdStrike released a corrupt configuration update to the CrowdStrike Falcon sensor. When Windows systems tried to load the driver with the corrupt configuration, they were unable continue resulting in a Blue Screen (BSOD) error. Since the driver is loaded on boot and reads the (then corrupt) configuration file prior to loading, systems were left in a non-bootable state until administrators and / or support personnel were able to either update the configuration or manually remove it. Since Friday, both CrowdStrike and Microsoft have released a number of articles and tools to aid with the remediation.
Some key takeaways
- This does not appear to have been an attack. We have had a number of organizations, understandably concerned in this era of a cyber attack de jour, asking if we believe that this was an attack. At this point, we have no reason to believe that it was anything other than a corrupt configuration update.
- In many cases, the impacted computers and the administrators / support personnel were in very different geographic locations. The failure left impacted systems in a non-bootable state with no network access to do the repair and no plan b.
- Full drive encryption is becoming more common either through organizational policies or as an operating system default so a large number of the systems that were impacted were encrypted. This is a very good thing but, unfortunately, a number of organizations either no key management or a non-functional key management system in place for recording and storing the encryption keys and when those keys were needed to repair these systems, they either were not there or did not work.
- We have seen an unexpectedly large number of organizations, including other MSPs, caught completely off guard with this incident because they either didn’t have a plan or because their plan either so far out of date that it was irrelevant or because it was so overly complicated that it was irrelevant.
How are our processes changing?
- Remote key for MyIT Clients – I do not have an ETA for these to be deployed to all MyIT clients but we are working now on a “network recovery key” (we’re working on the name) that we can use with the client to get remote access to non-bootable remote systems, including workstations, to resolve issues like what we saw with the CrowdStrike outage. For single client or smaller scale outages our primary response to something like this would be to go to the client site but, for large scale outages where that simply isn’t viable, the network recovery key would be used.
- Updating our MyIT Monthly Maintenance Process to spot-check Bitlocker recovery keys.
- We have updated our Triage policy to better accommodate large scale events. Moving forward, in the case of a large scale event, MyIT Clients will be prioritized in the following order:
- Emergent Medical
- Critical Infrastructure
- DoD Contractors
Additional Information
- Nice walkthrough of the problem – https://x.com/Perpetualmaniac/status/1814376668095754753?t=QZnNdAEMWglD6bY8uG8RVw&s=19
- CISA coverage – https://www.cisa.gov/news-events/alerts/2024/07/19/widespread-it-outage-due-crowdstrike-update
- SANS coverage – https://isc.sans.edu/diary/rss/31098
- Crowdstrike hub – https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/