CrowdStrike Outage: What happened, what next.

EPISODE - #60

CrowdStrike Outage: What happened, what next.

Published on July 30, 2024

Leading IT podcast episode summary

Josh: Good afternoon, and welcome to episode 60 of the Leading IT Podcast. Today, we’re going to analyze the CrowdStrike outage, and we’ve got, uh, Empyrean Lead Cyber Lead David Stevenson with us. We’re going to talk about what happened, the ramifications, and what are some next steps that people in IT leadership can and should take. Our podcast, if you’re new, is for Australian IT leaders who are looking to stay up to date with the latest news, trends in AI, cybersecurity, cloud and infrastructure, strategy, and leadership. And your hosts are Tom Leyden—he’s the CIO at Longview—and myself, Josh Rubens, CEO at consulting firm Empyrean. We look to tackle the fast-changing IT landscape from both sides of the client-vendor paradigm with pragmatic and actionable advice.

So, good day, guys. Dave, welcome. I think you’re the first guest we’ve had back, actually—maybe the second. Honored nonetheless. Well, it’s good to have you. Tom, how are you going?

Tom: I’m good, I’m good. Luckily, we weren’t scathed by Friday’s events, but I know plenty of people who had to work throughout the night, on the weekend, into Monday. We’ll talk about that in more detail shortly.

Josh:  Yes, yes. And, uh, we’ve been able to stay warm in a fantastic Noran weather.

Tom: Ah, it only gets better from here, right, Josh?

Josh: Yeah, yeah. It definitely can’t get any worse, right?

Tom: No, it definitely can’t get any worse.

Josh: All right. So, Dave, we’re going to bring you into the news. We’re going to keep this episode very, very cybersecurity-centric. Um, given it’s been all AI, all the time, for quite a while. And if AI is what you’re looking for, don’t worry—the next episode we are looking to talk about co-pilot-enabled business processes and automation. So, we will—we’ll get back on that train. But I think, Tom, this, uh, incident was big enough to, uh, to change our strategy for today.

Tom: Absolutely. And I feel like a lot of people have talked about it, but I think what we’re going to do is really understand, just, I guess, reflect and take some time to go through it in a bit of detail, and then work out what the hell we would do to avoid this kind of disaster catastrophe again, right?

Josh:  Yeah. And what should—and what should we do? And yes, I, I think hopefully we’ll have some pragmatic and actionable advice for IT leads out there on what they could do.

We’ll get into the news. So, the first good news was that, um, the intro—the Microsoft intro suite was, uh, made generally available this month. These solutions, they were announced last year, but they’re now, you know, actually GA. And, um, I think it’s important because they’re a key component of a zero trust strategy. So, they are Microsoft’s secure service edge solution. Secure service edge is, um, a bit of a subset of SASE—secure access service edge solution. So, it includes—so it’s all cloud-based. Um, you know, it’s all these services are all, uh, delivered out of the cloud. That includes things like secure web gateways, zero trust network access, CASB, and firewall as a service.

So, to date, the sort of main vendors have been people like Zscaler, Netscope, and Cloudflare. And now Microsoft’s really, you know, as they do, they’re sort of fast followers. They come in with something that’ll be nicely integrated with everything else and bundled in, you know, at a price point that’ll be very attractive for most enterprises. So, um, so just quickly, uh, before I get some, some commentary, so what is now GA is, uh, what’s called Entra Private Access. So this is a, um, zero trust network access to on-prem applications and resources. So, this would allow you to replace a legacy VPN solution. So, somewhere, you know, we’ve got users working everywhere and we need to, you know, protect their, um, you know, protect their access. So, one of them is accessing on-prem and legacy VPNs—there’s infrastructure cost, etc. So, now, um, they’re providing this called private access solution. And the other one is called Entra Internet Access. So, this is for securing access from anywhere to SaaS applications and the internet. So, this is the Zscaler alternative. Um, so it’s really, you know, giving you between these two and all the other Microsoft stuff. It’s really giving you, you know, secure access from anywhere to any application on-prem, in the cloud, all delivered as a cloud service, right? You don’t need physical devices for any of this stuff. It’s all via cloud, all connected into the Microsoft platform, as you say.

Tom: So, can people now rip up all their, you know, legacy devices in their server comms rooms around the world—around the country? Is that what you’re saying, Josh?

Josh:  Pretty much. I mean, if they had some—if they had some VPN appliances, they could.

Tom: Dave, did you want to comment on those two? There were a couple of other bits that were, that I think were useful, but those are the two main ones. Did you want to comment on the ecosystem and the change here?

David: For a lot of organizations, traditionally, you go and buy your enterprise-grade firewalls, and when you’re sitting behind them, you’re incredibly protected. In the modern age, where we’re primarily a work-from-home or hybrid working environment, so it’s more when users leave that ecosystem, they’re no longer as protected as what they were behind your half-a-million-dollar firewall build. So, by introducing Entra Internet Access, you can start doing sort of that software gateway at the endpoint, protecting users, but also protecting them at the identity level. Yeah, so if we’ve got a whole raft of applications that are specified, we can really specify that these users can access them, those users can’t access them at that identity level within Entra ID. And it’s the same as when we’re looking at Entra Private Access. It’s very similar to, you know, we can dial into a VPN, but once we dialed on that VPN, we’ve primarily got access to everything that’s exposed to the VPN itself. With Entra Private Access, again, we focus on the identity itself. Yeah, and we’ll say, you know, Tom, you do need access to this install system, where I don’t need access to it, but I might need access to, say, our knowledge base, where you won’t need access to that. It also gives us and sort of replaces the always-on VPN technology as well. So, I’m not looking to dial in—I open up my laptop, we’re checking the posture to make sure that I’m secure, and I’ve just got access to those resources. So, the end-user experience goes through the roof as well. Yeah, and also, it’s not like once you’re in, you have access all the time. It’s continually checking your posture. So, it’s continuous, which is one of the principles of zero trust. It’s continually verifying and assuming breach, um, using, you know, AI and ML to continue watching you. So, if you move to another location or to an insecure network, so, you know, it’s providing that sort of, you know, ongoing verification.

Other parts were something called Entra ID Governance. So, this is a governance solution that automates the identity and access lifecycle. So, from when a user joins an organization, they’re going to have access to certain things. But if you change roles or, you know, or when you’re leaving, so this solution automates, will automate that whole process, which would be, you know, that would be, you know, the bane of many IT departments, or that, you know, oh, Tom’s moved to this role, and now he needs access, so a lot of manual stuff. So, whereas you can, through ID Governance, I think you create groups, Dave, and, you know, and then once, you know, you move into a different group, your access will, you know, change based on the group that you move into.

David: Yeah, exactly. I think there’s been a trend in the industry, sort of over the last two to three years, of putting the power back into HR’s hands. So, as they start changing where your profile sits within the organization, the ID Governance is going to make sure that you’ve got the right required access, and then removing access where you no longer require, depending on the position in the company that you are.

Josh: Yeah. So, this is, you know, automation. It does use things like Power Automate and Logic Apps and things like that. And, and the other one was, um, ID Protection, which is an ongoing identity—it’s a managed identity service. So, it’s on—it’s—it’s verifying your identity, um, you know, all the time. There’s another one called Verified ID—very confusing, these names—which I think they’ve released something called Face Check.

David: Face Check, right?

Josh: Yeah, I don’t know what that is. Um, so instead of scanning your driver’s license to validate your identity, you can now prove your identity by scanning with Windows Hello for Business. Ultimately

, it’s all about trying to improve the security posture for users while making the end-user experience a lot more friendly. I don’t know if you’ve seen anything else on these Entra solutions or how it might affect any of your clients.

David: No, I think the key part is, as I said, it’s the ongoing verification. So, it’s, um, you’re now validating identity continuously and the end-user experience goes through the roof. And a big part is the risk side of it as well. So, it’s risk management. So, making sure that people don’t get access to resources they shouldn’t be accessing and, therefore, reducing the amount of, um, potential leakage from a security perspective.

Josh:  Yeah, I agree. I agree. It’s all that continuous verification and identity and conditional access based on—based on risk, rather than the, you know, trust, but verify. And that’s really, you know, the key to zero trust, isn’t it?

David: Yeah, exactly.

Josh: Yeah. So, very good. All right. I think that’s, um, enough on the news. The second bit of news was more of a bit of interest. I noticed that, um, a cybersecurity firm that’s been around for four years knocked back a $35 billion offer from Alphabet, Google’s parent company. Just, excuse me, I don’t want $35 billion. These are four blocks—they started the company in 2020. I have to say, I mean, I started in, you know, 2020, so if anyone came to me with, you know, $35 billion, sorry guys, sorry Tom, I’d be out the door. But, uh, you know, that’s a lot of cash. So, they’re going down the IPO route. I just thought that was—so they’re the fastest-growing software company in history. They went from a $1 million run rate to a $100 million run rate in 18 months. Wow.

So, if you look at these guys, they’re ex-Mossad’s 8200 cyber security team. They had a business that was acquired by Microsoft for many millions of dollars. So, they’re at Microsoft, they set up the Azure security team there, and they left Microsoft and started this business in 2020. Um, Dave, I don’t know if you’ve had a look at the product. I—I don’t know. All I know is it’s basically a cloud security posture management solution. So, I think you use it multiple times. The main life for it is—it’s cross-tenancy, cross-cloud. So, it’s giving you one portal to manage all of your cloud infrastructure and really leaning into that security cloud posture management for you as well.

Well, um, it’s kind of nice to see that they have pulled out and they’re going to remain as an independent product, but at the same time, it’s, you know, you—Josh, I would have taken it. Dave, what would—if you had—what would that have helped you with CrowdStrike problems?

Tom: I don’t think so, really. People are saying, would it have helped?

Josh: You reckon?

David: No, not necessarily. I think, you know, if you’re— Not really. I think it’s so different. I’ll explain to you, Tom, what it does. So basically, it’s agentless scanning across all your Cloud resources, so whether it’s PaaS, containers, serverless—so looking at all your data. Um, so that’s really good. So that’s one of the differentiators; you don’t need to install anything. And then it looks, it scans your data across public buckets, databases, data volumes, and then it has a nice graph which shows all the interdependencies, DU pathways, and then it has Cloud detection and incident response as well to, you know, any issues that are going on in your environment. So in your Cloud environment. So it wouldn’t have helped your Windows endpoints, Tom, I think is what—right? Yeah, wouldn’t have helped your Windows endpoints. Positioning yourself as an independent view of security, you’ve got all these things going on, but—and there’s a few others like this—here’s you use your one pane of glass, so a holistic view of your D in your systems. It’s, but it’s very cloud-native. So if you had a—if you were a company that was very Cloud public Cloud-centric, using Cloud-native applications, doing a lot of DevOps, Whiz would be, you know, your—you know what you could look at. And interestingly enough, they’ve got 40% of the Fortune 100 already, so it’s incredible. So customer, yeah. So Morgan Stanley, BMW, Salesforce, Allgate, Blackstone. So, you know, these guys have smashed it. It’s hard to imagine that within, yeah, four years, that you could already be turning down that money. Yeah, yeah, I’d like to know what this strategy here—we can talk about that in another podcast. Yes, yes.

Josh: So, all right. So that was the news. It was definitely not a, you know, outside the crowd. I think there definitely not a lot happening. It was a—it was a struggle to find. But let’s get into the CrowdStrike incident. So I’ll just do a little bit of, you know, I’ll sort of preamble, and then Dave, we can get into the tech. So, it’s called the largest IT outage in history. So, July 19th, so a single update, software update from CrowdStrike. It impacted airlines, banks, retailers. So if you went into Woolies or Cole’s, you saw it. It was really in your face. Um, yeah, air flight cancellations, cargo. So this was, you know, as big as you—big as you could want. So, Dave, tell us what actually, from a technical point of view, what actually—what actually happened here?

David: Yeah, so I guess ultimately, it’s been chalked up as a mistake that CrowdStrike have pushed an update out to their agent that’s known as the Falcon Sensor. So, you know, we often talk about Microsoft Defender as an EDR platform, so extended detect and response or endpoint detect and response. CrowdStrike have their Windows version known as the Falcon Sensor. So, it’s continuously receiving updates from the CrowdStrike team, so more so to make sure that, you know, they’re picking up any new indicators of exposure and that their scanning agent can then address any of those new indicators. It’s picked up an update. That update has then caused a kernel panic within the Windows environment. So the kernel being kind of the heart and brain of your Windows operating system. And that’s led into a blue screen of death. So, I think, you know, most people will be very familiar with them, sadly, but essentially, your computer’s going from operational to no longer operational. You’re presented with the blue screen, at which stage you have to restart the computer. The issue being, as the computer’s restarting, it then gets stuck into another blue screen. Microsoft posted on their blog, as they were sort of live-betting information coming through, some people are finding that if you turn it off and on 15, 10 times in a row, it might start up. And same thing, you keep turning it off and on, you eventually get past it and get into the operating system. Alternative to that, the remediation actions were quite heavy and required a lot of effort to go through and remediate the endpoints. Yeah, there’s a lot of—there was a lot of manual. A lot of manual. So, it was reboot in safe mode, delete a particular file entry, reboot. And if—and if the device wasn’t on the network, then they had to—the user had to literally bring it in. So that’s a pain, you know, that’s—that’s a pain in the ass. So CrowdStrike have said, by the way, their share price dropped 15% on Monday and SentinelOne’s went up 11% on the same day. Well, it’s not good because I actually have CrowdStrike shares, Tom, and quite a decent position. So they’ve, you know, um, although I will say—go on, sorry Tom.

Tom: So can we just talk about the actual incident in a bit of detail? Because it’s, frankly, it’s frightening, right? It’s, I think, 8.5 million machines, is it right? Windows compute, 8.5 million. And they were impacted. I think the release went out, you know, whatever, in the middle of the night US time. Within, what, an hour of that release going out, it was impacting New Zealand’s business hours. Like, the—from the moment that someone pressed a button of the release to the impact occurring across the globe, that time frame was absolutely frightening. For my mind, that was—that was a really scary takeaway. And yes, there was a fix, kind of. What was it, about two hours later, is that right?

David: They had a fix, an hour and a half. They actually pulled the update about an hour and 15 minutes after pushing it out. So, they had already reverted back. Yeah, the issue is the nature of how CrowdStrike updates that had already been deployed within that time frame. And then the other aspect of this thing was that the CrowdStrike device hit the kernel, as you said, and that hit right at the core of the Windows operating system. Not just Windows laptops and PCs, but servers as well, right? So, potentially—and I know people had the situation. Your phone was saying servers are out, Windows machines are out, we’re toast. Yeah, so people are faced with that reality. Yeah.

Josh: And the issue is, there’s nothing you could do at your end because it was all these things that get pushed out. And, you know, because in 2009, the EU forced Microsoft to give third-party ISVs kernel access because they said it was anti-competitive. By the way, Apple and Linux have said, “Get stuffed” to the EU. But Microsoft, given, you know, it’s more of an Enterprise organization, has enterprise software, was probably more, you know, responsive to, you know, all the governments in Europe. Um, there’ll be a lot of software at risk there. So they’ve given third-party ISVs access, which is why Macs and, you know, Linux machines and servers weren’t impacted. Yeah, Microsoft was impacted. So, yeah, not a lot you at that time. I mean, we’ll talk about what you can do moving forward to prevent it, but at the time, you know, it’s basically vendors pushing out an update, which they need to do regularly in response to the changing, you know, cybersecurity landscape and hitting straight into, you know, the kernel of the Windows operating system and just taking you out. Yeah, out, right? So for someone who’s in control of their network or is charged by the organization to be in control of your network, you pretty much had no control over this incident. And that, I think, that strikes fear into a lot of people who listen to this show or, you know, who is responsible for making sure this never ever happens.

David: Well, you know, I saw something, right? So there’s basically four key issues there that are, you know, that are fundamental to our industry and, you know, the ecosystem. Firstly, that we’re giving ISVs direct access to the kernel, which removes the operating system vendor from the trust value chain. So the trust value chain is simply the ISV and the customer. Secondly, the silent updates that are coming out, so it means you’re 100% reliant on the QA process of the ISV. So that’s the other one. And then, you know, the third point is, what are the checks and balances for the ISV? So what’s the accountability? How accountable can you hold CrowdStrike? And then, you know, we’ve got the fragile, it’s called human-centric Windows stack, unlike the modern network-centric Unix and Linux OSs that, you know, essentially cannot deal with this. So those are, you know, those are, I suppose, the bigger issues that, you know, maybe it’s worth having a conversation because if you think about China, China wasn’t impacted. China, you know, they don’t give a stuff about Microsoft. They use their own technology. You know, there wouldn’t have been any CrowdStrike. I think so. There are some sort of, um, you know, those geopolitical considerations as well. So Dave, Tom, what are your thoughts on some of those?

Tom: Well, I mean, the obvious one is maybe, Dave, you jump in.

David: Yeah, look, it’s kind of interesting from the kernel perspective as well of, you know, Microsoft has had so many things in the past that have been able to get to kernel level. And, you know, it’s been around for early 2000s. We were getting a lot of viruses and a lot of malware that were getting direct to the kernel at that stage. It sort of makes sense from the ISV point of view. They’re building upon an architecture that’s been around since the dawn of time that hasn’t really changed its security principles as well. So, you know, it’s the good and bad from an ISV point of view. You kind of want CrowdStrike to be able to watch the kernel to make sure that they’re protecting you at the layer that’s going to be attacked. But at the same time, you know, is it more time for Microsoft to really re-architect that system and sort of move more towards how Mac has privacy first and sort of start protecting you from more of that landscape? And, you know, it’s a big change. It’d be interesting to see how this would affect the new sort of Microsoft ARM operating system as well because, again, that’s a bit more of a re-architecture away from the x86 architecture.

Josh: Yeah, yeah, yeah. Look, I think the other issue here is the way they roll it out, right? So I just—I really struggle with this idea that they just press a button and it goes straight out. Surely, it should be going out through a process. And I’ve noticed that CrowdStrike have announced that they’re going to test the releases now for process. Oh, thank you. Thank you for doing that. Something we were hoping you were doing already. They said today they’re implementing additional validation checks to its content validator. Well, they’ve used the word validation many times for rapid response content with new check processes currently in place. So, what would you expect? I mean, you know, quickly, do we think CrowdStrike are going to survive? Are they going to survive this? Can they survive this?

Tom: We’ll know soon enough. George CTS has been brought in front of Congress, so I think, you know, that’s going to really dictate the outcome of a lot of things at the moment. You know, from their share prices, we saw the dip. It’s recovered a little bit, but I think it’s a lot of people just buying the dip ultimately.

Josh: Well, I mean, let’s face it, though. You’re not going to be able to—it’s very hard to sue them directly, right? So if your business is out and you’re going, “Right, lost revenue, who am I going to sue?” It’s actually quite hard.

David: Well, can you? I mean, can you though? My concern is that, you know, as a CrowdStrike—uh, I will state, we are, you know, we’re not a big CrowdStrike partner. We don’t—we don’t sell a lot. I will say that as a caveat up front. We do—we are a partner, but we don’t do a lot. We don’t do a lot with them. Um, so it’s hard, right? Maybe, I mean, they’re going to be in court. They’re going to be tied up in court cases for a long time. I don’t see how they can escape that. And there’s so many big customers and big entities, you know, there’s going to be government suits. They’re going to be taken. I can’t see how they’re going to survive it.

Josh: Well, that’s right. And I think, so, okay, so you can’t go direct, but you go cyber insurance. So you look at insurance and get money back. Well, guess what the cyber insurance people do? They want their money back from CrowdStrike. So I think CrowdStrike are going to be tied up in this world for a long, long time. So, but ultimately, these companies are all about reputation, aren’t they? They’re all about reputation. If they aren’t safe, why the hell have you got them? Yeah, that’s their job.

Josh: Their one job is to keep you safe—not just from hackers but also from their own internal processes. So, it’d be very hard to go up to your board and say, “Hey guys, we’re going with CrowdStrike.” Very difficult conversation. “We’re going with these other options,” that’s a much easier conversation. Yes, so, let’s talk through the scenario, Dave. So, let’s say you’re a CrowdStrike customer—what are you doing?

David: I guess it still comes back to looking at what functions and needs you have with CrowdStrike. I think a lot of people who choose CrowdStrike are going for a full and complete solution; they’re getting the MDR service and all the bells and whistles that CrowdStrike offers. If you’re looking to move away from CrowdStrike or if you’re already evaluating products and CrowdStrike is off the table, realistically their biggest competitor is going to be someone like SentinelOne. They have a similar offering, including MDR service and the tools you need. Alternatively, you might consider Palo Alto as well, though they might not be as price-competitive in the mid-to-enterprise market, especially not in the SMB space. Another option is Microsoft Defender, but you would need to pair it with a provider like Arctic Wolf for managed detection and response. Together, that can offer a very good solution.

Josh: There’s also the consideration of a multi-vendor strategy. If you’re dealing with a mix of different environments—servers, endpoints, IoT—would you recommend mixing vendors a bit?

David: In the enterprise market, that’s quite common. A lot of the big players have minimum seat requirements, so you might need to use different vendors for different parts of your infrastructure. For instance, you might use CrowdStrike for endpoints and another vendor for servers if CrowdStrike can’t service the server needs effectively. It’s about balancing the number of management consoles and ensuring consistency across your policies. A multi-vendor strategy makes sense for larger organizations with complex needs, but for smaller or mid-sized businesses, sticking with one vendor might be more straightforward.

Josh: So, considering this incident, do you think businesses should be re-evaluating their risk management strategies? How do they prepare?

David: Absolutely. This incident highlights the vendor-related risks and underscores the need for a robust strategy. Companies should evaluate their vendors, understand their patching processes, and have a mitigation strategy in place. It’s crucial to be prepared for when issues arise, not just if they do.

Tom: Yes, and adding to that, understanding the processes that your vendors follow for patches and updates is essential. If vendors are pushing patches without adequate testing or processes, it can have significant impacts. Businesses need to have a strategy to manage these risks and ensure they’re not solely dependent on one vendor.

David: Exactly. It’s also important to have incident response and business continuity plans in place. How do you manage and recover from such disruptions? It’s not just about having a plan but also practicing and updating it regularly.

Josh: And what about tabletop exercises and recovery plans? Are they necessary?

David: Yes, they are crucial. Businesses should regularly conduct tabletop exercises to simulate different scenarios and test their response plans. This helps in understanding the impact of incidents and refining the response strategies.

Tom: And as for current trends, besides the CrowdStrike issue, what’s happening in the cyber world?

David: We’re seeing a lot of focus on data governance and AI. Clients are increasingly concerned about data protection, classification, and loss prevention. There’s also a shift away from traditional VPNs towards more secure models like Zero Trust Networks. Additionally, Essential Eight assessments and remediation are gaining traction, especially as organizations try to close gaps in their cybersecurity posture.

Tom: AI is indeed a big topic. Using AI for data classification and protection is becoming more prevalent. It’s a powerful tool for managing and securing information.

David: Exactly. AI can significantly enhance our ability to protect against vulnerabilities and manage data. It’s an exciting development, but it also highlights the need for rigorous processes and safeguards.

Josh: Well, that’s been a comprehensive discussion. If anyone wants to get in touch with us to discuss these topics further, how should they reach out?

David: They can connect with me on LinkedIn—David Stevenson at Empyrean—or email me at DStevenson@empyreanit.com.au.

Josh: Great, thanks, Dave. And thank you to everyone for joining us. We appreciate your support over the years. Stay safe out there!

David: Thank you, Josh. It’s been great being here. Cheers!

Available on all your favourite channels

Filter

Insight Podcast
Podcasts Radio Filters

Filter

Tech Insight
Insights Category Filters

Filter

App Library
App Library

Filter

Case Studies
Case Study Radio Filters