On Friday, a whole lot of Microsoft Windows servers and the services running on them went out for a good portion of the morning. You probably weren’t affected much (neither was I), but thousands of corporations and businesses were, including the airline and rail industry, bringing transportation and other services to a standstill.
Needless to say, it was messy and will end up costing the companies affected millions. Messy, expensive technical blunders are fascinating to me and one of the things I think is always worth exploring more. At the risk of sounding like the proverbial Monday morning quarterback, let’s have a look at this one.
Android & Chill
One of the web’s longest-running tech columns, Android & Chill is your Saturday discussion of Android, Google, and all things tech.
While I think the overall blame must be laid at Microsoft’s feet, the Redmond giant didn’t cause this outage. An optional third-party Windows component from CrowdStrike—another Windows Security vendor—sent out an update that crashed the low-level systems of the affected computers and sent them into the famous Windows blue screen. The only thing Microsoft did wrong was build a system that allows this to happen, but this is also the most important part of what happened.
That should also be your biggest takeaway from this because the next time it happens—and there will be a next time—you could be affected, and it could be much worse. CrowdStrike may have caused this, but it was Microsoft’s fault.
How does CloudStrike factor into all of this?
Let’s talk a little more about what CrowdStrike is and why so many big companies use their products. According to the company’s website, CrowdStrike has “redefined security”, securing “the most critical areas of risk – endpoints and cloud workloads, identity, and data.” I am definitely not a Windows security professional but I can recognize a sales pitch when I see one.
I’m sure the software offers an important service. I’m equally sure that the decision to use what CrowdStrike offers is financially based as much or more than it is technically. Salesmen exist because they are good at selling a good or service and if the service is legitimate, it’s a lot easier to do.
I have no problem with an entrepreneur finding a way to get the corporate world to buy into their product. I do find two things very concerning here.
Firstly, and most importantly, if CrowdStrike offers something so important, why is it not already a part of Windows Server? Microsoft is one of the biggest, and dare I say best, software companies in the world. If there is a legitimate need for a product like the ones CrowdStrike offers, Microsoft could provide it themselves. With Windows Server licensing being so expensive, it probably should be provided.
My next concern is how an optional piece of software can get such low-level OS access and cripple a machine if it’s corrupt or misconfigured. Microsoft should never allow software from another company to hijack its operating system this way.
This is why I’ll place the blame for this particular outage on Microsoft even though the company did nothing to directly cause it. I’m always going to hold the best companies to higher standards.
Neither of these ideas is crazy or new. I guarantee that engineers at Microsoft knew this could happen, looked at how it could be prevented, and analyzed what the company needed to do to “fix” them. It’s trendy to hate on the company, but Microsoft is one of the best companies in the world when it comes to computing, both at the edge and in the cloud. Even if you’re not a fan of its products, you can easily see this. Critical infrastructure depends on Microsoft because it is so good at what it does.
What about next time?
Enough with the amateur analysis, though. This is all concerning because we got off easy this time. Yes, your flight got canceled if you were traveling today, and maybe you had no cell service on your new phone for a few hours this morning. If you were lucky, you got to slack off instead of work at your office this morning. If you’re unlucky, you get to spend the weekend repairing the damage the outage caused to your IT department.
What if, the next time, the national power grid goes down? Imagine an entire country in the dark for an extended amount of time because of a misconfigured kernel module from a third-party vendor. I know there are multiple fail-safes in place to prevent anything like this, but you should never say never.
More realistically, what if the next global outage affects mobile devices? Forget the inconvenience of Gmail or iMessage going down and instead imagine every Android or iPhone or Surface laptop crapping out for a few hours. It’s easy to say it would be an opportunity to go outside and get some much-needed fresh air, but billions and billions of dollars would be lost, and entire companies would go bankrupt because of it.
I’m certain that incidents like what happened this week are great educational tools and help prevent a more serious incident from happening. I hope the right people—the ones who control the purse strings—use them as a learning opportunity.