Single Point of Failure: Global IT Outage Underpins Over-Reliance on Microsoft by Corporates

Perhaps the most dangerous thing emanating from the global IT outage yesterday is the realization that a single update, such as the faulty CrowdStrike update that is blamed, could bring down corporate systems the world over. As a result of just a few lines of faulty code in the update, supermarkets, banks, hospitals, and even airlines were crippled, with their operations thrown into a rush of trying to find and fix the problem.

Affected computers' screens went blue with a recovery message showing, following the IT outage | Global IT Outage Underpins Over-Reliance on Microsoft by Corporates | Mania Tech — Affected computers’ screens went blue with a recovery message showing, following the IT outage. Source: CD.

A Single Point of Failure: Sending the Wrong Message to Hackers

When a huge number of the world’s corporate entities rely on a single system to run their operations and do business, it exposes them to a momentous danger as the system in question presents a single point of failure. What’s even worse, the global IT outage tells hackers that they can target a single system and have corporate entities’ systems go down the world over.

What hackers see from events such as yesterday’s global IT outage is that corporations the world over are over-reliant on Microsoft systems and suites such as Microsoft 365. To them, this presents a huge opportunity as they now know there is a single point of failure that they can potentially exploit and bring down corporate systems worldwide.

How Microsoft Failed: “Largest IT Outage in History” a Huge Risk Going into the Future

Yesterday’s outage was termed as the “largest IT outage in history” and this sends shivers down my spine. The fact that we are in 2024 and a few lines of code can cause the largest IT outage in history is quite shocking. How on earth can a company such as Microsoft not have protections in place to avoid something like this happening?

The fix going forward is that Microsoft needs to have better policies to roll back defective drivers and not just raw dog risky updates to customers.
Crowdstrike will likely promote their code safety officer to put in code sanitization tools that will catch this automatically.
— Zach Vorhies / Google Whistleblower (@Perpetualmaniac) July 19, 2024

Even more, how can corporations the world over put all their trust, belief, and operations in this one suite or company [Microsoft]? How can hospitals, banks, supermarkets, and airlines bet on this one horse and foolishly place all their eggs in one basket?

A hospital in Michigan affected by the CrowdStrike outage | "Largest IT Outage in History" a Huge Risk Going into the Future | Mania News — A hospital in Michigan affected by the CrowdStrike outage. Hospitals’ operations were curtailed and surgeries could not be performed during the outage. Source: DFP.

How CrowdStrike Failed

As per my understanding, CrowdStrike also failed in ensuring that it had checks in place for its code and went on to push a faulty update knowing very well that its updates could potentially affect millions of devices. Microsoft, on its part, should have done better and not just allow a security program to load its faulty code without doing its own vetting of the code.

And Crowdstrike will likely take a hard look at rewriting their system driver from what it currently is, C++ to a more modern language like Rust, which doesn’t have this problem.
— Zach Vorhies / Google Whistleblower (@Perpetualmaniac) July 19, 2024

Advertisement. Scroll to continue reading.

Microsoft Blog Post: “8.5 Million Windows Devices Affected”

Microsoft, via a blog post, has said “We currently estimate that CrowdStrike’s update affected 8.5 million Windows devices, or less than one percent of all Windows machines,”. Can you imagine that, 8.5 million companies and individuals were touched and brought down by a sloppy update? The post from David Weston, a CyberSecurity executive at Microsoft, went on to say “While the percentage was small, the broad economic and societal impacts reflect the use of CrowdStrike by enterprises that run many critical services”. Critical services that could be targeted, I must say.

CrowdStrike CEO: “A Fix Has Been Deployed”

CrowdStrike CEO George Kurtz said, “This is not a security incident or cyberattack. The issue has been identified, isolated and a fix has been deployed.” The thing is though, this was an attack on the confidence people will have on IT systems such as Microsoft Azure and the companies behind them going forward.

The Human Cost: Thousands Stranded in Airports

We could talk about all the monetary losses but I want to talk about the human suffering that has emanated from the outage. First, hundreds if not thousands of airline passengers have spent the night in airports. These are families and individuals including children, the elderly, and the sick who have been forced to camp out in airports just because the airline couldn’t have backups in place.

Stranded passengers at an airport following the global IT outage | Single Point of Failure: Global IT Outage Underpins Over-Reliance on Microsoft by Corporates | Mania News — Stranded passengers at an airport following the global IT outage. Most passengers were forced to sleep at the airport. Source: BI.

A Shame that Even Hospitals Lacked Alternative Systems or Analogue Backups

Second, the same can be said for hospitals where patients’ records could have potentially been compromised, and had it been a hack, millions of patients’ health records could have been leaked; all because the hospital could not be cyber security-conscious enough to know that you never rely on a single system in IT. Moreover, you never lack an analog backup of everything. There could even be a global electricity outage from solar rays, for example, and would that mean that people won’t be treated because the hospital cannot operate without its digital systems?

Possible Solutions Going Forward

Interoperable Alternative Systems

I think that first, and foremost, every business, individual, hospital, or airline using the Microsoft operating system (OS) and products such as Microsoft Azure and 365 should immediately sign up for an alternate program or OS on top of the Microsoft one. An alternate and additional tech service provider, whose system is interoperable with Microsoft’s, could mean that when one system goes down, they can use the other, and later reconcile the data.

photo 1535223289827 42f1e9919769cropentropyampcstinysrgbampfitmaxampfmjpgampixidM3wxMTc3M3wwfDF8c2VhcmNofDExfHx0ZWNobm9sb2d5fGVufDB8fHx8MTcyMTIwOTQ3NXwwampixlibrb 4.0 | Mania Africa: News, Sports, Tech, Movies, and Lifestyle | maniainc.com — Read more tech articles here

System Backups and Analogue Backups

Second, every business and entity should always have backups of their systems and backups of backups stored in remote locations, maybe even on a cloud server, to ensure they can avoid redundancies.

Several drives connected to a computer | Single Point of Failure: Global IT Outage Underpins Over-Reliance on Microsoft by Corporates | Mania Africa — Backups and backups of backups are hugely important for any business. Photo by Markus Spiske / Unsplash

Third, every entity should have an analog backup of their data, records, and files. For instance, a hospital should have analog patient health records, that they can use to deliver care in instances when the digital system is down or when they are faced with something like an electrical outage. An analogue backup could also be an analog system of operations that can be used without the need for computers, which could be later reconciled with the digital system once normalcy is restored.

Has the IT Outage been Fixed?

Unfortunately, even now there are still businesses, banks, supermarkets, airlines, and hospitals still trying to fix the outage. What happened is that CrowdStrike’s faulty update forced computers to crash and shut down in a way that they could not be easily turned on again. Here’s an X thread explaining that:

Crowdstrike Analysis:
It was a NULL pointer from the memory unsafe C++ language.
Since I am a professional C++ programmer, let me decode this stack trace dump for you. pic.twitter.com/uUkXB2A8rm
— Zach Vorhies / Google Whistleblower (@Perpetualmaniac) July 19, 2024

Essentially, a single line of code that referenced a non-existent string led the computers to crash, by corrupting the device on the driver level, meaning that the computer is left displaying an error message and with no way to turn it back on. Additionally, these computers are also difficult to turn on remotely and the companies affected will need to have their IT administrators figure out what they need to do to reboot their systems.

The Lesson: Stay Prepared, Don’t Rely on IT Service Providers Too Much

I think it is such a shame that many entities were so unprepared, including Microsoft itself, for something like this. The good thing, though, is that the vulnerabilities have been exposed and we hope that both CrowdStrike and Microsoft will do better next time. Instead of corporate entities waiting for them to do better, however, they should bolster their own systems first and put in place mechanisms to ensure that they do not fall victim to the failures of their IT service providers.

Post Views: 127

Spread the love