Six preventative measures businesses should take to prevent future outages like recent Microsoft event

A blue Windows error message caused by the CrowdStrike software update is displayed on a screen in a bus shelter on July 22, 2024 in Washington, DC. Photo: AFP

A blue Windows error message caused by the CrowdStrike software update is displayed on a screen in a bus shelter on July 22, 2024 in Washington, DC. Photo: AFP

Published Jul 27, 2024

Share

By Francois Smit

The recent global Microsoft/CrowdStrike outage has affected many businesses – from small operators to global supply chains. It could have potentially been worse – therefore it’s important to look at lessons learned, the need to change systems and to emphasise the need for transparency.

It is a timely reminder that our “robust” infrastructure isn’t always sufficiently robust, and that consolidation and concentration of technology into the hands of far too few can have negative consequences. There needs to be greater awareness following this incident of our digital resilience - not only in the systems we run, but in the globally connected systems and in the Internet of Things.

All businesses should be asking: does their business have all of its eggs in one basket? Does it have fail-safes in case of an emergency?

As we observe the growing adoption of AI, it is noticeable that businesspeople tend to emphasise its capabilities over its potential failures. In our increasingly interconnected and automated world, ensuring business continuity is more crucial than ever.

Businesses can take several proactive measures to prevent and mitigate the impacts of future outages. By the following six steps - diversifying IT infrastructure, implementing robust disaster recovery (DR) plans, strengthening cybersecurity, enhancing communication protocols, investing in continuous improvement, and collaborating with technology partners - companies can build a resilient and reliable digital ecosystem. Preparing for potential disruptions ensures continuity, maintains trust, and safeguards business operations in an increasingly interconnected world.

The Microsoft outage has underscored the need for businesses to adopt strategies to safeguard their operations against similar disruptions by diversifying their IT infrastructure. Utilising services from multiple cloud providers can prevent total dependency on a single vendor, reducing the risk of widespread disruption, and thereafter distributing workloads across different platforms to ensure continuity if one provider experiences an outage.

For instance, a hybrid cloud solution could involve combining cloud services with on-premises infrastructure to maintain critical operations during cloud service interruptions, while ensuring data is backed up across multiple locations, both on-cloud and on-premises.

This unwelcome event highlights the need to implement robust DR plans with regular updates to adapt to new technologies and emerging threats, complete with DR drills and simulations to test the effectiveness of plans and to identify areas for improvement.

This could include automated fail-over systems to switch operations to backup systems without manual intervention, ensuring minimal downtime, as well as data replication technologies to keep real-time copies of critical data and applications in backup systems. AI and machine learning can learn to detect and respond to potential threats in real-time.

The outage underscored vulnerabilities in critical infrastructure, prompting calls for stronger cybersecurity measures. Future outages of a larger scale could have devastating economic consequences, emphasising the need for proactive measures.

The communication process within companies is equally important, and businesses should learn the need to develop clear communication protocols to ensure timely and accurate information dissemination during outages, while keeping all stakeholders, including employees, customers, and partners/vendors informed about the status and resolution steps.

It is important to foster a culture of transparency where users are informed about potential risks and mitigation measures, and ensure those users are educated about contingency plans and how they can stay informed and prepared during outages.

In addition to these six steps, I would encourage businesses to conduct thorough post-incident reviews to gather insights and improve future preparedness, by collecting feedback from users to understand their concerns and improve communication and support systems.

The Microsoft outage should serve as a wakeup call to a wide cross-section of society. Many small and medium enterprises (SMEs) experienced operational disruptions, leading to decreased productivity and customer dissatisfaction. Consumers relying on Microsoft services for daily activities faced inconveniences, affecting work-from-home setups and online learning. Global firms with integrated systems saw interruptions in their work-flows, causing delays in projects and communications.

But most severely affected were critical sectors: healthcare, finance and others relying on Microsoft’s cloud services faced heightened risks and operational challenges.

Companies dependent on Microsoft services experienced significant revenue losses due to halted operations, as well as additional costs incurred in deploying alternative solutions or compensating for the downtime.

There are equally a number of indirect costs that businesses should factor in: the erosion of trust among users could lead to long-term financial impacts as customers seek more reliable alternatives.

Businesses were affected to different extents based on their dependency levels. Companies with highly integrated Microsoft ecosystems faced more significant disruptions compared to those with diversified IT solutions; while entities with robust backup and recovery systems managed to mitigate the impact better.

Strengthening the digital supply chain to withstand similar outages through diversification and redundancy becomes one of the major lessons we must learn, requiring the encouragement of collaboration between tech companies to develop more resilient infrastructure.

Governments and regulatory bodies may need to enforce stricter guidelines on service continuity for major tech firms.

By learning from this incident and implementing strategic changes, companies and users can better prepare for and mitigate the impacts of future outages. Building a more secure and reliable digital ecosystem is essential for maintaining trust and ensuring continuity in our increasingly digital world.

Francois Smit is a senior lecturer in IT at Belgium Campus iTversity.

BUSINESS REPORT