Crowdstrike debacle underlines single-point-of-failure risk

A blue Windows error message caused by the CrowdStrike software update is displayed on a screen in a bus shelter in Washington, D.C. Justin Sullivan via Getty Images

By Alan R. Shark

| August 15, 2024

COMMENTARY | As our dependency on technology and energy increases, state and local leaders need to take a hard look at their disaster recovery and business plans.

At the height of the summer travel season last month, thousands of flights worldwide were halted and hundreds of thousands of travelers stranded for days all due to a faulty software update to a seemingly secure area within Microsoft operating systems by the cybersecurity firm CrowdStrike. The update impacted IT systems globally, affecting around 8.5 million Windows devices. Although the affected devices represented less than 1% of all Windows machines, the disruption was significant due to the critical services it upended. Aside from airlines, the outage impacted federal, state and local government entities.

The Crowdstrike debacle is far from the only example of technology failing. Just a few weeks later, the District of Columbia’s 911 Emergency Communications Center—the nation’s fourth largest by volume—was knocked offline for up to six hours because of a faulty software update.

Given the exponential growth in the use of technology permeating every aspect of business, government and our personal lives, we have become proportionally more vulnerable to catastrophic failure. Faulty software updates are but one example. Other negative forces threaten to disrupt a rather fragile digital infrastructure, such aging power networks and the growing risk of extreme weather like flooding, heat and high winds knocking out the electric grid.

Take, for instance, the Great New York City Blackout of 1977, which was caused by a series of lightning strikes that overwhelmed the grid. Or consider the Northeast Blackout of 2003, the largest blackout in North American history. It affected millions of people across the Northeast and Midwest, and was caused by a combination of factors, including a failure to address overloaded power lines and a series of equipment failures. The West and Southwest have also had their share of power disruption, such as The Blackout of 2011, the largest in California’s history, was caused by a technician’s error. For around 12 hours, 2.7 million Americans had no electricity.

To add to the collective worry facing IT leaders, most state and local governments have increased their reliance on third-party vendors as they seek ways to reduce costs, supplement staff expertise and hopefully gain services they could not alone afford to provide. This trend toward greater dependence on “outside” expertise is itself a challenge.

All of this leads to what every technology leader worries about the most: a single point of failure, which even has its own acronym, SPOF.

Here are some common examples of SPOFs in technology systems:

One server running a critical application: If the server fails, the entire application becomes unavailable.

A lone network switch: If the switch connects multiple servers and fails, all those servers become inaccessible.

A single Internet service provider: Relying on just one for internet connectivity can lead to complete loss of internet access if that provider experiences an outage.

Single power source: Having only one power supply for critical equipment can lead to system-wide failure if that power source goes down.

One database: If a critical database is not replicated and fails, it can bring down all applications and services that depend on it.

A lone storage device: Relying on a single storage device or drive for important data without backups creates a SPOF.

Single network connection: Having only one network link between critical parts of the infrastructure can lead to isolation if that link fails.

A single firewall: If only one firewall protects the network and fails, the entire network becomes vulnerable.

One domain controller: In Windows environments, having only one domain controller can cause authentication and policy issues if it fails.

A solitary load balancer: If all traffic is routed through a single load balancer and it fails, it can disrupt access to all backend services.

Single cooling system: In data centers, relying on a single cooling system can lead to overheating and system shutdowns if it malfunctions.

One administrator or subject matter expert: When only one person knows how to manage or troubleshoot a critical system, their unavailability can become a SPOF.

A single vendor responsible for “everything”: A dependency on one vendor can lead to unexpected failure if the vendor itself faces a failure in an internal system or the execution of a standard operating procedure.

To mitigate these risks and more, governmental organizations should implement redundancy and failover mechanisms, and distribute critical components across multiple systems or locations. While no responsible tech leader would disagree with these approaches, many lament that, in practice, they often fall short in this area of preparedness. This is why regular risk assessments and system audits are always recommended to help identify potential SPOFs before actual problems arise. Perhaps the old adage is appropriate here, “Don’t place all your eggs in one basket.”

Redundancy and failover are the operative words when it comes to SPOFs. An example of this is server clustering. This involves multiple servers working together to provide the same service. If one server in the cluster fails, another server can take over its workload seamlessly, ensuring continuous availability of applications and services. This redundancy helps prevent downtime and data loss that could occur if a single server fails. Other approaches include shared storage with redundancy, network redundancy and geographical distribution.

We are only beginning to learn some important lessons from the CrowdSrike debacle. But clearly over-dependence and "overtrust” in one vendor is a paramount learning. This event is just another “wake-up call” that IT leaders and managers need to do a better job of planning, testing, documenting, training and conducting lifelike simulations. As our dependency on technology and energy increases, so does the need to actively reexamine disaster recovery and continuity of business operations plans. Such plans must be tested and updated and, as importantly, practiced.

At this moment, we are in the throes of a summer that has brought several record-breaking storms, it is hurricane season and our power grids are at capacity. When was the last time your jurisdiction did an SPOF analysis?

Dr. Alan R. Shark is the executive director of the Public Technology Institute (PTI) and Associate Professor for the Schar School of Policy and Government, George Mason University, where he is also an affiliate faculty member at the Center for Advancing Human-Machine Partnership (CAHMP). Shark is a National Academy of Public Administration Fellow and Co-Chair of the Standing Panel on Technology Leadership. Shark also hosts the bi-monthly podcast Sharkbytes.net. Dr. Shark acknowledges collaboration with generative AI in developing certain materials.

NEXT STORY: Texas’ $1.4B settlement with Meta highlights the need for data privacy protections, experts say

This website uses cookies to enhance user experience and to analyze performance and traffic on our website. We also share information about your use of our site with our social media, advertising and analytics partners. / Do Not Sell My Personal Information

Accept Cookies