When AWS’s US-East-1 region went dark in late October, followed just a week later by a Microsoft Azure outage, it was yet another stark reminder that even the world’s biggest cloud vendors are not immune to failures. A simple DNS failure in AWS’s Route 53 rippled outward, knocking out applications, disrupting database services, and reminding us how dependent our tech infrastructure has become on a handful of cloud regions. With “an inadvertent tenant configuration change,” the Azure outage further highlighted the instability of some of these systems, once again demonstrating how small changes can have quite a large impact.
With CyberCube estimating that the cost of the AWS outage could run between $38 and $581 million, the economic and operational toll of that outage can’t be overstated. That’s especially true for smaller and midsize organizations that lack the resources to absorb multi-hour or multi-day downtime. For many businesses, this latest disruption exposed the hidden cost of cloud centralization: When one region falters, everything can grind to a halt.
Outages are inevitable. Even AWS’s own CTO has said as much: Systems will fail, so they must be architected to expect and withstand failure. Yet too many organizations still design as if the cloud itself is infallible. They assume redundancy, backups, and recovery are baked in automatically and discover far too late that they aren’t.
The good news is that resiliency can be built in before the next failure strikes.
PRE-OUTAGE DIVERSIFICATION: DON’T WAIT FOR THE NEXT OUTAGE
The first line of defense is simple in concept, but hard in execution. You must diversify before disaster strikes. Think of it as an investment portfolio. You wouldn’t put all your money into one single account; it’s spread across a variety of options to give your investment the best chance of success. This means designing for failure across multiple availability zones or regions. AWS even recommends doing so in their “AWS Well-Architected” guide.
A well-architected system should be able to shift traffic from one region to another (say, US-East-1 to US-West-1) in seconds. Outages rarely take down multiple regions at once, so a multiregion architecture remains one of the most effective defenses against downtime.
TURN TO MULTICLOUD AND ELIMINATE WASTEFUL SPEND
Some organizations take this even an extra step further, distributing workloads across multiple cloud providers. Multicloud designs offer additional resilience, but they require significant complexity and technical skills, as well as potentially higher costs. The key here is to start small and move only your most critical workloads or control planes into redundancy. Then, once you’ve evaluated the complexity and costs involved, you can expand.
Most companies will find multiregion diversification within a single cloud more practical, but whichever route they choose, the mindset must be the same: Assume something will break, and plan accordingly.
Equally critical is identifying and eliminating wasteful technology spend. Not every workload needs to run in the most expensive, high-availability configuration. Through a proper business impact analysis, organizations can align investments with risk, spending where failure would truly hurt the business, and economizing where they’re able. For smaller firms, this understanding of what’s mission-critical and what can wait to come back online is key to cost-efficient resiliency.
BCDR TO MANAGE DATA CENTER AND NETWORK RESILIENCE
If your organization has already diversified across different geographic regions or even different cloud providers, it’s crucial to recognize resilience does not end with those infrastructure choices. This is where business continuity and disaster recovery (BCDR) plans come into play. Diversification helps reduce exposure. But without a tested plan to respond when things go wrong, even the most well-architected environment can falter. When you’re prepared for anything, nothing can phase you.
Whatever your organization’s BCDR plans may be, an easy way to build your resilience is by testing those plans regularly. Netflix famously uses a tool they refer to as Chaos Monkey that randomly disables production instances to ensure systems can withstand unexpected failures. There’s no telling how or when the Chaos Monkey may strike. By intentionally injecting chaos, teams must build fault-tolerant architectures that can recover quickly and continue operating under stress. This is an extreme example.
Smaller organizations can start with once- or twice-yearly tests, refining plans as they grow. Larger organizations may want to run these kinds of tests on a more frequent basis, like quarterly, before following in Netflix’s footsteps. Either way, dust off the binder and give that plan an upgrade that accounts for any and every situation.
A FORWARD-LOOKING RESILIENCE MINDSET
Just as we don’t build cities on single bridges, we shouldn’t anchor the digital economy on a handful of hyperscaler regions. The recent AWS and Microsoft outages weren’t the first of their kind, and they certainly won’t be the last. The difference between these and the next ones will be how prepared organizations are.
The hidden cost of centralization isn’t just downtime; it’s the fragility baked into modern digital systems. If you’re not spending money up front in architecting for failures and outages, you’ll lose out on more in the long run. But with smart architecture and disciplined investment, we can turn past fragility into future resilience and save on costs in the long term.
The next outage is not a matter of if, it’s when. The question is, will you be ready or caught flatfooted?
Juan Orlandini is chief technology officer of Insight Enterprises.
The final deadline for Fast Company’s World Changing Ideas Awards is Friday, December 12, at 11:59 p.m. PT. Apply today.
