AWS sees 60 services go down in Bahrain – second outage in a month

Have we reached the limit regarding resilience in the hyperscalers’ highly centralised cloud model?

Amazon announced its Amazon Web Services region in Bahrain has been “disrupted” by the conflict ‌in the Middle East – the second time in a month its operations have been affected by the war.

Reuters was first to report on this and said the disruption was due to drone activity but exactly what happened has not been made public. The report quoted AWS saying it is helping to migrate customers to alternative AWS regions while it recovers.

Colin Bannon, formerly CTO of BT Global and now CTO of BT Business has been talking about needing to rethink and redesign resilience at scale for several years now – he does so entertainingly in this interview from 2024: “Customers do not pay cloud providers to have an image of their application in every availability zone, they tend to just put one in a single region and think that’s resilient”.

In this article, he points out that a fault could result in the return journey having a latency of between 200 to 250 milliseconds (ms) by which time it’s likely that the customer’s experience has degraded, for example, with screens timing out.

This is what led to BT developing its global fabric.

Bannon continues in the article, “The way to fix it has to be additional diversity – and what we’re spending our CapEx on. Being dual homed, via different providers such as Equinix and Digital Realty, into that hyperscaler’s data centre in Korea. We’re doing this for Microsoft, AWS and Google, etc so rerouting is just a street across town with a 5ms failover, not going through the sea four times. That means you have resilience and performance.

“This greater robustness and differentiation brings real quality of experience that the cloud providers can’t fix themselves. We’re identifying needs in the market and solving for them as only service providers can. That’s just one example.”

The hyperscaler model ‘has a ceiling‘

David Sherman, Head of Brand Strategy at io.net, a leading, open source, distributed GPU network, also has views on resilience and the hyperscale cloud model: “This week’s AWS disruptions in the Middle East are a reminder of something the industry has been quietly ignoring: the hyperscaler model has a ceiling, and resilience is where you find it. We’ve spent a decade consolidating AI and cloud infrastructure into the hands of a few enormous providers. That concentration made things faster. It also meant that when something goes wrong in one place, the shockwaves are global.”

He adds, “Unfortunately, due to the nature of centralised architecture, 60 services going down at once isn’t a freak event – it’s what centralised architecture looks like under pressure. This model has inbuilt fragility and as AI workloads become more critical the cost of that fragility keeps going up. The assumption has always been that scale equals reliability.

“But concentration and resilience are not the same thing. A network with thousands of independent nodes spread across dozens of geographies doesn’t have a single facility whose loss cascades into a global outage. That architecture exists today. The question is whether the industry is willing to take it seriously before the next disruption forces the conversation.

“The industry built itself around convenience. The conversation now needs to be about what it means to build for the long term, and whether putting that much of the world’s compute in so few hands was ever really a good idea.”

AWS sees 60 services go down in Bahrain – second outage in a month

Have we reached the limit regarding resilience in the hyperscalers’ highly centralised cloud model?

Latest independent research

Achieving autonomous network operations

AWS sees 60 services go down in Bahrain – second outage in a month

Have we reached the limit regarding resilience in the hyperscalers’ highly centralised cloud model?

RELATED ARTICLES

Latest independent research

Achieving autonomous network operations