Why Coinbase Could Not Simply Switch to a Backup During the AWS Outage
-

Coinbase has explained why an AWS data center problem in Northern Virginia produced a 12-hour trading outage rather than a brief disruption absorbed by backup infrastructure, and the technical detail is worth understanding for anyone who assumed that a company of Coinbase's scale would have instant failover capabilities. The exchange said its systems are specifically designed to withstand an outage in a single AWS Availability Zone and recover quickly when one occurs — which is the standard redundancy architecture for major cloud-dependent platforms. The problem on Thursday was that failures were observed across multiple AWS Availability Zones simultaneously, meaning more than one of the separate data center clusters within AWS's US-EAST-1 Northern Virginia region was affected at the same time. When multiple zones fail concurrently rather than a single zone failing in isolation, the redundancy architecture designed to route around a single failure point cannot fully compensate, and core services go down rather than seamlessly shifting to backup infrastructure.
AWS attributed the incident to increased temperatures in the affected data center causing infrastructure impairment, which forced it to shift traffic away from the compromised Availability Zone. The physical cause — overheating data center equipment — represents a category of infrastructure failure that is qualitatively different from software bugs or network issues, as it can propagate across physically co-located equipment in ways that affect multiple redundancy zones before cooling or containment measures take effect. For Coinbase specifically, the failure mode revealed a gap between its stated resilience design — tolerant of single-zone failures — and the actual conditions of a multi-zone thermal incident. The exchange said it will publish a full post-incident analysis once its investigation and AWS's official retrospective are complete, which should provide more detail on whether architectural changes are planned to address the multi-zone failure scenario that its current design did not fully account for.