TL;DR
- Root Trigger: AWS cooling failure shut down racks, triggering widespread Coinbase service outages throughout buying and selling and account actions.
- Compounding Points: Matching‑engine quorum loss and a silent Kafka management‑aircraft defect considerably prolonged restoration time.
- Subsequent Steps: Coinbase is enhancing cross‑zone resiliency, increasing failover testing, and upgrading Kafka deployments to stop related incidents.
The Could 7 service disruption left prospects throughout retail and institutional platforms going through hours of halted exercise, and the corporate is now outlining how a localized AWS cooling failure escalated into a protracted, multi‑layer outage. Coinbase says the eight‑hour interruption, adopted by a twelve‑hour full‑system restoration window, fell wanting its requirements and warranted a detailed accounting of the technical breakdowns that compounded the incident.
AWS Cooling Failure Triggered Widespread Shutdowns
In response to the corporate, the chain of occasions started at 7:20 PM ET when a number of chiller items failed inside a single information corridor in AWS’s us‑east‑1 area. The cooling loss pressured a thermal‑security shutdown of racks internet hosting EC2 situations and EBS volumes, impacting Coinbase techniques alongside different main web companies. By 7:48 PM ET, practically all buying and selling on the platform had halted, leaving retail customers unable to purchase, promote, ship, obtain, deposit, or withdraw.
Institutional purchasers on Prime additionally noticed order‑routing degradation as Coinbase Trade markets went offline. Restoration progressed inconsistently. The matching engine returned in cancel‑solely mode at 2:25 AM ET on Could 8, full buying and selling resumed at 3:49 AM ET, and coinbase.com and the cell app regained full performance by 9:53 AM ET. Occasion‑streaming backlogs cleared by 2:00 PM ET. Coinbase additionally notified regulators inside required home windows and is finishing formal influence assessments.

Two Architectural Weaknesses Extended the Outage
The corporate recognized two core points that turned a single‑zone AWS occasion right into a multi‑hour platform disruption. First, the matching engine was pinned to a single constructing resulting from its Raft‑based mostly cluster design, which prioritizes low‑latency co‑location. When AWS terminated situations at 9:29 PM ET, three of 5 nodes went down, eliminating quorum. With no automated cross‑zone failover, engineers needed to ship an emergency code change, create a brand new node group, and restore a 3‑of‑5 quorum manually. Second, AWS’s managed Kafka service failed silently.
A defect within the MSK management aircraft prevented computerized partition‑chief reelection, leaving producers unable to put in writing and blocking downstream techniques, together with charges, quoting, ledger parts, funds, and information pipelines. Handbook partition reassignments started at 3:00 AM ET, with precedence subjects restored by 9:30 AM ET. Coinbase says it’s bettering cross‑zone standby design for its matching engines, rising failover testing, collaborating with AWS on the Kafka defect, enhancing inner tooling, and migrating the remaining 2‑AZ Kafka clusters to three‑AZ deployments.

