Recent Blogs

The Day DNS Broke the Internet: Understanding the October 2025 AWS Outage

20 November 2025

On 19–20 October 2025, Amazon Web Services experienced a failure in the Northern Virginia (us-east-1) region that began as a problem with Domain Name System (DNS) automation and ended up hobbling a long list of cloud services and household-name apps. AWS later explained the incident in a detailed post-event summary, confirming it was not a cyber attack but a fault in how DNS records for DynamoDB were managed. The result was a wave of connection errors, delayed recoveries in related services, and a stark reminder that when DNS falters, the effects can feel global.

What happened

AWS attributes the disruption to a latent race condition inside DynamoDB’s automated DNS management. DynamoDB keeps hundreds of thousands of DNS records up to date to steer traffic to healthy capacity. Two independent “DNS Enactor” processes collided in an unlucky sequence: an older DNS plan briefly overwrote a newer one and was then automatically deleted, leaving the regional endpoint (dynamodb.us-east-1.amazonaws.com) with an empty answer. With no IPs to return, resolvers could not establish new connections to DynamoDB in us-east-1. Engineers restored correct DNS data, but internal control-plane systems that also rely on DynamoDB had already begun to backlog. That’s why downstream services, like the new EC2 instance launches, network propagation and Network Load Balancer health checks continued to misbehave for hours after the DNS fix.

Timeline (BST)

  • 07:48– Event begins: regional DynamoDB endpoint DNS fails; clients can’t establish new connections.
  • 10:25–10:40 – DNS data restored; recovery proceeds as negative answers and caches expire. Primary DynamoDB disruption ends.
  • 10:25–18:36 – EC2 new instance launches impaired; network config propagation backlog builds.
  • 13:30–22:09– Network Load Balancer (NLB) connection errors due to health-check churn resolve in stages.
  • 21:50 – EC2 fully recovers; AWS reports services operating normally later that day.

What is DNS?

DNS is the internet’s phonebook. When your device asks where a name lives, a recursive resolver (often your ISP’s or a security resolver you configure) chases down the answer from the domain’s authoritative nameservers and then caches that answer for a “time-to-live” (TTL). Caching is why DNS feels fast, but it also explains why fixes can take minutes to show up everywhere: clients will keep using whatever answer or error they’ve cached until the TTL expires. In events like this one, once the correct records are republished, the internet “heals” progressively as caches refresh.

Why It Had Such A Global Impact

It basically came down to three things, and they all hit at once.. First, central dependency: huge numbers of AWS services and customer applications use DynamoDB, and they all need DNS to reach it. When the regional endpoint stopped returning IPs, any workflow that required a fresh connection in us-east-1 began to fail immediately. Second, control-plane coupling: EC2’s launch orchestration and network state propagation both depend on subsystems that talk to DynamoDB, so fixing DNS did not instantly unwind operational backlogs. Finally, concentration: us-east-1 is AWS’s busiest region and a hub for many third-party apps. That’s why the outage was visible across popular services and even UK institutions, with rolling reports throughout 20 October and confirmation from major outlets that this was an AWS-side DNS problem rather than an attack.

How to be prepared for DNS attacks

Although this incident was not malicious, DNS remains a prime target for attackers through DDoS, hijacking, and cache poisoning. A sensible first step for UK organisations is to adopt a Protective DNS (PDNS) resolver for staff devices so connections to known-malicious domains are blocked at resolution time. The National Cyber Security Centre publishes guidance for private-sector organisations on choosing and deploying PDNS, and the broader advice is clear: use vetted resolvers and make DNS visibility part of your security telemetry.

Strengthening your own domains is equally important. Enable DNSSEC on your zones to add cryptographic integrity to answers and reduce the risk of spoofing, and lock down domain registrar accounts with multi-factor authentication and change alerts. NCSC’s guidance on managing public domain names lays out these controls in practical terms and is written with UK operators in mind.

Resilience also means avoiding single points of failure. Running your authoritative DNS with more than one provider, which is often called “secondary” or multi-provider DN. Adds redundancy so that a provider-specific issue does not take your domain down. Modern platforms support automated zone transfers (AXFR/IXFR) and health-checked failover, making dual-provider setups straightforward to run day-to-day. Keep TTLs short on critical records so changes propagate quickly during incidents.

Finally, secure how clients make DNS queries. Agencies like CISA provide implementation guidance for encrypted DNS (DoH/DoT) in enterprise environments. Properly deployed, this reduces on-path tampering on untrusted networks while still allowing you to apply policy at your chosen resolvers. Pair this with monitoring for spikes in NXDOMAIN and SERVFAIL, anomalies in answer counts, and resolver diversity signals that often precede user-visible problems.

What ICT Solutions could do to help

We help organisations translate these lessons into operational resilience. Our team can design and implement a dual-provider authoritative DNS architecture, enable DNSSEC, set sensible TTLs, and harden your registrar security with MFA and registry locks. We also roll out Protective DNS for end-users in line with NCSC guidance, giving you actionable visibility into malicious lookups and policy-driven blocking at the resolver. For cloud-hosted services, we run 24×7 monitoring across DNS and critical control-plane signals, with runbooks that cover degraded modes, CDN-hosted maintenance pages, and status-page communications so you can keep customers informed even during brownouts. If your estate runs heavily on AWS, we’ll validate multi-AZ by default, assess where multi-region endpoints make sense for the business, and exercise disaster-recovery plans so recovery time objectives are realistic and tested. The goal isn’t to chase perfection; it’s to ensure your services remain reachable, your teams have clear playbooks, and your customers experience a smaller blip the next time the internet’s phonebook has a bad day. Contact us today for a quick chat, we will happily discuss how to protect your business and your data with our cyber security and cloud hosting services.