Understanding the DNS Root Cause of the Latest AWS Outage

Jean-Pascal

October 24, 2025

5 min read

Illustration showing DNS resolution layers and how a regional DNS failure can impact global services

On October 20, 2025, a problem in one AWS region us-east-1 in Northern Virginia caused a massive chain reaction that took down parts of the internet. Apps and services people use every day, like Snapchat, Reddit, Venmo, Coinbase, Zoom, Signal, Roblox, Fortnite, and even internal Amazon services like Prime Video and Alexa, were disrupted.

Amazon Web Services (AWS) explained that the disruption started with a DNS-related failure connected to DynamoDB, one of AWS’s core database services, in us-east-1. This DNS failure then cascaded into other parts of AWS and affected many other services that depend on DynamoDB behind the scenes.

First, what happened?

According to AWS, customers began seeing high error rates when calling DynamoDB APIs in the us-east-1 region.

Long story short: software that expected DynamoDB to respond started getting “something is wrong” responses instead. Because many AWS services rely on DynamoDB internally — and because many customer apps also rely on DynamoDB directly.

AWS said early in the incident that “the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1,” and they advised customers to try flushing DNS caches if they were still having trouble reaching the service.

By later in the day, AWS reported the outage was mitigated and services were recovering. But by then the impact had already hit thousands of businesses worldwide.

So yes: one regional problem in Virginia rippled out to the rest of the world.

Why? Two words: DNS and us-east-1.

What is DNS, and why does it matter so much?

DNS (Domain Name System) is basically the internet’s address book.

When an app wants to talk to a service — say dynamodb.us-east-1.amazonaws.com — it first asks DNS: “What IP address should I talk to?” DNS answers with an IP, and then the app connects.

If DNS can’t answer, or answers with the wrong thing, the app can’t reach the service — even if the service itself is technically healthy.

During this incident, the problem wasn’t “DynamoDB crashed and disappeared.” According to AWS’s post-incident explanation, the issue was that the DNS system responsible for telling clients where DynamoDB lives returned bad / empty data for the DynamoDB endpoint in us-east-1. In effect, software couldn’t find DynamoDB anymore.

You can think of it like this:

The restaurant kitchen is still there.
The delivery drivers still exist.
But Google Maps suddenly says “this restaurant has no address.”
Result: no food reaches anyone.

That is how a DNS problem becomes an internet-scale problem fast.

What did AWS do to fix it?

From AWS’s public statements and reporting:

They identified the source of the DNS issue in DynamoDB’s automated DNS management layer.
They manually restored correct DNS records and routing so that clients could once again resolve the DynamoDB endpoint in us-east-1.
They disabled the specific automation behavior that allowed the “empty DNS record” condition to be published in the first place.
They’re adding safeguards to prevent a single race condition from being able to remove all IP addresses for a critical regional endpoint in the future.

What can cloud teams learn from this?

These are takeaways that are directly supported by what happened and by how AWS described it.

Lesson 1: DNS is not “just networking detail,” it’s a dependency

When DNS is wrong, everything that depends on it is effectively offline even if the underlying servers are fine. AWS itself explicitly tied the outage to DNS resolution of the DynamoDB API endpoint in us-east-1.
If you run critical services, you should treat DNS like a first-class reliability component, not an afterthought.

Lesson 2: Regional thinking is not enough if your dependencies are global

Many teams believe, “We’re safe because we’re running in multiple regions.” But if those regions still depend (directly or indirectly) on a single “core” region for identity updates, database metadata, configuration, or control-plane actions, then you still have a hidden single point of failure. Multiple analysts pointed out and long-time AWS customers have complained that global AWS features can still rely on us-east-1.

Architects need to ask: “If us-east-1 vanished, what exactly stops working in my stack?”

Lesson 3: Automation can break faster than humans can fix

AWS’s own explanation says a race condition in automated DNS management caused the bad update, and the self-healing mechanism didn’t immediately reverse it.
Automation is powerful it lets you run huge systems at scale but it also means a subtle logic bug can impact millions of users in seconds. The lesson for any team (even a small startup): review what your “safety scripts” are allowed to touch. Can they delete production records? Under what conditions?

Conclusion

AWS has publicly committed to adding safeguards so that an automated DNS race condition can’t wipe out a critical endpoint’s DNS mapping again. They’ve also disabled the behavior in the affected automation and are putting in additional protections to prevent recurrence.

For the rest of us engineers, team leads, founders, students preparing for certifications this incident is more than outage gossip. It’s a free architecture lesson, delivered at internet scale.

If your product (or future product) depends on the cloud, here’s the homework:

Map your critical dependencies. Not just “EC2 in region X,” but “Does this thing call DynamoDB in us-east-1 through an SDK I barely think about?”
Treat DNS like a single point of failure, because it is.
Assume automation can fail in creative ways and design human escape hatches.

Because when something as fundamental as “where is the database?” becomes unanswerable, the whole internet feels it.

Jean-Pascal

Founder of CertiPass.io

Professional Cloud & DevOps Architect with over 10 years working on projects migration, Landing zone services offerings and cloud advisor for technical teams

Comments (0)

AWS

Understanding the DNS Root Cause of the Latest AWS Outage

Jean-Pascal

October 24, 2025

5 min read

First, what happened?

According to AWS, customers began seeing high error rates when calling DynamoDB APIs in the us-east-1 region.

Long story short: software that expected DynamoDB to respond started getting “something is wrong” responses instead. Because many AWS services rely on DynamoDB internally — and because many customer apps also rely on DynamoDB directly.

By later in the day, AWS reported the outage was mitigated and services were recovering. But by then the impact had already hit thousands of businesses worldwide.

So yes: one regional problem in Virginia rippled out to the rest of the world.

Why? Two words: DNS and us-east-1.

What is DNS, and why does it matter so much?

DNS (Domain Name System) is basically the internet’s address book.

If DNS can’t answer, or answers with the wrong thing, the app can’t reach the service — even if the service itself is technically healthy.

You can think of it like this:

The restaurant kitchen is still there.
The delivery drivers still exist.
But Google Maps suddenly says “this restaurant has no address.”
Result: no food reaches anyone.

That is how a DNS problem becomes an internet-scale problem fast.

What did AWS do to fix it?

From AWS’s public statements and reporting:

They identified the source of the DNS issue in DynamoDB’s automated DNS management layer.
They manually restored correct DNS records and routing so that clients could once again resolve the DynamoDB endpoint in us-east-1.
They disabled the specific automation behavior that allowed the “empty DNS record” condition to be published in the first place.
They’re adding safeguards to prevent a single race condition from being able to remove all IP addresses for a critical regional endpoint in the future.

What can cloud teams learn from this?

These are takeaways that are directly supported by what happened and by how AWS described it.

Lesson 1: DNS is not “just networking detail,” it’s a dependency

Lesson 2: Regional thinking is not enough if your dependencies are global

Architects need to ask: “If us-east-1 vanished, what exactly stops working in my stack?”

Lesson 3: Automation can break faster than humans can fix

Conclusion

For the rest of us engineers, team leads, founders, students preparing for certifications this incident is more than outage gossip. It’s a free architecture lesson, delivered at internet scale.

If your product (or future product) depends on the cloud, here’s the homework:

Map your critical dependencies. Not just “EC2 in region X,” but “Does this thing call DynamoDB in us-east-1 through an SDK I barely think about?”
Treat DNS like a single point of failure, because it is.
Assume automation can fail in creative ways and design human escape hatches.

Because when something as fundamental as “where is the database?” becomes unanswerable, the whole internet feels it.

Jean-Pascal

Founder of CertiPass.io

Professional Cloud & DevOps Architect with over 10 years working on projects migration, Landing zone services offerings and cloud advisor for technical teams