DynamoDB, Don't Leave Me Hanging!
Please put the "root" back into "root cause analysis".
A Root Cause Analysis (RCA) post is like the final chapter of a murder mystery: we want to learn “whodunnit” and how. Unfortunately, for the large-scale AWS outage in October, the material put out by Amazon ends on more of a cliffhanger.
To elaborate, in the official RCA, you will see a lengthy explanation of the state that triggers the bug, as well as the cascading series of failures that took down much of AWS. However, the bug itself is described only by the following single sentence:
Additionally, because the active plan was deleted, the system was left in an inconsistent state that prevented subsequent plan updates from being applied by any DNS Enactors.
That’s not much to go on.
Earlier this month, my hopes for a more complete explanation were bolstered by the sudden appearance of an AWS re:Invent presentation devoted entirely to this particular outage. Sadly, in the hour-long video, only twelve seconds were spent on the actual bug:
Both the RCA and the presentation do a good job of explaining the conditions that triggered the bug. Assuming I understood correctly, the setup goes like this:
DynamoDB uses multiple DNS “enactors”, running concurrently, to update DNS records for load balancing (among other things).
Enactors use locking and backoff to partially serialize their behavior and make them “easier to reason about”1.
Pathological lock contention can cause one enactor to be “enacting” a series of DNS records so old that another enactor will assume they couldn’t still be in use, and try to delete them.
An “inconsistent state” occurs when one enactor deletes DNS records it thought were sufficiently old while another enactor updates a DNS endpoint to refer to those same records as authoritative.
Enactors will stop operating if they attempt to update a DNS endpoint when it does not already refer to a resolvable name (which it won’t once the “inconsistent state” has occurred).
After all that setup, the story leaves me hanging. To quote the twelve seconds of the presentation:
… but when that plan’s missing, our code couldn’t handle the fact that we tried to point the rollback to nothing …
To me, this is the actual bug, so I wanted to hear about it in detail. Why didn’t the code handle that? Or more specifically, why would the code even need to do anything special to “handle” that case at all?
Let me elaborate. Whenever I’m reading something like an RCA, I’m building a model of the system in my head. I’m sure many of you do this as well - it’s how programmers think through programming problems as they are being described.
My mental model of the AWS system at this point doesn’t yet have a bug. If I imagine myself implementing the code they’re describing here, the pseudo-code version is really just three steps:
1 get the current value of dynamodb.us-east-1.api.aws
("plan-110.ddb.aws" in the presentation example)
2 set rollback.ddb.aws to the value from step 1
3 set dynamodb.us-east-1.api.aws to the new value
("plan-145.ddb.aws" in the presentation example)When written in the straightforward way, there should be no issue with step 1 receiving an unresolvable name - some kind of “null”, garbage like 0.0.0.0, or what I assume would have happened in the actual outage: the “old” name whose record had been deleted (shown as plan-110.ddb.aws in the example from the re:Invent presentation).
As far as we’ve been told, performing these DNS record modifications does not require the enactor to actually communicate with any target load balancers, so it shouldn’t matter if the target DNS records exist or not. The rollback entrypoint (rollback.ddb.aws) would just be set to the unresolvable name, just as dynamodb.us-east-1.api.aws had an unresolvable name prior to this update. As far as the explanation goes, the enactors shouldn’t have cared whether these names were resolvable or not.
Additionally, in step 3, the endpoint (dynamodb.us-east-1.api.aws) would be updated to a valid name. Everything except debug code that looks at rollback.dds.aws would work properly had the enactor continued running. On the next update, literally everything - even rollback.ddb.aws - would be back to normal. It would then point to plan-145.ddb.aws, which remains valid and will exist for some time before being cleaned up.
So to me, we have not yet hit the root cause! What is it? Something in the way Amazon wrote the enactor code is the actual root cause. I want to know what it is.
To underscore why I think this is important: taking the explanations at face value, DynamoDB could have been taken down in exactly the same way - causing the same catastrophic outage - simply by manually deleting the target DNS record of a DynamoDB endpoint. According to the available explanation, just that action alone would cause all enactors to cease functioning, and the endpoint would remain an unresolvable name forever.
So it’s hard to call any of the “setup” the root cause, because none of it is actually required for the exact same outage. An accidental manual deletion of an endpoint’s target - even just once - would bring the entire system down, and it would be unrecoverable without operator intervention.
Although the rare, emergent appearance of the “inconsistent state” is something that you would want to fix as well, by itself it wouldn’t have caused AWS to go down catastrophically. At worst it sounds like it would have been a momentary outage. If the enactors weren’t halted by the inconsistent state, one of them would shortly enact a new DNS plan, and things would be back to normal.
In closing, I should also mention that my interest in this is more than idle curiosity. A lot of programming paradigms today rely on architectural patterns I consider dangerous for mission-critical systems. As described in the RCA and presentation, this bug sounds like one that wasn’t really caused by a latent race condition as claimed - it was merely exposed by it. Given the disclosures so far, it sounds far more likely that this bug was actually caused by a bad programming practice in a piece of practical code that communicated with Route53 to change DNS entries.
So don’t leave me hanging, DynamoDB! Tell me what the actual code looked like that took down the internet last October. I want to know how something as basic as exchanging a DNS record was written in such a way that it could permanently halt not one but three different concurrent processes on a case as simple as an unresolvable prior record value.
The presentation did not elaborate on what aspect of DNS “enacting”, specifically, the serialization makes easier to reason about.


I just wanna know did they have to call that guy at the bar to get a password to fix it lol.