We're running Aurora PostgreSQL 15.12, which includes the fix mentioned in the r...

We're running Aurora PostgreSQL 15.12, which includes the fix mentioned in the release notes. Looking at this comment and the AWS documentation, I think there's an important distinction to make about what was actually fixed in Aurora PostgreSQL 15.12.4. Based on our experience and analysis, we believe AWS's fix primarily focused on data protection rather than eliminating the race condition itself.

Here's what we think is happening: Before the fix (pre-15.12.4):

1. Failover starts

2. Both instances accept and process writes simultaneously

3. Failover eventually completes after the writer steps down

4. Result: Potential data consistency issues ???

After the fix (15.12.4+):

1. Failover starts

2. If the old writer doesn't demote before the new writer is promoted, the storage layer now detects this and rejects write requests

3. Both instances restart/crash

4. Failover fails or requires manual intervention

The underlying race condition between writer demotion and reader promotion still exists - AWS just added a safety mechanism at the storage layer to prevent the dangerous scenario of two writers operating simultaneously. They essentially converted a data inconsistency risk into an availability issue. This would explain why we're still seeing failover failures on 15.12 - the race condition wasn't eliminated, just made safer.

The comment in the release notes about "fixed a race condition where an old writer instance may not step down" is somewhat misleading - it's more accurate to say they "mitigated the consequences of the race condition" by having the storage layer reject writes when it detects the problematic state and that is probably why AWS Support did not point us to this release when we raised the issue.