How we discovered, and recovered from, Postgres corruption on the matrix.org homeserver
2025-07-23 — General, matrix.org homeserver — Richard van der HoffGreetings from Element's backend/SRE team, who run the matrix.org
homeserver on behalf of the Matrix.org Foundation.
Recently users of the matrix.org
homeserver began seeing problems where rooms would simply stop working. Operations such as sending a new message, or joining the room as a new member, would fail for mysterious reasons. Where an error message was shown at all, it tended to be something cryptic like "No create event in auth events".
After a couple of weeks of hard work by a team of Element staff including backend developers and systems engineers, we were able to repair almost all of the affected rooms. Although we're still investigating exactly what went wrong and checking that everything is now working as it should, we'd like to share some details about what we know and the work we've done to date.
We'll be diving into some quite technical details. Hopefully you'll find it interesting learning a bit about how Synapse works, how Postgres works, and the work we sometimes find ourselves doing to keep the matrix.org
homeserver running.
🔗TL;DR
Let's start with a high-level summary.
The matrix.org
homeserver is backed by a large PostgreSQL database instance. Parts of an index on one of tables in this database had become corrupted. We are unsure exactly what caused this corruption, but believe it happened at least a year ago, and likely significantly longer.
The nature of this corruption was such that it had little or no effect at first. However, a background maintenance task which removes old, unreferenced data from this table recently started working on the corrupted region. Due to the corrupt index, the maintenance task incorrectly removed active data from the table, in effect corrupting rooms.
Having identified the problem, we rebuilt the corrupted index, and then restored the data that had been incorrectly removed, from database backups.