<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
    <title>Matrix.org - matrix.org homeserver</title>
    <subtitle>The Matrix.org Foundation</subtitle>
    <link href="https://matrix.org/category/matrix-org-homeserver/atom.xml" rel="self" type="application/atom+xml"/>
    <link href="https://matrix.org"/>
    <generator uri="https://www.getzola.org/">Zola</generator>
    <updated>2025-10-29T10:00:00+00:00</updated>
    <id>https://matrix.org/category/matrix-org-homeserver/atom.xml</id>
    
    
    
<entry xml:lang="en">
    <title>Post-mortem of the September 2 outage</title>
    <published>2025-10-29T10:00:00+00:00</published>
    <updated>2025-10-29T10:00:00+00:00</updated>
    <author>
      <name>Matthew Hodgson, Neil Johnson, Thib, SRE Team</name>
    </author>
    <link rel="alternate" href="https://matrix.org/blog/2025/10/post-mortem/" type="text/html"/>
    <id>https://matrix.org/blog/2025/10/post-mortem/</id>
    <content type="html">&lt;p&gt;On 2nd September 2025 the &lt;a href=&quot;https:&#x2F;&#x2F;matrix.org&#x2F;homeserver&#x2F;about&#x2F;&quot;&gt;matrix.org homeserver&lt;&#x2F;a&gt; suffered a ~24h outage.&lt;&#x2F;p&gt;
&lt;p&gt;During routine maintenance to increase disk capacity, the primary database failed, and we fell back to the secondary. In attempting to restore the original primary, we lost the secondary-turned-primary rendering matrix.org unavailable.&lt;&#x2F;p&gt;
&lt;p&gt;To recover, it was necessary to restore from S3 storage, however the restore process was lengthy due to the size of the dataset (51TB).&lt;&#x2F;p&gt;
&lt;p&gt;The matrix.org homeserver was unavailable from 2025-09-02 17:45 UTC and full service resumed at 2025-09-03 18:00 UTC. No data was lost as a result of the incident.&lt;&#x2F;p&gt;
&lt;span id=&quot;continue-reading&quot;&gt;&lt;&#x2F;span&gt;&lt;h2 id=&quot;what-happened&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-happened&quot; aria-label=&quot;Anchor link for: what-happened&quot;&gt;🔗&lt;&#x2F;a&gt;What happened&lt;&#x2F;h2&gt;
&lt;p&gt;The matrix.org homeserver is made of a main Synapse instance with hundreds of workers, backed by a single logical Postgres cluster made up of two machines. The primary database is replicated to a secondary, read-only instance via &lt;a href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;warm-standby.html#STREAMING-REPLICATION&quot;&gt;streaming&lt;&#x2F;a&gt; replication.&lt;&#x2F;p&gt;
&lt;figure style=&quot;height:100%;&quot;&gt;
    &lt;img src=&quot;&amp;#x2F;blog&amp;#x2F;img&amp;#x2F;morg-high-level-architecture.png&quot; &quot; &#x2F;&gt;
    &lt;figcaption&gt;&lt;p&gt;A schema showing Synapse connected to a primary database. It also shows a secondary database pulling WALs from the primary. Finally the primary database also pushes WALs to a S3 bucket.&lt;&#x2F;p&gt;
&lt;&#x2F;figcaption&gt;
&lt;&#x2F;figure&gt;
&lt;p&gt;Confusingly, at the time of the incident, the primary database server is called &lt;code&gt;db-02&lt;&#x2F;code&gt;, and the secondary database server is called &lt;code&gt;db-01&lt;&#x2F;code&gt;. The deployment runs on bare metal servers at &lt;a href=&quot;https:&#x2F;&#x2F;www.mythic-beasts.com&#x2F;&quot;&gt;Mythic Beasts&lt;&#x2F;a&gt; and the Postgres database servers both use their own logical RAID 10 array with &lt;a href=&quot;https:&#x2F;&#x2F;docs.kernel.org&#x2F;admin-guide&#x2F;md.html&quot;&gt;&lt;code&gt;mdraid&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Our primary database is backed up to an S3 bucket in AWS. At the time of the incident, we performed a full database backup weekly, incremental database backups daily, and we archived WALs continuously to a separate S3 bucket. If you are not familiar with WALs, you can see them as the primary database recording what it does when inserting or removing records into its tables.&lt;&#x2F;p&gt;
&lt;p&gt;Since WALs are exact records of what happened, they can be useful for two things&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Archive&#x2F;backups.&lt;&#x2F;strong&gt; WALs can be seen as “small incremental backups” to aid point-in-time recovery and&#x2F;or bridge the gap between full backups. This is why we keep them in the S3 bucket in addition to the weekly and daily backups.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Replication.&lt;&#x2F;strong&gt; The secondary database will fetch those WALs from the primary database and also replay them on itself, to have the exact same records as the primary database.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;The primary database will produce WALs as it adds or removes records, and keep them until they have been both archived to a S3 bucket &lt;em&gt;and&lt;&#x2F;em&gt; been fetched by the secondary database.&lt;&#x2F;p&gt;
&lt;p&gt;We monitor the database size and growth, and when the database reached roughly 51TB (90% of disk capacity) we set about adding more disks in the array.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;timeline&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#timeline&quot; aria-label=&quot;Anchor link for: timeline&quot;&gt;🔗&lt;&#x2F;a&gt;Timeline&lt;&#x2F;h3&gt;
&lt;p&gt;At 11:03 UTC on Sept 2nd 2025, Mythic Beasts’ teams added 2 NVMe drives to our primary and secondary database &lt;code&gt;db-02&lt;&#x2F;code&gt; and &lt;code&gt;db-01&lt;&#x2F;code&gt;, respectively the primary and secondary database servers. We then set about introducing the new drives to the respective RAID arrays.&lt;&#x2F;p&gt;
&lt;p&gt;At 11:17 UTC, one existing drive disappeared from the RAID array of &lt;code&gt;db-02&lt;&#x2F;code&gt;, our primary database server. Our monitoring fired, and Mythic Beasts confirmed the issue. Because we’re using &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Nested_RAID_levels#RAID_10_(RAID_1+0)&quot;&gt;RAID 10&lt;&#x2F;a&gt;, the setup was still functional but running in degraded mode. There was no data loss, but the RAID array could potentially not survive another drive failure, and performance could be degraded.&lt;&#x2F;p&gt;
&lt;p&gt;We had to restore the RAID array of &lt;code&gt;db-02&lt;&#x2F;code&gt;, our primary database server, to a non-degraded state. That meant failing over to our secondary database on &lt;code&gt;db-01&lt;&#x2F;code&gt; and doing maintenance on &lt;code&gt;db-02&lt;&#x2F;code&gt;, a decision we took at 12:57 UTC.
At 13:27 UTC the fail-over to the database on &lt;code&gt;db-01&lt;&#x2F;code&gt; was complete, and &lt;code&gt;db-01&lt;&#x2F;code&gt; was now the primary. Synapse happily started writing to it. At this point there has been minimal disruption. But the new primary didn’t archive WALs to S3 due to an issue in the archiving script. Because of this and because the new secondary was offline, WALs could not be discarded from &lt;code&gt;db-01&lt;&#x2F;code&gt; yet.&lt;&#x2F;p&gt;
&lt;p&gt;At 13:30 UTC, we restarted the postgres instance on &lt;code&gt;db-02&lt;&#x2F;code&gt; in replica mode, effectively turning our former primary database into a secondary. The new secondary needed to catch up with what had been happening on the new primary running on &lt;code&gt;db-01&lt;&#x2F;code&gt; by consuming its WALs.&lt;&#x2F;p&gt;
&lt;p&gt;At 13:53 UTC, after the new secondary on &lt;code&gt;db-02&lt;&#x2F;code&gt; caught up with the new primary on &lt;code&gt;db-01&lt;&#x2F;code&gt;, we decided to restart the &lt;code&gt;db-02&lt;&#x2F;code&gt; server, in the hope of restoring its RAID 10 array to a fully functional state.&lt;&#x2F;p&gt;
&lt;p&gt;At 14:01 UTC, the &lt;code&gt;db-02&lt;&#x2F;code&gt; server rebooted in recovery mode, because its RAID array could not be assembled as an additional drive was now missing. Recovery mode means no network, no ssh, no postgres instance was running. At this point, our secondary database was offline, and our new primary still didn’t archive WALs to S3. WALs kept accumulating on &lt;code&gt;db-01&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;At 15:44 UTC, we reached the conclusion that&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;The RAID array on our &lt;code&gt;db-02&lt;&#x2F;code&gt; server was not recoverable as the RAID headers were missing on both drives that were missing from the RAID array.&lt;&#x2F;li&gt;
&lt;li&gt;We needed to recreate a fresh RAID array.&lt;&#x2F;li&gt;
&lt;li&gt;We would need to restore the database on &lt;code&gt;db-02&lt;&#x2F;code&gt;, ideally by making it a replica of the new primary running on &lt;code&gt;db-01&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;At 16:11 UTC, the &lt;code&gt;db-02&lt;&#x2F;code&gt; server went back online with a fresh RAID 10 array, and by 16:50 UTC we unblocked the WALs archival from the primary on &lt;code&gt;db-01&lt;&#x2F;code&gt; to S3. WALs could start being discarded on the primary on db-01; it was time to restore the secondary on db-02.&lt;&#x2F;p&gt;
&lt;p&gt;At 17:20 UTC, we upgraded the Postgres on the brand new and empty secondary on &lt;code&gt;db-02&lt;&#x2F;code&gt; to the latest patch version. That meant not having to do another set of failovers to upgrade the databases after getting back to a healthy state. At this point, we still had a fully functional primary database.&lt;&#x2F;p&gt;
&lt;p&gt;At 17:25 UTC we attempted to start restoring the data on &lt;code&gt;db-02&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;First we ran a command on the machine to list all of the backups and identify the correct backup ID:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#1e1e1e;color:#dcdcdc;&quot;&gt;&lt;code&gt;&lt;span&gt;sudo &#x2F;opt&#x2F;wal-g&#x2F;wal-g \
&lt;&#x2F;span&gt;&lt;span&gt;  --walg-s3-prefix=s3:&#x2F;&#x2F;&amp;lt;backup-bucket&amp;gt; \
&lt;&#x2F;span&gt;&lt;span&gt;  --aws-shared-credentials-file=&#x2F;home&#x2F;postgres&#x2F;.aws&#x2F;credentials \
&lt;&#x2F;span&gt;&lt;span&gt;  --aws-region=eu-west-2 backup-list
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We were able to identify the most recent backup and target it with a restore command that we have documented as part of our restore procedures:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#1e1e1e;color:#dcdcdc;&quot;&gt;&lt;code&gt;&lt;span&gt;sudo time &#x2F;opt&#x2F;wal-g&#x2F;wal-g \
&lt;&#x2F;span&gt;&lt;span&gt;  --walg-s3-prefix=s3:&#x2F;&#x2F;&amp;lt;backup-bucket&amp;gt; \
&lt;&#x2F;span&gt;&lt;span&gt;  --aws-shared-credentials-file=&#x2F;home&#x2F;postgres&#x2F;.aws&#x2F;credentials \
&lt;&#x2F;span&gt;&lt;span&gt;  --aws-region=eu-west-2  \
&lt;&#x2F;span&gt;&lt;span&gt;  --walg-download-concurrency=32 \
&lt;&#x2F;span&gt;&lt;span&gt;  backup-fetch &#x2F;mnt&#x2F;data&#x2F;postgresql-14&#x2F; &amp;lt;backup_id&amp;gt; \
&lt;&#x2F;span&gt;&lt;span&gt;  2&amp;gt;&amp;amp;1 | tee restore.log
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This command was entered while the current directory was the Postgres database directory, which caused the &lt;code&gt;tee&lt;&#x2F;code&gt; command to fail and abort the restore process, which had enough time to create some directories in the data path but nothing else. We switched to the home path and re-ran the command, which successfully wrote to the log file, but failed due to the data directory being non-empty after the previous aborted restore.&lt;&#x2F;p&gt;
&lt;p&gt;The necessary course of action at this point was to clear the remains of the failed restore attempt from the data directory and start again. Since &lt;code&gt;db-02&lt;&#x2F;code&gt; had already been cleared and needed to be restored, this didn’t register as a particularly high risk manoeuvre.&lt;&#x2F;p&gt;
&lt;p&gt;Unfortunately, in attempting to do so, we erroneously deleted the data directory of the primary on &lt;code&gt;db-01&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;After realising our mistake, we decided to keep our Postgres up on &lt;code&gt;db-01&lt;&#x2F;code&gt; in case deleted files were still open in Postgres processes, with the hopes that the open file handles would forestall the actual deletion of the data on disk.&lt;&#x2F;p&gt;
&lt;p&gt;With both &lt;code&gt;db-01&lt;&#x2F;code&gt; and &lt;code&gt;db-02&lt;&#x2F;code&gt; out of action we had no other option but to restore at least one database from offsite backup. Since &lt;code&gt;db-02&lt;&#x2F;code&gt; was in a pristine state, with an expanded RAID array, we decided to restore the database on this server.&lt;&#x2F;p&gt;
&lt;p&gt;As detailed earlier, our backup strategy at the time was: full database backups weekly, incremental database backups daily, and WALs archival continuously. To perform a complete backup without any data loss on &lt;code&gt;db-02&lt;&#x2F;code&gt;, we needed to&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Restore the latest weekly full database backup from S3.&lt;&#x2F;li&gt;
&lt;li&gt;Restore all the daily incremental backups from S3 since the last daily backup.&lt;&#x2F;li&gt;
&lt;li&gt;Replay the WALs since the last daily incremental backup.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;So at 17:30 UTC, we started restoring the database on &lt;code&gt;db-02&lt;&#x2F;code&gt; by using &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;wal-g&#x2F;wal-g&quot;&gt;wal-g&lt;&#x2F;a&gt; -  a well known tool that pulls the backups from S3 to restore databases. That was going to be costly and slow, but we didn’t have a choice and that’s what backups are for.&lt;&#x2F;p&gt;
&lt;p&gt;In the meantime, the backend team was paged to manage the impact to Synapse, an incident was opened, and an emergency was declared. Our primary database on &lt;code&gt;db-01&lt;&#x2F;code&gt; was partially wiped and throwing errors, but not corrupt enough to crash Synapse. We decided to shut down both Synapse and the primary database to avoid unknown database states. At this point, the matrix.org homeserver was down.&lt;&#x2F;p&gt;
&lt;p&gt;At 18:06 UTC we decided to re-mount the data partition of &lt;code&gt;db-01&lt;&#x2F;code&gt; as read-only. We were now in emergency mode, and wanted to ensure we couldn’t damage the database further, in case we could salvage it later.&lt;&#x2F;p&gt;
&lt;p&gt;At 18:40 UTC, after taking the time to consider our options, we realised the following&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;extundelete.sourceforge.net&#x2F;&quot;&gt;extundelete&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;ext4magic.sourceforge.net&#x2F;ext4magic_en.html&quot;&gt;ext4magic&lt;&#x2F;a&gt; were both unmaintained for a decade, and are unable to work on an unmounted filesystem. ext4magic even explicitly documents it “can no longer successfully process current ext4 file systems”&lt;&#x2F;li&gt;
&lt;li&gt;We also tried &lt;a href=&quot;https:&#x2F;&#x2F;www.r-studio.com&#x2F;free-linux-recovery&#x2F;&quot;&gt;R-Linux&lt;&#x2F;a&gt;, but weren’t confident in the integrity of the recovered files - especially with our recent experiences with slow-burning &lt;a href=&quot;https:&#x2F;&#x2F;matrix.org&#x2F;blog&#x2F;2025&#x2F;07&#x2F;postgres-corruption-postmortem&#x2F;&quot;&gt;postgres corruption&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;So we decided against trying to recover the lost data by carving or undeletion, in favour of a guaranteed reliable restore from offsite backup.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;At 20:30 UTC, &lt;code&gt;db-02&lt;&#x2F;code&gt; was still restoring from the S3 backup. After restoring the database on &lt;code&gt;db-02&lt;&#x2F;code&gt; from its full and incremental backups, we would need to replay the WALs produced by &lt;code&gt;db-01&lt;&#x2F;code&gt; to fill the gap between the last backup taken from &lt;code&gt;db-02&lt;&#x2F;code&gt; and the moment we lost &lt;code&gt;db-01&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;When we promoted &lt;code&gt;db-01&lt;&#x2F;code&gt; as the primary, the script that archives WALs to S3 started erroring out. As a result, there were WALs on &lt;code&gt;db-01&lt;&#x2F;code&gt; that were not in S3. We were going to need those to bring &lt;code&gt;db-02&lt;&#x2F;code&gt; up to date with the point of the outage. We started copying these WALs from &lt;code&gt;db-01&lt;&#x2F;code&gt; to &lt;code&gt;db-02&lt;&#x2F;code&gt; to have them ready to replay once the restore from S3 backup would complete. Restoring 51 TB from S3 &lt;em&gt;takes time&lt;&#x2F;em&gt; so we didn’t have much more to do than wait for the restore to complete.&lt;&#x2F;p&gt;
&lt;p&gt;At 07:21 UTC the next morning, the data extraction from the full weekly backup was complete. However as soon as wal-g attempted to start restoring the next daily increment backup it immediately errored out due to an &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;wal-g&#x2F;wal-g&#x2F;issues&#x2F;499&quot;&gt;issue&lt;&#x2F;a&gt; with wal-g that had &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;wal-g&#x2F;wal-g&#x2F;pull&#x2F;1320&quot;&gt;already received a fix&lt;&#x2F;a&gt;. Now, we regularly run backup recovery tests during which we spin up a short lived EC2 instance, called our Disaster Recovery Server, perform a full database restore on it and run a few tests before tearing it down. During one of those recovery tests, we had already run into the wal-g problem and fixed it in the backup recovery test ansible playbook… but unfortunately this got missed on the actual database servers.&lt;&#x2F;p&gt;
&lt;p&gt;This meant that our production version of wal-g was outdated and hadn’t received this fix. At this point, we had pulled all the full base backup data from S3, but wal-g had failed to restore any incremental backups on top of it because of this bug. We needed to update wal-g to the latest release of the same major version to benefit from the fix. After doing so, we tried to relaunch the restore, and it failed because the data directory already contained a partial restore.
So, we decided to patch wal-g to recover from a partial failed restore, and after fighting with the dependencies we figured out how to make it accept a non-empty data directory that contained a pristine full base backup, so we didn’t have to pull everything from S3 again. We patched it, built it, and used it against &lt;code&gt;db-02&lt;&#x2F;code&gt; at 09:23 UTC.&lt;&#x2F;p&gt;
&lt;p&gt;At 09:35 UTC the first incremental backup was restored, then the second at 09:44 UTC, the third at 09:54 UTC, and the final backup was restored at 10:03 UTC.&lt;&#x2F;p&gt;
&lt;p&gt;At 10:45 UTC we attempted to start the new instance in standby mode to check its consistency. But the standby mode of Postgres is meant to be for replicas, and replicas need either a primary to grab WALs from, or a &lt;code&gt;remote_command&lt;&#x2F;code&gt; set to fetch WALs. Since the new Postgres on &lt;code&gt;db-02&lt;&#x2F;code&gt; couldn’t reach any primary and it didn’t have any &lt;code&gt;restore_command&lt;&#x2F;code&gt; set, it refused to start in standby mode.&lt;&#x2F;p&gt;
&lt;p&gt;So we configured a &lt;code&gt;restore_command&lt;&#x2F;code&gt; with a wrapper script that could fetch WALs from both S3 (our “continuous backups”) or from the filesystem (db WALs carried over from &lt;code&gt;db-01&lt;&#x2F;code&gt;) and started Postgres in standby mode successfully. It started catching up on WALs from S3 at 11:00 UTC.&lt;&#x2F;p&gt;
&lt;p&gt;Frustratingly, the playback rate was slower than expected - to replay the ~18 hours of WALs ended up taking 5.5 hours (we had been hoping it would take around 10 minutes for every 1 hour of WALs). It took until 16:27 UTC to replay all the WALs. And at this point we could log into the Postgres database on &lt;code&gt;db-02&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;At long last, we had a working database instance, with no data loss. We promoted it to a primary database at 16:45 UTC, and started a Synapse test worker at 16:51 UTC. We could see new WALs start to appear in S3, which meant WAL shipping worked. It was time to restart Synapse and bring matrix.org back online. We started Synapse at 16:54 UTC, and after various thundering-herd overloads as everyone reconnected, all the workers were online and stable by 18:00 UTC.&lt;&#x2F;p&gt;
&lt;p&gt;At this point, the server was back online, matrix.org was catching up with everything that had happened on the rest of the federation while it was offline, albeit with a single database node (although WALs were being archived to S3 for safety).&lt;&#x2F;p&gt;
&lt;p&gt;At this point, if our database had caught fire we could have been able to restore it without losing data, but at the cost of bringing matrix.org offline again. We had just been through it, we didn’t want to do it again. We needed our secondary back.&lt;&#x2F;p&gt;
&lt;p&gt;But we also needed the team to get some rest. Given how slow it was to replay WALs, we reconfigured our backups to happen against the primary database rather than against the (missing) replica. We let the European team go to bed, while our American SRE kept tabs on everything. At 03:26 UTC a new incremental backup completed.&lt;&#x2F;p&gt;
&lt;p&gt;At 09:21 UTC we added the two NVMe disks to the RAID array and to the LVM volumes group of &lt;code&gt;db-01&lt;&#x2F;code&gt;. We rebooted to ensure the disks were properly detected and mounted - but the server didn’t come back. We opened the lights-out console Mythic Beasts provides us, and saw that the RAID array was not in the functional state. We had rebooted &lt;code&gt;db-01&lt;&#x2F;code&gt; at a critical moment of the array reshaping.
After fixing up the array to bring it in a bootable state, &lt;code&gt;db-01&lt;&#x2F;code&gt; finally restarted, and we copied over the basebackup from &lt;code&gt;db-02&lt;&#x2F;code&gt; and set it to replicating.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;lessons-learned&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#lessons-learned&quot; aria-label=&quot;Anchor link for: lessons-learned&quot;&gt;🔗&lt;&#x2F;a&gt;Lessons learned&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;we-have-a-massive-database&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#we-have-a-massive-database&quot; aria-label=&quot;Anchor link for: we-have-a-massive-database&quot;&gt;🔗&lt;&#x2F;a&gt;We have a massive database&lt;&#x2F;h3&gt;
&lt;p&gt;A lot of the pain we experienced during this outage came from how massive our database is.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Now that we have extra storage, it’s the right time to run &lt;code&gt;pg_repack&lt;&#x2F;code&gt; and reclaim free space.&lt;&#x2F;li&gt;
&lt;li&gt;We have already increased the frequency of incremental backups, since they’re much faster to restore than it is to replay WALs.&lt;&#x2F;li&gt;
&lt;li&gt;We also know Synapse could do much better in terms of data storage and there &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;element-hq&#x2F;synapse&#x2F;issues?q=sort%3Aupdated-desc%20is%3Aissue%20is%3Aopen%20label%3AA-Disk-Space&quot;&gt;are plans to drastically reduce storage requirements in future&lt;&#x2F;a&gt;, also see Matthew’s “how hard could it be” hack from the week before the incident: &lt;a href=&quot;https:&#x2F;&#x2F;youtu.be&#x2F;D5zAgVYBuGk?t=1852&quot;&gt;https:&#x2F;&#x2F;youtu.be&#x2F;D5zAgVYBuGk?t=1852&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;our-safeguards-can-be-improved&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#our-safeguards-can-be-improved&quot; aria-label=&quot;Anchor link for: our-safeguards-can-be-improved&quot;&gt;🔗&lt;&#x2F;a&gt;Our safeguards can be improved&lt;&#x2F;h3&gt;
&lt;p&gt;Running a destructive command on the incorrect server was a key moment in the incident. While it can be attributed to human error, it is incorrect to focus on the individual, and instead consider how to improve the tooling and processes surrounding them to minimise the chances of a repeat in the future.&lt;&#x2F;p&gt;
&lt;p&gt;On making the sensitive changes, the on-call group effectively paired as a trio, however, in the heat of the moment, this was insufficient to catch the error.&lt;&#x2F;p&gt;
&lt;p&gt;We realised that the database servers names were a source of confusion. &lt;code&gt;db-01&lt;&#x2F;code&gt; reads like “Primary DB” and &lt;code&gt;db-02&lt;&#x2F;code&gt; reads like “Secondary DB”. Not only is this false in our case, a primary database server can become a secondary database server, and the other way around. Names with intrinsic meanings are a source of confusion.&lt;&#x2F;p&gt;
&lt;p&gt;We’re considering changing the background colour of the terminal dynamically depending on the role the database is playing in the cluster. An idea we floated is to monitor the presence of the &lt;code&gt;standby.signal&lt;&#x2F;code&gt; file in the database data directory to know whether it is a primary or a secondary database, and update the terminal’s background colour accordingly. This is not a silver bullet since the background colour would only change after a command has been sent, but that would already be an improvement.&lt;&#x2F;p&gt;
&lt;p&gt;We also discussed wrapper scripts around sensitive commands (such as an alias for &lt;code&gt;rm&lt;&#x2F;code&gt;) or automating some operations, such as starting a base backup from primary to secondary as a means to minimise risk.&lt;&#x2F;p&gt;
&lt;p&gt;We could restore the service after 24h offline, &lt;em&gt;without any data loss&lt;&#x2F;em&gt; despite losing both our primary and secondary databases. This accounts for a great Recovery Point Objective and is testament to our PITR processes that we test regularly. We should take pride in the recovery, but we need to work on a shorter Recovery Time Objective, we’re currently talking to service providers to get free infrastructure that would make it easier and faster to recover.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;we-can-have-better-tools&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#we-can-have-better-tools&quot; aria-label=&quot;Anchor link for: we-can-have-better-tools&quot;&gt;🔗&lt;&#x2F;a&gt;We can have better tools&lt;&#x2F;h3&gt;
&lt;p&gt;We upgraded wal-g on all servers, not just the Disaster Recovery Server, and have done a round of Disaster Recovery testing with it. We didn’t explore yet how we can ensure the Disaster Recovery Server and the production servers can stay aligned.&lt;&#x2F;p&gt;
&lt;p&gt;At the next hardware refresh, we will explore using ZFS so we can make local snapshots and recover much more quickly from not so happy accidents such as accidentally wiping the wrong database.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;we-have-a-great-community-and-providers&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#we-have-a-great-community-and-providers&quot; aria-label=&quot;Anchor link for: we-have-a-great-community-and-providers&quot;&gt;🔗&lt;&#x2F;a&gt;We have a great community and providers&lt;&#x2F;h3&gt;
&lt;p&gt;We received a lot of support on social media where we communicated actively around the incident. This was welcomed positively by the broad community, despite our status page not receiving the attention it deserved. We’re adding steps to our incident response playbook to update &lt;a href=&quot;http:&#x2F;&#x2F;status.matrix.org&quot;&gt;status.matrix.org&lt;&#x2F;a&gt; as the canonical source of truth during an incident, and liaise with the advocacy team to keep social media updated as well.&lt;&#x2F;p&gt;
&lt;p&gt;The SRE team would like to thank our hosting provider Mythic Beasts. They reached out quickly and proactively when adding new disks, reporting the errors they were seeing. They have been much more than just a pair of remote hands. They also reached out with an offer of support during the incident.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, we’d like to sincerely apologise again to everyone impacted by the outage, we hope you found the post-mortem informative and if you would like to investigate running your own homeserver, there are plenty of &lt;a href=&quot;https:&#x2F;&#x2F;matrix.org&#x2F;ecosystem&#x2F;distributions&#x2F;&quot;&gt;distributions&lt;&#x2F;a&gt; to choose from.&lt;&#x2F;p&gt;
</content>
</entry>

    
<entry xml:lang="en">
    <title>Matrix.org (Official Account) and Terms updates</title>
    <published>2025-07-31T00:00:00+00:00</published>
    <updated>2025-07-31T00:00:00+00:00</updated>
    <author>
      <name>Amandine Le Pape</name>
    </author>
    <link rel="alternate" href="https://matrix.org/blog/2025/07/terms-update/" type="text/html"/>
    <id>https://matrix.org/blog/2025/07/terms-update/</id>
    <content type="html">&lt;p&gt;Users of the Matrix.org homeserver have recently received – or will shortly receive as the notifications are rolled out progressively – an invite from a user called &lt;em&gt;Matrix.org (Official Account).&lt;&#x2F;em&gt; Those checking the room will have noticed that it announces upcoming changes to our the Matrix.org Homeserver Terms and Conditions.&lt;&#x2F;p&gt;
&lt;p&gt;Some of you have asked us questions about these two events so we would like to offer some clarification and (hopefully) some reassurance.&lt;&#x2F;p&gt;
&lt;span id=&quot;continue-reading&quot;&gt;&lt;&#x2F;span&gt;&lt;h2 id=&quot;matrix-org-official-account&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#matrix-org-official-account&quot; aria-label=&quot;Anchor link for: matrix-org-official-account&quot;&gt;🔗&lt;&#x2F;a&gt;Matrix.org (Official Account)&lt;&#x2F;h2&gt;
&lt;p&gt;Firstly, the Matrix.org (Official Account): given not all the users having an account on the Matrix.org homeserver have an email or another way to reach out to them linked to their Matrix accounts we decided to use a room to send official communications to them.&lt;&#x2F;p&gt;
&lt;p&gt;This user is currently sending out messages about the upcoming changes to the Terms and Conditions updates (see next section), but we anticipate sending out other account related messages in future. For long time users of the homeserver if you scroll back you may also see some messages from several years ago.&lt;&#x2F;p&gt;
&lt;p&gt;You can verify that this user is legitimate by checking the &lt;a href=&quot;&#x2F;homeserver&#x2F;official&#x2F;&quot;&gt;page on our website&lt;&#x2F;a&gt; we have dedicated to it. In short, the user’s matrix identifier is &lt;strong&gt;@server:matrix.org&lt;&#x2F;strong&gt; and the room should look like the screenshot below from various clients. Any communications coming from a different identifier are not coming from us.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;img&#x2F;official-account-1.png&quot; alt=&quot;A screenshot of a mouse hovering the avatar of an account. The tooltip display @server:matrix.org&quot; &#x2F;&gt;
&lt;img src=&quot;&#x2F;blog&#x2F;img&#x2F;official-account-2.png&quot; alt=&quot;A screenshot of the account view of an account. The identifier is @server:matrix.org&quot; &#x2F;&gt;
&lt;img src=&quot;&#x2F;blog&#x2F;img&#x2F;official-account-3.png&quot; alt=&quot;A screenshot of the account view of an account in Element web. The identifier is @server:matrix.org&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;terms-and-conditions-updates&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#terms-and-conditions-updates&quot; aria-label=&quot;Anchor link for: terms-and-conditions-updates&quot;&gt;🔗&lt;&#x2F;a&gt;Terms and Conditions updates&lt;&#x2F;h2&gt;
&lt;p&gt;It is the first time we have introduced significant changes to our Terms and Conditions (also known as “&lt;a href=&quot;&#x2F;legal&#x2F;terms-and-conditions&quot;&gt;Homeserver Terms&lt;&#x2F;a&gt;”) since their creation in 2018. These changes have been material enough to warrant notifications going out to all matrix.org accounts. Some of these changes are associated with the incoming &lt;a href=&quot;&#x2F;blog&#x2F;2025&#x2F;06&#x2F;funding-homeserver-premium&#x2F;&quot;&gt;premium accounts&lt;&#x2F;a&gt; and others are directly related to our compliance with the Online Safety Act (OSA). But all in all, if you’re a typical matrix.org user there shouldn’t be any impacts to how you use Matrix on a day-to-day basis.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;premium-accounts-and-fair-usage-limits&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#premium-accounts-and-fair-usage-limits&quot; aria-label=&quot;Anchor link for: premium-accounts-and-fair-usage-limits&quot;&gt;🔗&lt;&#x2F;a&gt;Premium accounts and fair usage limits&lt;&#x2F;h3&gt;
&lt;p&gt;When it comes to the premium accounts, the main changes we’ve introduced are directly related to payment and clarification of fair usage limits. As &lt;a href=&quot;&#x2F;blog&#x2F;2025&#x2F;06&#x2F;funding-homeserver-premium&#x2F;&quot;&gt;announced a couple of weeks ago&lt;&#x2F;a&gt;, we will be progressively introducing premium accounts on the Matrix.org homeserver, first to new users, then slowly migrating existing accounts. This is due to the need for the Foundation to help both cover the costs of running the server and limit the abuse we are seeing from a few users.&lt;&#x2F;p&gt;
&lt;p&gt;The introduction of payment is why you might have noticed the terms becoming a bit heavier on the “legalese”: there are certain things we need to make sure we cover when payments are involved which can be quite dense. These are mainly clauses 5 and 6, around liability and payment, respectively. We will work to make these friendlier over time.&lt;&#x2F;p&gt;
&lt;p&gt;In terms of fair usage: so far there was no mention of fair usage in the terms, which of course led to some users pushing the limits of their usage, by storing a large amount of data on the server. We have now made it clear that this won’t be tolerated, but these fair usage limits should not impact normal users. We are introducing both a per 24h and per 4 weeks rolling cap, the values of which have been based on looking at the stats of usage and we may iterate on them if needed.&lt;&#x2F;p&gt;
&lt;p&gt;It is worth noting that whilst the terms will be effective on the 14th of August for existing users, they will be effective immediately for new users.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;With regards to the roll out of premium accounts, here are a few important notes:&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pricing details and usage limits are published &lt;a href=&quot;&#x2F;homeserver&#x2F;pricing&#x2F;&quot;&gt;here&lt;&#x2F;a&gt; and may still evolve&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Premium accounts will start being rolled out to some of the new users in a few days&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Existing users will only be migrated from their legacy account to the new Free accounts in several weeks (we will communicate when we start the migration)&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Existing users will be notified with enough advance warning before the migration of their account happens&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;For any question, please refer to the &lt;a href=&quot;&#x2F;blog&#x2F;2025&#x2F;06&#x2F;funding-homeserver-premium&quot;&gt;announcement blog post&lt;&#x2F;a&gt; or  join &lt;a href=&quot;https:&#x2F;&#x2F;matrix.to&#x2F;#&#x2F;#foundation-office:matrix.org&quot;&gt;https:&#x2F;&#x2F;matrix.to&#x2F;#&#x2F;#foundation-office:matrix.org&lt;&#x2F;a&gt;.&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;online-safety-act-and-digital-services-act-related-changes&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#online-safety-act-and-digital-services-act-related-changes&quot; aria-label=&quot;Anchor link for: online-safety-act-and-digital-services-act-related-changes&quot;&gt;🔗&lt;&#x2F;a&gt;Online Safety Act and Digital Services Act related changes&lt;&#x2F;h3&gt;
&lt;p&gt;Outside of these, the most common questions we are seeing are associated with clause 3.2 “user content”. This explains the measures we take to comply with our OSA and DSA obligations around proactive monitoring for illegal content. To reiterate and be as clear as possible:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Your encrypted messages are not being looked at - we have never introduced backdoors and never will;&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Your private discussions remain private.&lt;&#x2F;strong&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;What we are doing is using readily available tooling like &lt;a href=&quot;https:&#x2F;&#x2F;developers.cloudflare.com&#x2F;cache&#x2F;reference&#x2F;csam-scanning&#x2F;&quot;&gt;Cloudflare’s scanning tool&lt;&#x2F;a&gt; to identify &lt;strong&gt;unencrypted&lt;&#x2F;strong&gt; media which might have a match with content identified in reputable Child Sexual Abuse Materials (CSAM) databases such as &lt;a href=&quot;https:&#x2F;&#x2F;www.missingkids.org&#x2F;home&quot;&gt;NCMEC&lt;&#x2F;a&gt;. Like all proactive tooling described in this clause, this is all done in public and unencrypted rooms. We have no way of looking into encrypted rooms nor do we have plans to do so.&lt;&#x2F;p&gt;
&lt;p&gt;If you’ve been around for a while you will have seen that we have started &lt;a href=&quot;&#x2F;blog&#x2F;2021&#x2F;05&#x2F;19&#x2F;how-the-uk-s-online-safety-bill-threatens-matrix&#x2F;&quot;&gt;raising the alarm&lt;&#x2F;a&gt; about the dangers and potential risks of the OSA back in 2021. Whilst we have certain legal obligations as a UK based organisation, we will not falter and compromise on our principles and values.&lt;&#x2F;p&gt;
&lt;p&gt;We are planning to do a longer blog post on our regulatory compliance in the near future. If there is anything in particular you would like to see answered by that please feel free to drop us an email at &lt;a href=&quot;mailto:legal@matrix.org&quot;&gt;legal@matrix.org&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</content>
</entry>

    
<entry xml:lang="en">
    <title>How we discovered, and recovered from, Postgres corruption on the matrix.org homeserver</title>
    <published>2025-07-23T00:00:00+00:00</published>
    <updated>2025-07-23T00:00:00+00:00</updated>
    <author>
      <name>Richard van der Hoff</name>
    </author>
    <link rel="alternate" href="https://matrix.org/blog/2025/07/postgres-corruption-postmortem/" type="text/html"/>
    <id>https://matrix.org/blog/2025/07/postgres-corruption-postmortem/</id>
    <content type="html">&lt;p&gt;Greetings from Element&#x27;s backend&#x2F;SRE team, who run the &lt;a href=&quot;https:&#x2F;&#x2F;matrix.org&#x2F;homeserver&#x2F;&quot;&gt;&lt;code&gt;matrix.org&lt;&#x2F;code&gt; homeserver&lt;&#x2F;a&gt; on behalf of the Matrix.org Foundation.&lt;&#x2F;p&gt;
&lt;p&gt;Recently users of the &lt;code&gt;matrix.org&lt;&#x2F;code&gt; homeserver began &lt;a href=&quot;https:&#x2F;&#x2F;status.matrix.org&#x2F;incidents&#x2F;8gljb3gtlv11&quot;&gt;seeing problems where rooms would simply stop working&lt;&#x2F;a&gt;. Operations such as sending a new message, or joining the room as a new member, would fail for mysterious reasons. Where an error message was shown at all, it tended to be something cryptic like &quot;No create event in auth events&quot;.&lt;&#x2F;p&gt;
&lt;p&gt;After a couple of weeks of hard work by a team of Element staff including backend developers and systems engineers, we were able to repair almost all of the affected rooms. Although we&#x27;re still investigating exactly what went wrong and checking that everything is now working as it should, we&#x27;d like to share some details about what we know and the work we&#x27;ve done to date.&lt;&#x2F;p&gt;
&lt;p&gt;We&#x27;ll be diving into some quite technical details. Hopefully you&#x27;ll find it interesting learning a bit about how Synapse works, how Postgres works, and the work we sometimes find ourselves doing to keep the &lt;code&gt;matrix.org&lt;&#x2F;code&gt; homeserver running.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;tl-dr&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#tl-dr&quot; aria-label=&quot;Anchor link for: tl-dr&quot;&gt;🔗&lt;&#x2F;a&gt;TL;DR&lt;&#x2F;h2&gt;
&lt;p&gt;Let&#x27;s start with a high-level summary.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;matrix.org&lt;&#x2F;code&gt; homeserver is backed by a large PostgreSQL database instance. Parts of an index on one of tables in this database had become corrupted. We are unsure exactly what caused this corruption, but believe it happened at least a year ago, and likely significantly longer.&lt;&#x2F;p&gt;
&lt;p&gt;The nature of this corruption was such that it had little or no effect at first. However, a background maintenance task which removes old, unreferenced data from this table recently started working on the corrupted region. Due to the corrupt index, the maintenance task incorrectly removed &lt;em&gt;active&lt;&#x2F;em&gt; data from the table, in effect corrupting rooms.&lt;&#x2F;p&gt;
&lt;p&gt;Having identified the problem, we rebuilt the corrupted index, and then restored the data that had been incorrectly removed, from database backups.&lt;&#x2F;p&gt;
&lt;span id=&quot;continue-reading&quot;&gt;&lt;&#x2F;span&gt;&lt;h2 id=&quot;initial-investigations-or-what-exactly-is-a-state-group&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#initial-investigations-or-what-exactly-is-a-state-group&quot; aria-label=&quot;Anchor link for: initial-investigations-or-what-exactly-is-a-state-group&quot;&gt;🔗&lt;&#x2F;a&gt;Initial investigations, or &quot;what exactly is a state group?&quot;&lt;&#x2F;h2&gt;
&lt;p&gt;We were first alerted to the problem via a &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;element-hq&#x2F;synapse&#x2F;issues&#x2F;18606&quot;&gt;bug report&lt;&#x2F;a&gt; from a user, and similar reports in public Matrix rooms and other social media. As more anecdotal reports came in, we started to investigate what was going on.&lt;&#x2F;p&gt;
&lt;p&gt;To understand what we found, you&#x27;ll need to understand what we mean by a &quot;state group&quot;.&lt;&#x2F;p&gt;
&lt;p&gt;As most readers probably know, Matrix allows applications to associate &quot;state&quot; with a room. In contrast to &quot;message&quot; events which are normal messages that fit at one particular point in the timeline, state sticks around, visible to all, until it is replaced. One example of state is a user&#x27;s room membership — whether or not they are currently a member of that room. Another example is &lt;code&gt;m.room.name&lt;&#x2F;code&gt;, which, as the name implies, holds the room&#x27;s name.&lt;&#x2F;p&gt;
&lt;p&gt;Yet another type of state is the &quot;create event&quot;: this is the very first event that happened in a room. The create event is somewhat special in that it can never be changed, but we still always expect it to be part of the room state.&lt;&#x2F;p&gt;
&lt;p&gt;Obviously, the state of a room changes over time. What may be less obvious is that a homeserver often needs to know what the state of a room was at some point in the past, to answer questions such as &quot;should this user be allowed to see this event&quot; or &quot;should I accept this event that has been sent to me over federation from another homeserver&quot;. Whilst in theory we could figure out what the state was at any given point in history by replaying each event that happened in the room before that point, that would be extremely computationally intensive. So in practice, homeservers end up storing what amounts to a snapshot of the room state at each historical event.&lt;&#x2F;p&gt;
&lt;p&gt;Of course, regular events don&#x27;t change the state of the room, so there is no point actually storing the state at each of those events. So, at last we can understand what a &quot;state group&quot; is: Synapse groups together a set of events in a given room, where the state in that room remained unchanged. In other words, a run of &lt;code&gt;m.room.message&lt;&#x2F;code&gt; events (normal room messages) will likely all share the same &quot;state group&quot;. Once somebody changes the room state (for example, by joining the room), we&#x27;ll start a new state group, and subsequent events will be part of that new state group.&lt;&#x2F;p&gt;
&lt;p&gt;The diagram below illustrates this. Blue creates a new room, and Yellow joins. The first few events each change the state of the room, meaning that each new event goes into a new state group. But events &lt;code&gt;F&lt;&#x2F;code&gt; and &lt;code&gt;G&lt;&#x2F;code&gt; are regular messages, meaning they don&#x27;t change the state of the room. The room state after each of events &lt;code&gt;E&lt;&#x2F;code&gt;, &lt;code&gt;F&lt;&#x2F;code&gt; and &lt;code&gt;G&lt;&#x2F;code&gt; is the same, so they can all be in state group 5.&lt;&#x2F;p&gt;
&lt;p&gt;Things get a bit more complicated at &lt;code&gt;H&lt;&#x2F;code&gt; and &lt;code&gt;I&lt;&#x2F;code&gt;: both Yellow and Blue try to change the name at the same time, so the state after &lt;code&gt;H&lt;&#x2F;code&gt; includes &lt;code&gt;H&lt;&#x2F;code&gt; and the state after &lt;code&gt;I&lt;&#x2F;code&gt; includes &lt;code&gt;I&lt;&#x2F;code&gt;. The state resolution algorithm determines that &lt;code&gt;I&lt;&#x2F;code&gt; ends up &quot;winning&quot;, so the state after &lt;code&gt;J&lt;&#x2F;code&gt; includes &lt;code&gt;I&lt;&#x2F;code&gt; and not &lt;code&gt;H&lt;&#x2F;code&gt;, meaning that &lt;code&gt;J&lt;&#x2F;code&gt; (and &lt;code&gt;K&lt;&#x2F;code&gt;) can share state group 7 with &lt;code&gt;I&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;img&#x2F;stategroups.png&quot; alt=&quot;State-groups diagram&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Now, when we started investigating the rooms where people had reported problems, we found clear signs of corrupted state groups. For example, the state in some of the state groups in affected rooms was completely empty. As I said earlier, the room&#x27;s create event is always part of the state of a room, and it can never change, so finding state groups whose state does not at least include a create event was a big red flag.&lt;&#x2F;p&gt;
&lt;p&gt;This also gives a clue to the meaning of that error I mentioned earlier: when we decide whether to accept an event into the room, we check the state of the room. One of the things we check for is the presence of a create event: &quot;No create event in auth events&quot; means Synapse rejected the new event because there was no create event in the room state.&lt;&#x2F;p&gt;
&lt;p&gt;There&#x27;s one more wrinkle we&#x27;ll need to understand about state groups. As you can see in the diagram above, most state groups only differ very slightly (typically by a single piece of state) from the previous state group in the same room. Storing a complete snapshot of the state every time the state in a room changes would be very expensive in terms of storage. So instead, Synapse normally just stores the difference from an earlier state group; then, to stop lookups becoming too expensive, we store a complete snapshot every 100 state groups or so.&lt;&#x2F;p&gt;
&lt;p&gt;Again, you can see that &quot;compression&quot; technique at play in the diagram above. Most state groups have a grey arrow representing the link to the previous state group, meaning that each state group only needs to store the delta from the previous state group (shown in bold whilst those states implied by the &quot;previous&quot; link are greyed out). State groups 1 and 8 are stored as complete snapshots.&lt;&#x2F;p&gt;
&lt;p&gt;Synapse stores all this data in its database: the &lt;code&gt;event_to_state_groups&lt;&#x2F;code&gt; table tells us which state group each event is in, &lt;code&gt;state_groups_state&lt;&#x2F;code&gt; stores the actual state snapshot or delta for that state group, and &lt;code&gt;state_group_edges&lt;&#x2F;code&gt; gives us the previous state group for delta-stored state groups.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-hunt-for-suspects&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-hunt-for-suspects&quot; aria-label=&quot;Anchor link for: the-hunt-for-suspects&quot;&gt;🔗&lt;&#x2F;a&gt;The hunt for suspects&lt;&#x2F;h2&gt;
&lt;p&gt;Thanks to the way Matrix works, once Synapse has created a state group, we very rarely ever have to change it. (If more events arrive, they may be assigned to an existing state group, but the state group itself, and the room state for that state group, remain unchanged). The only exceptions are:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;rust-synapse-compress-state&quot;&gt;state compressor&lt;&#x2F;a&gt;, which rewrites state groups so that they can be stored more efficiently.&lt;&#x2F;li&gt;
&lt;li&gt;purge operations, where all or part of a room&#x27;s history is removed from the database, making the corresponding state groups redundant.&lt;&#x2F;li&gt;
&lt;li&gt;a cleanup job which removes state groups which were created but never used.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;... and of course, the creation of the state group in the first place.&lt;&#x2F;p&gt;
&lt;p&gt;At least that gave us a place to start looking, but since we hadn&#x27;t made any changes to those areas of the code recently, we were still at a bit of a loss.&lt;&#x2F;p&gt;
&lt;p&gt;The state compressor was easy to rule out, at least, since it runs as a separate process and we were certain it wasn&#x27;t running on &lt;code&gt;matrix.org&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;As a precaution, we temporarily disabled the cleanup job that removes redundant state groups. We couldn&#x27;t figure out how it could cause the problem, but better safe than sorry, and disabling it would just mean we used a bit more disk space for a while.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;more-evidence-comes-to-light&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#more-evidence-comes-to-light&quot; aria-label=&quot;Anchor link for: more-evidence-comes-to-light&quot;&gt;🔗&lt;&#x2F;a&gt;More evidence comes to light&lt;&#x2F;h2&gt;
&lt;p&gt;Our next step was to try and figure out when the problem started. Searching the logs for one Synapse process gave some clear, and worrying, results:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;2025-06-24: 0 results for “No create event”&lt;&#x2F;li&gt;
&lt;li&gt;2025-06-25: 0 results for “No create event”&lt;&#x2F;li&gt;
&lt;li&gt;2025-06-26: 0 results for “No create event”&lt;&#x2F;li&gt;
&lt;li&gt;2025-06-27: 48 results for “No create event”&lt;&#x2F;li&gt;
&lt;li&gt;2025-06-28: 1100 results for “No create event”&lt;&#x2F;li&gt;
&lt;li&gt;2025-06-29: 3610 results for “No create event”&lt;&#x2F;li&gt;
&lt;li&gt;2025-06-30: 6902 results for “No create event”&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;So, we double-checked for changes that had been made around 27th June, and still didn&#x27;t find anything. We considered rolling back Synapse to an older version, but since we couldn&#x27;t figure out what had changed, we didn&#x27;t know how far we would have to roll back.&lt;&#x2F;p&gt;
&lt;p&gt;What&#x27;s more, we found state groups that must have been fine initially (say, on 2025-06-29) were now corrupt: in other words, this confirmed that the problem wasn&#x27;t that we were creating new, invalid state groups, but there was a process somewhere in the system that was corrupting &lt;em&gt;existing&lt;&#x2F;em&gt; state groups.&lt;&#x2F;p&gt;
&lt;p&gt;The diagram below illustrates the problem. The state in state group 4 has been corrupted, meaning that that state group (and state groups 5, 6, and 7 which all reference it) are now missing an important part of the room state, and events will not be authorised.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;img&#x2F;stategroups-borked.png&quot; alt=&quot;Broken state-groups diagram&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;some-remedial-steps&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#some-remedial-steps&quot; aria-label=&quot;Anchor link for: some-remedial-steps&quot;&gt;🔗&lt;&#x2F;a&gt;Some remedial steps&lt;&#x2F;h2&gt;
&lt;p&gt;Now that we knew we were dealing with data loss, it seemed likely that we would need to restore data from backup, so started the process of restoring the database backup from 26th June into a new Postgres instance hosted in Amazon EC2. The restore process takes several hours, so we wanted to get it started. On the other hand, it would leave the Matrix Foundation an EC2 bill of hundreds of USD per day for an EC2 instance large enough to host the database!&lt;&#x2F;p&gt;
&lt;p&gt;We also set up a guard against further corruption: we &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;element-hq&#x2F;synapse&#x2F;blob&#x2F;64126ac9797895ce24734b4093cb849b4f9c5468&#x2F;synapse&#x2F;storage&#x2F;schema&#x2F;state&#x2F;delta&#x2F;92&#x2F;08_no_empty_state_groups.sql.postgres&quot;&gt;added&lt;&#x2F;a&gt; a Postgres &quot;constraint&quot; which would reject any SQL queries which attempted to delete the state from a state group while that state group was still in use.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;a-culprit-emerges&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#a-culprit-emerges&quot; aria-label=&quot;Anchor link for: a-culprit-emerges&quot;&gt;🔗&lt;&#x2F;a&gt;A culprit emerges&lt;&#x2F;h2&gt;
&lt;p&gt;By this point, it was the morning of 3rd July. The cleanup job had been disabled for 24 hours, and we hadn&#x27;t seen any further corruption. Now that we had the protective constraint in place, we decided to re-enable the cleanup job, and see what happened. Almost immediately, we could see that the cleanup job was hitting the constraint. From the Postgres logs:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#1e1e1e;color:#dcdcdc;&quot;&gt;&lt;code&gt;&lt;span&gt;2025-07-03 12:30:38.250 UTC [matrix background_worker1] ERROR: Deleting state_groups_state row when it still exists in state_groups_edges: prev_state_group = 963361509
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;... meaning it was trying to delete the state for state group &lt;code&gt;963361509&lt;&#x2F;code&gt; while that state group was still in use. The Synapse logs, meanwhile, suggested it was actually trying to delete completely different state groups. Was it a bug in Synapse? Or the &lt;a href=&quot;https:&#x2F;&#x2F;pypi.org&#x2F;project&#x2F;psycopg2&#x2F;&quot;&gt;Python Postgres driver&lt;&#x2F;a&gt;?&lt;&#x2F;p&gt;
&lt;p&gt;We spent a while narrowing down the problem, even resorting to &lt;a href=&quot;https:&#x2F;&#x2F;www.tcpdump.org&#x2F;&quot;&gt;tcpdump&lt;&#x2F;a&gt; to see what was happening between Synapse and the database. With &lt;code&gt;tcpdump&lt;&#x2F;code&gt;, we could see &lt;code&gt;DELETE&lt;&#x2F;code&gt; queries being made, but none which would affect state group &lt;code&gt;963361509&lt;&#x2F;code&gt;. Maybe this was actually a bug in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;postgresml&#x2F;pgcat&quot;&gt;PgCat&lt;&#x2F;a&gt;, which we use to pool Postgres connections? Or even in Postgres itself?&lt;&#x2F;p&gt;
&lt;p&gt;We tried replaying the query that &lt;code&gt;tcpdump&lt;&#x2F;code&gt; had captured. Here&#x27;s a screenshot from our ops room:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;img&#x2F;oh-wow.png&quot; alt=&quot;A transcript from our ops room, in which Erik notes that a DELETE query deletes different rows, and everyone else expresses astonishment&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Oh wow indeed. That shouldn&#x27;t happen. We narrowed the problem down to one particular state group: &lt;code&gt;483128098&lt;&#x2F;code&gt;. What happens if we just try and read that state group from the database?&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#1e1e1e;color:#dcdcdc;&quot;&gt;&lt;code&gt;&lt;span&gt;matrix=&amp;gt; SELECT state_group, room_id FROM state_groups_state WHERE state_group = 483128098;
&lt;&#x2F;span&gt;&lt;span&gt;state_group |            room_id
&lt;&#x2F;span&gt;&lt;span&gt;------------+----------------------------------------
&lt;&#x2F;span&gt;&lt;span&gt;  483128098 | !XtFbidoIcAVPuQtXcG:matrix.org
&lt;&#x2F;span&gt;&lt;span&gt;  963361875 | !IvVovpFpWhKsKMCGCO:irc.snt.utwente.nl
&lt;&#x2F;span&gt;&lt;span&gt;  483128098 | !XtFbidoIcAVPuQtXcG:matrix.org
&lt;&#x2F;span&gt;&lt;span&gt;(3 rows)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Oh dear. Once your database starts returning nonsense results, you&#x27;re going to be in for a bad time.&lt;&#x2F;p&gt;
&lt;p&gt;What it meant here was that, although the cleanup job was (correctly) trying to clean up state group &lt;code&gt;483128098&lt;&#x2F;code&gt;, Postgres would &lt;em&gt;also&lt;&#x2F;em&gt; delete the data for state group &lt;code&gt;963361875&lt;&#x2F;code&gt;. Suddenly things started to make sense: rooms were getting corrupted by cleanup jobs for &lt;em&gt;completely unrelated&lt;&#x2F;em&gt; rooms.&lt;&#x2F;p&gt;
&lt;p&gt;We&#x27;ve encountered Postgres index corruption &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;synapse&#x2F;issues&#x2F;6696&quot;&gt;before&lt;&#x2F;a&gt;, and this matched the symptoms perfectly. In short: the index entries for state group &lt;code&gt;483128098&lt;&#x2F;code&gt; point to the wrong place in the main table data (the &quot;heap&quot;). So, if we did a query that Postgres could answer by &lt;em&gt;just&lt;&#x2F;em&gt; looking at the index, we&#x27;d get plausible-looking results:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#1e1e1e;color:#dcdcdc;&quot;&gt;&lt;code&gt;&lt;span&gt;matrix=&amp;gt; SELECT state_group, type FROM state_groups_state WHERE state_group = 483128098;
&lt;&#x2F;span&gt;&lt;span&gt;state_group | type
&lt;&#x2F;span&gt;&lt;span&gt;------------+--------------
&lt;&#x2F;span&gt;&lt;span&gt;  483128098 | m.room.member
&lt;&#x2F;span&gt;&lt;span&gt;  483128098 | m.room.member
&lt;&#x2F;span&gt;&lt;span&gt;  483128098 | m.room.member
&lt;&#x2F;span&gt;&lt;span&gt;(3 rows)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;... but as soon as Postgres had to look at the heap, it would return nonsense, as above.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;give-it-to-me-straight-doc&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#give-it-to-me-straight-doc&quot; aria-label=&quot;Anchor link for: give-it-to-me-straight-doc&quot;&gt;🔗&lt;&#x2F;a&gt;Give it to me straight, doc&lt;&#x2F;h2&gt;
&lt;p&gt;The good news, such as it was, was that we could now be reasonably certain that other homeservers would not be affected: this was data corruption on the &lt;code&gt;matrix.org&lt;&#x2F;code&gt; Postgres instance.&lt;&#x2F;p&gt;
&lt;p&gt;On the other hand, we had no idea how extensive the corruption was, when it had happened, or if it was still happening.&lt;&#x2F;p&gt;
&lt;p&gt;We did several things to try to assess the damage.&lt;&#x2F;p&gt;
&lt;p&gt;The first thing to check was whether both Postgres instances had the same problem. (We replicate all our data to a warm standby server using &lt;a href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;warm-standby.html#STREAMING-REPLICATION&quot;&gt;streaming replication&lt;&#x2F;a&gt; so that we can fail over rapidly in the event of a hardware failure.) As far as we could tell, both servers had identical corruption.&lt;&#x2F;p&gt;
&lt;p&gt;Secondly, we wrote a script which sampled the &lt;code&gt;state_groups_state&lt;&#x2F;code&gt; table to look for corruption. It told us that the problem was worryingly large: millions of state groups were affected. But for some reason, it only seemed to affect state groups in the range 147M - 541M. (State group 541M was created in January 2021. As of July 2025, we&#x27;re now up to 1040M.)&lt;&#x2F;p&gt;
&lt;p&gt;We also ran &lt;a href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;app-pgamcheck.html&quot;&gt;pg_amcheck&lt;&#x2F;a&gt; on the affected index. This is a tool that forms part of the Postgres distribution, and it checks for inconsistencies in all or part of a database. It took a while, but didn&#x27;t return any problems. This mostly told us that &lt;code&gt;amcheck&lt;&#x2F;code&gt; couldn&#x27;t detect this sort of corruption, but one thing it checks is that all rows in the table also appear in the index; so now we knew that we weren&#x27;t &lt;em&gt;missing&lt;&#x2F;em&gt; any index rows — we just had &lt;em&gt;extra&lt;&#x2F;em&gt; ones.&lt;&#x2F;p&gt;
&lt;p&gt;Meanwhile, we tried &lt;a href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;message-id&#x2F;flat&#x2F;CAPo1J60Vcu%2B5G0EvvAZtYgTn6U6ADij3aVJ8WFVz77jP%2BBd_Tw%40mail.gmail.com&quot;&gt;reaching out&lt;&#x2F;a&gt; to the helpful folks on the &lt;code&gt;pgsql-general&lt;&#x2F;code&gt; mailing list. We figured if anyone knew what could have caused this, they would.&lt;&#x2F;p&gt;
&lt;p&gt;The final thing we did at this point was to take a look at the actual index data with &lt;a href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;pageinspect.html&quot;&gt;pageinspect&lt;&#x2F;a&gt;, to see if there were any clues there. It didn&#x27;t really tell us anything we didn&#x27;t already know (i.e., that the index rows were pointing at the wrong place in the heap), but it was interesting to check out the structure of the index.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;a-deeper-dive-into-postgres-indexes&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#a-deeper-dive-into-postgres-indexes&quot; aria-label=&quot;Anchor link for: a-deeper-dive-into-postgres-indexes&quot;&gt;🔗&lt;&#x2F;a&gt;A deeper dive into Postgres indexes&lt;&#x2F;h2&gt;
&lt;p&gt;On the morning of 4th July, our backup from 26th June at last finished restoring. That meant two things: first, we could check if it had the same index corruption as our primary and secondary servers (it did), and secondly, we could start to think about how to repair the damage.&lt;&#x2F;p&gt;
&lt;p&gt;We noticed something else interesting, though. On the production servers, some index entries pointed to state groups which didn&#x27;t yet exist on 26th June:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#1e1e1e;color:#dcdcdc;&quot;&gt;&lt;code&gt;&lt;span&gt;-- On the production database
&lt;&#x2F;span&gt;&lt;span&gt;matrix=&amp;gt; SELECT state_group, type, ctid FROM state_groups_state WHERE state_group = 353864583;
&lt;&#x2F;span&gt;&lt;span&gt; state_group |           type            |      ctid      
&lt;&#x2F;span&gt;&lt;span&gt;-------------+---------------------------+----------------
&lt;&#x2F;span&gt;&lt;span&gt;   353864583 | m.room.member             | (39060361,12)
&lt;&#x2F;span&gt;&lt;span&gt;  1034753774 | m.room.member             | (264925234,54)
&lt;&#x2F;span&gt;&lt;span&gt;  1034753810 | im.vector.modular.widgets | (264925240,54)
&lt;&#x2F;span&gt;&lt;span&gt;  1034753803 | m.room.member             | (264925252,54)
&lt;&#x2F;span&gt;&lt;span&gt;  1034753803 | m.room.member             | (264925252,55)
&lt;&#x2F;span&gt;&lt;span&gt;(5 rows)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;(&lt;code&gt;ctid&lt;&#x2F;code&gt;, or &quot;current tuple ID&quot; is Postgres&#x27;s internal identifier for a row in a table: the format is a &lt;a href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;storage-page-layout.html&quot;&gt;page&lt;&#x2F;a&gt; number, followed by an offset within that page. We&#x27;ll return to &lt;code&gt;ctid&lt;&#x2F;code&gt;s in a minute.)&lt;&#x2F;p&gt;
&lt;p&gt;Those state groups (&lt;code&gt;1034753774&lt;&#x2F;code&gt; etc.) were only created on 3rd July, so clearly the backup will look different. Indeed:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#1e1e1e;color:#dcdcdc;&quot;&gt;&lt;code&gt;&lt;span&gt;-- On the restored backup
&lt;&#x2F;span&gt;&lt;span&gt;matrix=# SELECT state_group, type, ctid FROM state_groups_state WHERE state_group = 353864583;
&lt;&#x2F;span&gt;&lt;span&gt; state_group |     type      |     ctid      
&lt;&#x2F;span&gt;&lt;span&gt;-------------+---------------+---------------
&lt;&#x2F;span&gt;&lt;span&gt;   353864583 | m.room.member | (39060361,12)
&lt;&#x2F;span&gt;&lt;span&gt;(1 row)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Did that mean that the corruption was ongoing? Time for another look with &lt;code&gt;pageinspect&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;As with most Postgres indexes, this one is a &lt;a href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;indexes-types.html#INDEXES-TYPES-BTREE&quot;&gt;B-Tree&lt;&#x2F;a&gt;. To find a specific entry, you start at the &quot;root&quot; of the tree (a single page which covers the whole table, but with very coarse index entries: there might be one sub-page for all the A&#x27;s, for example, and another for all the B&#x27;s), and work down the tree until you get to the right &quot;leaf&quot; page.&lt;&#x2F;p&gt;
&lt;p&gt;On our restored backup, we manually walked the tree to find the leaf index pages for state group &lt;code&gt;353864583&lt;&#x2F;code&gt;. Turned out, there were several pages of entries: it seems like, at some point in the past, this state group had lots of state associated with it. Anyway, the interesting page was this:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#1e1e1e;color:#dcdcdc;&quot;&gt;&lt;code&gt;&lt;span&gt;-- On the restored backup
&lt;&#x2F;span&gt;&lt;span&gt;matrix=# select ctid, left(data, 77) as data from bt_page_items(&amp;#39;state_groups_state_type_idx&amp;#39;, 192904826);
&lt;&#x2F;span&gt;&lt;span&gt;      ctid      |                                     data                                      
&lt;&#x2F;span&gt;&lt;span&gt;----------------+-------------------------------------------------------------------------------
&lt;&#x2F;span&gt;&lt;span&gt; (264925236,41) | 87 8b 17 15 00 00 00 00 1d 6d 2e 72 6f 6f 6d 2e 6d 65 6d 62 65 72 35 40 66 72
&lt;&#x2F;span&gt;&lt;span&gt; (264925234,54) | 87 8b 17 15 00 00 00 00 1d 6d 2e 72 6f 6f 6d 2e 6d 65 6d 62 65 72 4b 40 66 72
&lt;&#x2F;span&gt;&lt;span&gt; (264925235,54) | 87 8b 17 15 00 00 00 00 1d 6d 2e 72 6f 6f 6d 2e 6d 65 6d 62 65 72 47 40 66 72
&lt;&#x2F;span&gt;&lt;span&gt;(3 rows)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Being a leaf index page, the &lt;code&gt;ctid&lt;&#x2F;code&gt; points to the actual row in the heap. This is an index on &lt;code&gt;(state_group, type, state_key)&lt;&#x2F;code&gt;, so the &lt;code&gt;data&lt;&#x2F;code&gt; here is:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;a little-endian 64-bit representation of &lt;code&gt;353864583&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;a length&#x2F;flags byte (&lt;code&gt;1d&lt;&#x2F;code&gt; =&amp;gt; 13 bytes of uncompressed text)&lt;&#x2F;li&gt;
&lt;li&gt;the event type (&lt;code&gt;m.room.member&lt;&#x2F;code&gt;)&lt;&#x2F;li&gt;
&lt;li&gt;another length&#x2F;flags byte&lt;&#x2F;li&gt;
&lt;li&gt;the &lt;code&gt;state_key&lt;&#x2F;code&gt;: a user ID, which I&#x27;ve truncated in the above for brevity and privacy.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The point is, even in the backup, we have index rows pointing to heap tuple &lt;code&gt;(264925234,54)&lt;&#x2F;code&gt;. And what is at that heap tuple?&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#1e1e1e;color:#dcdcdc;&quot;&gt;&lt;code&gt;&lt;span&gt;matrix=# SELECT * FROM heap_page_items(get_raw_page(&amp;#39;state_groups_state&amp;#39;, 264925234));
&lt;&#x2F;span&gt;&lt;span&gt; lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data 
&lt;&#x2F;span&gt;&lt;span&gt;----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+--------
&lt;&#x2F;span&gt;&lt;span&gt;  1 |      0 |        0 |      0 |        |        |          |        |             |            |        |        |       | 
&lt;&#x2F;span&gt;&lt;span&gt;(1 row)
&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Nothing at all. That tuple doesn&#x27;t exist. It&#x27;s just empty space in the table data.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, we can understand a bit about what&#x27;s happened here. The corruption is &lt;strong&gt;not&lt;&#x2F;strong&gt; ongoing. Rather, the index was already corrupt at the time the backup was taken, but the index rows point into empty space -- and apparently Postgres ignores such index rows.&lt;&#x2F;p&gt;
&lt;p&gt;Then, on 3rd July, that empty space got used for state group &lt;code&gt;1034753774&lt;&#x2F;code&gt;, meaning that the index entry for state group &lt;code&gt;353864583&lt;&#x2F;code&gt; now points to the data for state group &lt;code&gt;1034753774&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;This tells us something else interesting: this corruption could have been there for months or years, without anyone noticing. It was only once Postgres started populating that bit of table space that any problem would have been observable.&lt;&#x2F;p&gt;
&lt;p&gt;So why was the index entry pointing at empty space? That&#x27;s a great question, and something we spent a long time discussing. Presumably, at some point in the past, we used to have lots of entries in &lt;code&gt;state_groups_state&lt;&#x2F;code&gt; for state group &lt;code&gt;353864583&lt;&#x2F;code&gt;. Then, most of these entries were removed (likely by the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;rust-synapse-compress-state&quot;&gt;state compressor&lt;&#x2F;a&gt;), causing a bunch of free space to be created in the table data -- but for some reason, the index entries for those rows didn&#x27;t get correctly cleaned up, leaving them dangling.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;repairing-the-damage&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#repairing-the-damage&quot; aria-label=&quot;Anchor link for: repairing-the-damage&quot;&gt;🔗&lt;&#x2F;a&gt;Repairing the damage&lt;&#x2F;h2&gt;
&lt;p&gt;We now had enough information to start to get things working again.&lt;&#x2F;p&gt;
&lt;p&gt;The first priority was to get Postgres back to a consistent state. That meant rebuilding the index, which in itself wasn&#x27;t trivial, given the index takes up over 4 TB — but we had just enough spare disk, so we set the reindex going overnight.&lt;&#x2F;p&gt;
&lt;p&gt;Next, we needed to repair any state groups which were incorrectly modified by the cleanup job due to the corrupt index. To do this, we considered the range of state groups that the cleanup job had been working on, and wrote a script which queried each of those state groups on our restored backup, noting down the targets of any bogus data: this was the list of potential victims of incorrect cleanup.&lt;&#x2F;p&gt;
&lt;p&gt;We then cross-referenced that list of &lt;em&gt;potential&lt;&#x2F;em&gt; victims against the production database, checking for &lt;code&gt;state_groups_state&lt;&#x2F;code&gt; entries which had been removed but where the state group was still in use: this gave us the &lt;em&gt;actual&lt;&#x2F;em&gt; victim list. Each of those victims had to be re-inserted into the production database.&lt;&#x2F;p&gt;
&lt;p&gt;We started those scripts running on 5th July, but due to the amount of data involved, it took nearly a week before we were able to announce on 11th July that the majority of rooms were repaired.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;assessing-the-root-cause&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#assessing-the-root-cause&quot; aria-label=&quot;Anchor link for: assessing-the-root-cause&quot;&gt;🔗&lt;&#x2F;a&gt;Assessing the root cause&lt;&#x2F;h2&gt;
&lt;p&gt;So, what went wrong to cause those index pages to get corrupted? The short answer is, we don&#x27;t know.&lt;&#x2F;p&gt;
&lt;p&gt;First, some timeframes. We know for certain that corruption happened &lt;em&gt;after&lt;&#x2F;em&gt; January 2021 (or at least, that corruption was still ongoing at that point), since it affected state groups created at that time. And we know that it happened &lt;em&gt;before&lt;&#x2F;em&gt; July 2025, since corruption was present in the backup from the end of June. It&#x27;s hard to be any more certain than that.&lt;&#x2F;p&gt;
&lt;p&gt;The one thing we can be sure it&#x27;s &lt;em&gt;not&lt;&#x2F;em&gt; is a bug in Synapse or PgCat: there is no way that an application should be able to cause internal corruption within a Postgres database.&lt;&#x2F;p&gt;
&lt;p&gt;One possibility is a Postgres bug, but Postgres is an extremely robust piece of software, and the Postgres team treats corruption bugs extremely seriously. We were using Postgres 10.12 in January 2021, and we&#x27;ve looked through the Postgres release notes for every version since then, and not found any bug fixes that would fit this pattern.&lt;&#x2F;p&gt;
&lt;p&gt;It&#x27;s worth noting that Postgres relies heavily on its underlying filesystem, as well as the device drivers and hardware, to behave correctly: in particular, if the filesystem claims that data has been persisted, it really has been persisted. Problems in this area are far from unknown — back in 2018, the Postgres team discovered that their 20-year-old assumptions about how &lt;code&gt;fsync&lt;&#x2F;code&gt; worked were incorrect (&lt;a href=&quot;https:&#x2F;&#x2F;wiki.postgresql.org&#x2F;wiki&#x2F;Fsync_Errors&quot;&gt;wiki page&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;archive.fosdem.org&#x2F;2019&#x2F;schedule&#x2F;event&#x2F;postgresql_fsync&#x2F;&quot;&gt;FOSDEM presentation&lt;&#x2F;a&gt;). But the fixes to that were backported to Postgres 10.7 so that problem can&#x27;t explain this corruption.&lt;&#x2F;p&gt;
&lt;p&gt;So that really leaves kernel or disk firmware bugs, and hardware failures. Our filesystem is nothing fancy, just &lt;code&gt;ext4&lt;&#x2F;code&gt;, and we&#x27;re using stock Debian kernels. Some sort of hardware problem seems like the most plausible cause. We&#x27;re somewhat surprised that hardware failure would cause extensive damage to a single index, whilst apparently leaving all other data intact, but it&#x27;s at least possible.&lt;&#x2F;p&gt;
&lt;p&gt;For the curious: our current generation of database servers run Linux kernel 6.1, and each server uses eight 15TB Intel NVME SSDs in a RAID10 configuration to give us 64TB of storage. The previous generation (retired in November 2023) used 8TB SSDs with LVM and no RAID, on Linux 4.19. Of course, we have checked &lt;code&gt;fsck&lt;&#x2F;code&gt;, &lt;code&gt;smartctl&lt;&#x2F;code&gt; and &lt;code&gt;mdadm&lt;&#x2F;code&gt; for any errors on the current disks: none have shown up.&lt;&#x2F;p&gt;
&lt;p&gt;There was a disk failure on the primary database server in October 2021, which caused us to fail over to the secondary, so it&#x27;s conceivable that the dying disk lost some writes, though it would have to have been doing so for a while for the corruption to have made it onto the secondary. We&#x27;re not entirely satisfied with this explanation.&lt;&#x2F;p&gt;
&lt;p&gt;If you&#x27;ve got any ideas, &lt;a href=&quot;https:&#x2F;&#x2F;matrix.to&#x2F;#&#x2F;#synapse:matrix.org&quot;&gt;let us know&lt;&#x2F;a&gt;!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusions&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#conclusions&quot; aria-label=&quot;Anchor link for: conclusions&quot;&gt;🔗&lt;&#x2F;a&gt;Conclusions&lt;&#x2F;h2&gt;
&lt;p&gt;Incidents like this happen from time to time when running software services, particularly relatively large scale ones like the &lt;code&gt;matrix.org&lt;&#x2F;code&gt; homeserver. They are impossible to plan for and often, as in this case, take significant time and effort from people who would otherwise be developing features or fixing bugs.&lt;&#x2F;p&gt;
&lt;p&gt;We know that there are plenty of users out there who will have been affected by the problem, and found themselves unable to communicate as a result. We very much share your frustration, and we&#x27;d like to apologise for the disruption to service.&lt;&#x2F;p&gt;
&lt;p&gt;With that said, we&#x27;re glad that we were able to get to the bottom of most of the problem, and get the lost data restored within a relatively short time. If nothing else, hopefully this blog post will be of use to future generations faced with Postgres index corruption!&lt;&#x2F;p&gt;
</content>
</entry>

    
<entry xml:lang="en">
    <title>Introducing premium accounts to fund the matrix.org homeserver</title>
    <published>2025-06-13T14:00:00+00:00</published>
    <updated>2025-06-13T14:00:00+00:00</updated>
    <author>
      <name>Amandine Le Pape</name>
    </author>
    <link rel="alternate" href="https://matrix.org/blog/2025/06/funding-homeserver-premium/" type="text/html"/>
    <id>https://matrix.org/blog/2025/06/funding-homeserver-premium/</id>
    <content type="html">&lt;h2 id=&quot;tl-dr&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#tl-dr&quot; aria-label=&quot;Anchor link for: tl-dr&quot;&gt;🔗&lt;&#x2F;a&gt;TL;DR&lt;&#x2F;h2&gt;
&lt;p&gt;As we need to take more concrete steps to improve the financial situation of the Foundation, we will be rolling out a freemium offer for the matrix.org homeserver users. The alternative is to turn off the server, which we want to avoid doing. The goal is for the most active users to support the cost of the service. Free users will have limits on how they can use the service (mostly around media). The change can be supported by any client with limited to no development. Premium plans will be rolled out over the summer, and we will be iterating on the exact scope in the first few weeks. The Homeserver Terms and Privacy Policy will be updated accordingly and deployed in the coming weeks.&lt;&#x2F;p&gt;
&lt;span id=&quot;continue-reading&quot;&gt;&lt;&#x2F;span&gt;&lt;h2 id=&quot;the-full-story&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-full-story&quot; aria-label=&quot;Anchor link for: the-full-story&quot;&gt;🔗&lt;&#x2F;a&gt;The full story&lt;&#x2F;h2&gt;
&lt;p&gt;We have been communicating on the lack of funds in the Foundation for a while now, the latest being &lt;a href=&quot;&#x2F;blog&#x2F;2025&#x2F;02&#x2F;crossroads&#x2F;&quot;&gt;here&lt;&#x2F;a&gt;. And whilst we’ve been working hard to gather new members and are happy to see the &lt;a href=&quot;&#x2F;support&#x2F;#supporters&quot;&gt;number of logos increasing&lt;&#x2F;a&gt; (thank you all for seeing the need for Matrix to stay independent and safe, and the value in supporting it!), none of the big players in the ecosystem have actually committed to one of the higher membership tiers, so we need to find other ways towards sustainability.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;a href=&quot;&#x2F;foundation&#x2F;about&#x2F;#mission&quot;&gt;Foundation’s mission&lt;&#x2F;a&gt; can basically be summarised by 4 main goals:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Ensure the specification of the protocol stays canonical and unencumbered, to avoid fragmentation and being overridden by a single player.&lt;&#x2F;li&gt;
&lt;li&gt;Ensure that all players in the ecosystem are at a level playing field, helping them succeed by giving them visibility and listening to their needs.&lt;&#x2F;li&gt;
&lt;li&gt;Promote the Matrix standard, as the value of Matrix is directly proportional to the size of the public network and how much it is used and commercialised.&lt;&#x2F;li&gt;
&lt;li&gt;Ensure the public network is safe by building moderation tools that can be used by the server admins, for the sake of our users and making sure the network is attractive.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;In practice, it means that we are currently spending money on:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;A small team of developers and moderators, to develop Trust &amp;amp; Safety tooling, moderate the matrix.org server, and redirect people who do not understand the decentralised nature of Matrix reporting abuse to us towards the appropriate server admins.&lt;&#x2F;li&gt;
&lt;li&gt;The infrastructure of the matrix.org homeserver, including the SRE team, who are on call to keep it running, and the support team.&lt;&#x2F;li&gt;
&lt;li&gt;Organise and sponsor events to promote and evangelise the protocol.&lt;&#x2F;li&gt;
&lt;li&gt;A tiny team to run the Foundation itself, including the support of external contractors for the administrative side (finance, legal, tax). The staff works on governance (organising the governing board elections, running the meetings, liaising between the different teams), raises money and brings members in, manages social media and liaises with the community, keeps the website up and up to date, publishes TWIMs and blogs, organises the events, etc. This team whose day job is to keep the Foundation running is also supported by a lot of volunteer (and sometimes sponsored by employer) time from the Governing Board and its Working Groups, the Spec Core team, the Guardians, and other external staff.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;We haven’t gotten to the point of publishing the public financial report (although it should be almost finalised now), because we are frantically trying to focus on closing the financial gap, but here is an overview of the split of expenditures in the last year:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;blog&#x2F;img&#x2F;foundation-expenses-graph.png&quot; alt=&quot;A pie chart showing the Foundation&amp;#39;s expenses: 30% Trust &amp;amp; Safety, 20% Server Infrastructure, 14.2% Management, 12.5% Events, 20% Other staff, 2.5% Other expenses&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;As you can see, 20% of the Foundation’s expenditure goes towards hosting the matrix.org free and public homeserver. If we add in the cost of the moderation work done by the Trust and Safety team, the total share of the costs attributable to the matrix.org homeserver account for almost 50% of all expenditure. Meanwhile, today, only 50% of the spending of the Foundation is covered by its revenues (donations and memberships), and we are working hard towards reducing this gap.&lt;&#x2F;p&gt;
&lt;p&gt;We’ve kept the matrix.org homeserver around so far, despite its costs, as we consider it essential to seed the network in support of the nurturing part of the Foundation’s mission: despite Matrix being decentralised by design, users need a trusted place to create a free Matrix account to try it out in the first place.&lt;&#x2F;p&gt;
&lt;p&gt;However, we can’t continue to bear the cost of the server as is, and before we get to the extreme position of being forced to turn it off leaving its 370k monthly active users in the awkward position of finding a new home for their account, we’ve decided to try to alleviate some of these costs by setting-up a freemium offering and proposing premium plans in addition to the free ones. The goal is to get the server to an at least financially break-even position. If, by any chance, it was ending up profitable, the profit would directly be invested in &lt;a href=&quot;&#x2F;blog&#x2F;2025&#x2F;02&#x2F;building-a-safer-matrix&#x2F;&quot;&gt;Trust and Safety&lt;&#x2F;a&gt;, or other new programmes which can support the ecosystem. As a reminder, the Foundation is a Community Interest Company, i.e. a limited company which operates to provide a benefit to the community it serves rather than private profit.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;what-will-the-freemium-offer-look-like&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-will-the-freemium-offer-look-like&quot; aria-label=&quot;Anchor link for: what-will-the-freemium-offer-look-like&quot;&gt;🔗&lt;&#x2F;a&gt;What will the freemium offer look like?&lt;&#x2F;h3&gt;
&lt;p&gt;The idea is to set some limits for users on the free plans, which would be lifted for users on the premium plans in exchange for an affordable membership.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;We are still iterating (and will do for a while) on how it looks,&lt;&#x2F;strong&gt; but users can expect limits around media sizes and&#x2F;or volumes. The goal is to ensure that the most active users participate in covering the costs of the service, in return for the access to a fully encrypted and decentralised open network.&lt;&#x2F;p&gt;
&lt;p&gt;We are limited in scope and design by the fact we need to ship a minimum viable product as soon as possible (we need to reduce costs now) and by not wanting to impose too much development (if any) to Matrix client developers.&lt;&#x2F;p&gt;
&lt;p&gt;Obviously we would have preferred to keep everything free of charge. We will never sell our users’ data or cripple our services with ads, so we need to find ethical sources of revenue.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;when-will-the-new-plans-take-effect&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#when-will-the-new-plans-take-effect&quot; aria-label=&quot;Anchor link for: when-will-the-new-plans-take-effect&quot;&gt;🔗&lt;&#x2F;a&gt;When will the new plans take effect?&lt;&#x2F;h3&gt;
&lt;p&gt;The roll-out will happen progressively, starting in the coming weeks and hopefully completing in the summer of 2025. We will start by opening up premium plans to new users only, before progressively migrating all existing accounts to a free plan which will give them the option to upgrade to a premium plan. Users will of course be notified ahead of their account being migrated.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;will-this-work-with-whatever-matrix-client-i-use&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#will-this-work-with-whatever-matrix-client-i-use&quot; aria-label=&quot;Anchor link for: will-this-work-with-whatever-matrix-client-i-use&quot;&gt;🔗&lt;&#x2F;a&gt;Will this work with whatever Matrix client I use?&lt;&#x2F;h3&gt;
&lt;p&gt;Yes. The plan management will be handled via the &lt;a href=&quot;https:&#x2F;&#x2F;account.matrix.org&#x2F;account&#x2F;&quot;&gt;My Account&lt;&#x2F;a&gt; screens provided by the Matrix Authentication Service (MAS), and notifications to users will be sent in a dedicated room using the &lt;a href=&quot;https:&#x2F;&#x2F;spec.matrix.org&#x2F;v1.14&#x2F;client-server-api&#x2F;#server-notices&quot;&gt;Server Notices&lt;&#x2F;a&gt; feature built into the Matrix protocol – already used by the homeserver to send automatic messages to the user – so should be seamless for every client.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;i-am-a-matrix-client-developer-do-i-need-to-do-anything&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#i-am-a-matrix-client-developer-do-i-need-to-do-anything&quot; aria-label=&quot;Anchor link for: i-am-a-matrix-client-developer-do-i-need-to-do-anything&quot;&gt;🔗&lt;&#x2F;a&gt;I am a Matrix client developer, do I need to do anything?&lt;&#x2F;h3&gt;
&lt;p&gt;There are two considerations from a Matrix client point of view:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;support for the &lt;a href=&quot;https:&#x2F;&#x2F;spec.matrix.org&#x2F;v1.14&#x2F;client-server-api&#x2F;#server-notices&quot;&gt;Server Notices&lt;&#x2F;a&gt; feature&lt;&#x2F;li&gt;
&lt;li&gt;if the client is distributed via the Apple App Store, then support for &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;4286&quot;&gt;MSC4286&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;If the client doesn&#x27;t show server notices at all then, whilst the client will remain usable with the matrix.org homeserver, your users will have a degraded UX as they won&#x27;t receive notifications when encountering usage limits.&lt;&#x2F;p&gt;
&lt;p&gt;Apple places &lt;a href=&quot;https:&#x2F;&#x2F;developer.apple.com&#x2F;app-store&#x2F;review&#x2F;guidelines&#x2F;#in-app-purchase&quot;&gt;restrictions&lt;&#x2F;a&gt; on how payments are implemented by iOS (et al) apps that are distributed via the App Store.&lt;&#x2F;p&gt;
&lt;p&gt;We expect that most, if not all, apps that fall within scope would be classified as what Apple calls “&lt;a href=&quot;https:&#x2F;&#x2F;developer.apple.com&#x2F;app-store&#x2F;review&#x2F;guidelines&#x2F;#free-stand-alone-apps&quot;&gt;Free Stand-alone Apps&lt;&#x2F;a&gt;”. Such apps do not need to use in-app purchases so long as “there is no purchasing inside the app, or calls to action for purchase outside of the app”.&lt;&#x2F;p&gt;
&lt;p&gt;In order to meet these requirements we have proposed &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;4286&quot;&gt;MSC4286&lt;&#x2F;a&gt; which provides a way for a homeserver (such as the matrix.org homeserver) to flag parts of messages as containing a call to action and for affected clients to be able to hide that content. Example implementations are linked in the MSC.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;i-am-already-supporting-the-foundation-by-paying-an-individual-membership-will-i-have-to-pay-for-a-premium-plan-too&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#i-am-already-supporting-the-foundation-by-paying-an-individual-membership-will-i-have-to-pay-for-a-premium-plan-too&quot; aria-label=&quot;Anchor link for: i-am-already-supporting-the-foundation-by-paying-an-individual-membership-will-i-have-to-pay-for-a-premium-plan-too&quot;&gt;🔗&lt;&#x2F;a&gt;I am already supporting the Foundation by paying an individual membership, will I have to pay for a premium plan too?&lt;&#x2F;h3&gt;
&lt;p&gt;No, &lt;a href=&quot;&#x2F;membership&quot;&gt;individual members&lt;&#x2F;a&gt; of the Foundation will get access to the premium features at no extra cost. This benefit will be implemented as part of the process of migrating existing accounts to the free plan.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;what-else-will-be-changing&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-else-will-be-changing&quot; aria-label=&quot;Anchor link for: what-else-will-be-changing&quot;&gt;🔗&lt;&#x2F;a&gt;What else will be changing?&lt;&#x2F;h3&gt;
&lt;p&gt;In order to support these changes we will be releasing updates to the Homeserver Terms and the Privacy Policy in the coming weeks. Users of the matrix.org homeserver will be notified and will need to accept the new terms. The scope of change will be clearly highlighted in the release note, but essentially you can expect new terms around payment and additional information on the types of information we will collect about your account, as well as the processors we will use to enable payments.&lt;&#x2F;p&gt;
&lt;p&gt;We realise this is quite a big change, but our position is that a slightly limited service is better than no service at all, so we chose to ask for financial contribution rather than turn off the server. Paying a subscription for the matrix.org homeserver is basically a way to support Matrix, ensuring the Foundation can continue to play its role of neutral custodian, enabler and safeguardian of the protocol and the network. We will be publishing more details and a proper FAQ as the roll-out happens, so watch this space for more details.&lt;&#x2F;p&gt;
</content>
</entry>

    
<entry xml:lang="en">
    <title>Matrix.org is now running MAS!</title>
    <published>2025-04-08T15:30:00+00:00</published>
    <updated>2025-04-08T15:30:00+00:00</updated>
    <author>
      <name>Quentin</name>
    </author>
    <link rel="alternate" href="https://matrix.org/blog/2025/04/morg-now-running-mas/" type="text/html"/>
    <id>https://matrix.org/blog/2025/04/morg-now-running-mas/</id>
    <content type="html">&lt;p&gt;We&#x27;re thrilled to announce that the migration of matrix.org to the Matrix Authentication Service (MAS) is complete and went according to plan - having been running for over 24h in our brave new world, we’re declaring the migration a success! As of Monday April 7th 07:30 UTC, matrix.org is running on Matrix’s &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;blob&#x2F;hughns&#x2F;delegated-oidc-architecture&#x2F;proposals&#x2F;3861-next-generation-auth.md&quot;&gt;next-generation auth system&lt;&#x2F;a&gt; based on OAuth 2.0&#x2F;OpenID Connect.&lt;&#x2F;p&gt;
&lt;p&gt;This is no mean feat - the migration shifted all 45M access tokens and 110M users from Synapse to MAS in under 30 minutes (thanks in part to MAS’s cheeky use of the x86-64-v2 architecture; who knew that database migrations can be SIMD-accelerated?) - and represents the culmination of over 4 years of work to move Matrix to a modern authentication standard. Many thanks go to Element for funding, Hugh, Olivier and many other contributors who helped me make Next Gen Auth happen!&lt;&#x2F;p&gt;
&lt;span id=&quot;continue-reading&quot;&gt;&lt;&#x2F;span&gt;&lt;h2 id=&quot;what-does-this-mean-for-you&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-does-this-mean-for-you&quot; aria-label=&quot;Anchor link for: what-does-this-mean-for-you&quot;&gt;🔗&lt;&#x2F;a&gt;What does this mean for you?&lt;&#x2F;h2&gt;
&lt;p&gt;Check back to our &lt;a href=&quot;https:&#x2F;&#x2F;matrix.org&#x2F;blog&#x2F;2025&#x2F;04&#x2F;matrix-auth-service&#x2F;&quot;&gt;previous announcement&lt;&#x2F;a&gt; for the full details of migration, but your existing sessions remain active - no logging out and back in required.&lt;&#x2F;p&gt;
&lt;p&gt;The move to MAS provides enormous improvements to security and usability:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Access tokens rotate regularly, a leaked token has a limited lifespan&lt;&#x2F;li&gt;
&lt;li&gt;A single home for your account credentials which password managers can manage for you&lt;&#x2F;li&gt;
&lt;li&gt;Consistent auth and account management experience across apps&lt;&#x2F;li&gt;
&lt;li&gt;All Matrix.org users can finally fully enjoy next generation clients like Element X&lt;&#x2F;li&gt;
&lt;li&gt;A solid basis for all our upcoming authentication features, which we’ll enable on matrix.org as they get approved for merge in the Matrix spec:
&lt;ul&gt;
&lt;li&gt;Login via QR code, complete with E2EE identity!&lt;&#x2F;li&gt;
&lt;li&gt;Support for 2FA, MFA, Passkeys etc&lt;&#x2F;li&gt;
&lt;li&gt;OAuth 2.0 scopes let users control what features an app can access.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;your-account-has-a-new-home&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#your-account-has-a-new-home&quot; aria-label=&quot;Anchor link for: your-account-has-a-new-home&quot;&gt;🔗&lt;&#x2F;a&gt;Your account has a new home&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;account.matrix.org&#x2F;&quot;&gt;&lt;strong&gt;account.matrix.org&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt; is now the dedicated home for managing your matrix.org account, which you can access through your browser or supported clients. Here you can:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;View and manage your connected devices&lt;&#x2F;li&gt;
&lt;li&gt;Update your email address and contact information&lt;&#x2F;li&gt;
&lt;li&gt;Change your password&lt;&#x2F;li&gt;
&lt;li&gt;Manage account settings and security options&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;(Eagle-eyed users may notice that client URLs for web logins aren’t shown in account.matrix.org - this only affects migrated devices; new logins will show up correctly. One workaround is to use the native device manager in Element Web to see the URLs of your old sessions).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;see-it-in-action&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#see-it-in-action&quot; aria-label=&quot;Anchor link for: see-it-in-action&quot;&gt;🔗&lt;&#x2F;a&gt;See it in action&lt;&#x2F;h2&gt;
&lt;p&gt;If you’re wondering what the new world of Next Gen Auth looks like, but don’t want to mess around logging in to a new client - fear not, for we have videos!&lt;&#x2F;p&gt;
&lt;p&gt;Here’s an example of native Next Gen Auth in Element X iOS logging into the shiny new matrix.org system:&lt;&#x2F;p&gt;
&lt;noscript&gt;
  Today&#x27;s Matrix Live:
  &lt;a href=&quot;https:&#x2F;&#x2F;youtube.com&#x2F;watch?v=K5dxgNN1Vmc&quot;&gt;
    https:&#x2F;&#x2F;youtube.com&#x2F;watch?v=K5dxgNN1Vmc
  &lt;&#x2F;a&gt;
&lt;&#x2F;noscript&gt;
&lt;youtube-player video-id=&quot;K5dxgNN1Vmc&quot;&gt;&lt;&#x2F;youtube-player&gt;
&lt;p&gt;…and here’s Fractal showing off its native Next Gen Auth support too!&lt;&#x2F;p&gt;
&lt;noscript&gt;
  Today&#x27;s Matrix Live:
  &lt;a href=&quot;https:&#x2F;&#x2F;youtube.com&#x2F;watch?v=uvP24r7ul04&quot;&gt;
    https:&#x2F;&#x2F;youtube.com&#x2F;watch?v=uvP24r7ul04
  &lt;&#x2F;a&gt;
&lt;&#x2F;noscript&gt;
&lt;youtube-player video-id=&quot;uvP24r7ul04&quot;&gt;&lt;&#x2F;youtube-player&gt;
&lt;h1 id=&quot;moving-forward&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#moving-forward&quot; aria-label=&quot;Anchor link for: moving-forward&quot;&gt;🔗&lt;&#x2F;a&gt;Moving forward&lt;&#x2F;h1&gt;
&lt;p&gt;The MSCs that power this new authentication system have now all completed their Final Comment Period and will be merged into the next spec release!&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;3861&quot;&gt;MSC3861: Next-generation auth for Matrix, based on OAuth 2.0&#x2F;OIDC&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;2964&quot;&gt;MSC2964: Usage of OAuth 2.0 authorization code grant and refresh token grant&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;2965&quot;&gt;MSC2965: OAuth 2.0 Authorization Server Metadata&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;2966&quot;&gt;MSC2966: Usage of OAuth 2.0 Dynamic Client Registration in Matrix&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;2967&quot;&gt;MSC2967: API scopes&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;4254&quot;&gt;MSC4254: Usage of RFC7009 Token Revocation for Matrix client logout&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Next up is landing all the non-core MSCs and then getting them enabled on matrix.org too:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;3824&quot;&gt;MSC3824: OIDC aware clients #3824&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;4108&quot;&gt;MSC4108: Mechanism to allow OIDC sign in and E2EE set up via QR code #4108&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;4190&quot;&gt;MSC4190: Device management for application services #4190&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;4191&quot;&gt;MSC4191: Account management deep-linking #4191&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;4198&quot;&gt;MSC4198: Usage of OIDC login_hint #4198&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;questions-or-issues&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#questions-or-issues&quot; aria-label=&quot;Anchor link for: questions-or-issues&quot;&gt;🔗&lt;&#x2F;a&gt;Questions or issues?&lt;&#x2F;h2&gt;
&lt;p&gt;If you encounter any problems or have questions about the new authentication system, please join us in &lt;a href=&quot;https:&#x2F;&#x2F;matrix.to&#x2F;#&#x2F;#matrix-auth:matrix.org&quot;&gt;Matrix Auth &amp;amp; Identity&lt;&#x2F;a&gt; where the team resides.&lt;&#x2F;p&gt;
</content>
</entry>

    
<entry xml:lang="en">
    <title>Matrix.org will migrate to MAS on Apr 7th 2025</title>
    <published>2025-04-02T15:00:00+00:00</published>
    <updated>2025-04-02T15:00:00+00:00</updated>
    <author>
      <name>Quentin</name>
    </author>
    <link rel="alternate" href="https://matrix.org/blog/2025/04/matrix-auth-service/" type="text/html"/>
    <id>https://matrix.org/blog/2025/04/matrix-auth-service/</id>
    <content type="html">&lt;p&gt;&lt;strong&gt;On Monday 7th of April 2025 at 7am UTC, we will migrate the Matrix.org homeserver&#x27;s authentication system over to MAS (Matrix Authentication Service) in order to benefit from Next-generation authentication.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The migration will involve up to one hour of downtime.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;3861&quot;&gt;MSC3861&lt;&#x2F;a&gt; (Next-generation auth for Matrix, based on OAuth 2.0&#x2F;OpenID Connect (OIDC)) and its dependent MSCs have progressed sufficiently that the Foundation is confident in MAS and the new next-generation auth APIs. Specifically, all the MSCs are now in or have passed Final Comment Period (FCP) with disposition to merge! 🎉&lt;&#x2F;p&gt;
&lt;p&gt;We expect the MSCs to finish FCP and get merged into the next spec release. The full list of core Next-gen Auth MSCs is:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;3861&quot;&gt;MSC3861: Next-generation auth for Matrix, based on OAuth 2.0&#x2F;OIDC&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;2964&quot;&gt;MSC2964: Usage of OAuth 2.0 authorization code grant and refresh token grant&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;2965&quot;&gt;MSC2965: OAuth 2.0 Authorization Server Metadata&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;2966&quot;&gt;MSC2966: Usage of OAuth 2.0 Dynamic Client Registration in Matrix&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;2967&quot;&gt;MSC2967: API scopes&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;4254&quot;&gt;MSC4254: Usage of RFC7009 Token Revocation for Matrix client logout&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;This is incredibly exciting, reflecting 4 years of work on next-generation auth, and brings with it a new account management interface, additional security, and a better registration experience.&lt;&#x2F;p&gt;
&lt;span id=&quot;continue-reading&quot;&gt;&lt;&#x2F;span&gt;&lt;h3 id=&quot;the-account-management-interface&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-account-management-interface&quot; aria-label=&quot;Anchor link for: the-account-management-interface&quot;&gt;🔗&lt;&#x2F;a&gt;The account management interface&lt;&#x2F;h3&gt;
&lt;p&gt;You will be able to manage your account on a &lt;strong&gt;dedicated interface at &lt;a href=&quot;https:&#x2F;&#x2F;account.matrix.org&quot;&gt;account.matrix.org&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt; (accessible through your client or browser), where you can:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;See and delete your devices.&lt;&#x2F;li&gt;
&lt;li&gt;Update your contact information, like your email address.&lt;&#x2F;li&gt;
&lt;li&gt;Change your password and deactivate your account.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;figure style=&quot;height:100%;&quot;&gt;
    &lt;img src=&quot;&amp;#x2F;blog&amp;#x2F;img&amp;#x2F;mas-devices.webp&quot; &quot; &#x2F;&gt;
    &lt;figcaption&gt;&lt;p&gt;The new device overview in MAS&lt;&#x2F;p&gt;
&lt;&#x2F;figcaption&gt;
&lt;&#x2F;figure&gt;
&lt;h3 id=&quot;improved-security&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#improved-security&quot; aria-label=&quot;Anchor link for: improved-security&quot;&gt;🔗&lt;&#x2F;a&gt;Improved security&lt;&#x2F;h3&gt;
&lt;p&gt;MAS comes with a significant refactoring of how authentication works on Matrix. Without breaking compatibility with the former authentication API, it brings several benefits&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Now, only your server will be able to see your account credentials! No more typing your password in every client you’d like to log in to.&lt;&#x2F;li&gt;
&lt;li&gt;Restricted access to sensitive operations, like deactivating your account.&lt;&#x2F;li&gt;
&lt;li&gt;Clearer view of which clients are using your account.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;figure style=&quot;height:100%;&quot;&gt;
    &lt;img src=&quot;&amp;#x2F;blog&amp;#x2F;img&amp;#x2F;mas-grant-access.webp&quot; &quot; &#x2F;&gt;
    &lt;figcaption&gt;&lt;p&gt;An example of the scope request view of MAS showing Element requesting permissions to see the profile, view existing messages and data and sending messages on your behalf.&lt;&#x2F;p&gt;
&lt;&#x2F;figcaption&gt;
&lt;&#x2F;figure&gt;
&lt;h3 id=&quot;improved-registration-experience&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#improved-registration-experience&quot; aria-label=&quot;Anchor link for: improved-registration-experience&quot;&gt;🔗&lt;&#x2F;a&gt;Improved registration experience&lt;&#x2F;h3&gt;
&lt;p&gt;Regardless of the client you use, the new registration and login experience makes it clear where your account lives, and it supports next-generation clients like Element X.&lt;&#x2F;p&gt;
&lt;figure style=&quot;height:100%;&quot;&gt;
    &lt;img src=&quot;&amp;#x2F;blog&amp;#x2F;img&amp;#x2F;mas-create-account.webp&quot; &quot; &#x2F;&gt;
    &lt;figcaption&gt;&lt;p&gt;The new Registration Dialog for MAS showing an input field for a username and various social logins&lt;&#x2F;p&gt;
&lt;&#x2F;figcaption&gt;
&lt;&#x2F;figure&gt;
&lt;h2 id=&quot;impact&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#impact&quot; aria-label=&quot;Anchor link for: impact&quot;&gt;🔗&lt;&#x2F;a&gt;Impact&lt;&#x2F;h2&gt;
&lt;p&gt;Your current sessions will remain active after the migration has taken effect. In other words, you will not be logged out of your clients.&lt;&#x2F;p&gt;
&lt;p&gt;We’re providing backwards compatibility for existing Matrix clients - this does not remove the stable pre-Matrix 2.0 APIs. You can read more about the impact on clients in our previously published blog article - &lt;a href=&quot;https:&#x2F;&#x2F;matrix.org&#x2F;blog&#x2F;2025&#x2F;01&#x2F;06&#x2F;authentication-changes&#x2F;&quot;&gt;Authentication changes on Matrix.org&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;this-is-only-the-beginning&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#this-is-only-the-beginning&quot; aria-label=&quot;Anchor link for: this-is-only-the-beginning&quot;&gt;🔗&lt;&#x2F;a&gt;This is only the beginning!&lt;&#x2F;h2&gt;
&lt;p&gt;Matrix Authentication Service is Matrix&#x27;s next-generation authentication stack. Together with the next-generation authentication APIs, it is the base of a new exciting era for authentication in Matrix!&lt;&#x2F;p&gt;
&lt;p&gt;This has been one of the most ambitious projects within the Matrix project, the result of a multi-year investment by Element, funded in turn by Element’s customers, including BWI.&lt;&#x2F;p&gt;
&lt;p&gt;It will enable new forms of authentication flows, like QR-code login (coming soon to matrix.org with &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;4108&quot;&gt;MSC4108&lt;&#x2F;a&gt;), and new categories of applications building on Matrix, thanks to fine-grained control over client access to the account.&lt;&#x2F;p&gt;
&lt;p&gt;You can find all the technical details in Quentin&#x27;s Matrix Conference talk, &lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=wOW8keNafdE&quot;&gt;Harder Better Faster Stronger Authentication with OpenID Connect&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Finally, if you have any concerns, please come talk to us in &lt;a href=&quot;https:&#x2F;&#x2F;matrix.to&#x2F;#&#x2F;#matrix-auth:matrix.org&quot;&gt;#matrix-auth:matrix.org&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
</content>
</entry>

    
<entry xml:lang="en">
    <title>Authentication changes on Matrix.org</title>
    <published>2025-01-06T18:00:00+00:00</published>
    <updated>2025-01-06T18:00:00+00:00</updated>
    <author>
      <name>Will Lewis</name>
    </author>
    <link rel="alternate" href="https://matrix.org/blog/2025/01/06/authentication-changes/" type="text/html"/>
    <id>https://matrix.org/blog/2025/01/06/authentication-changes/</id>
    <content type="html">&lt;p&gt;The Matrix.org homeserver will see changes related to authentication in Q1 2025. The team will turn off guest account access on Matrix.org on January 16th and roll out Matrix Authentication Service (MAS) to embrace Matrix 2.0 after February 10. Client developers need to ensure their clients support the required changes.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-is-mas&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-is-mas&quot; aria-label=&quot;Anchor link for: what-is-mas&quot;&gt;🔗&lt;&#x2F;a&gt;What is MAS&lt;&#x2F;h2&gt;
&lt;p&gt;Matrix Authentication Service is Matrix&#x27;s next-generation authentication stack. It allows for more flexible authentication journeys without requiring client developers to support every one of them.&lt;&#x2F;p&gt;
&lt;p&gt;You can find all the technical details in Quentin&#x27;s Matrix Conf talk, &lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=wOW8keNafdE&quot;&gt;Harder Better Faster Stronger Authentication with OpenID Connect&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;span id=&quot;continue-reading&quot;&gt;&lt;&#x2F;span&gt;&lt;h2 id=&quot;what-s-the-impact&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-s-the-impact&quot; aria-label=&quot;Anchor link for: what-s-the-impact&quot;&gt;🔗&lt;&#x2F;a&gt;What&#x27;s the impact&lt;&#x2F;h2&gt;
&lt;p&gt;Client developers need to ensure that their projects support &lt;a href=&quot;https:&#x2F;&#x2F;areweoidcyet.com&#x2F;#next-gen-auth-aware-clients&quot;&gt;the requirements listed on areweoidcyet.com&lt;&#x2F;a&gt; and, more precisely, the requirements listed in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;3824&quot;&gt;MSC3824&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Developers can already use beta.matrix.org to see if their clients are compatible with MAS. &lt;strong&gt;If you notice anything that doesn&#x27;t work as intended, make sure to give your feedback on &lt;a href=&quot;https:&#x2F;&#x2F;areweoidcyet.com&#x2F;#next-gen-auth-aware-clients&quot;&gt;those MSCs&lt;&#x2F;a&gt;.&lt;&#x2F;strong&gt; If clients work on beta.matrix.org, they will be able to connect to matrix.org after the rollout.&lt;&#x2F;p&gt;
&lt;p&gt;Homeserver administrators from the public federation don&#x27;t have to worry about this deployment. MAS only affects the APIs between the clients and the server, so this deployment only impacts clients connecting to matrix.org. Federation APIs, used for servers to talk to each other, remain unchanged.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;disabling-guest-accounts&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#disabling-guest-accounts&quot; aria-label=&quot;Anchor link for: disabling-guest-accounts&quot;&gt;🔗&lt;&#x2F;a&gt;Disabling guest accounts&lt;&#x2F;h2&gt;
&lt;p&gt;Guest accounts are a legacy Matrix feature that allows clients to create temporary, limited technical accounts to participate in specific rooms that allow it.&lt;&#x2F;p&gt;
&lt;p&gt;The Matrix.org Foundation would have liked to find an efficient way to let people create guest accounts when joining a conversation and then turn them into fully fledged accounts later. Nobody in the ecosystem found resources to design and implement such a user journey, and guest accounts ended up being used for technical reasons, like displaying room previews or badges via shields.io.&lt;&#x2F;p&gt;
&lt;p&gt;Those accounts make up a significant load on the matrix.org homeserver. For that reason, the Matrix.org Foundation has decided to disable them at least temporarily to save precious resources and go ahead with the rollout of the new authentication stack.&lt;&#x2F;p&gt;
&lt;p&gt;The Matrix.org Foundation is open to re-enabling guests accounts once it has the financial capacity to support them. If guest accounts on matrix.org are important to you and your business, please &lt;a href=&quot;https:&#x2F;&#x2F;matrix.org&#x2F;membership&#x2F;&quot;&gt;join the Matrix.org Foundation as a supporting member&lt;&#x2F;a&gt; to contribute to its financial sustainability.&lt;&#x2F;p&gt;
&lt;p&gt;We encourage developers using guest access for room information, such as topics, aliases, or member counts, to utilize the endpoint proposed by &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;3266&quot;&gt;MSC3266&lt;&#x2F;a&gt;. This endpoint is publicly accessible without authentication and can serve as an alternative resource until guest access is reinstated in a more robust form.&lt;&#x2F;p&gt;
&lt;p&gt;We appreciate your understanding as we take these steps to enhance the user experience on Matrix.org.&lt;&#x2F;p&gt;
</content>
</entry>

    
<entry xml:lang="en">
    <title>Sunsetting the Sliding Sync Proxy: Moving to Native Support</title>
    <published>2024-11-14T16:00:00+00:00</published>
    <updated>2024-11-14T16:00:00+00:00</updated>
    <author>
      <name>Will Lewis</name>
    </author>
    <link rel="alternate" href="https://matrix.org/blog/2024/11/14/moving-to-native-sliding-sync/" type="text/html"/>
    <id>https://matrix.org/blog/2024/11/14/moving-to-native-sliding-sync/</id>
    <content type="html">&lt;p&gt;&lt;em&gt;We will be decommissioning the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;3575&quot;&gt;sliding sync proxy&lt;&#x2F;a&gt; next week (21&#x2F;11&#x2F;2024) and Element are removing client support in mid-January (17&#x2F;01&#x2F;2025).&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Sliding Sync is designed to provide a significantly faster and more scalable sync experience in our clients. The initial implementation was first prototyped in Element Web backed by an entirely experimental server proxy. The implementation had half an eye on low bandwidth use cases, and the prototype led to &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;3575&quot;&gt;MSC3575&lt;&#x2F;a&gt;. We then realised that a simpler approach would be beneficial, and reused the same (experimental) proxy concept to facilitate beta testing with Element X, this time making it available on matrix.org. In doing so, we learned valuable lessons, leading to a refined and simplified API design in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;4186&quot;&gt;MSC4186&lt;&#x2F;a&gt;. The proxy itself was only ever considered as a temporary arrangement to aid speed of development, rather than being a long term solution.&lt;&#x2F;p&gt;
&lt;p&gt;Simplified Sliding Sync &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;blob&#x2F;erikj&#x2F;sss&#x2F;proposals&#x2F;4186-simplified-sliding-sync.md&quot;&gt;MSC4186&lt;&#x2F;a&gt; (also known as native sliding sync), has since been implemented in Synapse, with encouraging results. Now that we don’t expect the API shape to change significantly, we recommend homeserver developers to implement MSC4186 natively.&lt;&#x2F;p&gt;
&lt;p&gt;The Matrix.org Foundation does not have the resources to keep up maintenance of the proxy service or its codebase, and plans to decommission the proxy from Mid-November and archive the sliding-sync repo.&lt;&#x2F;p&gt;
&lt;p&gt;Recognising that the community needs time to adopt sliding sync natively, Element will keep client support for the old API (MSC3575) until the 17th of January, 2025.&lt;&#x2F;p&gt;
&lt;span id=&quot;continue-reading&quot;&gt;&lt;&#x2F;span&gt;&lt;h2 id=&quot;the-timeline&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#the-timeline&quot; aria-label=&quot;Anchor link for: the-timeline&quot;&gt;🔗&lt;&#x2F;a&gt;The Timeline&lt;&#x2F;h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Now: EX Apps support migrating from the proxy server to native Sliding Sync.&lt;&#x2F;strong&gt; The apps automatically detect when the homeserver supports native Sliding Sync and offers the option to migrate. If users choose to migrate, they will be prompted to log in again. This migration is optional, as the apps continue to support both native Sliding Sync and the proxy.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;November 21st: Service decommissioning.&lt;&#x2F;strong&gt; We plan to decommission the proxy service on Matrix.org and archive its codebase.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;January 17th: Element X stops supporting MSC3575.&lt;&#x2F;strong&gt; EX apps (and matrix-rust-sdk) will remove proxy support, fully shifting to native SS. The migration on EX apps will be forced. Users will get logged out so that they can log in again using native Sliding Sync. We encourage server developers to implement Sliding Sync natively by this point.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h2 id=&quot;what-this-means-for-users&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#what-this-means-for-users&quot; aria-label=&quot;Anchor link for: what-this-means-for-users&quot;&gt;🔗&lt;&#x2F;a&gt;What This Means for Users&lt;&#x2F;h2&gt;
&lt;p&gt;To continue enjoying the speed of Sliding Sync your homeserver and client must support the native Sliding Sync implementation (MSC4186).&lt;&#x2F;p&gt;
&lt;p&gt;At the time of writing, the latest versions of Synapse support native Sliding Sync, as do the Element X clients. There may be other server &#x2F; client implementations that also have or are in the process of adding support.
If you do use Element X apps, native Sliding Sync is used for every new login. For those currently using Element X through the proxy service, the app will prompt you to log out to switch to native Sliding Sync. While this migration is optional for now, it will become mandatory on the 21st of November for those on Matrix.org, when the proxy will be decommissioned.
Element X will discontinue support for the previous Sliding Sync implementation (MSC3575) entirely by January 17th.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;guidance-for-server-client-developers&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#guidance-for-server-client-developers&quot; aria-label=&quot;Anchor link for: guidance-for-server-client-developers&quot;&gt;🔗&lt;&#x2F;a&gt;Guidance for Server &amp;amp; Client Developers&lt;&#x2F;h2&gt;
&lt;p&gt;Server &amp;amp; Client developers are encouraged to implement MSC4186 for native sliding sync. Server developers should be aware that by the 17th of January Element clients will drop support for MSC3575, marking a transition to the native system.&lt;&#x2F;p&gt;
&lt;p&gt;We appreciate your understanding as we take this step forward for the Matrix ecosystem.&lt;&#x2F;p&gt;
</content>
</entry>

    
<entry xml:lang="en">
    <title>Sunsetting unauthenticated media</title>
    <published>2024-06-26T14:31:27+00:00</published>
    <updated>2024-06-26T14:31:27+00:00</updated>
    <author>
      <name>Travis Ralston</name>
    </author>
    <link rel="alternate" href="https://matrix.org/blog/2024/06/26/sunsetting-unauthenticated-media/" type="text/html"/>
    <id>https://matrix.org/blog/2024/06/26/sunsetting-unauthenticated-media/</id>
    <content type="html">&lt;p&gt;Hello everyone,&lt;&#x2F;p&gt;
&lt;p&gt;The Trust &amp;amp; Safety team has been working hard to get &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;matrix-org&#x2F;matrix-spec-proposals&#x2F;pull&#x2F;3916&quot;&gt;MSC3916&lt;&#x2F;a&gt; in the hands of users, and we’re nearly there with &lt;a href=&quot;https:&#x2F;&#x2F;matrix.org&#x2F;blog&#x2F;2024&#x2F;06&#x2F;20&#x2F;matrix-v1.11-release&#x2F;&quot;&gt;Matrix 1.11 being released last week&lt;&#x2F;a&gt;. This fixes a long-standing design flaw in Matrix where media (images, avatars, files, etc) can be accessed without authentication if the URL is known. Matrix 1.11 fixes this by requiring authentication on these URLs, removing the ability for users to treat homeservers as CDNs for hosting arbitrary Matrix content for arbitrary users.&lt;&#x2F;p&gt;
&lt;p&gt;Rolling this feature out to the entire public federation is a bit tricky, particularly when considering the user safety and privacy benefits which Matrix 1.11 brings. Developers are encouraged to support authenticated media quickly to give server admins the ability to freeze unauthenticated media access on their servers. Media uploaded or cached before the freeze will remain accessible on the unauthenticated endpoints, but any media uploaded or cached after the freeze will only be available through the authenticated endpoints.&lt;&#x2F;p&gt;
&lt;p&gt;This freeze reduces the amount of breakage users will see. We’re aware of links being shared outside the context of a room already, and breaking those would be pretty disappointing for those users. We also don’t want to encourage that capability going forwards due to the space it takes up and the anonymous nature of the requests. Users who keep their clients updated should see no impact when their servers implement their freeze, but may find themselves unable to copy&#x2F;paste media URLs to their friends.&lt;&#x2F;p&gt;
&lt;p&gt;Matrix 1.11 recommends that all servers evaluate their local ecosystem to determine when would be best to implement the freeze, and that the freeze should happen before Matrix 1.12 is released in August 2024. For the matrix.org homeserver, we anticipate most of our users to have updated clients in July, putting our freeze date in August.&lt;&#x2F;p&gt;
&lt;p&gt;Developers, and those curious, are encouraged to review the &lt;a href=&quot;https:&#x2F;&#x2F;matrix.org&#x2F;blog&#x2F;2024&#x2F;06&#x2F;20&#x2F;matrix-v1.11-release&#x2F;&quot;&gt;Matrix 1.11 blog post&lt;&#x2F;a&gt; for details on the changes they’ll need to make in July to have near-zero matrix.org user impact, and for information about the recommended freeze approach.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;timeline-for-matrix-org-homeserver&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#timeline-for-matrix-org-homeserver&quot; aria-label=&quot;Anchor link for: timeline-for-matrix-org-homeserver&quot;&gt;🔗&lt;&#x2F;a&gt;Timeline for matrix.org homeserver&lt;&#x2F;h2&gt;
&lt;p&gt;To assist developers and other server admins in testing their implementations, we will be updating the beta.matrix.org homeserver to enact the freeze as soon as code is available for that. We expect this to happen in July 2024. The matrix.org (non-beta) homeserver’s freeze will be started on &lt;del&gt;August 28th, 2024&lt;&#x2F;del&gt; &lt;strong&gt;September 4th, 2024&lt;&#x2F;strong&gt; during normal UK business hours.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Update August 14, 2024: Most of the ecosystem has already updated to support authenticated media with only a few bug fixes pending release. To give a little bit more buffer for these bug fixes to roll out, we&#x27;ve moved our scheduled date to September 4th, 2024.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;All media uploaded and cached prior to the freeze will remain accessible on the unauthenticated endpoints and authenticated endpoints. Media uploaded and cached after the freeze will only be available through the authenticated endpoints, not the unauthenticated ones.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;developer-support&quot;&gt;&lt;a class=&quot;zola-anchor&quot; href=&quot;#developer-support&quot; aria-label=&quot;Anchor link for: developer-support&quot;&gt;🔗&lt;&#x2F;a&gt;Developer support&lt;&#x2F;h2&gt;
&lt;p&gt;The team is making themselves available in the &lt;a href=&quot;https:&#x2F;&#x2F;matrix.to&#x2F;#&#x2F;#matrix-client-developers:matrix.org&quot;&gt;#matrix-client-developers:matrix.org&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;matrix.to&#x2F;#&#x2F;#matrix-homeserver-developers:matrix.org&quot;&gt;#matrix-homeserver-developers:matrix.org&lt;&#x2F;a&gt; rooms on Matrix to help support developers in implementing this feature. Client, server, and bridge authors are welcome to visit those rooms to get help in figuring out what needs to happen to support authenticated media. Further resources are also available in the &lt;a href=&quot;https:&#x2F;&#x2F;matrix.org&#x2F;blog&#x2F;2024&#x2F;06&#x2F;20&#x2F;matrix-v1.11-release&#x2F;&quot;&gt;Matrix 1.11 blog post&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;For questions about the rollout itself, the freeze date, or the (beta.)matrix.org homeserver, please visit &lt;a href=&quot;https:&#x2F;&#x2F;matrix.to&#x2F;#&#x2F;#foundation-office:matrix.org&quot;&gt;#foundation-office:matrix.org&lt;&#x2F;a&gt; on Matrix.&lt;&#x2F;p&gt;
&lt;p&gt;We look forward to seeing the ecosystem working towards a safer, authenticated, experience for users.&lt;&#x2F;p&gt;
&lt;p&gt;Thank you,&lt;&#x2F;p&gt;
&lt;p&gt;Travis Ralston &amp;amp; the whole Matrix.org Foundation team&lt;&#x2F;p&gt;
</content>
</entry>

    
</feed>
