We launched the Matrix Public Archive publicly on June 2nd, 2023. We decided to take it down on Sunday, June 25th out of precaution after a member of OFTC staff warned us that the archive made the content of two OFTC IRC channels bridged to Matrix available on the Internet.
After investigating the issue, we determined that the Matrix Public Archive's behaviour was expected for these channels, given an IRC chanop had explicitly configured the Matrix side of the rooms to be world-readable.
Let's talk about how room visibility works in vanilla Matrix, how it works with bridges, and what are the next steps.
- archive.matrix.org does not expose history for Matrix rooms (or channels bridged from IRC) unless a room admin (aka IRC chanop) has explicitly configured that room to be world-readable.
- There was confusion over this because the UI failed to explain why a given room is viewable (or not), and folks didn't realise that some rooms had explicitly been configured as world-readable in the dim and distant past.
- archive.matrix.org is not an indexer or an archive - it's just a read-only matrix client. It doesn't store any messages. We're going to find a better name for it.
- The only reason the @archive:matrix.org bot joins rooms when someone views them via archive.matrix.org is because the Peek API is deprecated - and rather than implementing a deprecated API to view the room without joining, the service explicitly joins the room instead. Once peeking (e.g. MSC2753) lands, then the bot won't be needed any more.
Room Visibility in Matrix
Matrix rooms have some flexibility regarding whether new members can see the history of a room or not. People interested in technical details can check the Room History Visibility in the specification. The history visibility possibilities are the following, by increasing order of openness (least open first):
joined: people need to join the room to see the history, and will only see the messages sent after they joined. This behaviour is similar to the experience of IRC on a bouncer, and is how all IRC channels are bridged to Matrix by default
invited: people need to join the room to see the history, and will only see the messages after they were invited.
shared: people need to join the room to see the history, but will then see the history up to when this visibility setting was set (the change is not retroactive).
world_readable: everyone can see the room history without even joining the room.
Element is far from being the only Matrix client out there but is commonly used in the Matrix community. The visibility settings described above are translated as follows in Element, by decreasing order of openness:
An example of a world-readable room (with history visibility set to "Anyone" in Element) is Matrix HQ. When trying to reach it via matrix.to, you can pick Element in a browser, and it will show you a preview of the conversation.
What is not necessarily obvious here is that Element Web creates a guest user who never joins the room in order to peek into it. Indeed, the guest is only created to be able to use Element, and then the guest is looking at a preview of the room (as defined in the Room Previews section of the spec): they're able to read the history without ever joining. All of this is defined for each room, and is vanilla Matrix without any involvement of the Matrix Public Archive.
Room Visibility and IRC Bridging
It is important to note that the Libera Chat and OFTC bridges hosted by the
Matrix.org Foundation (and any other bridged powered by a vanilla matrix-appservice-irc),
mimic the IRC behaviour by default when creating the rooms: their visibility is
joined, a.k.a. "People need to join the room to see history and will
only see new messages since they joined". In other words, by default, IRC
channel history is only ever visible to users currently in the channel - it is
never shared with other users.
For the sake of completeness, let's cover the two types of rooms that exist when bridging a room to IRC, and the implications on history visibility control.
When someone tries to join #example:libera.chat, the bridge is going to create this room and automatically bridge it to the #example channel on Libera Chat. The Bridge Bot user (@appservice:libera.chat) is the owner of the room and has the maximum power level (PL100).
Nobody on the Matrix side has privileges when the room is created. No Matrix user can change the visibility of this room. If someone from the IRC side promotes the IRC representation of a Matrix user as op in the channel, the bridge bot will promote said user to the power level 50 on the Matrix side. This Matrix user will be able to change the visibility of the room on the Matrix side, and opt-in for a world-readable visibility.
When someone takes an existing Matrix room and tries to manually plug it (or plumb it) to an IRC channel, they can do so using a widget for interactive configuration. The Matrix user needs to specify which is the IRC network and channel they want to bridge to, and the nick of the IRC op who can approve that request.
Plumbing a room in this fashion requires someone with sufficient privileges in the IRC channel to approve the request. In plumbed rooms, the Matrix user who made the plumbing request has the maximum power level in this room (usually PL100). They are in total control of the history visibility, which can be world-readable from the start. It's worth reiterating that such rooms can only be linked to IRC when an IRC chanop approves the plumbing request.
The Matrix Public Archive is not an archive
In retrospect, the Matrix Public Archive is a terrible name for this project - all the webapp does is to act as a read-only Matrix client for world-readable content. It doesn't archive anything; it doesn't store anything; it just pulls data from world-readable rooms on the Matrix homeserver, and exposes it to the web.
The Matrix Public Archive also depends on a bot joining the room to assess
whether the room is world-readable or not, purely because the original peeking
APIs in Matrix are deprecated, and the new ones (MSC2753)
haven't landed yet. In the case of the Matrix Public Archive hosted by the
Matrix.org Foundation, that bot is
@archive:matrix.org. However, the bot user
is not reading any information which wasn't already publicly visible without
joining the room - but we can see why having a random bot join is scary,
especially when it's called ‘archive', and it's not actually archiving anything.
Wait, my room's world readable?!
The most obvious issue is that some people were surprised that their room was world readable in the first place. Some rooms have a long history themselves, and it's entirely possible for some admins to have inherited a room someone else created, made public, and never revisited the settings. It then came as a surprise for them that their room history was in the Matrix Public Archive at all.
When matrix.to was introduced, some room administrators also set up their rooms
world_readable so potential joiners could peek at what was happening in the
room. Earlier in Matrix history, guest accounts were popular in some communities
and people also made their room
world_readable to onboard guests more easily.
All of this leads us to the same two issues today.
First, it should be made clearer in the UI of archive.matrix.org on why a room is world-readable (i.e. "A room admin (chanop) called Bob set the room to be world-readable on Jan 2, 2018"). And moreover, Matrix clients in general could do a better job of calling out when a room's history can be read by everyone, including people who didn't join. Second, the room settings may not make it obvious enough that sharing the history with "anyone" literally means "anyone" and not "anyone who has joined".
A note on shared history visibility
Having a public room doesn't necessarily mean you want everyone to be able to
read the whole history. Initially, the Matrix Public Archive also made rooms
shared history visibility readable via the archive.matrix.org interface
(acting as a read-only client, effectively) but disallowed search engines to
index that content. A
shared room is a room where, having joined, you can see
the whole history of the room.
In retrospect, this was a thinko -
shared history doesn't mean you expect
anonymous users to be able to read history (otherwise you'd have set it
world_readable), and we've subsequently merged a fix
kindly provided by tulir to address this.
By default, portalled IRC rooms were already set with
visibility which prevented them from being in the archive at all. We have
additionally prevented the Matrix.org hosted public archive from exposing the
content of portalled rooms that are bridged to Libera Chat since June 7, and
rooms bridged to OFTC since June 27, regardless of their history visibility
Looking at Libera Chat's Public logging policy
there might be a way to make the bridge change the topic to be explicit about
the channel being publicly logged when the Matrix room is
feature doesn't exist yet, so we would rather prevent the archive from logging
any room bridged to their network.
- Making it clear in the archive.matrix.org UI why a given room is world readable (and thus showing up in the interface)
- Renaming the archive (as Matrix Viewer?)
- Avoiding bots ever joining rooms on behalf of the system;
world_readableprivileges should mean by definition that nothing needs to join the room.
- MSC2291 to act like web's Robots Exclusion Protocol, at the room level
A word on GDPR
We understand that there have been some concerns around the GDPR compliance of the Matrix Public Archive. As always, we hear you and welcome your feedback.
The first step we will take is actually rename the project, to clarify what its technical purpose is. ‘Archiving' has very specific connotations within the GDPR, mainly governed by art. 89. No further data is collected and archived outside of your normal use of a Matrix homeserver, but there is indeed some additional processing by further disseminating the data. We are making this clearer in the matrix.org homeserver Privacy Notice.
This article requires measures around data minimisation to be taken when archiving data in the public interest. We would argue that processing this data only in a cache state, would meet this principle of minimisation. In fact, removing data from the ‘archive' is as simple as deleting it from the room.
If you have feedback on the legal aspects of this project, please send it over to [email protected].
Once we've renamed the project and clarified the visibility settings, we'll be turning archive.matrix.org back on. If you have any further feedback, please talk to us at #matrix-public-archive:matrix.org.