Matthew Hodgson

161 posts tagged with "Matthew Hodgson" (See all authors)

How do you implement interoperability in a DMA world?

2022-03-29 — GeneralMatthew Hodgson
Last update: 2022-03-29 09:23

With last week’s revelation that the EU Digital Markets Act will require tech gatekeepers (companies valued at over $75B or with over $7.5B of turnover) to open their communication APIs for the purposes of interoperability, there’s been a lot of speculation on what this could mean in practice, To try to ground the conversation a bit, we’ve had a go at outlining some concrete proposals for how it could work.

However, before we jump in, we should review how the DMA has come to pass.

🔗What’s driven the DMA?

Today’s gatekeepers all began with a great product, which got more and more popular until it grew to such a size that one of the biggest reasons to use the service is not necessarily the product any more, but the benefits of being able to talk to a large network of users. This rapidly becomes anti-competitive, however: the user becomes locked into the network and can’t switch even if they want to. Even when people have a really good reason to move provider (e.g. WhatsApp’s terms of use changing to share user data with Facebook, Apple doing a 180 on end-to-end encrypting iCloud backups, or Telegram not actually being end-to-end encrypted), in practice hardly anything changes - because the users are socially obligated to keep using the service in order to talk to the rest of the users on it.

As a result, it’s literally harmful to the users. Even if a new service launches with a shiny new feature, there is enormous inertia that must be overcome for users to switch, thanks to the pull of their existing network - and even if they do, they’ll just end up with their conversations haphazardly fragmented over multiple apps. This isn’t accepted for email; it isn’t accepted for the phone network; it isn’t accepted on the Internet itself - and it shouldn’t be accepted for messaging apps either.

Similarly: the closed networks of today’s gatekeepers put a completely arbitrary limit on how users can extend and enrich their own conversations. On email, if you want to use a fancy new client like Superhuman - you can. If you want to hook up a digital assistant or translation service to help you wrangle your email - you can. If you want to hook up your emails to a CRM to help you run your business - you can. But with today’s gatekeepers, you have literally no control: you’re completely at the mercy of the service provider - and for something like WhatsApp or iMessage the options are limited at best.

Finally - all the users’ conversation metadata for that service (who talks to who, and when) ends up centralised in the gatekeepers’ databases, which then become an incredibly valuable and sensitive treasure trove, at risk of abuse. And if the service provider identifies users by phone number, the user is forced to disclose their phone number (a deeply sensitive personal identifier) to participate, whether they want to or not. Meanwhile the user is massively incentivised not to move away: effectively they are held hostage by the pull of the service’s network of users.

So, the DMA exists as a strategy to improve the situation for users and service providers alike by building a healthier dynamic ecosystem for communication apps; encouraging products to win by producing the best quality product rather than the biggest network. To quote Cédric O (Secretary of State for the Digital Sector of France), the strategy of the legislation came from Washington advice to address the anticompetitive behaviour of the gatekeepers “not by breaking them up… but by breaking them open.” By requiring the gatekeepers to open their APIs, the door has at last been opened to give users the option to pick whatever service they prefer to use, to choose who they trust with their data and control their conversations as they wish - without losing the ability to talk to their wider audience.

However, something as groundbreaking as this is never going to be completely straightforward. Of course while some basic use cases (i.e. non-E2EE chat) are easy to implement, they initially may not have a UX as smooth as a closed network which has ingested all your address book; and other use cases(eg E2EE support) may require some compromises at first. It’s up to the industry to figure out how to make the most of that challenge, and how to do it in a way which minimises the impact on privacy - especially for end-to-end encrypted services.

🔗What problems need to be solved?

We’ve already written about this from a Matrix perspective, but to recap - the main challenge is the trade-off between interoperability and privacy for gatekeepers who provide end-to-end encryption, which at a rough estimate means: WhatsApp, iMessage, secret chats in Facebook Messenger, and Google Messages. The problem is that even with open APIs which correctly expose the end-to-end encrypted semantics (as DMA requires), the point where you interoperate with a different system inevitably means that you’ll have to re-encrypt the messages for that system, unless they speak precisely the same protocol - and by definition you end up trusting the different system to keep the messages safe. Therefore this increases the attack surface on the conversations, putting the end-to-end encryption at risk.

Alex Stamos (ex-CISO at Facebook) said that “WhatsApp rolling out mandatory end-to-end encryption was the largest improvement in communications privacy in human history” – and we agree. Guaranteed end-to-end encrypted conversations on WhatsApp is amazing, and should be protected at all costs. If users are talking to other users on WhatsApp (or any set of users communicating within the same E2EE messenger), E2EE should and must be maintained - and there is nothing in the DMA which says otherwise.

But what if the user consciously wants to prioritise interoperability over encryption? What if the user wants to hook their WhatsApp messages into a CRM, or run them through a machine translation service, or try to start a migration to an alternative service because they don’t trust Meta? Should privacy really come so spectacularly at the expense of user freedom?

We also have the problem of figuring out how to reference users on other platforms. Say that I want to talk to a user with a given phone number, but they’re not on my platform - how do I locate them? What if my platform only knows about phone numbers, but you’re trying to talk to a user on a platform which uses a different format for identifiers?

Finally, we have the problem of mitigating abuse: opening up APIs makes it easier for bad actors to try to spam or phish or otherwise abuse users within the gatekeepers. There are going to have to be changes in anti-abuse services/software, and some signals that the gatekeeper platforms currently use are going to go away or be less useful, but that doesn't mean the whole thing is intractable. It will require changes and innovative thinking, but we’ve been making steady progress (e.g. the work done by Element’s trust and safety team). Meanwhile, the gatekeepers already have massive anti-abuse systems in place to handle the billions of users within their walled gardens, and unofficial APIs are already widespread: adding official APIs does not change the landscape significantly (assuming interoperability is implemented in such a way that the existing anti-abuse mechanisms still apply).

In the past, gatekeepers dismissed the effort of interop as not being worthwhile - after all, the default course of action is to build a walled garden, and having built one, the temptation is to try to trap as many users as possible. It was also not always clear that there were services worth interoperating with (thanks to the chilling effects of the gatekeepers themselves, in terms of stifling innovation for communication startups). Nowadays this situation has fundamentally changed however: there is a vibrant ecosystem of open communication startups out there, and a huge appetite to build a vibrant open ecosystem for interoperable communication, but like the open web itself.

🔗What are the requirements?

Before going further in considering solutions, we need to review the actual requirements of the DMA. Our best understanding at this point is that the DMA will mandate that:

  • Gatekeepers will have to provide open and documented APIs to their services, on request, in order to facilitate interoperability (i.e. so that other services can communicate with their users).
  • These APIs must preserve the same level of end-to-end encryption (if any) to remote users as is available to local users.
  • This applies to 1:1 messaging and file transfer in the short term, and group messaging, file-transfer, 1:1 VoIP and group VoIP in the longer term.

🔗So, what could this actually look like?

The DMA legislation deliberately doesn’t focus on implementation, instead letting the industry figure out how this could actually work in practice. There are many different possible approaches, and so from our point of view as Matrix we’ve tried to sketch out some options to make the discussion more concrete. Please note these are preliminary thoughts, and are far from perfect - but hopefully useful as a starting point for discussion.

🔗Finding Bob

Imagine that you have a user Alice on an existing gatekeeper, which we’ll call AliceChat, who runs an E2EE messaging service which identifies users using phone numbers. Say that they want to start a 1-to-1 conversation with Bob, who doesn’t use AliceChat, but Alice knows he is a keen user of BobChat. Today, you’d have no choice but to send them an SMS and nag them to join AliceChat (sucks to be them if they don’t want to use that service, or if they’re unable to for whatever reason - e.g. their platform isn’t supported, or their government has blocked access, etc), or join BobChat yourself.


However, imagine if instead the gatekeeper app had a user experience where the app prompted you to talk to the user via a different platform instead. It’d be no different to your operating system prompting you to pick which app to use to open a given file extension (rather than the OS vendor hardcoding it to one of their own apps - another win for user rights led by the EU!).


Now, the simplest approach in the short term would be for each gatekeeper to pre-provision a set of options of possible alternative networks. (The DMA says that, on request, other service providers can ask to have access to the gatekeeper’s APIs for the purposes of interoperability, so the gatekeeper knows who the alternative networks may be). “Bob is not on AliceChat - do you want to try to reach him instead on BobChat, CharlieChat, DaveChat (etc)”.

Much like users can configure their preferred applications for file extensions in an operating system today, users would also be able to add their own preferred service providers - simply specifying their domain name.

🔗Connecting to Bob

Now, AliceChat itself needs to figure out how to query the remote service provider to see if Bob actually exists there. Given the DMA requires that gatekeepers provide open APIs with the same level of security to remote users as their local ones using today’s private APIs - and very deliberately doesn’t mandate specific protocols for interoperability - they will need to locate a bridge which can connect to the other system.

In this thought experiment, the bridge used would be up to the destination provider. For instance, bobchat.com could announce that AliceChat users should connect to it via alicechat-bridge.bobchat.com using the AliceChat protocol(or matrix-bridge.bobchat.com via Matrix or xmpp-bridge.bobchat.com via XMPP) by a simple HTTP API or even a .well-known URL. Users might also be able to override the bridge used to connect to them (e.g. to point instead at a client-side bridge), and could sign the advertisement to prove that it hadn’t been tampered with.

AliceChat would then connect to the discovered bridge using AliceChat’s vendor-specific newly opened API, and would then effectively treat Bob as if they were a real AliceChat user and client to all intents and purposes. In other words, Bob would effectively be a “ghost user” on AliceChat, and subject to all their existing anti-abuse mechanisms.

Meanwhile, the other side of the bridge converts through to whatever the target system is - be that XMPP, Matrix, a different proprietary API, etc. For Matrix, it’d be chatting away to a homeserver via the Application Service API (using End-to-Bridge Encryption via MSC3202). It’s also worth noting that the target might not even be a bridge - it could be a system which already natively speaks AliceChat’s end-to-end encrypted API, thus preserving end-to-end encryption without any need to re-encrypt. It’s also worth noting that while historically bridges have had a bad reputation as being a second class (often a second class afterthought), Matrix has shown that by considering them as a first class citizen and really focusing on mapping the highest common denominator between services rather than lowest common denominator, it’s possible for them to work transparently in practice. Beeper is a great example of Matrix bridging being used for real in the wild (rather amusingly they just shipped emoji reactions for WhatsApp on iOS via their WhatsApp<->Matrix bridge before WhatsApp themselves did…)

Architecturally, it could look like this:

Or, more likely (given a dedicated bridge between two proprietary services would be a bit of a special case, and you’d have to solve the dilemma of who hosts the bridge), both services could run a bridge to a common open standard protocol like Matrix or XMPP instead (thus immediately enabling interoperability with everyone else connected to that network):

Please note that while these examples show server-side bridges, in practice it would be infinitely preferable to use client-side bridges when connecting to E2EE services - meaning that decrypted message data would only ever be exposed on the client (which obviously has access to the decrypted data already). Client-side bridges are currently complicated by OS limits on background tasks and push notification semantics (on mobile, at least), but one could envisage a scenario where you install a little stub AliceChat client on your phone which auths you with AliceChat and then sits in the background receiving messages and bridging them through to Matrix or XMPP, like this:

Another possible architecture could be for the E2EE gatekeeper to expose their open APIs on the clients, rather than the server. DMA allows this, to the best of our knowledge - and would allow other apps on the device to access the message data locally (with appropriate authorisation, of course) - effectively doing a form of realtime data liberation from the closed service to an open system, looking something like this:

Finally, it's worth noting that when peer-to-peer decentralised protocols like P2P Matrix enter production, clientside bridges could bridge directly into a local communication server running on the handset - thus avoiding metadata being exposed on Matrix or XMPP servers providing a common language between the service providers.

🔗Locating users

Now, the above describes the simplest and most naive directory lookup system imaginable - the problem of deciding which provider to use to connect to each user is shouldered by the users. This isn’t that unreasonable - after all, users may have strong feelings about what providers to use to talk to a given user. Alice might be quite happy to talk to Bob via BobChat, but might be very deliberately avoiding talking to him on DaveChat, for whatever ominous reasons.

However, it’s likely in future we will also see other directory services appear in order to map phone numbers (or other identities) to providers - whether these piggyback on top of existing identity providers (gatekeepers, DNS, telcos, SSO providers, governments) or are decentralised through some other mechanism. For instance, Bob could send AliceChat a blinded proof that he authorises them to automatically route traffic to him over at BobChat, with BobChat maintaining a matching proof that Bob is who he claims to be (having gone through BobChat’s auth process) - and the proofs could be linked using a temporary key such that Bob doesn’t even need to maintain a long-term one. (Thanks to James Monaghan for suggesting this one!)

Another alternative to having the user decide where to find each other could be to use a decentralised Keybase-style identity system to track verified mappings of identities (phone numbers, email addresses etc) through to service providers - perhaps something like IDX might fit the bill? While this decentralised identity lookups have historically been a hard problem, there is a lot of promising work happening in this space currently and the future looks promising.

🔗Talking to Bob

Meanwhile, Alice still needs to talk to Bob. As already discussed, unless everyone speaks the same end-to-end encrypted protocol (be it Matrix, WhatsApp or anything else), we inevitably have a trade-off here between interoperability and privacy if Bob is not on the same system as Alice (assuming AliceChat is end-to-end encrypted) - and we will need to clearly warn Alice that the conversation is no longer end-to-end encrypted:


To be clear: right now, today, if Bob were on AliceChat, he could be copy-pasting all your messages into (say) Google Translate in a frantic effort to workaround the fact that his closed E2EE chat platform has no way to do machine translation. However, in a DMA world, Bob could legitimately loop a translation bot into the conversation… and Alice would be warned that the conversation was no longer secure (given the data is now being bridged over to Google).

This is a clear improvement in user experience and transparency. Likewise, if I’m talking to a bridged user today on one of these platforms, I have no way of telling that they have chosen to prioritise interop over E2EE - which is frankly terrifying. If I’m talking to someone on WhatsApp today I blindly assume that they are E2EE as they are on the same platform - and if they’re using an unofficial app or bridge, I have no way to tell. Whereas in a DMA world, you would expect the gatekeeper to transparently expose it.

If anything, this is good news for the gatekeeper in that it consciously advertises a big selling point for them: that for full E2EE, users need to talk to other users in the same walled garden (unless of course the platform speaks the same protocol). No more need for bus shelter adverts to remind everywhere that WhatsApp is E2EE - instead they can remind the user every time they talk to someone outside the walled garden!

Just to spell it out: the DMA does not require or encourage any reduction in end-to-end encryption for WhatsApp or similar: full end-to-end encryption will still be there for users in the same platform, including through to users on custom clients (assuming the gatekeeper doesn’t flex and turn it off for other reasons).

Obviously, this flow only considers the simple case of Alice inviting Bob. The flow is of course symmetrical for Bob inviting Alice; AliceChat will need to advertise bridges which can be used to connect to its users. As Bob pops up from BobChat, the bridge would use AliceChat’s newly open APIs to provision a user for him, authing him as per any other user (thus ensuring that AliceChat doesn’t need to have trusted BobChat to have authenticated the user). The bridge then sends/receives messages on Bob’s behalf within AliceChat.

🔗Group communication

This is all very well for 1:1 chats - which are the initial scope of the DMA. However, over the coming years, we expect group chats to also be in scope. The good news is that the same general architecture works for group chats too. We need a better source of identity though: AliceChat can’t possibly independently authenticate all the new users which might be joining via group conversations on other servers (especially if they join indirectly via another server). This means adopting one of the decentralised identity lookup approaches outlined earlier to determine whether Charlie on CharlieChat is the real Charlie or an imposter.

Another problem which emerges with group chats which span multiple service providers is that of indirect routing, especially if the links between the providers use different protocols. What if AliceChat has a direct bridge to BobChat (a bit like AIM and ICQ both spoke OSCAR), BobChat and CharlieChat are connected by Matrix bridges, and AliceChat and CharlieChat are connected via XMPP bridges? We need a way for the bridges to decide who forwards traffic for each network, and who bridges the users for which network. If they were all on Matrix or XMPP this would happen automatically, but with mixed protocols we’d probably have to extend the lookup protocol to establish a spanning tree for each conversation to prevent forwarding loops.

Here’s a deliberately twisty example to illustrate the above thought experiment:

There is also a risk of bridge proliferation here - in the worst case, every service would have to source bridges to directly connect to every other service who came along, creating a nightmarish n-by-m problem. But in practice, we expect direct proprietary-to-proprietary bridges to be rare: instead, we already have open standard communication protocols like Matrix and XMPP which provide a common language between bridges - so in practice, you could just end up in a world where each service has to find a them-to-Matrix or them-to-XMPP bridge (which could be run by them, or whatever trusted party they delegate to).

🔗Conclusion

A mesh of bridges which connect together the open APIs of proprietary vendors by converting them into open standards may seem unwieldy at first - but it’s precisely the sort of ductwork which links both phone networks and the Internet together in practice. As long as the bridging provides for highest common denominator fidelity at the best impedance ratio, then it’s conceptually no different to converting circuit switched phone calls to VoIP, or wired to wireless Ethernet, or any of the other bridges which we take entirely for granted in our lives thanks to their transparency.

Meanwhile, while this means a bit more user interface in the communication apps in order to select networks and warn about trustedness, the benefits to users are enormous as they put the user squarely back in control of their conversations. And the UX will improve as the tech evolves.

The bottom line is, we should not be scared of interoperability, just because we’ve grown used to a broken world where nothing can interconnect. There are tractable ways to solve it in a way that empowers and informs the user - and the DMA has now given the industry the opportunity to demonstrate that it can work.

Interoperability without sacrificing privacy: Matrix and the DMA

2022-03-25 — GeneralMatthew Hodgson
Last update: 2022-03-25 18:01

Yesterday the EU Parliament & Council agreed on the contents of the Digital Markets Act - new legislation from the EU intended to limit anticompetitive behaviour from tech “gatekeepers”, i.e. big tech companies (those with market share larger than €75B or with more than €7.5B a year of revenue).

This is absolutely landmark legislation, where the EU has decided not to break the gatekeepers up in order to create a more competitive marketplace - but instead to “break them open”. This is unbelievably good news for the open Internet, as it is obligating the gatekeepers to provide open APIs for their communication services. In other words: no longer will the tech giants be able to arbitrarily lock their users inside their walled gardens - there will be a legal requirement for them to expose APIs to other services.

While the formal outcomes of yesterday’s agreement haven’t been published yet (beyond this press release), our understanding is that the DMA will mandate:

  • Gatekeepers will have to provide open and documented APIs to their services, on request, in order to facilitate interoperability (i.e. so that other services can communicate with their users).
  • These APIs must preserve the same level of end-to-end encryption (if any) to remote users as is available to local users.
  • This applies to 1:1 messaging and file transfer in the short term, and group messaging, file-transfer, 1:1 VoIP and group VoIP in the longer term.

This is the best possible outcome imaginable for the open internet. Never again will a big tech company be able to hold their users hostage in a walled garden, or arbitrarily close down or sabotage their APIs.

🔗So, what’s the catch?

Since the DMA announcement on Thursday, there’s been quite a lot of yelling from some very experienced voices that mandating interoperability via open APIs is going to irrevocably undermine end-to-end encrypted messengers like WhatsApp. This seems to mainly be born out of a concern that the DMA is somehow trying to subvert end-to-end encryption, despite the fact that the DMA explicitly mandates that the APIs must expose the same level of security, including end-to-end encryption, that local users are using. (N.B. Signal doesn’t qualify as a gatekeeper, so none of this is relevant to Signal).

So, for WhatsApp, it means that the API would expose both the message-passing interface as well as the key management APIs required to interoperate with WhatsApp using your own end-to-end-encrypted WhatsApp client - E2EE would be preserved.

However, this does mean that if you were to actively interoperate between providers (e.g. if Matrix turned up and asked WhatsApp, post DMA, to expose an API we could use to write bridges against), then that bridge would need to convert between WhatsApp’s E2EE’d payloads and Matrix’s E2EE’d payloads. (Even though both WhatsApp and Matrix use the Double Ratchet, the actual payloads within the encryption are completely different and would need to be converted). Therefore such a bridge has to re-encrypt the traffic - which means that the plaintext is exposed on the bridge, putting it at risk and breaking the end-to-end encryption guarantee.

There are solutions to this, however:

  • We could run the bridge somewhere relatively safe - e.g. the user’s client. There’s a bunch of work going on already in Matrix to run clientside bridges, so that your laptop or phone effectively maintains a connection over to iMessage or WhatsApp or whatever as if it were logged in… but then relays the messages into Matrix once re-encrypted. By decentralising the bridges and spreading them around the internet, you avoid them becoming a single honeypot that bad actors might look to attack: instead it becomes more a question of endpoint compromise (which is already a risk today).

  • The gatekeeper could switch to a decentralised end-to-end encrypted protocol like Matrix to preserve end-to-end encryption throughout. This is obviously significant work on the gatekeeper’s side, but we shouldn’t rule it out. For instance, making the transition for a non-encrypted service is impressively little work, as we proved with Gitter. (We’d ideally need to figure out decentralised/federated identity-lookup first though, to avoid switching from one centralised identity database to another).

  • Worst case, we could flag to the user that their conversation is insecure (the chat equivalent of a scary TLS certificate warning). Honestly, this is something communication apps (including Matrix-based ones!) should be doing anyway: as a user you should be able to tell what 3rd parties (bots, integrations etc) have been added to a given conversation. Adding this sort of semantic actually opens up a much richer set of communication interactions, by giving the user the flexibility over who to trust with their data, even if it breaks the platonic ideal of pure E2E encryption.

On balance, we think that the benefits of mandating open APIs outweigh the risks that someone is going to run a vulnerable large-scale bridge and undermine everyone’s E2EE. It’s better to have the option to be able to get at your data in the first place than be held hostage in a walled garden.

🔗Other considerations

One other complaint which has come up a bunch is around speed of innovation: the idea that WhatsApp or similar would be seriously slowed down by having to (effectively) maintain stable documented federation APIs, and figure out how to do backwards compatibility for new features. It’s true that this will take a bit more effort (similar to how adding GDPR compliance takes some effort), but the ends make it more than worth it. Plus, if the rag-tag Matrix ecosystem can do it, it doesn’t seem unreasonable to think that a $600B company like Meta can figure it out too...

Another consideration is that it might make it too easy to build malicious 3rd party clients - e.g. building your own "special" version of Signal which connects to the official service, but deliberately or otherwise has security flaws. The fact is that we're already in this position though: there are illicit alternative clients flying around all over the place, and the onus is on the app stores to protect their users from installing malware. This isn't reason to throw the baby of interoperability out with the bathwater of bootleg clients.

The final complaint is about moderation and abuse: while open APIs are good news for consumer choice, they can also be used by spammers, phishers and other miscreants to cause problems for the users within the gatekeeper. Much like a mediaeval citadel; opening up your walled garden means that both good and bad people can turn up. And much like real life, this is a solvable problem, even if it’s unfortunate: the benefits of free trade massively outweigh the downsides of having to police strangers more effectively. Frankly, moderation and anti-abuse approaches on the Internet today are infamously broken, with centralised moderation by gatekeepers producing increasingly erratic results. By opening the walled gardens, we are forcing a much-needed opportunity to review how to empower users and admins to filter unwanted content on their own terms. There’s a recent write-up of the proposed approach for Matrix at https://element.io/blog/moderation-needs-a-radical-change/, which outlines one strategy - but there are many others. Honestly, having to improve moderation tooling is a worthwhile price to pay for the benefits of open APIs.

So, there you have it. Hopefully you’ll agree that the benefits here outweigh the risks: without open APIs we wouldn't even have the option to talk about interoperability. We should be celebrating a new dawn for open access, rather than fearing that the sky is falling and this is nefarious attempt to undermine end-to-end encryption.

The Mega Matrix Holiday Special 2021

2021-12-22 — General, Holiday SpecialMatthew Hodgson
Last update: 2021-12-22 17:54

Hi all,

If you’re reading this - congratulations; you made it through another year :) Every winter we sit down and review Matrix’s progress over the last twelve months, and look forward to the next - for it’s all too easy to get lost in the day-to-day development and fail to realise how much the overall project is evolving, especially when it’s one as large and ambitious as Matrix!

Looking back at 2021, it’s unbelievable how much stuff has been going on in the core team (as you can tell by the length of this post - sorry!). There’s been a really interesting mix of activity too - between massive improvements to the core functionality and baseline features that Matrix provides, and also major breakthroughs on next generation work. But first, let’s check out what’s been happening in the wider ecosystem…

Continue reading…

Disclosing CVE-2021-40823 and CVE-2021-40824: E2EE vulnerability in multiple Matrix clients

2021-09-13 — SecurityDenis Kasak, Dan Callahan, Matthew Hodgson

Today we are disclosing a critical security issue affecting multiple Matrix clients and libraries including Element (Web/Desktop/Android), FluffyChat, Nheko, Cinny, and SchildiChat. Element on iOS is not affected.

Specifically, in certain circumstances it may be possible to trick vulnerable clients into disclosing encryption keys for messages previously sent by that client to user accounts later compromised by an attacker.

Exploiting this vulnerability to read encrypted messages requires gaining control over the recipient’s account. This requires either compromising their credentials directly or compromising their homeserver.

Thus, the greatest risk is to users who are in encrypted rooms containing malicious servers. Admins of malicious servers could attempt to impersonate their users' devices in order to spy on messages sent by vulnerable clients in that room.

This is not a vulnerability in the Matrix or Olm/Megolm protocols, nor the libolm implementation. It is an implementation bug in certain Matrix clients and SDKs which support end-to-end encryption (“E2EE”).

We have no evidence of the vulnerability being exploited in the wild.

This issue was discovered during an internal audit by Denis Kasak, a security researcher at Element.

🔗Remediation and Detection

Patched versions of affected clients are available now; please upgrade as soon as possible — we apologise sincerely for the inconvenience. If you are unable to upgrade, consider keeping vulnerable clients offline until you can. If vulnerable clients are offline, they cannot be tricked into disclosing keys. They may safely return online once updated.

Unfortunately, it is difficult or impossible to retroactively identify instances of this attack with standard logging levels present on both clients and servers. However, as the attack requires account compromise, homeserver administrators may wish to review their authentication logs for any indications of inappropriate access.

Similarly, users should review the list of devices connected to their account with an eye toward missing, untrusted, or non-functioning devices. Because an attacker must impersonate an existing or historical device, exploiting this vulnerability would either break an existing login on the user’s account, or a historical device would be re-added and flagged as untrusted.

Lastly, if you have previously verified the users / devices in a room, you would witness the safety shield on the room turn red during the attack, indicating the presence of an untrusted and potentially malicious device.

🔗Affected Software

Given the severity of this issue, Element attempted to review all known encryption-capable Matrix clients and libraries so that patches could be prepared prior to public disclosure.

Known vulnerable software:

We believe the following software is not vulnerable:

We believe the following are not vulnerable due to not implementing key sharing:

🔗Background

Matrix supports the concept of “key sharing”, letting a Matrix client which lacks the keys to decrypt a message request those keys from that user's other devices or the original sender's device.

This was a feature added in 2016 in order to address edge cases where a newly logged-in device might not have the necessary keys to decrypt historical messages. Specifically, if other devices in the room are unaware of the new device due to a network partition, they have no way to encrypt for it—meaning that the only way the new device will be able to decrypt history is if the recipient's other devices share the necessary keys with it.

Other situations where key sharing is desirable include when the recipient hasn't backed up their keys (either online or offline) and needs them to decrypt history on a new login, or when facing implementation bugs which prevent clients from sending keys correctly. Requesting keys from a user's other devices sidesteps these issues.

Key sharing is described here in the Matrix E2EE Implementation Guide, which contains the following paragraph:

In order to securely implement key sharing, clients must not reply to every key request they receive. The recommended strategy is to share the keys automatically only to verified devices of the same user.

This is the approach taken in the original implementation in matrix-js-sdk, as used in Element Web and others, with the extension of also letting the sending device service keyshare requests from recipient devices. Unfortunately, the implementation did not sufficiently verify the identity of the device requesting the keyshare, meaning that a compromised account can impersonate the device requesting the keys, creating this vulnerability.

This is not a protocol or specification bug, but an implementation bug which was then unfortunately replicated in other independent implementations.

While we believe we have identified and contacted all affected E2EE client implementations: if your client implements key sharing requests, we strongly recommend you check that you cryptographically verify the identity of the device which originated the key sharing request.

🔗Next Steps

The fact that this vulnerability was independently introduced so many times is a clear signal that the current wording in the Matrix Spec and the E2EE Implementation Guide is insufficient. We will thoroughly review the related documentation and revise it with clear guidelines on safely implementing key sharing.

Going further, we will also consider whether key sharing is still a necessary part of the Matrix protocol. If it is not, we will remove it. As discussed above, key sharing was originally introduced to make E2EE more reliable while we were ironing out its many edge cases and failure modes. Meanwhile, implementations have become much more robust, to the point that we may be able to go without key sharing completely. We will also consider changing how we present situations in which you cannot decrypt messages because the original sender was not aware of your presence. For example, undecryptable messages could be filed in a separate conversation thread, or those messages could require that keys are shared manually, effectively turning a bug into a feature.

We will also accelerate our work on matrix-rust-sdk as a portable reference implementation of the Matrix protocol, avoiding the implicit requirement that each independent library must necessarily reimplement this logic on its own. This will have the effect of reducing attack surface and simplifying audits for software which chooses to use matrix-rust-sdk.

Finally, we apologise to the wider Matrix community for the inconvenience and disruption of this issue. While Element discovered this vulnerability during an internal audit of E2EE implementations, we will be funding an independent end-to-end audit of the reference Matrix E2EE implementations (not just Olm + libolm) in the near future to help mitigate the risk from any future vulnerabilities. The results of this audit will be made publicly available.

🔗Timeline

Ultimately, Element took two weeks from initial discovery to completing an audit of all known, public E2EE implementations. It took a further week to coordinate disclosure, culminating in today's announcement.

  • Monday, 23rd August — Discovery that Element Web is exploitable.
  • Thursday, 26th August — Determination that Element Android is exploitable with a modified attack.
  • Wednesday, 1 September — Determination that Element iOS fails safe in the presence of device changes.
  • Friday, 3 September — Determination that FluffyChat and Nheko are exploitable.
  • Tuesday, 7th September — Audit of Matrix clients and libraries complete.
  • Wednesday, 8th September — Affected software authors contacted, disclosure timelines agreed.
  • Friday, 10th September — Public pre-disclosure notification. Downstream packagers (e.g., Linux distributions) notified via Matrix and e-mail.
  • Monday, 13th September — Coordinated releases of all affected software, public disclosure.

Element raises $30M to boost Matrix

2021-07-27 — General, NewsMatthew Hodgson

Hi folks,

Big news today: Element, the startup founded by the team who created Matrix, just raised $30M of Series B funding in order to further accelerate Matrix development and improve Element, the flagship Matrix app. The round is led by our friends at Protocol Labs and Metaplanet, the fund established by Jaan Tallinn (co-founder of Skype and Kazaa). Both Protocol Labs and Metaplanet are spectacularly on board our decentralised communication quest, and you couldn't really ask for a better source of funding to help take Matrix to the next level. Thank you for believing in Matrix and leading Element's latest funding!

You can read all about it from the Element perspective over at the Element Blog, but suffice it to say that this is enormous news for the Matrix ecosystem as a whole. In addition to transforming the Element app, on the Matrix side this means that there is now concrete funding secured to:

Obviously this is in addition to all the normal business-as-usual work going on in terms of:

  • getting Spaces out of beta
  • adding Threading to Element (yes, it's finally happening!)
  • speeding up room joins over federation
  • creating 'sync v3' to lazy-load all content and make the API super-snappy
  • lots of little long-overdue fun bits and pieces (yes, custom emoji, we're looking at you).

If you're wondering whether Protocol Labs' investment means that we'll be seeing more overlap between IPFS and Matrix, then yes - where it makes tech sense to do so, we're hoping to work more closely together; for instance collaborating with the libp2p team on our P2P work (we still need to experiment properly with gossipsub!), or perhaps giving MSC2706 some attention. However, there are no plans to use cryptocurrency incentives in Matrix or Element any time soon.

So, exciting times ahead! We'd like to inordinately thank everyone who has supported Matrix over the years - especially our Patreon supporters, whose donations pay for all the matrix.org infrastructure while inspiring others to open their cheque books; the existing investors at Element (especially Notion and Automattic, who have come in again on this round); all the large scale Matrix deployments out there which are effectively turning Matrix into an industry (hello gematik!) - and everyone who has ever run a Matrix server, contributed code, used the spec to make their own Matrix-powered creation, or simply chatted on Matrix.

Needless to say, Matrix wouldn't exist without you: the protocol and network would have fizzled out long ago were it not for all the people supporting it (the matrix.org server can now see over 35.5M addressable users on the network!) - and meanwhile the ever-increasing energy of the community and the core team combines to keep the protocol advancing forwards faster than ever.

We will do everything we possibly can to succeed in creating the long-awaited secure communication layer of the open Web, and we look forward to large amounts of Element's new funding being directed directly into core Matrix development :)

thanks for flying Matrix,

Matthew, Amandine & the whole Matrix core team.

Dendrite 0.4.1 Released

2021-07-26 — ReleasesMatthew Hodgson

It's only been two weeks since Dendrite 0.4 landed, but there's already a significant new release with Dendrite 0.4.1 (it's amazing how much work we can do on Dendrite when not off chasing low-bandwidth and P2P Matrix!)

This release further improves memory performance and radically improves state resolution performance (rumour has it that it's a 10x speed-up). Meanwhile, SS API sytest coverage is up to 91%(!!) and CS API is now at 63%.

We're going to try to keep the pressure up over the coming weeks - and once sytest is at 100% coverage (and we're not missing any big features which sytest doesn't cover yet) we'll be declaring a 1.0 :)

If you're running Dendrite, please upgrade. If not, perhaps this would be a good version to give it a try? You can get it, as always from, https://github.com/matrix-org/dendrite/releases/tag/v0.4.1. The changelog follows:

🔗Features

  • Support for room version 7 has been added
  • Key notary support is now more complete, allowing Dendrite to be used as a notary server for looking up signing keys
  • State resolution v2 performance has been optimised further by caching the create event, power levels and join rules in memory instead of parsing them repeatedly
  • The media API now handles cases where the maximum file size is configured to be less than 0 for unlimited size
  • The initial_state in a /createRoom request is now respected when creating a room
  • Code paths for checking if servers are joined to rooms have been optimised significantly

🔗Fixes

  • A bug resulting in cannot xref null state block with snapshot during the new state storage migration has been fixed
  • Invites are now retired correctly when rejecting an invite from a remote server which is no longer reachable
  • The DNS cache cache_lifetime option is now handled correctly (contributed by S7evinK)
  • Invalid events in a room join response are now dropped correctly, rather than failing the entire join
  • The prev_state of an event will no longer be populated incorrectly to the state of the current event
  • Receiving an invite to an unsupported room version will now correctly return the M_UNSUPPORTED_ROOM_VERSION error code instead of M_BAD_JSON (contributed by meenal06)

-- Team Dendrite

Germany’s national healthcare system adopts Matrix!

2021-07-21 — General, NewsMatthew Hodgson

Hi folks,

We’re incredibly excited to officially announce that the national agency for the digitalisation of the healthcare system in Germany (gematik) has selected Matrix as the open standard on which to base all its interoperable instant messaging standard - the TI-Messenger.

gematik has released a concept paper that explains the initiative in full.

🔗TL;DR

With the TI-Messenger, gematik is creating a nationwide decentralised private communication network - based on Matrix - to support potentially more than 150,000 healthcare organisations within Germany’s national healthcare system. It will provide end-to-end encrypted VoIP/Video and messaging for the whole healthcare system, as well as the ability to share healthcare based data, images and files.

Initially every healthcare provider (HCP) with an HBA (HPC ID card) will be able to choose their own TI-Messenger provider. The homesever for HCP accounts will be hosted by the provider’s datacentre. The homeserver for institutions can be hosted by TI-Messenger providers, or on-premise.

Each organisation and individual will therefore retain complete ownership and control of their communication data - while being able to share it securely within the healthcare system with end-to-end encryption by default. All servers in the Matrix-based private federation will be hosted within Germany.

Needless to say, security is key when underpinning the entire nation’s healthcare infrastructure and safeguarding sensitive patient data. As such, the entire implementation will be accredited by BSI (Federal Office for Information Security) and BfDI (Federal Commissioner for Data Protection and Freedom of Information).

🔗The full context...

Germany’s digital care modernisation law (“Digitale Versorgung und Pflege Modernisierungs Gesetz” or DVPMG), which came into force in June 2021, spells out the need for an instant messaging solution.

The urgency has increased by a significant rise in the use of instant messaging and video conferencing within the healthcare system - for instance, the amount of medical practices using messenger services doubled in 2020 compared to 2018 (much of this using insecure messaging solutions).

gematik, majority-owned by Germany’s Federal Ministry of Health, is responsible for the standardised digital transformation of Germany’s healthcare sector. It focuses on improving efficiency and introducing new ways of working by setting, testing and certifying healthcare technology including electronic health cards, electronic patient records and e-prescriptions.

TI-Messenger is gematik’s technical specification for an interoperable secure instant messaging standard. The healthcare industry will be able to build a wide range of apps based on TI-Messenger specifications knowing that, being built on Matrix, all those apps will interoperate.

More than 150,000 organisations - ranging from local doctors to clinics, hospitals, and insurance companies - can potentially standardise on instant messaging thanks to gematik’s TI-Messenger initiative.

🔗The road to interoperability

By 1 October 2021, TI-Messenger will initially specify how communication should work in practice between healthcare professionals (HCPs). Physicians will be able to find and communicate with each other via TI-Messenger approved apps - specifications include secure authentication mechanisms with electronic health professional cards (eHBAs), electronic institution cards (SMC-B) and a central FHIR directory. The first compliant apps for HCPs are expected to be licensed by Q2 2022.

Eric Grey (product manager for TI-Messenger at gematik), reckons there will initially be around 10-15 TI-Messenger compliant Matrix-based apps for HCP communications available from different vendors.

Healthcare professionals will be able to choose a TI-Messenger provider, who will be hosting their personal accounts and provide the messenger-client.

Healthcare organisations will choose a TI-Messenger provider to build the dedicated homeserver infrastructure (on prem or in a data center), provide the client and ongoing support.

🔗What does this mean for the Matrix community?

Matrix is already integral to huge parts of the public sector; from the French government’s Tchap platform, to Bundeswehr’s use of BwMessenger and adoption by universities and schools across Europe.

Germany’s healthcare system standardising on Matrix takes this to entirely the next level - and we can’t wait to see the rest of Europe (and the world!) converge on Matrix for healthcare!

We'll have more info about TI-Messenger on this week's Matrix Live, out on Friday - stay tuned!

Security update: Synapse 1.37.1 released

2021-06-30 — Releases, SecurityMatthew Hodgson

Hi all,

Over the last few days we've seen a distributed spam attack across the public Matrix network, where large numbers of spambots have been registered across servers with open registration and then used to flood abusive traffic into rooms such as Matrix HQ.

The spam itself has been handled by temporarily banning the abused servers. However, on Monday and Tuesday the volume of traffic triggered performance problems for the homeservers participating in targeted rooms (e.g. memory explosions, or very delayed federation). This was due to a combination of factors, but one of the most important ones was Synapse issue #9490: that one busy room could cause head-of-line blocking, starving your server from processing events in other rooms, causing all traffic to fall behind.

We're happy to say that Synapse 1.37.1 fixes this and we now process inbound federation traffic asynchronously, ensuring that one busy room won't impact others. First impressions are that this has significantly improved federation performance and end-to-end encryption stability — for instance, new E2EE keys from remote users for a given conversation should arrive immediately rather than being blocked behind other traffic.

Please upgrade to Synapse 1.37.1 as soon as possible, in order to increase resilience to any other traffic spikes.

Also, we highly recommend that you disable open registration or, if you keep it enabled, use SSO or require email validation to avoid abusive signups. Empirically adding a CAPTCHA is not enough. Otherwise you may find your server blocked all over the place if it is hosting spambots.

Finally, if your server has open registration, PLEASE check whether spambots have been registered on your server, and deactivate them. Once deactivated, you will need to contact [email protected] to request that blocks on your server are removed.

Your best bet for spotting and neutralising dormant spambots is to review signups on your homeserver over the past 3-5 days and deactivate suspicious users. We do not recommend relying solely on lists of suspicious IP addresses for this task, as the distributed nature of the attack means any such list is likely to be incomplete or include shared proxies which may also catch legitimate users.

To ease review, we're working on an auditing script in #10290; feedback on whether this is useful would be appreciated. Problematic accounts can then be dealt with using the Deactivate Account Admin API.

Meanwhile, over to Dan for the Synapse 1.37 release notes.

🔗Synapse 1.37 Release Announcement

Synapse 1.37 is now available!

**Note: ** The legacy APIs for Spam Checker extension modules are now considered deprecated and targeted for removal in August. Please see the module docs for information on updating.

This release also removes Synapse's built-in support for the obsolete ACMEv1 protocol for automatically obtaining TLS certificates. Server administrators should place Synapse behind a reverse proxy for TLS termination, or switch to a standalone ACMEv2 client like certbot.

🔗Knock, knock?

After nearly 18 months and 129 commits, Synapse now includes support for MSC2403: Add "knock" feature and Room Version 7! This feature allows users to directly request admittance to private rooms, without having to track down an invitation out-of-band. One caveat: Though the server-side foundation is there, knocking is not yet implemented in clients.

🔗A Unified Interface for Extension Modules

Third party modules can customize Synapse's behavior, implementing things like bespoke media storage providers or user event filters. However, Synapse previously lacked a unified means of enumerating and configuring third-party modules. That changes with Synapse 1.37, which introduces a new, generic interface for extensions.

This new interface consolidates configuration into one place, allowing for more flexibility and granularity by explicitly registering callbacks with specific hooks. You can learn more about the new module API in the docs linked above, or in Matrix Live S6E29, due out this Friday, July 2nd.

🔗Safer Reauthentication

User-interactive authentication ("UIA") is required for potentially dangerous actions like removing devices or uploading cross-signing keys. However, Synapse can optionally be configured to provide a brief grace period such that users are not prompted to re-authenticate on actions taken shortly after logging in or otherwise authenticating.

This improves user experience, but also creates risks for clients which rely on UIA as a guard against actions like account deactivation. Synapse 1.37 protects users by exempting especially risky actions from the grace period. See #10184 for details.

🔗Smaller Improvements

We've landed a number of smaller improvements which, together, make Synapse more responsive and reliable. We now:

  • More efficiently respond to key requests, preventing excessive load (#10221, #10144)
  • Render docs for each vX.Y Synapse release, starting with v1.37 (#10198)
  • Ensure that log entries from failures during early startup are not lost (#10191)
  • Have a notion of database schema "compatibility versions", allowing for more graceful upgrades and downgrades of Synapse (docs)

We've also resolved two bugs which could cause sync requests to immediately return with empty payloads (#8518), producing a tight loop of repeated network requests.

🔗Everything Else

Lastly, we've merged an experimental implementation of MSC2716: Incrementally importing history into existing rooms (#9247) as part of Element's work to fully integrate Gitter into Matrix.

These are just the highlights; please see the Upgrade Information and Release Notes for a complete list of changes in this release.

Synapse is a Free and Open Source Software project, and we'd like to extend our thanks to everyone who contributed to this release, including aaronraimist, Bubu, dklimpel, jkanefendt, lukaslihotzki, mikure, and Sorunome,

The Matrix Space Beta!

2021-05-17 — General, TechMatthew Hodgson
Last update: 2021-05-17 17:35

Hi all,

As many know, over the years we've experimented with how to let users locate and curate sets of users and rooms in Matrix. Back in Nov 2017 we added 'groups' (aka 'communities') as a custom mechanism for this - introducing identifiers beginning with a + symbol to represent sets of rooms and users, like +matrix:matrix.org.

However, it rapidly became obvious that Communities had some major shortcomings. They ended up being an extensive and entirely new API surface (designed around letting you dynamically bridge the membership of a group through to a single source of truth like LDAP) - while in practice groups have enormous overlap with rooms: managing membership, inviting by email, access control, power levels, names, topics, avatars, etc. Meanwhile the custom groups API re-invented the wheel for things like pushing updates to the client (causing a whole suite of problems). So clients and servers alike ended up reimplementing large chunks of similar functionality for both rooms and groups.

And so almost before Communities were born, we started thinking about whether it would make more sense to model them as a special type of room, rather than being their own custom primitive. MSC1215 had the first thoughts on this in 2017, and then a formal proposal emerged at MSC1772 in Jan 2019. We started working on this in earnest at the end of 2020, and christened the new way of handling groups of rooms and users as... Spaces!

Spaces work as follows:

  • You can designate specific rooms as 'spaces', which contain other rooms.
  • You can have a nested hierarchy of spaces.
  • You can rapidly navigate around that hierarchy using the new 'space summary' (aka space-nav) API - MSC2946.
  • Spaces can be shared with other people publicly, or invite-only, or private for your own curation purposes.
  • Rooms can appear in multiple places in the hierarchy.
  • You can have 'secret' spaces where you group your own personal rooms and spaces into an existing hierarchy.

Today, we're ridiculously excited to be launching Space support as a beta in matrix-react-sdk and matrix-android-sdk2 (and thus Element Web/Desktop and Element Android) and Synapse 1.34.0 - so head over to your nearest Element, make sure it's connected to the latest Synapse (and that Synapse has Spaces enabled in its config) and find some Space to explore! #community:matrix.org might be a good start :)

The beta today gives us the bare essentials: and we haven't yet finished space-based access controls such as setting powerlevels in rooms based on space membership (MSC2962) or limiting who can join a room based on their space membership (MSC3083) - but these will be coming asap. We also need to figure out how to implement Flair on top of Spaces rather than Communities.

This is also a bit of a turning point in Matrix's architecture: we are now using rooms more and more as a generic way of modelling new features in Matrix. For instance, rooms could be used as a structured way of storing files (MSC3089); Reputation data (MSC2313) is stored in rooms; Threads can be stored in rooms (MSC2836); Extensible Profiles are proposed as rooms too (MSC1769). As such, this pushes us towards ensuring rooms are as lightweight as possible in Matrix - and that things like sync and changing profile scale independently of the number of rooms you're in. Spaces effectively gives us a way of creating a global decentralised filesystem hierarchy on top of Matrix - grouping the existing rooms of all flavours into an epic multiplayer tree of realtime data. It's like USENET had a baby with the Web!

For lots more info from the Element perspective, head over to the Element blog. Finally, the point of the beta is to gather feedback and fix bugs - so please go wild in Element reporting your first impressions and help us make Spaces as awesome as they deserve to be!

Thanks for flying Matrix into Space;

Matthew & the whole Spaces (and Matrix) team.

How we hosted FOSDEM 2021 on Matrix

2021-02-15 — Events, FOSDEMMatthew Hodgson

Hi all,

Just over a week ago we had the honour of using Matrix to host FOSDEM: the world's largest free & open source software conference. It's taken us a little while to write up the experience given we had to recover and catch up on business as usual... but better late than never, here's an overview of what it takes to run a ~30K attendee conference on Matrix!

[confetti and firework easter-eggs explode over the closing keynote of FOSDEM 2021]

First of all, a quick (re)introduction to Matrix for any newcomers: Matrix is an open source project which defines an open standard protocol for decentralised communication. The global Matrix network makes up at least 28M Matrix IDs spread over around 60K servers. For FOSDEM, we set up a fosdem.org server to host newcomers, provided by Element Matrix Services (EMS) - Element being the startup formed by the Matrix core team to help fund Matrix development.

The most unique thing about Matrix is that conversations get replicated across all servers whose users are present in the conversation, so there's never a single point of control or failure for a conversation (much as git repositories get replicated between all contributors). And so hosting FOSDEM in Matrix meant that everyone already on Matrix (including users bridged to Matrix from IRC, XMPP, Slack, Discord etc) could attend directly - in addition to users signing up for the first time on the FOSDEM server. Therefore the chat around FOSDEM 2021 now exists for posterity on all the Matrix servers whose users who participated; and we hope that the fosdem.org server will hang around for the benefit of all the newcomers for the foreseeable so they don't lose their accounts!

Talking of which: the vital stats of the weekend were as follows:

  • We saw almost 30K local users on the FOSDEM server + 4K remote users from elsewhere in Matrix.
  • There were 24,826 guests (read-only invisible users) on the FOSDEM server.
  • There were 8,060 distinct users actively joined to the public FOSDEM rooms...
  • ...of which 3,827 registered on the FOSDEM server. (This is a bit of an eye-opener: over 50% of the actively participating attendees for FOSDEM were already on Matrix!)
  • These numbers don't count users who were viewing the livestreams directly, but only those who were attending via Matrix.

Given last year's FOSDEM had roughly 8,500 in-person attendees at the Université libre de Bruxelles, this feels like a pretty good outcome :)

Graphwise, local user activity on the FOSDEM server looked like this:

🔗How was it built?

There were four main components on the Matrix side:

  1. A horizontally-scalable Matrix server deployment (Synapse hosted in EMS)
  2. A Jitsi cluster for the video conferencing, used to host all the Q&A sessions, hallway sessions, stands, and other adhoc video conferences
  3. An elastically scalable Jibri cluster used to livestream the Jitsi conferences both to the official FOSDEM livestreams and to provide a local preview of the conference on Matrix (to avoid the Jitsis getting overloaded with folks who just want to view)
  4. conference-bot - a Matrix bot which orchestrated the overall conference on Matrix, written from scratch for FOSDEM by TravisR, consuming the schedule from FOSDEM and maintaining all the necessary rooms with the correct permissions, widgets, invites, etc.

Architecturally, it looked like this:

On the clientside, we made heavy use of widgets: the ability to embed arbitrary web content as iframes into Matrix chatrooms. (Widgets currently exist as a set of proposals for the Matrix spec, which have been preemptively implemented in Element.)

For instance, the conference-bot created Matrix rooms for all the FOSDEM devrooms with a predefined widget for viewing the official FOSDEM livestream for that room, pointing at the appropriate HLS stream at stream.fosdem.org - which looked like this:

Each devroom also had a schedule widget available on the righthand side, visualising the schedule of that room - huge thanks to Hato and Steffen and folks at Nordeck for putting this together at the last minute; it enormously helped navigate the devrooms (and even had a live countdown to help you track where you were at in the schedule!)

Each devroom was also available via IRC on Freenode via a dedicated bridge (#fosdem-...) and via XMPP.

The bot also created rooms for each and every talk at FOSDEM (all 666 of them), as the space where the speaker and host could hang out in advance; watch the talk together, and then broadcast the Q&A session. At the end of the talk slot, the bot then transformed the talk room into a 'hallway' for the talk, and advertised it to the audience in the devroom, so folks could pose follow-on questions to the speaker as so often happens in real life at FOSDEM. The speaker's view of the talk rooms looked like this:

On the right-hand side you can see a "scoreboard" - a simple widget which tracked which messages in the devroom had been most upvoted, to help select questions for the Q&A session. On the left-hand side you can see a hybrid Jitsi/livestream widget used to coordinate between the speaker & host. By default, the widget showed the local livestream of the video call - if you clicked 'join conference' you'd join the Jitsi itself. This stopped view-only users from overloading the Jitsi once the room became public.

The widgets themselves were hosted by the bot (you can see them at https://github.com/matrix-org/conference-bot/tree/main/web). Meanwhile the chat.fosdem.org webclient itself ended up being identical to mainline Element Web 1.7.19, other than FOSDEM branding and being configured to hook the 'video call' button up to the hybrid Jitsi/livestream widget rather than a plain Jitsi.

Meanwhile, for conferencing we hosted an off-the-shelf Jitsi cluster sized to ~100 concurrent conferences, and for the Jibri livestreaming we set up an elastic scalable cluster using AWS Auto Scaling Groups. Jibri is essentially a Chromium which views the Jitsi webapp, running in a headless X server whose framebuffer and ALSA audio is hooked up to an ffmpeg process which livestreams to the appropriate destination - so we chose to run a separate VM for every concurrent livestream to keep them isolated from each other. The Jibri ffmpegs compressed the livestream to RTMP and relayed it to our nginx, which in turn relayed it to FOSDEM's livestreaming infrastructure for use in the official stream, as well as relaying it back to the local video preview in the Matrix livestream/video widget.

Here's a screengrab of the Jitsi/Jibri Grafana dashboard during the first day of the conference, showing 46 concurrent conferences in action, with 25 spare jibris in the scaling group cluster ready for action if needed :)

There was also an explosion of changes to Element itself to try to make things go as smoothly as possible. Probably the most important one was implementing Social Login - giving single-click registration for attendees who were happy to piggyback on existing identity providers (GitHub, GitLab, Google, Apple and Facebook) rather then signing up natively in Matrix:

This was a real epic to get together (and is also an important part of achieving parity between Gitter and Element) - and seems to have been surprisingly successful for FOSDEM. Almost 50% of users who signed up on the FOSDEM server did so via social login! We should also be turning it on for the matrix.org server this week.

Finally, on the Matrix server side, we ran a cluster of synapse worker processes (1 federation inbound, reader and sender, 1 pusher, 1 initial sync worker, 10 synchrotrons, 1 event persister, 1 event creator, 4 general purpose client readers, 1 typing worker and 1 user directory) within Kubernetes on EMS. These were hooked up for horizontal scalability as follows:

The sort of traffic we saw (from day 2) looked like this:

🔗How did it go?

Overall, people seem to have had a good time. Some folks have even been kind enough to call it the best online event they've been to :) This probably reflects the fact that FOSDEM rocks no matter what - and that Matrix is an inherently social medium, built by and for open source communities (after all, the whole Matrix ecosystem is developed over Matrix!). Also, Matrix being an open network means that folks could join from all over, so the social dynamics already present in Matrix spilled over into FOSDEM - and we even saw a bunch of people spin up their own servers to participate; literally sharing the hosting responsibility themselves. Finally, having critical infrastructure rooms available such as #beerevent:fosdem.org, #cafe:fosdem.org and #food-trucks:fosdem.org probably helped as well.

That said, we did have some production incidents which impacted the event. The most serious one was on Saturday morning, where it transpired that some of the endpoints hosted on the main Synapse process were taking way more CPU than we'd anticipated - most importantly the /groups endpoints which handle traffic relating to communities (the legacy way of defining groups of rooms in Matrix). One of the last things we'd done to prepare for the conference on Friday night was to create a +fosdem:fosdem.org community which spanned all 1000 public rooms in the conference, as well as add the +staff:fosdem.org community to all of those rooms - and unfortunately we didn't anticipate how popular these would be. As a result we had to do some emergency rebalancing of endpoints, spinning up new workers and reconfiguring the loadbalancer to relieve load off the main process.

Ironically the Matrix server was largely working okay during this timeframe, given event-sending no longer passes through the main process - but the most serious impact was that the conference bot was unable to boot due to hitting a wide range of endpoints on startup as it syncs with the conference, some of which were timing out. This in turn impacted widgets, which had been hosted by the bot for convenience, meaning that the Jitsi conferences for stands and talk Q&A were unavailable (even though the Jitsi/Jibri cluster was fine). This was solved by lunchtime on Saturday: we are really sorry for folks whose Q&As or conferences got caught in this. On the plus side, we spotted that many affected rooms just added their own widgets for their own Jitsis or BBBs to continue with minimal distraction - effectively manually taking over from the bot.

The other main incident was briefly first thing on Sunday morning, where two Jibri livestreams ended up trying to broadcast video to the same RTMP URL (potentially due to a race when rapidly removing and re-adding the jitsi/livestream widget for one of the stands). This caused a cascading failure which briefly impacted all RTMP streams, but was solved within about 30 minutes. We also had a more minor problem with the active speaker recognition malfunctioning in Jitsi on Sunday (apparently a risk when using SCTP rather than Websockets as a transport within Jitsi) - this was solved around lunchtime. Again, we're really sorry if you were impacted by this. We've learned a lot from the experience, and if we end up doing this again we will make sure these failure modes don't repeat!

Other things we'd change if we have the chance to do it again include:

  • Providing a to-the-second countdown via a widget in the talk room so the speaker & host can see precisely when they're going 'live' in the devroom (and when precisely they're going to be cut in favour of the next talk)
  • Providing a scratch-pad of some kind in the talk room so the host & speaker can track which questions they want to answer, and which they've already answered
  • Keep the questions scoreboard and scratchpad visible to the speaker/host after their Q&A has finished so they can keep answering the questions in the per-talk room, and advertise the per-talk room more effectively.
  • Use Spaces rather than Communities to group the rooms together and automatically provide a structured room directory! (Like this!)
  • Use threads (once they land!) to help structure conversations in the devroom (perhaps these could even replace the hallway rooms?)
  • Make the schedule widgets easier to find, and have more of them around the place
  • Make room directory easier to find.
  • Give the option of recording the video in the per-talk and stands for posterity
  • Provide more tools to stands to help organise demos etc.

So, there you have it. We hope that this shows that it's possible to host successful large-scale conferences on Matrix using an entirely open source stack, and we hope that other events will be inspired to go online via Matrix! We should give a big shout out to HOPE, who independently trailblazed running conferences on Matrix last year and inspired us to make FOSDEM work.

If you want to know more, we also did a talk about FOSDEM-on-Matrix in this month's Open Tech Will Save Us meetup and the Building Massive Virtual Communities on Matrix talk at FOSDEM went into more detail too. Our historical Taking FOSDEM online via Matrix blog has been somewhat overtaken by events but gives further context still.

Finally, huge thanks to FOSDEM for letting Matrix host the social side of the conference. This was a big bet, starting from scratch with our offer to help back in September, and we hope it paid off. Also, thanks to all the folks at Element who bust a gut to pull it together - and to the FOSDEM organisers, who were a real pleasure to work with.

Let's hope that FOSDEM 2022 will be back in person at ULB - but whatever happens, the infrastructure we built this year will be available if ever needed in future.