Pre-disclosure: Upcoming critical security fix for Synapse

Hi all,

During the ongoing work to finalise a stable release of Matrix’s Server-Server federation API, we’ve been doing a full audit of Synapse’s implementation and have identified a serious vulnerability which we are going to release a security update to address (Synapse 0.33.3.1) on Thursday Sept 6th 2018 at 12:00 UTC.

We are coordinating with package maintainers to ensure that patched versions of packages will be available at that time – meanwhile, if you run your own Synapse, please be prepared to upgrade as soon as the patched versions are released.  All previous versions of Synapse are affected, so everyone will want to upgrade.

Thank you for your time, patience and understanding while we resolve the issue,

signed_predisclosure.txt

Matrix Spec Update August 2018

Introducing Client Server API 0.4, and the first ever stable IS, AS and Push APIs spec releases!

Hi folks,

As many know, we’ve been on a massive sprint to improve the spec – both fixing omissions where features have been implemented in the reference servers but were never formalised in the spec, and fixing bugs where the spec has thinkos which stop us from being able to ratify it as stable and thus fit for purpose .

In practice, our target has been to cut stable releases of all the primary Matrix APIs by the end of August – effectively declaring Matrix out of beta, at least at the specification level.  For context: historically only one API has ever been released as stable – the Client Server API, which was the result of a similar sprint back in Jan 2016. This means that the Server Server (SS) API, Identity Service (IS) API, Application Service (AS) API and Push Gateway API have never had an official stable release – which has obviously been problematic for those implementing them.

However, as of the end of Friday Aug 31, we’re proud to announce the first ever stable releases of the IS, AS and Push APIs!


To the best of our knowledge, these API specs are now complete and accurately describe all the current behaviour implemented in the reference implementations (sydent, synapse and sygnal) and are fit for purpose. Any deviation from the spec in the reference implementations should probably be considered a bug in the impl. All changes take the form of filling in spec omissions and adding clarifications to the existing behaviour in order to get things to the point that an independent party can implement these APIs without having to refer to anything other than the spec.

This is the result of a lot of work which spans the whole Spec Core Team, but has been particularly driven by TravisR, who has taken the lead on this whole mission to improve the spec.  Huge thanks are due to Travis for his work here, and also massive thanks to everyone who has suffered endured reviewed his PRs and contributed to the releases.  The spec is looking unrecognisably better for it – and Matrix 1.0 is feeling closer than ever!

Alongside the work on the IS/AS/Push APIs, there has also been a massive attempt to plug all the spec omissions in the Client Server API.  Historically the CS API releases have missed some of the newer APIs (and of course always miss the ones which postdate a given release), but we’ve released the APIs which /have/ been specified as stable in order to declare them stable.  However, in this release we’ve tried to go through and fill in as many remaining gaps as possible.

The result is the release of Client Server API version 0.4. This is a huge update – increasing the size of the CS API by ~40%. The biggest new stuff includes fully formalising support for end-to-end encryption (thanks to Zil0!), versioning for rooms (so we can upgrade rooms to new versions of the protocol), synchronised read markers, user directories, server ACLs, MSISDN 3rd party ids, and .well-known server discovery (not that it’s widely used yet), but for the full picture, best bet is to look at the changelog (now managed by towncrier!).  It’s probably fair to say that the CS API is growing alarmingly large at this point – Chrome says that it’d be 223 A4 pages if printed. Our solution to this will be to refactor it somehow (and perhaps switch to a more compact representation of the contents).

Some things got deliberately missed from the CS 0.4 release: particularly membership Lazy Loading (because we’re still testing it out and haven’t released it properly in the wild yet), the various GDPR-specific APIs (because they may evolve a bit as we refine them since the original launch), finalising ID grammars in the overall spec (because this is surprisingly hard and subtle and we don’t want to rush it) and finally Communities (aka Groups), as they are still somewhat in flux.

Meanwhile, on the Server to Server API, there has also been a massive amount of work.  Since the beginning of July it’s tripled in size as we’ve filled in the gaps, over the course of >200 commits (>150 of which from Travis).  If you take a look at the current snapshot it’s pretty unrecognisable from the historical draft; with the main changes being:

  • Adding the new State Resolution algorithm to address flaws in the original one.  This has been where much of our time has gone – see MSC1442 for full details.  Adopting the new algorithm requires rooms to be recreated; we’ll write more about this in the near future when we actually roll it out.
  • Adding room versioning so we can upgrade to the new State Resolution algorithm.
  • Everything is now properly expressed as Swagger (OpenAPI), just like the CS API
  • Adding all the details for E2E encryption (including dependencies like to-device messaging and device-list synchronisation)
  • Improvements in specifying how to authorize inbound events over federation
  • Document federation APIs such as /event_auth and /query_auth and /get_missing_events
  • Document 3rd party invites over federation
  • Document the /user/* federation endpoints
  • Document Server ACLs
  • Document read receipts over federation
  • Document presence over federation
  • Document typing notifications over federation
  • Document content repository over federation
  • Document room directory over federation
  • …and many many other minor bug fixes, omission fixes, and restructuring for coherency – see https://github.com/matrix-org/matrix-doc/issues/1464 for an even longer list :)

However, we haven’t finished it all: despite our best efforts we’re running slightly past the original target of Aug 31.  The current state of play for the r0 release overall (in terms of pending issues) is:…and you can see the full breakdown over at the public Github project dashboard.

The main stuff we still have remaining on the Server/Server API at this point is:

  • Better specifying how we validate inbound events. See MSC1646 for details & progress.
  • Switching event IDs to be hashes. See MSC1640 for details and progress.
  • Various other remaining security considerations (e.g. how to handle malicious auth events in the DAG; how to better handle DoS situations).
  • Merging in the changes to authoring m.room.power_levels (as per MSC1304)
  • Formally specifying the remaining identifiers which lack a formal grammar – MSC1597 and particularly room aliases (MSC1608)

The plan here is to continue speccing and implementing these at top priority (with Travis continuing to work fulltime on spec work), and we’ll obviously keep you up-to-date on progress.  Some of the changes here (e.g. event IDs) are quite major and we definitely want to implement them before speccing them, so we’re just going to have to keep going as fast as we can. Needless to say we want to cut an r0 of the S2S API alongside the others asap and declare Matrix out of beta (at least at the spec level :)

In terms of visualising progress on this spec mission it’s interesting to look at the rate at which we’ve been closing PRs: this graph shows the total number of PRs which are in state ‘open’ or ‘closed’ on any given day:

…which clearly shows the original sprint to get the r0 of the CS API out the door at the end 2015, and then a more leisurely pace until the beginning of July 2018 since which the pace has picked up massively.  Other ways of looking at include the number of open issues…


…or indeed the number of commits per week…


…or the overall Github Project activity for August.  (It’s impressive to see Zil0 sneaking in there on second place on the commit count, thanks to all his GSoC work documenting E2E encryption in the spec as part of implementing it in matrix-python-sdk!)


Anyway, enough numerology.  It’s worth noting that all of the dev for r0 has generally followed the proposed Open Governance Model for Matrix, with the core spec team made up of both historical core team folk (erik, richvdh, dave & matthew), new core team folk (uhoreg & travis) and community folk (kitsune, anoa & mujx) working together to review and approve the changes – and we’ve been doing MSCs (albeit with an accelerated pace) for anything which we feel requires input from the wider community.  Once the Server/Server r0 release is out the door we’ll be finalising the open governance model and switching to a slightly more measured (but productive!) model of spec development as outlined there.

Meanwhile, Matrix 1.0 gets ever closer.  With (almost) all this spec mission done, our plan is to focus more on improving the reference implementations – particularly performance in Synapse, UX in matrix-{react,ios,android}-sdk as used by Riot (especially for E2E encryption), and then declare a 1.0 and get back to implementing new features (particularly Editable Messages and Reactions) at last.

We’d like to thank everyone for your patience whilst we’ve been playing catch up on the spec, and hope you agree it’s been worth the effort :)

Matthew & the core spec team.

GSOC: Implementing End-to-End Encryption in the Matrix Python SDK

Following on from the previous post, we have an update from zil0 on his GSoC project, which entailed implementing E2E support in the Matrix Python SDK.


The goal of my project is to implement Matrix’s end-to-end encryption protocol in Python, as part of matrix-python-sdk. My mentors are Richard van der Hoff (richvdh) and Hubert Chathi (uhoreg).

It was easy to get started on the project, since the simple parts came first (adding API calls), and then the whole process to follow is documented in an implementation guide, while there is also the reference implementation in JavaScript. And most importantly, the community is nice. :)

Some parts of the work consist in wrapping around the cryptographic primitives implemented in libolm (via Python bindings), in order to handle encrypted events. Others are less straightforward, such as tracking device lists of users, or finding the right way to persist keys and related data between startups.

An interesting aspect of this project is that I am working on a new part of the Python SDK, while also having to integrate with existing code, which is a cool balance between freedom and guidelines.

As the encryption documentation is a bit outdated and incomplete, one (fun) difficulty is to look for information across old issues, Gdocs and source code (and asking my mentor when in doubt). For anyone trying to implement E2E, it should be better by the end of the project, as I am currently working on documenting the missing bits.

I have had a great experience so far. Working on an open source project differs from my previous coding experiences, as people are actually going to use what I write! I have learnt to think about the best design from a usability point of view, discuss different approaches, and I had to write tests and document my code, which sadly is not something I do on personal projects. I enjoyed reviews, and the discussions they led to. And of course I have learnt quite an interesting lot about the E2E voodoo, along with some new Python tricks.

Currently, the implementation is in a working state. Some of the code is merged, and some is awaiting review. It is possible to try it here before everything is merged.
The project will be finished in about one week, after some tidying up and when I release device verification and key sharing, which should be the last missing features compared to Riot.

Dendrite Progress Update

As you may know, for the last few months anoa (Andrew) and APWhiteHat have been working on Dendrite, the next generation Matrix homeserver, written in Go. We asked for an update on their progress, and Andrew provided the blog post below. Serious progress has been made on Dendrite this summer!

 


Hey everyone, my name is Andrew Morgan and I’ve been working full-time over the summer on Dendrite, our next-generation Matrix homeserver. Over the last two months, I’ve seen the project transform from a somewhat functioning toy server to a near-production-ready homeserver that is working towards complete feature support. I’ve appreciated the thought put into the project since day one, and enjoy the elegance of the multi-component design. Documentation is fairly decent at the moment, but comments are plentiful throughout the codebase, while the code itself tends towards simple and maintainable rather than complex and unmanageable.

Application Service Integration

The main focus of my time here has been on the implementation of application service support for Dendrite. Application services are external programs that act as privileged extensions to a homeserver, allowing such functionality as bots in rooms and bridges to third-party networks. Supporting application services requires a few different bits and pieces to be set up. Currently all planned features have a PR for them, with the bold items already merged:

  • Sending events to application services
  • Support user masquerading for events
  • Support editing event timestamps
  • Support room alias querying
  • Support user ID querying
  • Support third party lookup proxying

As you can see a decent portion of the functionality is already in master! The rest will hopefully follow after some further back and forth.

Google Summer of Code

I certainly haven’t been going at this all on my own. Alongside extensive help from Erik, who’s been mentoring me, our resident Google Summer of Code student, APWhiteHat, has been tackling feature after feature in Dendrite wherever he can find them. Application services received a good deal of help on client-server endpoint authentication side, however, APWhiteHat has mostly been focusing on federation and some other very useful pieces. While his GSoC period still has a week or so before its conclusion, he has so far implemented:

  • Idempotency to roomserver event processing to prevent duplication
  • Username auto generation
  • Tokens library based on macaroons
  • Lots of left-over federation stuff: state API & get missing events being the major ones
  • AS support to clientapi auth
  • Typing server: handling of PUT /typing by clientapi
  • More typing server stuff on its way

From my perspective, APWhiteHat was an excellent developer to work with. He asked good questions and was quick to answer any myself or the community had as well. His code reviews were also very comprehensive. I learned a lot from working with him and everyone else :)

OpenTracing and Prometheus Monitoring

Placing any large server into a production environment requires extensive monitoring capabilities in order to ensure operations are running smoothly. To that effect, Dendrite has been both the addition of OpenTracing and Prometheus support. Prometheus, also used heavily in Synapse, allows a homeserver operator to track a wide range of data including endpoint usage, resource management as well as user statistics over any given range of time.

In Dendrite, we are taking this one step further by introducing OpenTracing, a language and platform-agnostic framework for tracking the journey of an endpoint call from incoming request to outgoing response, with every method, hierarchy change and database call in between. It will be immensely useful in tracking down performance issues, as well as providing insight into the most critical paths throughout the codebase and where we should focus most of our optimization efforts on. It also comes with a lovely dashboard courtesy of Jaeger:

Community

We’ve also seen some encouraging interest and development work from the community in the past couple months. While PR review from our own side is admittedly slow due to our focus on getting the foundational work in place, that hasn’t stopped both old and new developers from sending in PRs and performing code reviews. A huge thank you to everyone involved! From this we’ve gotten API implementations and application service fixes from @turt2live, an end-to-end encryption implementation from @fadeAce, filtering support from @CromFr, and some PRs and numerous helpful review comments from @krombel.

We’ve also started to see some people running Dendrite in live environments, which is incredibly exciting for us to see! While Dendrite is not considered production-ready yet (though it moves closer every day), if you are interested in giving it a go please consult the quickstart installation guide. We look forward to any feedback you may have!

Synapse 0.33.0 is here!!

Hi all,

We’ve just released Synapse 0.33.0!  This is a major performance upgrade which speeds up /sync (i.e. receiving messages) by a factor of almost 2x!  This has already made a massive difference to the CPU usage and snappiness of the matrix.org homeserver since we rolled it out a few days ago – you can see the drop in sync worker CPU just before midday on July 17th; previously we were regularly hitting the CPU ceiling (at which point everything grinds to a halt) – now we’re back down hovering between 40% and 60% CPU (at the current load).  This is actually fixing a bug which crept in around Synapse 0.31, so please upgrade – especially if Synapse has been feeling slower than usual recently, and especially if you are still on Synapse 0.31.

Meanwhile we have a lot of new stuff coming on the horizon – a whole new algorithm for state resolution (watch this space for details); incremental state resolution (at last!) to massively speed up state resolution and mitigate extremities build up (and speed up the synapse master process, which is now the bottleneck again on the matrix.org homeserver); better admin tools for managing resource usage, and all the Python3 porting work (with associated speedups and RAM & GC improvements).  Fun times ahead!

The full changelog follows below; as always you can grab Synapse from https://github.com/matrix-org/synapse.   Thanks for flying Matrix!

Synapse 0.33.0 (2018-07-19)

Bugfixes

  • Disable a noisy warning about logcontexts. (#3561)

Synapse 0.33.0rc1 (2018-07-18)

Features

  • Enforce the specified API for report_event. (#3316)
  • Include CPU time from database threads in request/block metrics. (#3496#3501)
  • Add CPU metrics for _fetch_event_list. (#3497)
  • Optimisation to make handling incoming federation requests more efficient. (#3541)

Bugfixes

  • Fix a significant performance regression in /sync. (#3505#3521#3530#3544)
  • Use more portable syntax in our use of the attrs package, widening the supported versions. (#3498)
  • Fix queued federation requests being processed in the wrong order. (#3533)
  • Ensure that erasure requests are correctly honoured for publicly accessible rooms when accessed over federation. (#3546)

Misc

Security update: Synapse 0.31.2

Hi all,

On Monday (2018-06-11) we had an incident where #matrix:matrix.org was hijacked by a malicious user pretending to join the room immediately after its creation in 2014 and then setting an m.room.power_levels event ‘before’ the correct initial power_level for the room.

Under normal circumstances this should be impossible because the initial m.room.power_levels for a room should be set before its m.room.join_rules event, meaning users who join the room are subject to its power levels. However, back before we’d even released Synapse, the first two rooms ever created in Matrix (#test:matrix.org and #matrix:matrix.org) were manually created and set the join_rules before the power_levels event, letting users join before the room’s power_levels were defined, and so were vulnerable to this attack. We’ve since re-created #matrix:matrix.org – please re-/join the room if you haven’t already!

As a defensive measure, we are releasing a security update of Synapse (0.31.2) today which changes the rules used to authenticate power_level events, such that we fail-safe rather than fail-deadly if the existing auth mechanisms fail. In practice this means changing the default power level required to set state to be 50 rather than 0 if there is no power_levels event present, thus meaning that only the room creator can set the initial power_levels event.

We are not aware of anyone abusing this (other than the old #matrix:matrix.org room) but we’d rather be safe than sorry, so would recommend that everyone upgrade as soon as possible.

This of course constitutes a change to the spec, so full technical details and ongoing discussion around the Matrix Spec Change proposal can be followed over at MSC1304.

EDIT: if you are aware of your server participating in rooms whose first power_levels event is deliberately set by a different user to their creator, please let us know asap (and don’t upgrade!)

This work is all part of a general push to finalise and harden and fully specify the Server-Server API as we push towards a long-awaited stable release of Matrix!

As always, you can get the new update from https://github.com/matrix-org/synapse/releases/tag/v0.31.2 or from any of the sources mentioned at https://github.com/matrix-org/synapse.

thanks, and apologies for the inconvenience.

Changes in synapse v0.31.2 (2018-06-14)

SECURITY UPDATE: Prevent unauthorised users from setting state events in a room when there is no m.room.power_levels event in force in the room. (PR #3397)

Discussion around the Matrix Spec change proposal for this change can be followed at https://github.com/matrix-org/matrix-doc/issues/1304.