Matthew Hodgson

157 posts tagged with "Matthew Hodgson" (See all Author)

Dendrite 0.3.0 released

16.11.2020 17:44 — Releases Matthew Hodgson

Hi all,

Heads up that we just cut another beta release of Dendrite - now at 0.3.0!

This is a really fun release given almost all the changes are contributed from the wider community - so huge thanks to S7evinK, MayeulC and felix!

The main new feature is full Read Receipt support thanks to S7evinK, which makes an enormous perceptual improvement when using Dendrite - so especial thanks are due there :)

So, if you're interested in helping us test, please spin up a copy from https://github.com/matrix-org/dendrite and let us know how it goes - and if you're already running one, now is an excellent time to upgrade!

Full changelog (including 0.2.1, which we forgot to blog about) follows:

Dendrite 0.3.0 (2020-11-16)

Features

  • Read receipts (both inbound and outbound) are now supported (contributed by S7evinK)
  • Forgetting rooms is now supported (contributed by S7evinK)
  • The -version command line flag has been added (contributed by S7evinK)

Fixes

  • User accounts that contain the = character can now be registered
  • Backfilling should now work properly on rooms with world-readable history visibility (contributed by MayeulC)
  • The gjson dependency has been updated for correct JSON integer ranges
  • Some more client event fields have been marked as omit-when-empty (contributed by S7evinK)
  • The build.sh script has been updated to work properly on all POSIX platforms (contributed by felix)

Dendrite 0.2.1 (2020-10-22)

Fixes

  • Forward extremities are now calculated using only references from other extremities, rather than including outliers, which should fix cases where state can become corrupted (#1556)
  • Old state events will no longer be processed by the sync API as new, which should fix some cases where clients incorrectly believe they have joined or left rooms (#1548)
  • More SQLite database locking issues have been resolved in the latest events updater (#1554)
  • Internal HTTP API calls are now made using H2C (HTTP/2) in polylith mode, mitigating some potential head-of-line blocking issues (#1541)
  • Roomserver output events no longer incorrectly flag state rewrites (#1557)
  • Notification levels are now parsed correctly in power level events (gomatrixserverlib#228, contributed by Pestdoktor)
  • Invalid UTF-8 is now correctly rejected when making federation requests (gomatrixserverlib#229, contributed by Pestdoktor)

How we fixed Synapse's scalability!

03.11.2020 00:00 — Releases Matthew Hodgson

Hi all,

We had a major break-through in Synapse 1.22 which we want to talk about in more detail: Synapse now scales horizontally across multiple python processes.

Horizontal scaling means that you can support more users and traffic by adding in more python processes (spread over more machines, if necessary) without there being a single bottleneck which all the traffic is passing through - as opposed to vertical scaling where you make things go faster overall by making the bottleneck go faster.

After many years of having to vertically scale Synapse (by trying to make the main process go faster) we’re now finally at the point where you can configure Synapse so that messages no longer flow through the main process - eliminating the bottleneck entirely. What’s more, the Matrix.org homeserver has now been successfully running in this config and enjoying the massive scalability improvements for the last 2 weeks! Huge kudos goes to Erik and the wider Synapse team for pulling this off.

Some readers might wonder how this ties in with Dendrite entering beta, given one of Dendrite’s design goals is full horizontal scalability. The answer is that we’re very much using Dendrite for experimentation and next-gen stuff at the moment (currently focused more on scaling downwards for P2P rather than scaling upwards for megaservers) - while Synapse is the stable and long-term supported option.

So, that’s the context - now over to Erik with more than you could possibly ever want to know about how we actually did it...

Background

Synapse started life off back in 2014 as a single monolithic python process, and for quite a while we made it scale by adding more and more in-memory caches to speed things up by avoiding hitting the database (at the expense of RAM). It looked like this:

Eventually the caches stopped helping and we needed more than one thread of execution in order to spread CPU across multiple cores. Python’s Global Interpreter Lock (GIL) means that Python can mainly only use one CPU core at a time, so starting more threads doesn’t help with scalability - you have to run multiple processes.

Now, the vast majority of the work that Synapse does is related to “streams”. These are append only sequences of rows, such as the events stream, typing stream, receipts stream, etc. When a new event arrives (for example) we write it to the events stream, and then notify anything waiting that there has been an update. The /sync endpoint, for instance, will wait for updates to streams and send them down to long-polling Matrix clients.

Streams support being added to concurrently, so have a concept of the “persisted up-to position”. This is the point where all rows before that point have finished persisting. Readers only read up to the current “persisted up-to position”, so that they don’t skip updates that haven’t finished persisting at that point. (E.g. if two events A and B get assigned positions 5 and 6, but B finishes persisting first, then the persisted up to position will remain at 4 until A finishes persisting and then it jumps to 6).

To split any meaningful amount of work into separate processes, we need to add a mechanism where processes can be told that updates to streams have happened (otherwise they’d have to repeatedly poll the DB, which would be deeply inefficient). The architecture ended up being one where we had the “main” process that streams updates via a custom replication protocol (initially long-polling HTTP; later custom TCP) to any number of “worker” processes. This meant that we could move sync stream handling (and other read apis) off the main process and onto workers, but also that all database writes had to go through the single main process (as it was a star topology, the main process could talk to all workers but workers could only talk to the main process and not each other).

2020-11-03-synapse2.png

As an aside: cache invalidations also had to be streamed down the replication connections, which has the side effect that we could only cache things that would only be invalidated on the main process.

We continued to move more and more read APIs out onto separate workers. We also added workers in front of the main process that would e.g. handle the creation of the new events, authenticating, etc, and then call out to the main process with the event for it to persist the event.

Moving writes off the main process

Eventually we ran out of stuff to move out of the main process that didn’t involve writing to the DB. To write stuff from other processes we needed a way for the workers to stream updates to each other. The easiest and most obvious way was to just use Redis and its pub/sub support.

2020-11-03-synapse3.png

This almost allowed us to move writing of a particular stream to a different worker, except writing to streams generally also meant invalidating caches which in itself requires writing to a stream. We needed a way of writing to the cache invalidation stream from multiple workers at once.

Sharding the cache invalidation thankfully turned out to be easy, as workers would simply call the cache invalidation function whenever they get an invalidation notice over replication. In particular, the ordering of invalidations from different workers doesn’t matter and so there isn’t a need to calculate a single “persisted up-to position”. Sharding then just becomes a case of adding the name of the worker that is writing the update to the replication stream, and then workers reading from it can basically treat the cache stream the same as if they were multiple streams, one per worker.

This then unlocks the ability to move writing of streams off the main process and onto different workers - and so we added the “event persister” worker for offloading the main event stream off the main process:

2020-11-03-synapse4.png

Sharding the events stream

Eventually the worker responsible for doing nothing but persisting events started maxing out CPU. This meant that we had to look at sharding the events stream, i.e. writing to it from multiple workers.

This is more complicated than sharding the cache invalidation stream as the ordering of the events does matter; we send them down sync streams, in order, with a token that indicates where the sync stream is up to in the events stream. This means that workers need to be able to calculate a “persisted up-to position” when getting updates from different workers.

The easiest way of doing that is to simply set the persisted up-to position as the minimum position received over federation from all active writers. This works, except events would only be processed after all other writers have subsequently written events (to advance the persisted position past the point at which the event was written), which can add a lot of latency depending on how often events are written.

A refinement is to note that if you have a persisted up-to position of 10, then receive updates at sequential positions 11, 12, 13 and 14, you know that everything between 10 and 14 has finished persisting (as you received updates about them), and so can set the persisted up-to position to 14. Annoyingly, it’s not required that positions are sequential without gaps (due to various technical considerations), and so in the worst case this still has the same problems as the naïve solution.

To avoid these problems we change the persisted up-to position to be a vector clock of positions; tracking a vector of positions - one per writer. This still allows answering the query of “get all events after token X” (as events are written with the position and the name of the writer). The persisted up-to position is then calculated by just tracking the last position seen to arrive over replication from each writer.

This allows writing events from multiple workers, while ensuring that other workers can correctly keep track of a “persisted up-to position”. Then it's just a matter of inspecting the code to ensure that it does not assume that it is the only writer to the stream. In the case of writing to the events stream, we note that the function persisting events assumes it's the only writer for a given room, so when sharding we have to ensure that there are no concurrent writes to the same room. This is most easily done by sharding based on room ID, and ensuring that the mapping of room ID to worker does not change (without coordination).

The only thing left is to then encode the vector clock position into the sync tokens. We want to ensure that these tokens are not too long, as they get included as query string parameters (e.g. the since= parameter of /sync). By assigning persistent unique integer IDs to workers the vector clock can be persisted as a sequence of pairs of integers, which is relatively few bytes so long as we don’t have too many workers writing to the events stream. We can further reduce the size of the tokens by calculating an integer “persisted up-to position” as we did before, encoding that and only including positions for workers that are larger than the integer persisted upto position. (The idea here is that most of the time only a small number of workers will be ahead of the calculated persisted up-to position, and so we only need to encode those).

And this is what we have today:

2020-11-03-synapse5.png

The major limitation of the current situation is that you can’t dynamically add/remove workers which persist events, as the sharding by room ID is calculated at startup, and so changing it requires restarting the whole system. This could be replaced by any system that allowed coordination over which persister is allowed to write to a room at any given point. However this is likely tricky to get right in practice, but would allow dynamic auto scaling of deployments, or automatically recovering from a worker that gets wedged/dies.

Finally, it’s worth noting that sharding event persisters isn’t the only performance work that’s been going on - switching everything over to python 3 and async twisted has helped, along with lots of smaller optimisations on the hot paths, and further rebalancing workers (e.g. moving background jobs off the master process to dedicated workers). We’ve also benefited a lot from the maintainability of rolling out mypy typing throughout the codebase. And next up, we’ll be going back to speeding up the codebase as a whole - starting with algorithmic state resolution improvements! 🎉

Performance

So, how does it stack up?

Here’s the send time heatmap on Matrix.org showing the step change on Oct 16th when we rolled out the second event persister (full disclosure: this also coincides with moving background processes off the main Synapse process to a background worker). As you can see, we go from messages being spread over a huge range of durations (up to several seconds) to the sweet spot being 50ms or less - a spectacular improvement!

2020-11-03-synapse-heatmap.png

Meanwhile, here’s the actual CPU utilisation as we split the traffic from a single event persister (yellow) to two persisters (one yellow, one blue), showing the sharding beautifully horizontally balancing CPU between the two active/active worker processes:

2020-11-03-synapse-cpu.png

We’ve yet to loadtest to see just how fast we can go now (before we start hitting bottlenecks on the postgres cluster), but it sure feels good to have all our CPU headroom back on Matrix.org again, ready for the next wave of users to arrive.

Conclusion

So there you have it: folks running massive homeservers (50K+ concurrent users) like Matrix.org (and cough various high profile public sector deployments) are no longer held hostage by the bottleneck of the main synapse process and should feel free to experiment with setting up event persister workers to handle high traffic loads. Otherwise, if you can spread your users over smaller servers, that’s also a good bet (assuming they don’t have massively overlapping room membership, like we see on Matrix.org.)

The current worker documentation is up-to-date, although does assume you are already very familiar with how to administer Synapse. It’s also very much subject to change, as we keep adding new workers and improving the architecture. However, now is a pretty good time to get involved if you’re interested in large-scale Matrix deployments.

-- The Synapse Team

Dendrite 0.2.0 released

20.10.2020 19:35 — Releases Matthew Hodgson

Hi all,

It's been over a week since our next-generation homeserver Dendrite entered beta, and it's been a wild rollercoaster ride as the team has been frantically zapping all the initial teething issues that came up - mostly around room federation getting 'stuck' due to needing to fix bugs in how room state is managed. Huge huge thanks to everyone who has spun up a Dendrite to experiment and report bugs!

We're now in an impressively better place, and it's feeling way more stable now (but please don't trust it with your data yet). So we've skipped 0.1.x and jumped straight to 0.2.0.

Now would be a great time for more intrepid explorers to try spinning up a server from https://github.com/matrix-org/dendrite and see how it feels - the more feedback the better. And if you got scared off by weird bugs in 0.1.0, now's the right time to try it again!

Full changelog follows:

Dendrite 0.2.0 (2020-10-20)

Important

  • This release makes breaking changes for polylith deployments, since they now use the multi-personality binary rather than separate binary files
    • Users of polylith deployments should revise their setups to use the new binary - see the Features section below
  • This release also makes breaking changes for Docker deployments, as are now publishing images to Docker Hub in separate repositories for monolith and polylith
    • New repositories are as follows: matrixdotorg/dendrite-monolith and matrixdotorg/dendrite-polylith
    • The new latest tag will be updated with the latest release, and new versioned tags, e.g. v0.2.0, will preserve specific release versions
    • Sample Compose configs have been updated - if you are running a Docker deployment, please review the changes
    • Images for the client API proxy and federation API proxy are no longer provided as they are unsupported - please use nginx (or another reverse proxy) instead

Features

  • Dendrite polylith deployments now use a special multi-personality binary, rather than separate binaries
    • This is cleaner, builds faster and simplifies deployment
    • The first command line argument states the component to run, e.g. ./dendrite-polylith-multi roomserver
  • Database migrations are now run at startup
  • Invalid UTF-8 in requests is now rejected (contributed by Pestdoktor)
  • Fully read markers are now implemented in the client API (contributed by Lesterpig)
  • Missing auth events are now retrieved from other servers in the room, rather than just the event origin
  • m.room.create events are now validated properly when processing a /send_join response
  • The roomserver now implements KindOld for handling historic events without them becoming forward extremity candidates, i.e. for backfilled or missing events

Fixes

  • State resolution v2 performance has been improved dramatically when dealing with large state sets
  • The roomserver no longer processes outlier events if they are already known
  • A SQLite locking issue in the previous events updater has been fixed
  • The client API /state endpoint now correctly returns state after the leave event, if the user has left the room
  • The client API /createRoom endpoint now sends cumulative state to the roomserver for the initial room events
  • The federation API /send endpoint now correctly requests the entire room state from the roomserver when needed
  • Some internal HTTP API paths have been fixed in the user API (contributed by S7evinK)
  • A race condition in the rate limiting code resulting in concurrent map writes has been fixed
  • Each component now correctly starts a consumer/producer connection in monolith mode (when using Kafka)
  • State resolution is no longer run for single trusted state snapshots that have been verified before
  • A crash when rolling back the transaction in the latest events updater has been fixed
  • Typing events are now ignored when the sender domain does not match the origin server
  • Duplicate redaction entries no longer result in database errors
  • Recursion has been removed from the code path for retrieving missing events
  • QueryMissingAuthPrevEvents now returns events that have no associated state as if they are missing
  • Signing key fetchers no longer ignore keys for the local domain, if retrieving a key that is not known in the local config
  • Federation timeouts have been adjusted so we don't give up on remote requests so quickly
  • create-account no longer relies on the device database (contributed by ThatNerdyPikachu)

Known issues

  • Old events can incorrectly appear in /sync as if they are new when retrieving missing events from federated servers, causing them to appear at the bottom of the timeline in clients
  • Memory can explode when catching up after a federation outage.

Combating abuse in Matrix - without backdoors.

19.10.2020 00:00 — General Matthew Hodgson

UPDATE: Nov 9th 2020

Not only are UK/US/AU/NZ/CA/IN/JP considering mandating backdoors, but it turns out that the Council of the European Union is working on it too, having created an advanced Draft Council Resolution on Encryption as of Nov 6th, which could be approved by the Council as early as Nov 25th if it passes approval. This doesn't directly translate into EU legislation, but would set the direction for subsequent EU policy.

Even though the Draft Council Resolution does not explicitly call for backdoors, the language used...

Competent authorities must be able to access data in a lawful and targeted manner

...makes it quite clear that they are seeking the ability to break encryption on demand: i.e. a backdoor.

Please help us spread the word that backdoors are fundamentally flawed - read on for the rationale, and an alternative approach to combatting online abuse.


Hi all,

Last Sunday (Oct 11th 2020), the UK Government published an international statement on end-to-end encryption and public safety, co-signed by representatives from the US, Australia, New Zealand, Canada, India and Japan. The statement is well written and well worth a read in full, but the central point is this:

We call on technology companies to [...] enable law enforcement access to content in a readable and usable format where an authorisation is lawfully issued, is necessary and proportionate, and is subject to strong safeguards and oversight.

In other words, this is an explicit request from seven of the biggest governments in the world to mandate a backdoor in end-to-end encrypted (E2EE) communication services: a backdoor to which the authorities have a secret key, letting them view communication on demand. This is big news, and is of direct relevance to Matrix as an end-to-end encrypted communication protocol whose core team is currently centred in the UK.

Now, we sympathise with the authorities’ predicament here: we utterly abhor child abuse, terrorism, fascism and similar - and we did not build Matrix to enable it. However, trying to mitigate abuse with backdoors is, unfortunately, fundamentally flawed.

  • Backdoors necessarily introduce a fatal weak point into encryption for everyone, which then becomes the ultimate high value target for attackers. Anyone who can determine the secret needed to break the encryption will gain full access, and you can be absolutely sure the backdoor key will leak - whether that’s via intrusion, social engineering, brute-force attacks, or accident. And even if you unilaterally trust your current government to be responsible with the keys to the backdoor, is it wise to unilaterally trust their successors? Computer security is only ever a matter of degree, and the only safe way to keep a secret like this safe is for it not to exist in the first place.

  • End-to-end encryption is nowadays a completely ubiquitous technology; an attempt to legislate against it is like trying to turn back the tide or declare a branch of mathematics illegal. Even if Matrix did compromise its encryption, users could easily use any number of other approaches to additionally secure their conversations - from PGP, to OTR, to using one-time pads, to sharing content in password-protected ZIP files. Or they could just switch to a E2EE chat system operating from a jurisdiction without backdoors.

  • Governments protect their own data using end-to-end encryption, precisely because they do not want other governments being able to snoop on them. So not only is it hypocritical for governments to argue for backdoors,** it immediately puts their own governmental data at risk of being compromised**. Moreover, creating infrastructure for backdoors sets an incredibly bad precedent to the rest of the world - where less salubrious governments will inevitably use the same technology to the massive detriment of their citizens’ human rights.

  • Finally, in Matrix’s specific case: Matrix is an encrypted decentralised open network powered by open source software, where anyone can run a server. Even if the Matrix core team were obligated to add a backdoor, this would be visible to the wider world - and there would be no way to make the wider network adopt it. It would just damage the credibility of the core team, push encryption development to other countries, and the wider network would move on irrespectively.

In short, we need to keep E2EE as it is so that it benefits the 99.9% of people who are good actors. If we enforce backdoors and undermine it, then the bad 0.1% percent simply will switch to non-backdoored systems while the 99.9% are left vulnerable.

We’re not alone in thinking this either: the GDPR (the world-leading regulation towards data protection and privacy) explicitly calls out robust encryption as a necessary information security measure. In fact, the risk of US governmental backdoors explicitly caused the European Court of Justice to invalidate the Privacy Shield for EU->US data. The position of the seven governments here (alongside recent communications by the EU commissioner on the ‘problem’ of encryption) is a significant step back on the protection of the fundamental right of privacy.

So, how do we solve this predicament for Matrix?

Thankfully: there is another way.

This statement from the seven governments aims to protect the general public from bad actors, but it clearly undermines the good ones. What we really need is something that empowers users and administrators to identify and protect themselves from bad actors, without undermining privacy.

What if we had a standard way to let users themselves build up and share their own views of whether other users, messages, rooms, servers etc. are obnoxious or not? What if you could visualise and choose which filters to apply to your view of Matrix?

Just like the Web, Email or the Internet as a whole, there is literally no way to unilaterally censor or block content in Matrix. But what we can do is provide first-class infrastructure to let users (and room/community moderators and server admins) make up their own mind about who to trust, and what content to allow. This would also provide a means for authorities to publish reputation data about illegal content, providing a privacy-respecting mechanism that admins/mods/users can use to keep illegal content away from their servers/clients.

The model we currently have in mind is:

  • Anyone can gather reputation data about Matrix rooms / users / servers / communities / content, and publish it to as wide or narrow an audience as they like - providing their subjective score on whether something in Matrix is positive or negative in a given context.
  • This reputation data is published in a privacy preserving fashion - i.e. you can look up reputation data if you know the ID being queried, but the data is stored pseudonymised (e.g. indexed by a hashed ID).
  • Anyone can subscribe to reputation feeds and blend them together in order to inform how they filter their content. The feeds might be their own data, or from their friends, or from trusted sources (e.g. a fact-checking company). Their blended feed can be republished as their own.
  • To prevent users getting trapped in a factional filter bubble of their own devising, we’ll provide UI to visualise and warn about the extent of their filtering - and make it easy and fun to shift their viewpoint as needed.
  • Admins running servers in particular jurisdictions then have the option to enforce whatever rules they need on their servers (e.g. they might want to subscribe to reputation feeds from a trusted source such as the IWF, identifying child sexual abuse content, and use it to block it from their server).
  • This isn’t just about combating abuse - but the same system can also be used to empower users to filter out spam, propaganda, unwanted NSFW content, etc on their own terms.

This forms a relative reputation system. As uncomfortable as it may be, one man’s terrorist is another man’s freedom fighter, and different jurisdictions have different laws - and it’s not up to the Matrix.org Foundation to play God and adjudicate. Each user/moderator/admin should be free to make up their own mind and decide which reputation feeds to align themselves with. That is not to say that this system would help users locate extreme content - the privacy-preserving nature of the reputation data means that it’s only useful to filter out material which would otherwise already be visible to you - not to locate new content.

In terms of how this interacts with end-to-end-encryption and mitigating abuse: the reality is that the vast majority of abuse in public networks like Matrix, the Web or Email is visible from the public unencrypted domain. Abusive communities generally want to attract/recruit/groom users - and that means providing a public front door, which would be flagged by a reputation system such as the one proposed above. Meanwhile, communities which are entirely private and entirely encrypted typically still have touch-points with the rest of the world - and even then, the chances are extremely high that they will avoid any hypothetical backdoored servers. In short, investigating such communities requires traditional infiltration and surveillance by the authorities rather than an ineffective backdoor.

Now, this approach may sound completely sci-fi and implausibly overambitious (our speciality!) - but we’ve actually started successfully building this already, having been refining the idea over the last few years. MSC2313 is a first cut at the idea of publishing and subscribing to reputation data - starting off with simple binary ban rules. It’s been implemented and in production for over a year now, and is used to maintain shared banlists used by both matrix.org and mozilla.org communities. The next step is to expand this to support a blendable continuum of reputation data (rather than just binary banlists), make it privacy preserving, and get working on the client UX for configuring and visualising them.

Finally: we are continuing to hire a dedicated Reputation Team to work full time on building this (kindly funded by Element). This is a major investment in the future of Matrix, and frankly is spending money that we don’t really have - but it’s critical to the long-term success of the project, and perhaps the health of the Internet as a whole. There’s nothing about a good relative reputation system which is particularly specific to Matrix, after all, and many other folks (decentralised and otherwise) are clearly in desperate need of one too. We are actively looking for funding to support this work, so if you’re feeling rich and philanthropic (or a government wanting to support a more enlightened approach) we would love to hear from you at [email protected]!

Here’s to a world where users have excellent tools to protect themselves online - and a world where their safety is not compromised by encryption backdoors.

-- The Matrix.org Core Team

*Comments at HN, lobste.rs, and r/linux, LWN

Dendrite is entering Beta!

08.10.2020 00:00 — Releases Matthew Hodgson

Hi all,

We’re very excited to announce that Dendrite, the next-generation Matrix homeserver from the core Matrix team, is at last exiting alpha development and entering beta testing!

The path we’ve taken to get here has been quite a curious one, and it’s worth recapping to give context on why it’s taken reality a little while to catch up with the dream. :)

The Dendrite project has its roots in 2016 as Dendron: an attempt to write a next-generation homeserver in Golang rather than Python, in order to benefit from Go’s stronger typing, ease of profiling (no twisted stack-shredding via deferredInlineCallbacks), multithreading and faster GC performance. The idea for Dendron was to do a strangler pattern rewrite of Synapse - where we’d insert Dendron in front of Synapse as a load balancer, and incrementally replace Synapse’s API endpoints with ones implemented by Dendron.

However, as the project started to progress, it became clear that this was going to end up with many of Synapse’s architectural choices being baked into the project - particularly the DB schema and data flow architecture, such that the new endpoints could interoperate with the existing Python ones. We got as far as putting Dendron live on matrix.org and moving some of the login/registration APIs over to it… but then work fizzled out due to Synapse demanding more urgent attention as traffic grew on Matrix.org, combined with concerns about whether Dendron was the right approach in general.

So, towards the end of 2016 (after the rush to launch Vector Riot Element that summer), we went back to the drawing board to devise Dendrite—“Dendron done right!”—as opposed to Dendron, which in retrospect was Dendrite done wrong. ;) The new vision was:

  • Build a massively horizontally scalable architecture, such that large Matrix deployments like matrix.org and big government deployments could run smoothly without the constant scalability headaches we were seeing at the time with Synapse
  • Do so by splitting the server into well-defined microservice components, each of which could independently horizontally scale, each with its own DB (if desired)
  • Connect the components together with a set of append-only logs via Kafka or similar, easily letting components shard and maintain their databases from the logs, allowing rolling upgrades, possibly schema upgrades, and all sorts of other niceties. The logs effectively become a primary source of truth rather than putting all the onus on a massive monolithic ever-growing database

Rather than Dendron’s top-down approach, instead Dendrite started bottom-up with the very hardest bit: gomatrixserverlib, a standalone Go library implementing the state resolution algorithms and performing federation requests (such that it might also someday be used as a general purpose way to add Matrix federation support to an existing Go codebase).

Then we started building out the various components to implement the various services, starting with the roomserver (the service which models the history and state of one or more rooms in the server), then the syncserver (the service which implements the /sync API to let clients receive messages), etc. We even implemented a simplified in-memory version of Kafka named naffka—useful for glueing together the microservice components when running them all within a single binary.

Things were looking pretty positive by the summer of 2017: we had the server sending/receiving messages, federating with Synapse, and looking tantalisingly close:

We just sent the first ever synapse->dendrite federated traffic, including full dendrite media API (thumbnailing, fed, etc)!!! :D :D :D pic.twitter.com/sBcM2jMAr6

— Matrix (@matrixdotorg) June 8, 2017

However, we then hit three fairly major obstacles:

  • Matrix lost its funding
  • In the ensuing uncertainty, the two lead developers (Mjark & Kegan) went to work elsewhere
  • Meanwhile, Matrix uptake was starting to explode and Synapse was failing to scale to handle the traffic on matrix.org (and elsewhere)

At first, having formed what would become New Vector (now Element) to keep the rest of the core team hired, we pushed to see if we could get Dendrite finished fast enough to replace Synapse, with Erik & richvdh jumping over from Synapse to pick up the remaining work. However, it became clear that we urgently needed a quicker solution to address all the overloaded Synapses out there, and so they swung back to focus on improving Synapse (taking inspiration from some of the design of Dendrite - e.g. offloading endpoints onto worker processes connected via replication streams, and using OpenTracing to debug traffic as it flows over the various services).

At this point, Dendrite maintenance was in effect valiantly taken over by the community, with Brendan and later Anoa keeping the ball going in 2017, joined by APWhitehat in GSoC 2018 and cnly in GSoC 2019. The fact that Dendrite is now here today is thanks in no small part to their work to keep the project alive in its “wilderness years” between Sept 2017 and Dec 2019.

Meanwhile, it became clear that we were overdue getting Matrix itself out of beta - and the last thing we wanted to do was to split and dilute the implementation work of Matrix 1.0 over both Synapse and Dendrite - so we consciously made the decision to focus all our effort on Synapse for solving the remaining bugs and challenges.

Then, in July 2019, Matrix and Synapse exited beta, and we finally started to see light at the end of the tunnel. In October we started dusting off Dendrite again - looking to use it as a relatively simple and flexible codebase for experimenting with Peer-to-Peer Matrix, not least because being Go it can compile to WebAssembly and run clientside, and because even though Dendrite was originally built with massive deployments in mind, it turns out the elastic scaling means it can also scale down pretty small too—as a part of the iOS P2P demo, we’ve even ran full Dendrite homeservers on iPhones embedded into Element iOS! :)

In Dec 2019, we finally got to the point where Element could fund full-time dedicated development on Dendrite once again, with Neil Alexander joining the project and focusing fulltime on getting Dendrite out of alpha and getting it working for P2P and embedded usage (adding libp2p as a federation transport, and adding SQLite support) - and in Jan 2020 we got Dendrite successfully running clientside in a WASM service worker (just in time for FOSDEM!). Then, in Feb 2020, Kegan returned to the project to work fulltime on Dendrite - and the race began in earnest to get Dendrite ready for beta!

Here’s a pretty picture courtesy of GitHub to visualise the progress:

020-10-08-dendrite-contributors.png

Throughout 2020 there’s been a huge amount of stabilisation work and polish:

  • Refactoring much of Dendrite’s foundation to make the codebase more maintainable
  • Created all-new user server, key server, signing key server microservices
  • Moving some work from existing microservices (ultimately superseding the former currentstateserver, publicroomsapi and typingserver microservices altogether)
  • Developing new testing infrastructure:
    • Complement - our brand new Golang Matrix integration test harness
    • Are We Synapse Yet - an aggregator which parses sytest/complement output to compare how close Dendrite is to passing
  • All the Matrix 1.0 work - particularly state res v2 & room version support
  • Making it work with more P2P transports for all the exciting P2P experiments
  • Supporting backfill and fetching missing events
  • Fixing up SQLite support to make it work as a first class citizen (with shared storage code where we can!)
  • Supporting both sending and rejecting invites (even over federation)
  • E2E encryption support (one-time keys, device lists, send-to-device support)
  • Improved federation sender logic (resend retries, backoffs, blacklisting, metrics, resetting backoffs when receiving transactions)
  • Handling both inbound and outbound redactions
  • User interactive authentication (and implemented on various ‘sudo’ endpoints e.g. deleting devices and changing passwords)
  • Respecting server ACLs
  • Rejecting / soft-failing events properly
  • Support for database schema upgrades

... which brings us at last to the present day (Oct 2020), as we declare Dendrite sufficiently stable that we consider it ready for beta testing!

In practice, this means **Dendrite is now ready for experimentation by adventurous Matrix sysadmins. It is NOT ready for production usage yet, but we need folks to test it and help us iron out the remaining bugs! **Please do not trust it with sensitive data yet, and we don’t recommend trying to run it at scale yet as we haven’t done any serious optimisation work yet.

That said, we do provide the following guarantees:

  • We’re providing versioned releases from here on in, beginning with 0.1.0
  • We don’t expect any major breaking changes to the config or architecture before 1.0
  • Ready for early adopters to try running Dendrite without experiencing ~daily breaking churn
  • The database schema is now stable and will upgrade itself going forwards - your database should now be here to stay! (assuming we don’t hit any nasty data loss bugs during beta)

In terms of comparison with Synapse, the main things you should get excited about are:

  • Dendrite aims to provide an efficient, reliable and scalable alternative to Synapse:
    • Efficient: A small memory footprint with better baseline performance than an out-of-the-box Synapse
    • Reliable: Implements the Matrix specification as written, using the same test suite as Synapse as well as a brand new Go test suite
    • Scalable: can run on multiple machines and eventually scale to massive homeserver deployments
  • This means significantly less memory usage than Synapse (depends on joined rooms, often between 50MB - 400MB resident memory) - although we haven’t tuned this at all yet!
  • All-new database model, where every microservice instance has its own database tables, letting them scale arbitrarily wide
  • The ability to efficiently use all your available CPU cores without needing to split into separate processes, thanks to Go and our extensive use of goroutines. No more Python global interpreter lock! :)
  • Future experimental MSCs are likely to land in Dendrite before Synapse (e.g MSC2753 Peeking via /sync and MSC2444 Peeking over Federation are already being prototyped (#1370 and #1391) in Dendrite rather than Synapse!)

The provisos you should know about however are:

  • We’re not feature complete yet: sytest reports 56% CS API coverage and 77% Federation coverage. NB: these are always going to be underestimates of how much Dendrite actually performs due to how the tests are spread out, in actuality it’s likely more 70% CS, 95% Fed.
  • No read receipts, membership lazy-loading, presence, push notifications, search, event context, key backups, cross-signing. See changelog for full limitations.
  • Not battle-tested in the wild by many people (there are probably only ~10 dendrites on the open network today!) - so there’s likely to be a broad spectrum of bugs at first.
  • Clients that require more exotic features, like lazy loading, may not behave properly yet
  • Please use Postgres rather than SQLite wherever possible—it’s faster and has fewer issues regarding concurrency (some requests on SQLite Dendrites may 500 with ‘database is locked’ - though we’ve worked hard to eliminate most of these)
  • Dendrite can run in either “monolith” or “polylith” mode. In monolith, all the microservices are linked into a single binary - and we recommend running in this configuration wherever possible for now. Monolith mode is extremely capable as it is and has fewer moving parts for things to go wrong and will be the right choice for the majority of beta deployments!
  • Whilst Dendrite is nearly 100% federation compatible, there may still be situations where it will split-brain and disagree with the current room state that Synapse has calculated. We expect these issues to resolve as we get more user feedback.

Architecture-wise, this is what Dendrite looks like under the hood today:

2020-10-08-dendrite-arch.svg

To get up and running, please install Go and head on over to the Get Started guide at https://github.com/matrix-org/dendrite#get-started to join the fun :)

In terms of where we’re going next:

  • Read receipts. It’s a major missing feature and impacts UX significantly.
  • 100% Federation coverage (according to sytest). It’s crucial that Dendrite instances play nicely with other servers. This will be the best metric we have for asserting that we are just as capable as Synapse at the fed level.
  • Optimisation—Dendrite has not been optimised yet for speed or resource utilisation!
    • We plan to add benchmarks which will stress test different microservices in the presence of many different scaling factors (number of users, number of rooms, size of room, number of devices per user, number of sync requests, etc). This will hopefully allow us to identify early on bottlenecks and slow algorithms
    • Good old fashioned pprof with known slow scenarios to see what’s consuming CPU/memory and fixing issues ad-hoc (which we’ve already done a bit of pre-beta). This may involve adding additional in-memory caches, with a healthy respect for the complexities it may introduce (which Synapse has been bitten by)
  • We plan to add first class feature flag support for experimental MSCs—experimentation is one thing which makes Dendrite notably different from Synapse, and supporting it more thoroughly going forwards will be important. This may mean adding additional hooks; potentially a dedicated microservice to cleanly separate experiments, we don’t know yet
  • P2P work will continue with vigour now we have a working, featureful, and relatively stable HS to embed and play with

Longer term, it’s pretty hard to say right now when we expect to exit beta (it took Synapse 5 years to exit beta, after all ;) - but obviously we’ll need Dendrite to have parity with Synapse and have no known serious bugs.

Finally: you’re probably wondering what this means for Synapse. Synapse is here to stay - with tens of thousands of deployments around the world serving tens of millions of users. The majority of the core team is still focused on improving and optimising Synapse, and we’ll be keeping improving it for the foreseeable.

However, we’ll certainly be experimenting with new stuff on Dendrite first - whether that’s P2P, portable accounts, new-style communities, peeking etc. We expect Synapse to be the stable long-term-supported solution, while Dendrite (particularly while in beta) will be the more unstable and experimental platform. In the longer term we’ll provide ways of migrating from Synapse to Dendrite however (probably via portable accounts), and perhaps in future new deployments may choose to use Dendrite - a bit like you might choose to use nginx rather than Apache for a new web server these days. But this will be a long transition—meanwhile we expect to see more and more next-generation homeservers like Conduit, Mascarene or Construct coming of age too.

So, there you have it. If you’re an intrepid sysadmin please spin up a Dendrite and start filing bugs! :)

— Matthew, Neil Alexander, Kegan and the whole Matrix team.

Here’s the official changelog:

Client-Server API Features

Account registration and management

  • Registration: By password only.
  • Login: By password only. No fallback.
  • Logout: Yes.
  • Change password: Yes.
  • Link email/msisdn to account: No.
  • Deactivate account: Yes.
  • Check if username is available: Yes.
  • Account data: Yes.
  • OpenID: No.

Rooms

  • Room creation: Yes, including presets.
  • Joining rooms: Yes, including by alias or ?server_name=.
  • Event sending: Yes, including transaction IDs.
  • Aliases: Yes.
  • Published room directory: Yes.
  • Kicking users: Yes.
  • Banning users: Yes.
  • Inviting users: Yes, but not third-party invites.
  • Forgetting rooms: No.
  • Room versions: All (v1 - v6)
  • Tagging: Yes.

User management

  • User directory: Basic support.
  • Ignoring users: No.
  • Groups/Communities: No.

Device management

  • Creating devices: Yes.
  • Deleting devices: Yes.
  • Send-to-device messaging: Yes.

Sync

  • Filters: Timeline limit only. Rest unimplemented.
  • Deprecated /events and /initialSync: No.

Room events

  • Typing: Yes.
  • Receipts: No.
  • Read Markers: No.
  • Presence: No.
  • Content repository (attachments): Yes.
  • History visibility: No, defaults to joined.
  • Push notifications: No.
  • Event context: No.
  • Reporting content: No.

End-to-End Encryption

  • Uploading device keys: Yes.
  • Downloading device keys: Yes.
  • Claiming one-time keys: Yes.
  • Querying key changes: Yes.
  • Cross-Signing: No.

Misc

  • Server-side search: No.
  • Guest access: Partial.
  • Room previews: No, partial support for Peeking via MSC2753.
  • Third-Party networks: No.
  • Server notices: No.
  • Policy lists: No.

Federation Features

  • Querying keys (incl. notary): Yes.
  • Server ACLs: Yes.
  • Sending transactions: Yes.
  • Joining rooms: Yes.
  • Inviting to rooms: Yes, but not third-party invites.
  • Leaving rooms: Yes.
  • Content repository: Yes.
  • Backfilling / get_missing_events: Yes.
  • Retrieving state of the room (/state and /state_ids): Yes.
  • Public rooms: Yes.
  • Querying profile data: Yes.
  • Device management: Yes.
  • Send-to-Device messaging: Yes.
  • Querying/Claiming E2E Keys: Yes.
  • Typing: Yes.
  • Presence: No.
  • Receipts: No.
  • OpenID: No.

Welcoming Gitter to Matrix!

30.09.2020 16:28 — General Matthew Hodgson
Last update: 30.09.2020 14:58

Gitter ♥️ Matrix

Hi all,

We are ridiculously excited to announce that Gitter is joining the Matrix ecosystem and will become the first major existing chat platform to switch to natively speaking Matrix!

If you’re reading this from the Gitter community and have no idea what Matrix is: we’re an open source project that provides an open protocol for secure, decentralised communication - effectively the missing real-time communication layer of the open Web. The open Matrix network has more than 20M users on it and is growing fast (adding another 1.7M or so with the arrival of Gitter!)

Gitter is easily one of the best developer community chat systems out there, used by the communities of some massive projects (Node, TypeScript, Angular, Scala etc) and is a custodian of huge archives of knowledge via their chat logs. Gitter is unique in specifically focusing on developers: their tagline is literally “Where developers come to talk” (unlike Slack, which has barely any community features - or Discord, with its ban on unofficial clients, where developers are a bit of an afterthought relative to the gamers). With Gitter natively joining Matrix, we’re super excited to see the global developer community converging on the open Matrix network - and Gitter’s community rooms should see a huge new lease of life as they’re properly made natively available to the wider network as first class citizens :)

We’ve always had a bit of a crush on Gitter ever since we ended up opposite each other in the exhibition hall at TechCrunch Disrupt Europe 2014 - particularly when they demoed us not only their sexy webapp but also their official IRC server bridge at irc.gitter.im :D Over the years we’ve been gently nudging them to consider fully embracing Matrix, but perhaps understandably they’ve been busy focusing on their own stuff. However, earlier this year, our friends at GitLab (who acquired Gitter in 2017) reached out to explore the opportunity of Gitter becoming a core part of Matrix rather than a non-core project at GitLab… and we’ve jumped on that opportunity to bring Gitter fully into Matrix.

In practice, the way this is happening is that Element (the company founded by the Matrix core team to fund Matrix development) is acquiring Gitter from GitLab, with a combined Gitter and Element dev team focusing on giving Gitter a new life in Matrix! You can read about it from the Element angle over on the Element blog.

Practically speaking, we have a pretty interesting plan here, which we’d like to be very transparent about given it’s a little unusual:

At first, Gitter will keep running as it always has - needless to say, we will be doing everything we can to delight the Gitter community and keep the service in good shape.

Then we’re going to build out native Matrix connectivity - running a dedicated Matrix homeserver on gitter.im with a new bridge direct into the heart of Gitter; letting all Gitter rooms be available to Matrix directly as (say) #angular_angular:gitter.im, and bridging all the historical conversations into Matrix via MSC2716 or similar. We will of course do this entirely as open source, just as Gitter itself is open source thanks to GitLab releasing it under the MIT license in 2017. The plan is to comprehensively document our progress as the flagship worked example case study of “how do you make an existing chat system talk Matrix.”

This will of course replace the old and creaky matrix-appservice-gitter bridge we’ve been running since 2016. Gitter users will also be able to talk to other users elsewhere in the open Matrix network - e.g. DMing them, and (possibly) joining arbitrary Matrix rooms. Effectively, Gitter will have become a Matrix client.

Now we come to the interesting bit. Gitter has some really nice features which are sorely lacking in Element today:

  • Instant live room peeking (less than a second to load the webapp into a live-view of a massive room with 20K users!!)
  • Seamless onboarding thanks to using GitLab & GitHub for accounts
  • Curated hierarchical room directory
  • Magical creation of rooms on demand for every GitLab and GitHub project ever
  • GitLab/GitHub activity as a first-class citizen in a room’s side-panel
  • Excellent search-engine-friendly static content and archives
  • KaTeX support for Maths communities
  • Threads!

...and we promise to do everything in our power to preserve and honour these features at all costs and continue to give the Gitter community the experience they’ve come to know and love.

However: in the medium/long term, it’s simply not going to be efficient for the combined Element/Gitter team to split our efforts maintaining two high-profile Matrix clients. Our plan is instead to merge Gitter’s features into Element (or next generations of Element) itself and then - if and only if Element has achieved parity with Gitter based on the above list - we expect to upgrade the deployment on gitter.im to a Gitter-customised version of Element. The inevitable side-effect is that we’ll be adding new features to Element rather than Gitter going forwards.

In practice, the main outcome in the end should be Element having benefited massively from levelling up with Gitter - and Gitter benefiting massively from all the goodies which Element and Matrix brings, including:

  • E2E Encryption
  • Reactions
  • Constantly improving native iOS & Android clients (which should be a welcome alternative to Gitter’s natives ones, which are already being deprecated)
  • VoIP and conferencing
  • All the alternative clients, bots, bridges and servers in Matrix
  • The full open standard Matrix API
  • Widgets (embedding webapps into rooms!)
  • ...and of course participation in the wider decentralised Matrix network.

So, there you have it. It’s a new era for Gitter - and we look forward to reinvigorating Gitter’s communities over the coming months. We hope Gitter users will be blown away by the features arriving from Matrix… and we hope that Element users will be ecstatic with the performance and polish work that Gitter-parity will drive us towards. Imagine having guest access in Element that can launch and load a massive room in less than a second!

Finally, we would like to explicitly reassure the Gitter community again that we love and understand Gitter (it was one of the very first ever bridges we wrote for Matrix, for instance) - and we will be doing everything we can to not screw up our responsibility in looking after it. Please, please let us know if you have any concerns or if we ever fall short on this.

Any questions, come talk to us on #gitter:matrix.org - which is bridged with https://gitter.im/matrix-org/gitter. Exciting times ahead!

- Matthew, Amandine, and the whole Matrix, Element and Gitter teams.

Matthew & Amandine being dorky
Matthew and Amandine model 2014-vintage Matrix & Gitter swag in celebration :D

Bonus update - The Changelog Interview!

Sid Sijbrandij (CEO at GitLab) and Matthew had a chance to sit down with The Changelog to talk about Gitter's Big Adventure - so tune in to hear the story first hand! Warning: contains non-ironic use of the word "synergy" :D


Matrix Decomposition: an independent academic analysis of Matrix State Resolution

16.06.2020 20:15 — General Matthew Hodgson
Last update: 16.06.2020 19:09

Hi all,

Regular readers of TWIM may be familiar with the Decentralized Systems and Network Services Research Group at Karlsruhe Institute of Technology, who have been busy over the last few years analysing Matrix from an independent academic point of view. The work started in 2018 with Florian Jacob’s DSN Traveler spidering project, resulting in the Glimpse of the Matrix paper analysing Matrix’s scale and room/server distribution (at least as it was back then).

Last week, they released an entirely new paper: Matrix Decomposition: Analysis of an Access Control Approach on Transaction-based DAGs without Finality by Florian Jacob, Luca Becker, Jan Grashöfer and Hannes Hartenstein, presented at ACM SACMAT ‘20.

Now, the new paper is an absolutely fascinating deep dive analysis into State Resolution v2 - the algorithm at the heart of Matrix which defines how servers merge together their potentially conflicting copies of a given room, such that everyone ends up eventually with a consistent view… even in the face of bad actors. This means that Matrix effectively implements a decentralised access control system - ensuring that users stay banned, and only users with permission can ban, etc. You can see the slides below, and read the full paper here. The video of Florian’s talk from SACMAT should be published shortly.



To give some context from the Matrix side: designing and implementing State Resolution v2 back in 2018 was a bit of a mission. Our original v1 implementation had some bugs which meant that the result of the merge could unexpectedly favour historical state over the current state (so called ‘state resets’) - thus giving an attacker a way to maliciously revert the state of the room. In v2 we thought much more carefully about the algorithm, considering state present in one version of the room but not the other as a conflict, separating and applying access control events from regular events, and adding additional ordering of the state in the room by considering events in the context of their authorisation chain (the ‘auth DAG’). The end result is that we feel confident in v2 State Res, and we haven’t seen any problems with it in the wild since we shipped it in July 2018.

However: state resolution is not intuitive at first - for instance, when you merge two versions of a room together, you treat the state events as unordered sets… even though they are ordered in the context of the room DAG. The reason is that state res needs to work even if you don’t have a copy of the whole room DAG (otherwise you’d have to download way too much data to participate in a large room). Another example is the sequence in which orderings are then applied to the state events - and how that interacts with re-authorising those events, to stop malicious ones creeping in. In the core team, we’ve end up describing it several different ways to try to help folks understand: first Erik’s original MSC1442, then uhoreg’s literary Haskell implementation, then the terse reference version in the Spec itself, and most recently Neil Alexander’s State Resolution v2 for the Hopelessly Unmathematical.

As a result we are very excited and happy that Florian and the DSN team have now published the first ever independent in-depth analysis of the algorithm, particularly in the context of decentralised access control (i.e. enforcing bans, power levels, etc). We’re pleasantly surprised that apparently “To the best of our knowledge, Matrix is the only system that implements access control based on an eventually consistent partial order without finality and without a consensus algorithm”.

Even better, the DSN team found some remaining thinkos in Synapse’s implementation and the Matrix specification, which could have caused resolution results to diverge from other implementations, specifically:

  1. we weren’t enforcing integers in JSON to be within range [-253+1, 253-1], fixed in https://github.com/matrix-org/synapse/pull/7381 and MSC2540
  2. we forgot to include the notification field when authing power level events, fixed in https://github.com/matrix-org/synapse/issues/7501 and MSC2209 (thanks to Luca from DSN for the MSC!)
  3. we forgot to spec the limit that one should apply to the number of parents of an event in the DAG (fixed in https://github.com/matrix-org/matrix-doc/pull/2538)
  4. we missed that moderators could set server ACLs which could let them undermine room admins (fixed in https://github.com/matrix-org/synapse/pull/6834).

All of these have now been fixed in Synapse and the latest versions of the spec (room v6), and we’d like to sincerely thank Florian and Luca for rapidly and responsibly disclosing the issues to us. In other words: this research is directly improving Matrix, and it’s even more exciting that the stated future work for the DSN team is to work on a formal verification for the security of Matrix’s authorisation rules and state resolution. This stuff is tough, as anyone who’s played with TLA+ will know, and we are incredibly glad that the research community is helping out to formalise and hopefully prove that State Res v2 is as good as we think it is.

We should stress that DSN’s work is completely independent of The Matrix.org Foundation or anyone else building on the protocol; we’re just writing about it here because we think it’s incredibly cool and deserves the attention of the whole Matrix ecosystem.

Thanks again to Florian and the team - we look forward to seeing what comes next!

Introducing P2P Matrix

02.06.2020 00:00 — General Matthew Hodgson

TL;DR: we shipped a major update (v0.1.1) to https://p2p.riot.im - fire up a desktop Chrome or Firefox in not-private-browsing mode and give it a go!

Hi folks,

As many know by now, a few of us have been working away since mid-December on experimenting with running Matrix in a peer-to-peer architecture - one where every user has absolute total autonomy and ownership of their conversations, because the only place their conversations exist is on the devices they own.

In some ways this is the logical end goal of Matrix: our aim has always been to empower users to have full control over their communication rather than being beholden to any given service provider, and in a P2P world we completely return power over secure communication to the people.

Why P2P?

P2P Matrix is about more than just letting users store their own conversations: it can also avoid dependencies on the Internet itself by working over local networks, mesh networks, or situations where the Internet has been cut off. Even more interestingly, without homeservers, there is nowhere for metadata to accumulate about who is talking to who, and when - which is a legitimate complaint about today’s Matrix network, given the homeservers of all users in a given conversation necessarily have to store that conversation’s metadata. P2P also lets us radically simplify signup for new users if they don’t have to pick a server to get going - and we avoid the unintentional centralisation of users piling onto public servers.

P2P also forces us to solve many of the hardest remaining problems in Matrix: e.g multi-homed accounts, given multi-device P2P requires your account to exist in multiple places. This in turn unlocks high availability and geo-redundancy for accounts on today’s Matrix network (imagine having a primary and backup homeserver that magically did the right thing!), as well as account portability, and thus also vhosting and load-balancing accounts between servers, and even improved GDPR compliance (for if your user IDs are ephemeral they are no longer personally identifying information baked into your Matrix rooms). We’ll also need better safety mechanisms to avoid folks exploiting the anonymous nature of the network for abuse, accelerating the work we’re already doing for today’s Matrix network.

The way we’ve been approaching P2P is the “hamfisted but genius” approach of taking homeservers and running them on the client, alongside or within your Matrix client - meaning that there are literally no changes required for any Matrix client to talk P2P Matrix, and so P2P Matrix can instantly benefit from all the work which has gone into Riot and other apps. As a result, P2P is also a huge motivator towards developing much smaller homeservers which can run efficiently clientside (e.g. Dendrite!) - which is of course great news for Matrix as a whole. It also forces us to develop more scalable routing algorithms (as you don’t want your client to have to talk to every other device in a room every time it sends a message!) and also spurs development of low bandwidth Matrix transports (as you don’t want the additional chatter of talking to multiple peers to consume all your bandwidth). Finally, it forces us to really ruggedize federation, given nodes are constantly appearing and disappearing, giving the federation much more of a stress test than we see with today’s relatively static homeservers.

P2P in Practice

So, P2P has been acting as fuel for a lot of our longer term Matrix work over the last few months. There have been three main experiments so far: at FOSDEM we showed off running our next-gen Dendrite homeserver running clientside using HTTP over libp2p as the transport. We also highlighted Timothée Floure’s project at EPFL experimenting with Synapse talking P2P CoAP over yggdrasil as the transport via a proxy.

Most recently, however, we’ve been experimenting with compiling Dendrite down to Web Assembly and running it embedded in Riot Web as a Service Worker, using HTTP over libp2p’s websocket transport (coordinated via a websocket rendezvous server). Architecturally, it looks like this:

P2P Architecture Diagram

Today, we’re shipping a major new alpha (v0.1.1) of this P2P demo up at https://p2p.riot.im (requires desktop Chrome or Firefox in non-private-browsing mode) - which hopefully should give a really usable and concrete taste of the shape of things to come.

The main features are:

  • Your conversations are now persisted in your browser storage (via IndexedDB), meaning that as long as all the browsers participating in a given conversation don’t clear their local storage, rooms on the P2P network are here to stay!
  • Your room directory lists all the aliases for all the rooms published by active nodes on the network. Moreover, we now automatically publish a local room alias whenever you join a public room, so that others will be able to discover that room via you, even if the server who originally created the alias has disappeared.
  • Lots and lots of federation improvements between the nodes - for instance, when a node comes online, others should now automatically detect and send scrollback to it. Invites should work, and there should no longer be any unexpectedly redacted messages.

Needless to say, all the code for this is open source under the Apache license, and if you’re feeling particularly adventurous you can embed your very own P2P Dendrite into Riot Web by using the Dockerfile at https://github.com/matrix-org/dendrite/blob/master/build/docker/DendriteJS.Dockerfile or following the instructions at https://github.com/matrix-org/dendrite/blob/master/docs/p2p.md.

Please report bugs to https://github.com/matrix-org/dendrite/issues!

Finally, please understand that the demo is very likely not what the final version of P2P Matrix will look like - this is just one step in a series of experiments as we investigate the best paths forward :)

What’s next?

For the current demo, there’s still lots of stuff remaining, including:

  • More federation debugging (and hooking in tardis and writing up everything we’ve learned about implementing federation in Dendrite!)
  • Making the content repository work in-browser (gotta fill up those IndexedDBs with some GIFs!)
  • Hooking up E2E Encryption APIs in Dendrite (not that it buys us much in a pure P2P world)
  • WebRTC transports. Turns out that service workers aren’t allowed to speak WebRTC, so we’ll have to shim through to Riot to speak true peer-to-peer WebRTC data channels rather than relaying all the traffic through the websocket rendezvous server.
  • Decentralised accounts for multidevice support - reviewing MSC1228 and getting Dendrite supporting multihoming accounts!
  • Finishing all of Dendrite’s other remaining APIs.

Beyond this, there are some bigger picture questions left to be answered in future experiments.

Firstly: we do not yet have a solution for “store and forward” nodes which can relay messages on behalf of a room if all the participating devices are offline. A first cut will be to run a P2P-capable homeserver server-side for this, but then metadata will start to accumulate server-side for the conversations it hosts. A more interesting approach would be to use a store and forward system which obfuscates who is talking to who, such as a mixnet, and could even provide resistance to network traffic pattern analysis. This is very much an open area of research, but one we are getting into :D

Secondly: we want to experiment more with other transports, and find out which works best for Matrix. Libp2p has some really exciting new stuff in the form of Gossipsub v1.1 - a much smarter routing algorithm for pubsub traffic in libp2p, which David Dias gave us a VIP tour of at the first Open Tech Will Save Us meetup. So we’ll need to restructure our libp2p transport as pubsub to see how it works in practice. Separately, we also want to play with hooking up Yggdrasil (the encrypted overlay network) as a transport as a totally different approach - Yggdrasil will easily let us span different underlying network transports, but comes with different tradeoffs (e.g. no browser support yet). We also want to take a look at the DAT / hypercore / hyperswarm / Cabal ecosystem to see if there’s a match :)

Thirdly and finally: we obviously want to unify the new P2P Matrix network with today’s federated one. The ideal outcome here would be to have a hybrid model, where teams who want their users to have a dedicated homeserver (for availability, IT policies, etc) can continue to have one as they do today - but newbies who have just installed Riot would float around on P2P unless they decided to consciously put down roots on a server or two. Best of all, it would let us turn off the matrix.org homeserver: the best public homeserver is one you run yourself on your own phone ;) The approach we take for linking P2P and today’s Matrix will depend very much on the transport we select for P2P in the long run, but the likelihood is that today’s homeservers will sprout P2P gateways to link the networks.

Conclusion

So, there you have it. P2P Matrix exists, and is developing at an alarming speed - and pushing Dendrite development along with it. Most excitingly, there have been no changes yet to the Matrix spec for P2P at all; we’ve just swapped https for http-over-libp2p as the transport. So all of the work we’ve been doing making Dendrite work in a P2P world has directly translated into making Dendrite work on today’s Matrix too You can now stand up a Dendrite and have it federate pretty reliably with the wider Matrix network, although we’re still rushing through implementing APIs (we’re up to 35% passing sytest coverage - although that 35% does contain most of the important tests :)

Finally, in case you’re worried about why the Matrix core team is off chasing P2P dreams rather than improving Riot’s UX, or implementing Communities, or Extensible Profiles, or working through the MSC backlog etc... in practice only two people (ignoring Matthew) have been working on P2P - Neil Alexander (author of the original FOSDEM demo, Dendrite wrangler and Yggdrasil co-maintainer) and Kegan Dougal (of the original Matrix dev team, one of the original authors of Dendrite, and now wrangling the WASM P2P work too). Huge thanks to Kegan & Neil for pushing P2P forwards - and huge thanks to everyone else on the core team and the wider community for keeping today’s Matrix advancing too!

Hope this has given a tempting glimpse of the shape of things to come. Honestly we never thought we’d get as far as P2P when we started Matrix back in 2014, but it’s really fun to be finally catching up with the future :D

-- Matthew

P.S. You can read more about this from Neil Alexander’s point of view over at his blog (including more thoughts on the potential Yggdrasil demo!)

P.P.S You can read the gory details of the P2P and WASM implementation from Kegan's point of view over at the Dendrite wiki.

P.P.P.S Comments over at HN

Welcoming Automattic to Matrix!

21.05.2020 01:42 — General Matthew Hodgson
Last update: 21.05.2020 01:28

Automattic ♥️ Matrix

Hi all,

We’re very excited indeed to announce that Automattic, the creators of WordPress.com, are jumping head first into the Matrix ecosystem with a strategic investment of almost $5M into New Vector (the company which makes Riot and Modular.im, founded by the core Matrix team in 2017). More importantly, Matt Mullenweg (co-founder of WordPress and founder of Automattic) and the Automattic gang are committing to make the most of Matrix in their work going forwards!

This is huge news, not least because WordPress literally runs over 36% of the websites on today’s web - and the potential of bringing Matrix to all those users is incredible. Imagine if every WP site automatically came with its own Matrix room or community? Imagine if all content in WP automatically was published into Matrix as well as the Web? (This isn’t so far fetched an idea - turns out that Automattic already runs a XMPP bridge for wordpress.com over at im.wordpress.com!). Imagine there was an excellent Matrix client available as a WordPress plugin for embedding realtime chat into your site? Imagine if Tumblr (which is part of Automattic these days) became decentralised!?

In fact, if you’re a developer in either the Matrix or WordPress communities, now might be a good time to think about how to cross the streams.... not least because Automattic just opened up a role for a Matrix.org/WordPress Integrations Engineer! Quite aside from the investment, this shows Automattic is serious about Matrix - and we’d like to thank them for opening up jobs in these challenging times to further accelerate Matrix. Perhaps some day Matrix Engineer will be as common a career choice as Web Developer ;)

That said, it’s super early days for integration work, and there isn’t a concrete project to announce yet beyond the investment in New Vector (which is effectively an extension of the funding NV raised in October) and Automattic’s Job opening - but these are the sort of ideas we’ve been kicking around. And at the very least, we should expect to see Automattic’s communities migrating over to Matrix in the coming months.

It’s been loads of fun working with Matt and the team on this: we see a huge overlap in terms of a genuine love for the open web, open source and open standards. It’s also no coincidence that Matt (independently of Automattic) donated substantially to Matrix via Patreon back in 2017 when we needed it the most. We’re also looking forward to benefiting from Automattic’s experience in sustainably and responsibly funding and growing open source projects in general - WordPress.com is an excellent example of how one can support development of a project like WordPress without compromising its open source nature.

So, we’d like to formally welcome WordPress and the rest of the Automattic family into Matrix. It’s incredibly exciting times, and we can’t wait to see what will come of the partnership! And meanwhile, if any other massive open source organisations want to join Automattic and Mozilla in leaping into Matrix, you know where to find us… :D

Huge thanks go to Matt for believing in Matrix - watch this space for updates.

  • Matthew, Amandine & the Matrix Team.

Cross-signing and End-to-end Encryption by Default is HERE!!!

06.05.2020 00:00 — General Matthew Hodgson

Hi all,

As of today, Matrix is end-to-end encrypted by default for private conversations.

Three years have passed since we first announced End-to-end Encryption in Matrix and started to beta test it in Riot - and after an enormous amount of polishing and refinement on its user experience, we are finally declaring it out of beta and enabling it by default for all new private conversations in Riot. As Riot is currently the most common Matrix client, this means that Matrix as a whole should now be considered end-to-end encrypted by default for DMs and invite-only rooms.

Work on E2EE in Matrix has progressed in waves since we first shipped it - including:

  • adding keysharing (letting you share encryption keys between your devices to improve reliability)
  • making Riot Web's encryption resilient to running concurrently in multiple tabs
  • adding online key backup (so you don't lose all your history if you lose all your devices)
  • making encryption resilient to restoring the app from a backup
  • adding interactive key verification via emoji to make the verification process easier.

However, our goal was always to enable E2EE by default for all private rooms, which means having feature parity between unencrypted and E2EE Matrix so that we can enable encryption without any negative impact on usability. The high-level remaining items were significant:

  • Cross-signing: verifying your own logins so others don’t have to.
  • Adding QR codes for even better verification UX, to make cross-signing as painless as possible.
  • Replacing the old prototype UI for E2EE with final polished UI/UX.
  • Ability to support non-E2EE clients.
  • Ability to search encrypted rooms.
  • Ability to view file indexes in encrypted rooms.
  • Fixing the remaining “Unable to decrypt” errors.

Over the last few months the Riot team has been almost entirely focused on implementing solutions to these items - and we're finally at the point where the switch can be flipped and as of Riot Web/Desktop 1.6, Riot iOS 0.11.1 and RiotX Android 0.19, all new private rooms will be encrypted by default; completing the transition we began at FOSDEM 2020 when we landed cross-signing E2E-by-default in the development branches of Riot.

For full details, please go check out the massive deep dive over at the Riot blog - also featuring all the other recent progress in Riot!

Heads up that encrypted traffic is slightly heavier on the server than unencrypted (due to exchanging keys, verification traffic, and keybackup traffic), and so there is a risk that the already-over-popular Matrix.org server instance may feel a little hugged to death. However, unprecedented Synapse performance breakthroughs are on the horizon in the coming weeks which will fix this - and, of course, you can (and should!) be using your own instance anyway.

Thanks everyone for helping us test encryption over the years and getting us to this point: cross-signing provides a more secure way of tracking device trust than almost any other comms system out there, and we hope that you'll agree the improved UX has been worth the wait.

Next stop: Synapse performance, and rebuilding Riot's first time user experience!

thanks,

Matthew, Amandine & the Matrix Team.

(Comments over at HN)