Load problems on the Matrix.org homeserver

Hi folks,

Since FOSDEM we’ve seen even more interest in Matrix than normal, and we’ve been having some problems getting the Matrix.org homeserver to keep up with demand.  This has resulted in performance being slightly slower than normal at peak times, but the main impact has been the additional traffic exacerbating outages on the homeserver – either by revealing new failure modes, or making it harder to recover rapidly after something goes wrong.

Specifically: on Friday afternoon we had a service disruption caused by someone sending an unusual event into Matrix HQ.  It turns out that both matrix-android-sdk and matrix-ios-sdk based clients (e.g. Riot/Android and iOS) handled this naively by simply resyncing the room state… which has been fine in the past, but not when you have several hundred clients actively syncing the room, and resulted in a thundering herd effect which overloaded the server for ~10 mins or so whilst they all resynced the room (which, in turn, nowadays, involves calculating and syncing several MB of JSON state to each client).  The traffic load was then high enough that it took the server a further 10-20 minutes for the server to fully catch up and recover after the herd had dissipated.  We then had a repeat performance on Monday morning of the same failure mode.

Similarly, we had disruption last night after a user who hadn’t used the service for ages logged on for the first time and rapidly caught up on a few rooms which literally had *millions* of unread messages in them.  Generally this would be okay, but the combination of loaded DB and the sheer number of notifications being deleted ended up with 4 long-running DB deletes in parallel.  This seems to have caused postgres to lock the event_actions_table more aggressively than we’d expect, blocking other queries which were trying to access it… causing most requests to block until the deletes were over.  At the current traffic volumes this meant that the main synapse process tried to serve thousands of simultaneous requests as they stacked up and ran out of filehandles within about 10 minutes and wedged the whole synapse solid before the DB could unblock.  Irritatingly, it turns out our end-to-end monitoring has a bug where it in turn can crash on receiving a 500 from synapse, so despite having PagerDuty all set up and running (and having been receiving pages for traffic delays over the last few weeks)… we didn’t get paged when we got actual failed traffic rather than slow traffic, which delayed resolving the issue.  Finally, whilst rolling out a fix this afternoon, we again hit issues with the traffic load causing more problems than we were expecting, making a routine redeploy distinctly more disruptive.

So, what are we doing about this?

  1. Fix the root causes:
    • The ‘android/iOS thundering herd’ bug is being worked on both the android/iOS side (fixing the naive behaviour) and the server side.  A temporary mitigation is in now place which moves the server-side code to worker processes so that worst case it can’t take out the main synapse process and can scale better.
    • The ‘event_push_actions table is inefficient’ bug had already been fixed – so this was a matter of rushing through the hotfix to matrix.org before we saw a recurrence.
  2. Move to faster hardware.  Our current DB master is a “fast when we bought it 5 years ago” machine whose IO is simply starting to saturate (6x 300GB 10krpm disks in RAID5, fwiw), which is maxing out at around 500IOPS and 20MB/s of random access, and acting as a *very* hard limit to the current synapse performance.  We’re currently in the process of evaluating SSD-backed IO for the DB (in fact, we’re already running a DB slave), and assuming this tests out okay we’re hoping to migrate next week, which should give us a 10x-20x speed up on disk IO and buy considerable headroom.  Watch this space for details.
  3. Make synapse faster.  We’re continuing to plug away at optimisations (e.g. stuff like this), but these are reaching the point of diminishing returns, especially relative to the win from faster hardware.
  4. Fix the end-to-end monitoring.  This already happened.
  5. Load-test before deploying.  This is hard, as you really need to test against precisely the same traffic profile as live traffic, and that’s hard to simulate.  We’re thinking about ways of fixing this, but the best solution is probably going to be clustering and being able to do incremental redeploys to gradually test new changes.  On which note:
  6. Fix synapse’s architectural deficiencies to support clustering, allowing for rolling zero-downtime redeploys, and better horizontal scalability to handle traffic spikes like this.  We’re choosing not to fix this in synapse, but we are currently in full swing implementing dendrite as a next-generation homeserver in Golang, architected from the outset for clustering and horizontal scalability.  N.B. most of the exciting stuff is happening on feature branches and gomatrixserverlib atm. Also, we’re deliberately taking the time to try to get it right this time, unlike bits of synapse which were something of a rush job.  It’ll be a few weeks before dendrite is functional enough to even send a message (let alone finish the implementation), but hopefully faster hardware will give the synapse deployment on matrix.org enough headroom for us to get dendrite ready to take over when the time comes!

The good news of course is that you can run your own synapse today to avoid getting caught up in this operational fun & games, and unless you’re planning to put tens of thousands of daily active users on the server you should be okay!

Meanwhile, please accept our apologies for the instability and be assured that we’re doing everything we can to get out this turbulence as rapidly as possible.

Matthew

 

Synapse 0.19.1 released

Hi folks,

We’re a little late with this, but Synapse 0.19.1 was released last week. The only change is a bugfix to a regression in room state replication that snuck in during the performance improvements that landed in 0.19.0. Please upgrade if you haven’t already. We’ve also fixed the Debian repository to make installing Synapse easier on Jessie by including backported packages for stuff like Twisted where we’re forced to use the latest releases.

You can grab it from https://github.com/matrix-org/synapse/ as always.

Changes in synapse v0.19.1 (2017-02-09)

  • Fix bug where state was incorrectly reset in a room when synapse received an event over federation that did not pass auth checks (PR #1892)

FOSDEM 2017 report

Hi all,

FOSDEM this year was even more crazy and incredible than ever – with attendance up from 6,000 to 9,000 folks, it’s almost impossible to describe the atmosphere. Matt Jordan from Asterisk describes it as DisneyWorld for OSS Geeks, but it’s even more than that: it’s basically a corporeal representation of the whole FOSS movement.  There is no entrance fee; there is no intrusive sponsorship; there is no corporate presence: it’s just a venue for huge numbers of FOSS projects and their users and communities to come together in one place (the Université Libre de Bruxelles) and talk and learn.  Imagine if someone built a virtual world with storefronts for every open source project imaginable, where you could chat to the core team, geek out with other users, or gather in auditoriums to hear updates on the latest projects & ideas.  Well, this is FOSDEM… except even better, it’s in real life.  With copious amounts of Belgian beer.

Anyway: this year we had our normal stand on the 2nd floor of K building, sharing the Realtime Lounge chill-out space with the XMPP Standards Foundation.  This year we had a larger representation than ever before with Matthew, Erik and Luke from the London team as well as Manu & Yannick from Rennes – which is just as well given all 5 of us ended up speaking literally non-stop from 10am to 6pm on both Saturday & Sunday (and then into the night as proceedings deteriorated/evolved into an impromptu Matrix meetup with Coffee, uhoreg, tadzik, realitygaps and others!).  The level of interest at the Matrix booth was frankly phenomenal: a major change from the last two FOSDEMs in that this year pretty much everyone had already heard of Matrix, and were most likely to want to enthuse about features and bugs in Synapse or Riot, or geek out about writing new bridges/bots/clients, or trying to work out a way to incorporate Matrix into their own projects or companies.

Synapse 0.19 and Riot 0.9.7 were also released on Saturday to try to ensure that anyone joining Matrix for the best time at FOSDEM were on the latest & greatest code – especially given the performance and E2E fixes present in both.  Amazingly the last-minute release didn’t backfire: if you haven’t upgraded to Synapse 0.19 we recommend going so asap.  And if you’re a Riot user, make sure you’re on the latest version :)

We were very lucky to have two talks accepted this year: the main one in the Security Track on the Jansen main stage telling the tale of how we added end-to-end encryption to Matrix via Olm & Megolm – and the other in the Decentralised Internet room (AW1.125), focusing on the unsolved future problems of decentralised accounts, identity, reputation in Matrix.  Both talks were well attended, with huge queues for the Decentralised Internet room: we can only apologise to everyone who queued for 20+ minutes only to still not be able to get in.  Hopefully next year FOSDEM will allocate a larger room for decentralisation!  On the plus side, this year FOSDEM did an amazing job of videoing the sessions – livestreaming every talk, and automatically publishing the recordings (via a fantastic ‘publish your own talk’ web interface) – so many of the people who couldn’t get into the room (as well as the rest of the world) were able to watch it live anyway by the stream.

You can watch the video of the talks from the FOSDEM website here and here.  Both talks necessarily include the similar exposition for folks unfamiliar with Matrix, so apologies for the duplication – also, the “future of decentralised communication” talk ended up a bit rushed; 20 minutes is not a lot of time to both explain Matrix and give an overview of the challenges we face in fixing spam, identity, moderation etc.  But if you like hearing overenthusiastic people talking too fast about how amazing Matrix is, you may wish to check out the videos :)  You can also get at the slides as PDF here (E2E Encryption) and here (Future of Decentralisation).

Huge thanks to evevryone who came to the talks or came and spoke to us at the stand or around the campus.  We had an amazing time, and are already looking forward to next year!

Matthew & the team

When Ericsson discovered Matrix…

As something completely different, we’ve invited Stefan Ålund and his team at Ericsson to write a guest blog post about the really cool stuff Ericsson is doing with Matrix.  This is a fascinating glimpse into how major folks are already launching commercial products on top of Matrix – whilst also making significant contributions back to the projects and the community.  We’d like to thank Stefan and Ericsson for all their support and perseverance, and we wish them the very best with the Ericsson Contextual Communication Cloud!

— Matthew

At the end of 2014, my colleague Adam Bergkvist and I attended the WebRTC Expo in Paris, partly to promote our Open Source project OpenWebRTC, but also to meet the rest of the European WebRTC community and see what others were working on.

At Ericsson Research we had been working on WebRTC for quite some time. Not only on the client-side framework and how those could enable some truly experimental stuff, but more importantly how this emerging technology could be used to build new kinds of communication services where communication is not *the* service (A calling B), but is integrated as part of some other service or context. A simple example would be a health care solution, where the starting point could be the patient records and communication technologies are integrated to enable remote discussions between patients and their doctors.

Our research in this area, that we started calling “contextual communication”, pointed in a different direction from Ericsson’s traditional communication business, therefore making it hard for us to transfer our ideas and technologies out from Ericsson Research. We increasingly had the feeling that we needed to build something new and start from a clean slate, so to speak.

Some of our guiding principles:

  • Flexibility – the communication should be able to integrate anywhere
  • Fast iterations – browsers and WebRTC are moving targets
  • Open – interoperability is important for communication systems
  • Low cost – operations for the core communication should approach 0
  • Trust – build on the Ericsson brand and technology leadership

We had a pretty good idea about what we wanted to build, but even though Ericsson is a big company, the team working in this area was relatively small and also had a number of other commitments that we couldn’t abandon.

I think that is enough of a background, let’s circle back to the WebRTC Expo and the reason why I am writing this post on the Matrix blog.

Adam and I were pretty busy in our booth talking to people and giving demos, so we actually missed when Matrix won the Best Innovation Award. Nonetheless we finally got some time to walk around and I started chatting with Matthew and Amandine who were manning the Matrix booth. Needless to say, I was really impressed with their vision and what they wanted to build. The comparison to email and how they wanted to make it possible to build an interoperable bridge between communication “islands”, all in an open (source) manner, really appealed to me.

To be honest, the altruistic aspects of decentralising communication was not the most important part for us, even if we were sympathetic to the cause, working for a company that was founded from “the belief that communication is a basic human need”. We ultimately wanted to build a new kind of communication offering from Ericsson, and it looked like Matrix might be able to play a part in that.

I had recently hired a couple of interns and as soon as I came back from Paris, we set them to work evaluating Matrix. They were quickly able to port an existing WebRTC service (developed and used in-house) to use Matrix signalling and user management. We initially had some concerns about the maturity of the reference Home Server implementation (remember, this was almost 2 years ago) and we didn’t want to start developing our own since we were still a small team. However, Matthew and the rest of the Matrix team worked closely with us, helping to answer all our (dumb) questions and we finally got to a point where we had the confidence to say “screw it, let’s try this and see if it flies”. 😄

Ericsson had recently launched the Ericsson Garage where employees could pitch ideas for how to build new business. So we decided to give the process a try and presented an idea on how Ericsson could start selling contextual communication as-a-Service, directly to enterprises that wanted help integrating communication into their business processes, but didn’t necessarily have the competence or business interest to run their own communication services. We got accepted and moved (physically) out of Research to sit in the Garage for the next 4 months, developing a MVP.

Since the primary interface to our offering would be through SDKs on various platforms, we decided early on to develop our own. The SDKs were implementing the standard Matrix specification, but we put a lot of time in increasing the robustness and flexibility in the WebRTC call handling and eventually with added peer-2-peer data and collaboration features, on top of the secure WebRTC DataChannel. On the server side, our initial concerns about Synapse were eventually removed completely as the Matrix team relentlessly kept working on fixing performance issues, patching security holes and provided a story on how to scale. Over the years we have contributed with several patches to Synapse (SAML auth and auth improvements; application service improvements) and provided input to the Matrix specification. We have always found the Matrix team be very inclusive and easy to work with.

The project graduated successfully from the Ericsson Garage and moved in to Ericsson’s Business Unit IT & Cloud Products, where we started to increase the size of the team and just last month signed a contract with our first customer. We call the solution Ericsson Contextual Communication Cloud, or ECCC for short, and it can be summarised on a high level by the following picture:

ECCC in a nutshell

If you are interested in ECCC, feel free to reach out at https://discuss.c3.ericsson.net

As with any project developed in the open, it is essential to have a healthy community around it. We have received excellent support from the Matrix project and they have always been open for discussion, engaged our developers and listened to our needs. We depend on Matrix now and we see great potential for the future. We hope that others will adopt the technology and help make the community grow even stronger.

– Stefan Ålund and the Ericsson ECCC Team

SSL Issues With Chromium

It’s been brought to our attention that some users are unable to connect to matrix.org and riot.im due to an SSL error, failing with, “NET::ERR_CERTIFICATE_TRANSPARENCY_REQUIRED”. The cause of this is the Chromium bug detailed at https://bugs.chromium.org/p/chromium/issues/detail?id=664177

In short, older versions of Chrome / Chromium (including Chromium v0.53 which is the default in ubuntu) will refuse to make SSL connections to matrix.org or riot.im because they are unable to verify that the certificates are in the certificate transparency log. This is because the build of Chromium is over 10 weeks old which means it now considers its certificate transparency log to be stale.

This issue is affecting all sites using certificates signed by Symantec and its subsidiaries (which includes amazon.com).

There’s little we can do about this, short of completely changing our SSL certificate provider, but for users it should be fairly easy to work around by updating to a newer version of Chromium (which may be as simple as restarting the browser).

Update: see also https://sslmate.com/blog/post/ct_redaction_in_chrome_53 and https://news.ycombinator.com/item?id=12953172 (top of HN right now)