Synapse 0.19 is here, just in time for FOSDEM!

Hi all,

We’re happy to announce the release of Synapse 0.19.0 (same as 0.19.0-rc4) today, just in time for anyone discovering Matrix for the first time at FOSDEM 2017!  In fact, here’s Erik doing the release right now (with moral support from Luke):

release

This is a pretty big release, with a bunch of new features and lots and lots of debugging and optimisation work following on some of the dramas that we had with 0.18 over the Christmas break.  The biggest things are:

  • IPv6 Support (unless you have an IPv6 only resolver), thanks to contributions from Glyph from Twisted and Kyrias!
  • A new API for tracking the E2E devices present in a room (required for fixing many of the remaining E2E bugs…)
  • Rewrite the ‘state resolution’ algorithm to be orders of magnitude more performant
  • Lots of tuning to the caching logic.

If you’re already running a server, please upgrade!  And if you’re not, go grab yourself a brand new Synapse from Github. Debian packages will follow shortly (as soon as Erik can figure out the necessary backporting required for Twisted 16.6.0)

And here’s the full changelog…

 

Changes in synapse v0.19.0 (2017-02-04)

No changes since RC 4.

Changes in synapse v0.19.0-rc4 (2017-02-02)

  • Bump cache sizes for common membership queries (PR #1879)

Changes in synapse v0.19.0-rc3 (2017-02-02)

  • Fix email push in pusher worker (PR #1875)
  • Make presence.get_new_events a bit faster (PR #1876)
  • Make /keys/changes a bit more performant (PR #1877)

Changes in synapse v0.19.0-rc2 (2017-02-02)

  • Include newly joined users in /keys/changes API (PR #1872)

Changes in synapse v0.19.0-rc1 (2017-02-02)

Features:

  • Add support for specifying multiple bind addresses (PR #1709, #1712, #1795, #1835). Thanks to @kyrias!
  • Add /account/3pid/delete endpoint (PR #1714)
  • Add config option to configure the Riot URL used in notification emails (PR #1811). Thanks to @aperezdc!
  • Add username and password config options for turn server (PR #1832). Thanks to @xsteadfastx!
  • Implement device lists updates over federation (PR #1857, #1861, #1864)
  • Implement /keys/changes (PR #1869, #1872)

Changes:

  • Improve IPv6 support (PR #1696). Thanks to @kyrias and @glyph!
  • Log which files we saved attachments to in the media_repository (PR #1791)
  • Linearize updates to membership via PUT /state/ to better handle multiple joins (PR #1787)
  • Limit number of entries to prefill from cache on startup (PR #1792)
  • Remove full_twisted_stacktraces option (PR #1802)
  • Measure size of some caches by sum of the size of cached values (PR #1815)
  • Measure metrics of string_cache (PR #1821)
  • Reduce logging verbosity (PR #1822, #1823, #1824)
  • Don’t clobber a displayname or avatar_url if provided by an m.room.member event (PR #1852)
  • Better handle 401/404 response for federation /send/ (PR #1866, #1871)

Fixes:

  • Fix ability to change password to a non-ascii one (PR #1711)
  • Fix push getting stuck due to looking at the wrong view of state (PR #1820)
  • Fix email address comparison to be case insensitive (PR #1827)
  • Fix occasional inconsistencies of room membership (PR #1836, #1840)

Performance:

  • Don’t block messages sending on bumping presence (PR #1789)
  • Change device_inbox stream index to include user (PR #1793)
  • Optimise state resolution (PR #1818)
  • Use DB cache of joined users for presence (PR #1862)
  • Add an index to make membership queries faster (PR #1867)

Matrix.org homeserver outage (25th Jan 2017)

Hi folks,

As many will have noticed there was a major outage on the Matrix homeserver for matrix.org last night (UK-time). This impacted anyone with an account on the matrix.org server, as well as anyone using matrix.org-hosted bots & bridges. As Matrix rooms are shared over all participants, rooms with participants on other servers were unaffected (for users on those servers). Here’s a quick explanation of what went wrong (times are UTC):

  • 2017-01-24 16:00 – We notice that we’re badly running out of diskspace on the matrix.org backup postgres replica. (Turns out the backup box, whilst identical hardware to the master, had been built out as RAID-10 rather than RAID-5 and so has less disk space).
  • 2017-01-24 17:00 – We decide to drop a large DB index: event_push_actions(room_id, event_id, user_id, profile_tag), which was taking up a disproportionate amount of disk space, on the basis that it didn’t appear to be being used according to the postgres stats. All seems good.
  • 2017-01-24 ~23:00 – The core matrix.org team go to bed.
  • 2017-01-24 23:33 – Someone redacts an event in a very active room (probably #matrix:matrix.org) which necessitates redacting the associated push notification from the event_push_actions table. This takes out a lock within persist_event, which is then blocked on deleting the push notification. It turns out that this deletion requires the missing DB constraint, causing the query to run for hours whilst holding the transaction lock. The symptoms are that anything reading events from the DB was blocked on the transaction, causing messages not to be relayed to other clients or servers despite appearing to send correctly. Meanwhile, the fact that events are being received by the server fine (including over federation) makes the monitoring graphs look largely healthy.
  • 2017-01-24 23:35 – End-to-end monitoring detects problems, and sends alerts into pagerduty and various Matrix rooms. Unfortunately we’d failed to upgrade the pageduty trial into a paid account a few months ago, however, so the alerts are lost.
  • 2017-01-25 08:00 – Matrix team starts to wake up and spot problems, but confusion over the right escalation process (especially with Matthew on holiday) means folks assume that other members of the team must already be investigating.
  • 2017-01-25 09:00 – Server gets restarted, service starts to resume, although box suffers from load problems as traffic tries to catch up.
  • 2017-01-25 09:45 – Normal service on the homeserver itself is largely resumed (other than bridges; see below)
  • 2017-01-25 10:41 – Root cause established and the redaction path is patched on matrix.org to stop a recurrence.
  • 2017-01-25 11:15 – Bridges are seen to be lagging and taking much longer to recover than expected. Decision made to let them continue to catch up normally rather than risk further disruption (e.g. IRC join/part spam) by restarting them.
  • 2017-01-25 13:00 – All hosted bridges returned to normal.

Obviously this is rather embarrassing, and a huge pain for everyone using the matrix.org homeserver – many apologies indeed for the outage. On the plus side, all the other Matrix homeservers out there carried on without noticing any problems (which actually complicated spotting that things had broken, given many of the core team primarily use their personal homeservers).

In some ways the root cause here is that the core team has been focusing all its energy recently on improving the overall Matrix codebase rather than operational issues on matrix.org itself, and as a result our ops practices have fallen behind (especially as the health of the Matrix ecosystem as a whole is arguably more important than the health of a single homeserver deployment). However, we clearly need to improve things here given the number of people (>750K at the last count) dependent on the Matrix.org homeserver and its bridges & bots.

Lessons learnt on our side are:

  • Make sure that even though we had monitoring graphs & thresholds set up on all the right things… monitoring alerts actually have to be routed somewhere useful – i.e. phone calls to the team’s phones. Pagerduty is now set up and running properly to this end.
  • Make sure that people know to wake up the right people anyway if the monitoring alerting system fails.
  • To be even more paranoid about hotfixes to production at 5pm, especially if they can wait ’til the next day (as this one could have).
  • To investigate ways to rapidly recover bridges without causing unnecessary disruption.

Apologies again to everyone who was bitten by this – we’re doing everything we can to ensure it doesn’t happen again.

Matthew & the team.

Synapse 0.18.7 is out – Please upgrade, especially if on 0.18.5 or 0.18.6.

Hi all,

TL;DR: Please upgrade to Synapse 0.18.6, especially if you are on 0.18.5 which is a bad release.

TL;DR: Please upgrade to Synapse 0.18.7 – especially if you are on 0.18.5 or 0.18.6 which both have serious federation bugs.

Synapse 0.18.5 contained a really nasty regression in the federation code which causes servers to echo transactions that they receive back out to the other servers participating in a room. This has effectively resulted in a gradual amplification of federation traffic as more people have installed 0.18.5, causing every transaction to be received N times over where N is the number of servers in the room.

We’ll do a full write-up once we’re happy we’ve tracked down all the root problems here, but the short story is that this hit critical mass around Dec 26, where typical Synapses started to fail to keep up with the traffic – especially when requests hit some of the more inefficient or buggy codepaths in Synapse.  As servers started to overload with inbound connections, this in turn started to slow down and consume resources on the connecting servers – especially due to an architectural mistake in Synapse which blocks inbound connections until the request has been fully processed (which could require the receiving server in turn to make outbound connections), rather than releasing the inbound connection asap.  This hit the point that servers were running out of file descriptors due to all the outbound and inbound connections, at which point they started to entirely tarpit inbound connections, resulting in a slow feedback loop making the whole situation even worse.

We’ve spent the last two weeks hunting all the individual inefficient requests which were mysteriously starting to cause more problems than they ever had before; then trying to understand the feedback misbehaviour; before finally discovering the regression in 0.18.5 as the plausible root cause of the problem.  Troubleshooting has been complicated by most of the team having unplugged for the holidays, and because this is the first (and hopefully last!) failure mode to be distributed across the whole network, making debugging something of a nightmare – especially when the overloading was triggering a plethora of different exotic failure modes.  Huge thanks to everyone who has shared their server logs with the team to help debug this.

Some of these failure modes are still happening (and we’re working on fixing them), but we believe that if everyone upgrades away from the bad 0.18.5 release most of the symptoms will go away, or at least go back to being as bad as they were before.  Meanwhile, if you find your server suddenly grinding to a halt after upgrading to 0.18.6 0.18.7 please come tell us in #matrix-dev:matrix.org.

We’re enormously sorry if you’ve been bitten by the federation instability this has caused – and many many thanks for your patience whilst we’ve hunted it down.  On the plus side, it’s given us a lot of *very* useful insight into how to implement federation in future homeservers to not suffer from any of these failure modes.  It’s also revealed the root cause of why Synapse’s RAM usage is quite so bad – it turns out that it actually idles at around 200MB with default caching, but there’s a particular codepath which causes it to spike temporarily by 1GB or so – and that RAM is then not released back to the OS.  We’re working on a fix for this too, but it’ll come after 0.18.7.

Unfortunately the original release of 0.18.6 still exhibits the root bug, but 0.18.7 (originally released as 0.18.7-rc2) should finally fix this.  Sorry for all the upgrades :(

So please upgrade as soon as possible to 0.18.7. Debian packages are available as normal.

thanks,

Matthew

Changes in synapse v0.18.7 (2017-01-09)

  • No changes from v0.18.7-rc2

Changes in synapse v0.18.7-rc2 (2017-01-07)

Bug fixes:

  • Fix error in rc1’s discarding invalid inbound traffic logic that was incorrectly discarding missing events

Changes in synapse v0.18.7-rc1 (2017-01-06)

Bug fixes:

  • Fix error in #PR 1764 to actually fix the nightmare #1753 bug.
  • Improve deadlock logging further
  • Discard inbound federation traffic from invalid domains, to immunise against #1753

Changes in synapse v0.18.6 (2017-01-06)

Bug fixes:

  • Fix bug when checking if a guest user is allowed to join a room – thanks to Patrik Oldsberg (PR #1772)

Changes in synapse v0.18.6-rc3 (2017-01-05)

Bug fixes:

  • Fix bug where we failed to send ban events to the banned server (PR #1758)
  • Fix bug where we sent event that didn’t originate on this server to other servers (PR #1764)
  • Fix bug where processing an event from a remote server took a long time because we were making long HTTP requests (PR #1765, PR #1744)

Changes:

  • Improve logging for debugging deadlocks (PR #1766, PR #1767)

Changes in synapse v0.18.6-rc2 (2016-12-30)

Bug fixes:

  • Fix memory leak in twisted by initialising logging correctly (PR #1731)
  • Fix bug where fetching missing events took an unacceptable amount of time in large rooms (PR #1734)

Changes in synapse v0.18.6-rc1 (2016-12-29)

Bug fixes:

  • Make sure that outbound connections are closed (PR #1725)

matrix-appservice-irc 0.7.0 is out!

Also, we’ve just released a major update to the IRC bridge codebase after trialling it on the matrix.org-hosted bridges for the last few days.

The big news is:

  • The bridge uses Synapse 0.18.5’s new APIs for managing the public room list (improving performance a bunch)
  • Much faster startup using the new /joined_rooms and /joined_members APIs in Synapse 0.18.5
  • The bridge will now remember your NickServ password (encrypted at rest) if you want it to via the !storepass command
  • You can now set arbitrary user modes for IRC clients on connection (to mitigate PM spam etc)
  • After a split, the bridge will drop Matrix->IRC messages older than N seconds, rather than try to catch the IRC room up on everything they missed on Matrix :S
  • Operational metrics are now implemented using Prometheus rather than statsd
  • New !quit command to nuke your user from the remote IRC network
  • Membership list syncing for IRC->Matrix is enormously improved, and enabled for all matrix.org-hosted bridges apart from Freenode.  <b>At last, membership lists should be in sync betwen IRC and Matrix; please let us know if they’re not</b>.
  • Better error logging

For full details, please see the changelog.

With things like NickServ-pass storing, !quit support and full bi-directional membership list syncing, it’s never been a better time to run your own IRC bridge.  Please install or upgrade today from https://github.com/matrix-org/matrix-appservice-irc!

Synapse 0.18.5 released!

Hi folks,

We released synapse 0.18.5 on Friday.  This is mainly about fixing performance problems with the unread room counts and the public room directory; polishing the E2E endpoints based on beta feedback; and general minor bits and bobs.

Get it whilst it’s (almost) hot from https://github.com/matrix-org/synapse!  Changelog follows:

Changes in synapse v0.18.5 (2016-12-16)

Bug fixes:

  • Fix federation /backfill returning events it shouldn’t (PR #1700)
  • Fix crash in url preview (PR #1701)

Changes in synapse v0.18.5-rc3 (2016-12-13)

Features:

  • Add support for E2E for guests (PR #1653)
  • Add new API appservice specific public room list (PR #1676)
  • Add new room membership APIs (PR #1680)

Changes:

  • Enable guest access for private rooms by default (PR #653)
  • Limit the number of events that can be created on a given room concurrently (PR #1620)
  • Log the args that we have on UI auth completion (PR #1649)
  • Stop generating refresh_tokens (PR #1654)
  • Stop putting a time caveat on access tokens (PR #1656)
  • Remove unspecced GET endpoints for e2e keys (PR #1694)

Bug fixes:

  • Fix handling of 500 and 429’s over federation (PR #1650)
  • Fix Content-Type header parsing (PR #1660)
  • Fix error when previewing sites that include unicode, thanks to @kyrias (PR #1664)
  • Fix some cases where we drop read receipts (PR #1678)
  • Fix bug where calls to /sync didn’t correctly timeout (PR #1683)
  • Fix bug where E2E key query would fail if a single remote host failed (PR #1686)

Changes in synapse v0.18.5-rc2 (2016-11-24)

Bug fixes:

  • Don’t send old events over federation, fixes bug in -rc1.

Changes in synapse v0.18.5-rc1 (2016-11-24)

Features:

  • Implement “event_fields” in filters (PR #1638)

Changes:

  • Use external ldap auth pacakge (PR #1628)
  • Split out federation transaction sending to a worker (PR #1635)
  • Fail with a coherent error message if /sync?filter= is invalid (PR #1636)
  • More efficient notif count queries (PR #1644)

Synapse 0.18.4

Uncharacteristically, we’re actually remembering to announce a new release of Synapse!

Major performance fixes on federation, as well as the changes required to support E2E encrypted attachments (yay!)

Please install or upgrade from https://github.com/matrix-org/synapse :)

Changes in synapse v0.18.4 (2016-11-22)

Bug fixes:

  • Add workaround for buggy clients that the fail to register (PR #1632)

Changes in synapse v0.18.4-rc1 (2016-11-14)

Changes:

  • Various database efficiency improvements (PR #1188, #1192)
  • Update default config to blacklist more internal IPs, thanks to Euan Kemp @euank (PR #1198)
  • Allow specifying duration in minutes in config, thanks to Daniel Dent @DanielDent (PR #1625)

Bug fixes:

  • Fix media repo to set CORs headers on responses (PR #1190)
  • Fix registration to not error on non-ascii passwords (PR #1191)
  • Fix create event code to limit the number of prev_events (PR #1615)
  • Fix bug in transaction ID deduplication (PR #1624)