security-research/pocs/linux/kernelctf/CVE-2024-26583_cos/docs/exploit.md at ee783a1c23abacedd2033872aca58fdbbc33eb21 · google/security-research

Setup

To trigger the TLS encryption we must first configure the socket. This is done using the setsockopt() with SOL_TLS option:

        static struct tls12_crypto_info_aes_ccm_128 crypto_info;
        crypto_info.info.version = TLS_1_2_VERSION;
        crypto_info.info.cipher_type = TLS_CIPHER_AES_CCM_128;

        if (setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info)) < 0)
                err(1, "TLS_TX");

This syscall triggers allocation of TLS context objects which will be important later on during the exploitation phase.

In KernelCTF config PCRYPT (parallel crypto engine) is disabled, so our only option to trigger async crypto is CRYPTD (software async crypto daemon).

Each crypto operation needed for TLS is usually implemented by multiple drivers. For example, AES encryption in CBC mode is available through aesni_intel, aes_generic or cryptd (which is a daemon that runs these basic synchronous crypto operations in parallel using an internal queue).

Available drivers can be examined by looking at /proc/crypto, however those are only the drivers of the currently loaded modules. Crypto API supports loading additional modules on demand.

As seen in the code snippet above we don't have direct control over which crypto drivers are going to be used in our TLS encryption. Drivers are selected automatically by Crypto API based on the priority field which is calculated internally to try to choose the "best" driver.

By default, cryptd is not selected and is not even loaded, which gives us no chance to exploit vulnerabilities in async operations.

However, we can cause cryptd to be loaded and influence the selection of drivers for TLS operations by using the Crypto User API. This API is used to perform low-level cryptographic operations and allows the user to select an arbitrary driver.

The interesting thing is that requesting a given driver permanently changes the system-wide list of available drivers and their priorities, affecting future TLS operations.

Following code causes AES CCM encryption selected for TLS to be handled by cryptd:

        struct sockaddr_alg sa = {
                .salg_family = AF_ALG,
                .salg_type = "skcipher",
                .salg_name = "cryptd(ctr(aes-generic))"
        };
        int c1 = socket(AF_ALG, SOCK_SEQPACKET, 0);

        if (bind(c1, (struct sockaddr *)&sa, sizeof(sa)) < 0)
                err(1, "af_alg bind");

        struct sockaddr_alg sa2 = {
                .salg_family = AF_ALG,
                .salg_type = "aead",
                .salg_name = "ccm_base(cryptd(ctr(aes-generic)),cbcmac(aes-aesni))"
        };

        if (bind(c1, (struct sockaddr *)&sa2, sizeof(sa)) < 0)
                err(1, "af_alg bind");

What we start with and what can we do

If we win the race condition, vulnerability gives us a limited write primitive. To be exact, it gives us an ability to change a 8 bit integer value of '1' to '0' at an offset 0x158 in the struct tls_sw_context_rx object which is allocated from a general kmalloc-512 cache.

The big problem is finding a victim object in which this limited write gives us the ability to escalate privileges or at least get a better exploitation primitive.

Victim object

We had no success looking for kmalloc-512 objects, so we had to turn our attention to objects from other caches, even though it requires a cross-cache attack.

The only object we were able to find is ipcomp_tfms:

struct ipcomp_tfms {
        struct list_head           list;                 /*     0  0x10 */
        struct crypto_comp * *     tfms;                 /*  0x10   0x8 */
        int                        users;                /*  0x18   0x4 */

        /* size: 32, cachelines: 1, members: 3 */
};

This is used in XFRM code. Changing the reference counter 'users' from 1 to 0 gives us a use-after-free.

Unfortunately, only one object can be created for the whole system, so there is no way to spray the whole page with these objects.

There 128 possible positions of this object in the kmalloc-128 slab and 16 positions of rx context in kmalloc-512.

Only a few of these combinations align with the 0x158 offset giving us a chance to perform the attack.

Target: 0x158 (base: 0x0) victim(ipcomp_tfms): 0x158 (base: 0x140)
Target: 0x358 (base: 0x200) victim(ipcomp_tfms): 0x358 (base: 0x340)
Target: 0x558 (base: 0x400) victim(ipcomp_tfms): 0x558 (base: 0x540)
Target: 0x758 (base: 0x600) victim(ipcomp_tfms): 0x758 (base: 0x740)
Target: 0x958 (base: 0x800) victim(ipcomp_tfms): 0x958 (base: 0x940)
Target: 0xb58 (base: 0xa00) victim(ipcomp_tfms): 0xb58 (base: 0xb40)
Target: 0xd58 (base: 0xc00) victim(ipcomp_tfms): 0xd58 (base: 0xd40)
Target: 0xf58 (base: 0xe00) victim(ipcomp_tfms): 0xf58 (base: 0xf40)

Another issue is that kmalloc-32 uses order 0 pages, while kmalloc-512 uses order 1.

This means we not only have to discard the slab page back to the page allocator, but also move it from the PCP to the buddy allocator and arrange the state of the allocator so that order 1 page is returned for an order 0 request.

All those issues combined resulted in a very unreliable exploit, however it was reliable enough to eventually get the flag.

Triggering use-after-free through race condition

        spin_lock_bh(&ctx->decrypt_compl_lock);
        if (!atomic_dec_return(&ctx->decrypt_pending))
[1]                complete(&ctx->async_wait.completion);
[2]        spin_unlock_bh(&ctx->decrypt_compl_lock);
}

To exploit the race condition we have to hit window between lines [1] and [2] and perform following actions:

Close the socket to free tls context (struct tls_sw_context_rx), leading to discard of the slab page
Allocate a new page table in place of the tls context.

To hit this small window and extend it enough to fit our allocations we turn to a well-known timerfd technique invented by Jann Horn. The basic idea is to set hrtimer based timerfd to trigger a timer interrupt during our race window and attach a lot (as many as RLIMIT_NOFILE allows) of epoll watches to this timerfd to make the time needed to handle the interrupt longer. For more details see the original blog post.

Exploitation is done in 2 threads - main process runs on CPU 0, and a new thread (child_recv()) is cloned for each attempt and bound to CPU 1

CPU 0	CPU 1
allocate tls context	-
-	exploit calls recv() triggering async crypto ops
-	tls_sw_recvmsg() waits on completion
-	cryptd calls tls_decrypt_done()
-	tls_decryption_done() finishes complete() call
-	timer interrupts tls_decrypt_done()
recv() returns to userspace unlocking the socket	timerfd code goes through all epoll notifications
exploit calls close() to free tls context	...
exploit allocates a page table in place of tls context	...
-	interrupt finishes and returns control to tls_decrypt_done()
-	spin_unlock_bh() writes to PTE

Ensuring the slab page is discarded

struct tls_sw_context_rx is allocated from kmalloc-512. This cache uses a single page slab storing 16 objects. To ensure the slab page is discarded we have to meet the same requirements as in a cross-cache attack:

all objects in the same slab as tls_sw_context_rx must be freed. All neighbouring objects are xattrs from the same kmalloc-512 cache and are freed before starting the race condition, which freezes the slab and puts it on a per cpu partial list
per cpu partial list must be full to unfreeze the slab after tls context is freed
per node partial list must also be full for the slab to be discarded instead of moved to the per node list

All these requirements are met before tls context is freed by freeing enough kmalloc-512 xattrs.

Moving the order-1 page from PCP to buddy allocator

If we free more than pages then 'high' limit of the given PCP cache, a batch of pages will be released back to the buddy allocator:

        if (pcp->count >= high) {
                int batch = READ_ONCE(pcp->batch);

                free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), pcp);
        }
}

To be able to do this efficiently in the race condition window we free pages exactly up to the limit, so that the discard of the slab page immediately triggers free_pcppages_bulk(). The information we need about the current state of the PCP comes from reading the zoneinfo file.

Allocating an order 1 page

As long as there are no order 0 pages available, buddy allocator will return the order 1 page that was recently moved from the PCP.

We just have to allocate enough objects from order 0 slab like kmalloc-256, but if we allocate too much, buddy allocator will split some higher order pages and order 0 count might increase instead.

Fortunately, we can parse the buddyinfo file to get the zone counts we need.

Triggering the use-after-free after 'users' field change

At this point our users field was changed from 1 to 0 (this is stage2() in the exploit).

This field is a reference counter, but doesn't use the refcount_t type, so there are no protections against invalid values.

Code checking if the object is used is very simple:

static void ipcomp_free_tfms(struct crypto_comp * __percpu *tfms)
{
        struct ipcomp_tfms *pos;
        int cpu;

        list_for_each_entry(pos, &ipcomp_tfms_list, list) {
                if (pos->tfms == tfms)
                        break;
        }

        WARN_ON(list_entry_is_head(pos, &ipcomp_tfms_list, list));

[1]        if (--pos->users)
                return;

        list_del(&pos->list);
        kfree(pos);

        if (!tfms)
                return;

        for_each_possible_cpu(cpu) {
                struct crypto_comp *tfm = *per_cpu_ptr(tfms, cpu);
                crypto_free_comp(tfm);
        }

}

If 'users' is equal to 1, objects are freed.

Right now our counter is at 0, but we can just allocate another XFRM SA to increase this count to 1 and then perform the delete, freeing the object while still in use.

Getting RIP control

When ipcomp_tfms is freed, all crypto context is freed as well, including struct crypto_alg which contains struct compress_alg:

struct compress_alg {
        int                        (*coa_compress)(struct crypto_tfm *, const u8  *, unsigned int, u8 *, unsigned in
t *); /*     0   0x8 */
        int                        (*coa_decompress)(struct crypto_tfm *, const u8  *, unsigned int, u8 *, unsigned 
int *); /*   0x8   0x8 */

        /* size: 16, cachelines: 1, members: 2 */
};

These function pointers are called to compress/decompress network data on sockets configured with XFRM ipcomp.

If we allocate our payload in place of this object, we can trigger code execution by calling sendmsg() on our XFRM socket.

Pivot to ROP

At this point RSI contains a pointer to our data, so we only need 2 gadgets to pivot to ROP:

push rsi
jmp qword ptr [rsi+0xf]

and

pop rsp

Second pivot

At this point we have full ROP and enough space available, but our standard privilege escalation payload relies on ROP being at a known location, so we choose an unused read/write area in the kernel and use copy_user_generic_string() to copy the second stage ROP from userspace to that area. Then we use a pop rsp ; ret gadget to pivot there.

Privilege escalation

The execution is happening in the context of a syscall this time, so it's easy to escalate privileges with standard commit_creds(init_cred); switch_task_namespaces(pid, init_nsproxy); sequence and return to the root shell.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup

What we start with and what can we do

Victim object

Triggering use-after-free through race condition

Ensuring the slab page is discarded

Moving the order-1 page from PCP to buddy allocator

Allocating an order 1 page

Triggering the use-after-free after 'users' field change

Getting RIP control

Pivot to ROP

Second pivot

Privilege escalation

FilesExpand file tree

exploit.md

Latest commit

History

exploit.md

File metadata and controls

Setup

What we start with and what can we do

Victim object

Triggering use-after-free through race condition

Ensuring the slab page is discarded

Moving the order-1 page from PCP to buddy allocator

Allocating an order 1 page

Triggering the use-after-free after 'users' field change

Getting RIP control

Pivot to ROP

Second pivot

Privilege escalation