* Re: [PATCH 1/1] io_uring/sqpoll: do not allow pinning outside of cpuset
2024-09-09 15:00 [PATCH 1/1] io_uring/sqpoll: do not allow pinning outside of cpuset Felix Moessbauer
@ 2024-09-09 15:08 ` Jens Axboe
2024-09-09 15:09 ` Jens Axboe
2024-09-18 3:56 ` Lai, Yi
2 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2024-09-09 15:08 UTC (permalink / raw)
To: Felix Moessbauer
Cc: asml.silence, linux-kernel, io-uring, cgroups, dqminh, longman,
adriaan.schmidt, florian.bezdeka, stable
On 9/9/24 9:00 AM, Felix Moessbauer wrote:
> The submit queue polling threads are userland threads that just never
> exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF,
> the affinity of the poller thread is set to the cpu specified in
> sq_thread_cpu. However, this CPU can be outside of the cpuset defined
> by the cgroup cpuset controller. This violates the rules defined by the
> cpuset controller and is a potential issue for realtime applications.
>
> In b7ed6d8ffd6 we fixed the default affinity of the poller thread, in
> case no explicit pinning is required by inheriting the one of the
> creating task. In case of explicit pinning, the check is more
> complicated, as also a cpu outside of the parent cpumask is allowed.
> We implemented this by using cpuset_cpus_allowed (that has support for
> cgroup cpusets) and testing if the requested cpu is in the set.
This also looks good to me.
> that's hopefully the last fix of cpu pinnings of the sq poller threads.
> However, there is more to come on the io-wq side. E.g the syscalls for
> IORING_REGISTER_IOWQ_AFF that can be used to change the affinites are
> not yet protected. I'm currently just lacking good reproducers for that.
> I also have to admit that I don't feel too comfortable making changes to
> the wq part, given that I don't have good tests.
Yep io-wq will have the same ignorance of cpu limits, so would need the
same love for when someone asks for specific cpus.
> While fixing this, I'm wondering if it makes sense to add tests for the
> combination of pinning and cpuset. If yes, where should these tests be
> added?
Yeah certainly add tests, liburing would be a good spot for that. That's
where the feature/regression/bug tests always go.
--
Jens Axboe
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/1] io_uring/sqpoll: do not allow pinning outside of cpuset
2024-09-09 15:00 [PATCH 1/1] io_uring/sqpoll: do not allow pinning outside of cpuset Felix Moessbauer
2024-09-09 15:08 ` Jens Axboe
@ 2024-09-09 15:09 ` Jens Axboe
2024-09-18 3:56 ` Lai, Yi
2 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2024-09-09 15:09 UTC (permalink / raw)
To: Felix Moessbauer
Cc: asml.silence, linux-kernel, io-uring, cgroups, dqminh, longman,
adriaan.schmidt, florian.bezdeka, stable
On Mon, 09 Sep 2024 17:00:36 +0200, Felix Moessbauer wrote:
> The submit queue polling threads are userland threads that just never
> exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF,
> the affinity of the poller thread is set to the cpu specified in
> sq_thread_cpu. However, this CPU can be outside of the cpuset defined
> by the cgroup cpuset controller. This violates the rules defined by the
> cpuset controller and is a potential issue for realtime applications.
>
> [...]
Applied, thanks!
[1/1] io_uring/sqpoll: do not allow pinning outside of cpuset
commit: f011c9cf04c06f16b24f583d313d3c012e589e50
Best regards,
--
Jens Axboe
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/1] io_uring/sqpoll: do not allow pinning outside of cpuset
2024-09-09 15:00 [PATCH 1/1] io_uring/sqpoll: do not allow pinning outside of cpuset Felix Moessbauer
2024-09-09 15:08 ` Jens Axboe
2024-09-09 15:09 ` Jens Axboe
@ 2024-09-18 3:56 ` Lai, Yi
2024-09-18 6:16 ` Jens Axboe
2 siblings, 1 reply; 6+ messages in thread
From: Lai, Yi @ 2024-09-18 3:56 UTC (permalink / raw)
To: Felix Moessbauer
Cc: axboe, asml.silence, linux-kernel, io-uring, cgroups, dqminh,
longman, adriaan.schmidt, florian.bezdeka, stable, syzkaller-bugs,
pengfei.xu, yi1.lai
Hi Felix Moessbauer,
Greetings!
I used Syzkaller and found that there is KASAN: use-after-free Read in io_sq_offload_create in Linux-next tree - next-20240916.
After bisection and the first bad commit is:
"
f011c9cf04c0 io_uring/sqpoll: do not allow pinning outside of cpuset
"
All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/240917_135250_io_sq_offload_create
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/blob/main/240917_135250_io_sq_offload_create/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/blob/main/240917_135250_io_sq_offload_create/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/blob/main/240917_135250_io_sq_offload_create/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/blob/main/240917_135250_io_sq_offload_create/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/blob/main/240917_135250_io_sq_offload_create/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/main/240917_135250_io_sq_offload_create/bzImage_7083504315d64199a329de322fce989e1e10f4f7
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/240917_135250_io_sq_offload_create/7083504315d64199a329de322fce989e1e10f4f7_dmesg.log
"
[ 23.564898] ==================================================================
[ 23.565444] BUG: KASAN: use-after-free in io_sq_offload_create+0xcaa/0x11d0
[ 23.565971] Read of size 8 at addr ffff888036377898 by task repro/729
[ 23.566459]
[ 23.566593] CPU: 0 UID: 0 PID: 729 Comm: repro Not tainted 6.11.0-next-20240916-7083504315d6 #1
[ 23.567271] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 23.568066] Call Trace:
[ 23.568252] <TASK>
[ 23.568417] dump_stack_lvl+0xea/0x150
[ 23.568718] print_report+0xce/0x610
[ 23.569001] ? io_sq_offload_create+0xcaa/0x11d0
[ 23.569340] ? kasan_addr_to_slab+0x11/0xb0
[ 23.569651] ? io_sq_offload_create+0xcaa/0x11d0
[ 23.569992] kasan_report+0xcc/0x110
[ 23.570277] ? io_sq_offload_create+0xcaa/0x11d0
[ 23.570621] kasan_check_range+0x3e/0x1c0
[ 23.570917] __kasan_check_read+0x15/0x20
[ 23.571212] io_sq_offload_create+0xcaa/0x11d0
[ 23.571540] ? __pfx_io_sq_offload_create+0x10/0x10
[ 23.571893] ? __pfx___lock_acquire+0x10/0x10
[ 23.572228] ? __this_cpu_preempt_check+0x21/0x30
[ 23.572580] ? lock_acquire.part.0+0x152/0x390
[ 23.572910] ? __this_cpu_preempt_check+0x21/0x30
[ 23.573254] ? lock_release+0x441/0x870
[ 23.573541] ? __pfx_lock_release+0x10/0x10
[ 23.573846] ? trace_lock_acquire+0x139/0x1b0
[ 23.574180] ? debug_smp_processor_id+0x20/0x30
[ 23.574524] ? rcu_is_watching+0x19/0xc0
[ 23.574826] ? __alloc_pages_noprof+0x517/0x710
[ 23.575171] ? __pfx___alloc_pages_noprof+0x10/0x10
[ 23.575526] ? mod_objcg_state+0x42c/0x9c0
[ 23.575838] ? lockdep_hardirqs_on+0x89/0x110
[ 23.576159] ? __sanitizer_cov_trace_switch+0x58/0xa0
[ 23.576534] ? policy_nodemask+0xf9/0x450
[ 23.576835] ? __sanitizer_cov_trace_const_cmp2+0x1c/0x30
[ 23.577220] ? alloc_pages_mpol_noprof+0x35d/0x580
[ 23.577575] ? __pfx_alloc_pages_mpol_noprof+0x10/0x10
[ 23.577950] ? __kmalloc_node_noprof+0x3a3/0x4e0
[ 23.578302] ? __kvmalloc_node_noprof+0x7f/0x240
[ 23.578645] ? alloc_pages_noprof+0xa9/0x180
[ 23.578963] ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30
[ 23.579347] ? io_pages_map+0x244/0x5c0
[ 23.579631] io_uring_setup+0x18df/0x3950
[ 23.579936] ? __pfx_io_uring_setup+0x10/0x10
[ 23.580263] ? __audit_syscall_entry+0x39c/0x500
[ 23.580602] __x64_sys_io_uring_setup+0xa4/0x160
[ 23.580939] x64_sys_call+0x17f5/0x20d0
[ 23.581224] do_syscall_64+0x6d/0x140
[ 23.581498] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 23.581872] RIP: 0033:0x7efd9fa3ee5d
[ 23.582140] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
[ 23.583404] RSP: 002b:00007ffdd4400858 EFLAGS: 00000202 ORIG_RAX: 00000000000001a9
[ 23.583938] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007efd9fa3ee5d
[ 23.584430] RDX: 00007efd9fb3f247 RSI: 0000000020000080 RDI: 0000000000005230
[ 23.584927] RBP: 00007ffdd4400860 R08: 00007ffdd44002d0 R09: 00007ffdd4400890
[ 23.585419] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffdd44009b8
[ 23.585914] R13: 0000000000401730 R14: 0000000000403e08 R15: 00007efd9fcb5000
[ 23.586424] </TASK>
[ 23.586589]
[ 23.586709] The buggy address belongs to the physical page:
[ 23.587094] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x36377
[ 23.587644] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
[ 23.588103] raw: 000fffffc0000000 ffffea0000d8ddc8 ffffea0000d8ddc8 0000000000000000
[ 23.588636] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[ 23.589167] page dumped because: kasan: bad access detected
[ 23.589551]
[ 23.589670] Memory state around the buggy address:
[ 23.590007] ffff888036377780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 23.590514] ffff888036377800: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 23.591011] >ffff888036377880: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 23.591508] ^
[ 23.591794] ffff888036377900: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 23.592292] ffff888036377980: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 23.592789] ==================================================================
[ 23.593344] Disabling lock debugging due to kernel taint
"
I hope you find it useful.
Regards,
Yi Lai
---
If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.
How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0
// start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
// You could change the bzImage_xxx as you want
// Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost
After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/
Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage //x should equal or less than cpu num your pc has
Fill the bzImage file into above start3.sh to load the target kernel in vm.
Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install
On Mon, Sep 09, 2024 at 05:00:36PM +0200, Felix Moessbauer wrote:
> The submit queue polling threads are userland threads that just never
> exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF,
> the affinity of the poller thread is set to the cpu specified in
> sq_thread_cpu. However, this CPU can be outside of the cpuset defined
> by the cgroup cpuset controller. This violates the rules defined by the
> cpuset controller and is a potential issue for realtime applications.
>
> In b7ed6d8ffd6 we fixed the default affinity of the poller thread, in
> case no explicit pinning is required by inheriting the one of the
> creating task. In case of explicit pinning, the check is more
> complicated, as also a cpu outside of the parent cpumask is allowed.
> We implemented this by using cpuset_cpus_allowed (that has support for
> cgroup cpusets) and testing if the requested cpu is in the set.
>
> Fixes: 37d1e2e3642e ("io_uring: move SQPOLL thread io-wq forked worker")
> Cc: [email protected] # 6.1+
> Signed-off-by: Felix Moessbauer <[email protected]>
> ---
> Hi,
>
> that's hopefully the last fix of cpu pinnings of the sq poller threads.
> However, there is more to come on the io-wq side. E.g the syscalls for
> IORING_REGISTER_IOWQ_AFF that can be used to change the affinites are
> not yet protected. I'm currently just lacking good reproducers for that.
> I also have to admit that I don't feel too comfortable making changes to
> the wq part, given that I don't have good tests.
>
> While fixing this, I'm wondering if it makes sense to add tests for the
> combination of pinning and cpuset. If yes, where should these tests be
> added?
>
> Best regards,
> Felix Moessbauer
> Siemens AG
>
> io_uring/sqpoll.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
> index 713be7c29388..b8ec8fec99b8 100644
> --- a/io_uring/sqpoll.c
> +++ b/io_uring/sqpoll.c
> @@ -10,6 +10,7 @@
> #include <linux/slab.h>
> #include <linux/audit.h>
> #include <linux/security.h>
> +#include <linux/cpuset.h>
> #include <linux/io_uring.h>
>
> #include <uapi/linux/io_uring.h>
> @@ -459,10 +460,12 @@ __cold int io_sq_offload_create(struct io_ring_ctx *ctx,
> return 0;
>
> if (p->flags & IORING_SETUP_SQ_AFF) {
> + struct cpumask allowed_mask;
> int cpu = p->sq_thread_cpu;
>
> ret = -EINVAL;
> - if (cpu >= nr_cpu_ids || !cpu_online(cpu))
> + cpuset_cpus_allowed(current, &allowed_mask);
> + if (!cpumask_test_cpu(cpu, &allowed_mask))
> goto err_sqpoll;
> sqd->sq_cpu = cpu;
> } else {
> --
> 2.39.2
>
^ permalink raw reply [flat|nested] 6+ messages in thread