public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
@ 2021-04-11 15:27 Stefan Metzmacher
  2021-04-14 21:28 ` Stefan Metzmacher
  0 siblings, 1 reply; 44+ messages in thread
From: Stefan Metzmacher @ 2021-04-11 15:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Stefan Metzmacher, Jens Axboe, linux-kernel, io-uring

This allows gdb attach to userspace processes using io-uring,
which means that they have io_threads (PF_IO_WORKER), which appear
just like normal as userspace threads.

See the code comment for more details.

Fixes: 4727dc20e04 ("arch: setup PF_IO_WORKER threads like PF_KTHREAD")
Signed-off-by: Stefan Metzmacher <[email protected]>
cc: Linus Torvalds <[email protected]>
cc: Jens Axboe <[email protected]>
cc: [email protected]
cc: [email protected]
---
 arch/x86/kernel/process.c | 49 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 9c214d7085a4..72120c4b7618 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -163,6 +163,55 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	/* Kernel thread ? */
 	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
 		memset(childregs, 0, sizeof(struct pt_regs));
+		/*
+		 * gdb sees all userspace threads,
+		 * including io threads (PF_IO_WORKER)!
+		 *
+		 * gdb uses:
+		 * PTRACE_PEEKUSR, offsetof (struct user_regs_struct, cs)
+		 *  returning with 0x33 (51) to detect 64 bit
+		 * and:
+		 * PTRACE_PEEKUSR, offsetof (struct user_regs_struct, ds)
+		 *  returning 0x2b (43) to detect 32 bit.
+		 *
+		 * GDB relies on that the kernel returns the
+		 * same values for all threads, which means
+		 * we don't zero these out.
+		 *
+		 * Note that CONFIG_X86_64 handles 'es' and 'ds'
+		 * differently, see the following above:
+		 *   savesegment(es, p->thread.es);
+		 *   savesegment(ds, p->thread.ds);
+		 * and the CONFIG_X86_64 version of get_segment_reg().
+		 *
+		 * Linus proposed something like this:
+		 * (https://lore.kernel.org/io-uring/CAHk-=whEObPkZBe4766DmR46-=5QTUiatWbSOaD468eTgYc1tg@mail.gmail.com/)
+		 *
+		 *   childregs->cs = __USER_CS;
+		 *   childregs->ss = __USER_DS;
+		 *   childregs->ds = __USER_DS;
+		 *   childregs->es = __USER_DS;
+		 *
+		 * might make sense (just do it unconditionally, rather than making it
+		 * special to PF_IO_WORKER).
+		 *
+		 * But that doesn't make gdb happy in all cases.
+		 *
+		 * While 32bit userspace on a 64bit kernel is legacy,
+		 * it's still useful to allow 32bit libraries or nss modules
+		 * use the same code as the 64bit version of that library, which
+		 * can use io-uring just fine.
+		 *
+		 * So we better just inherit the values from
+		 * the originating process instead of hardcoding
+		 * values, which would imply 64bit userspace.
+		 */
+		childregs->cs = current_pt_regs()->cs;
+		childregs->ss = current_pt_regs()->ss;
+#ifdef CONFIG_X86_32
+		childregs->ds = current_pt_regs()->ds;
+		childregs->es = current_pt_regs()->es;
+#endif
 		kthread_frame_init(frame, sp, arg);
 		return 0;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-04-11 15:27 Stefan Metzmacher
@ 2021-04-14 21:28 ` Stefan Metzmacher
  0 siblings, 0 replies; 44+ messages in thread
From: Stefan Metzmacher @ 2021-04-14 21:28 UTC (permalink / raw)
  To: Linus Torvalds, Jens Axboe; +Cc: linux-kernel, io-uring

Hi Jens, hi Linus,

any comments on that patch?

On a system with 'uname -m -p -i -r'

  5.12.0-rc7 x86_64 x86_64 x86_64

and a ubuntu 20.04 amd64 userspace.

I compiled 3 versions of liburing + the following diff:

diff --git a/examples/io_uring-cp.c b/examples/io_uring-cp.c
index 2a44c30d0089..91722a79b2bd 100644
--- a/examples/io_uring-cp.c
+++ b/examples/io_uring-cp.c
@@ -116,13 +116,16 @@ static void queue_write(struct io_uring *ring, struct io_data *data)
 	io_uring_submit(ring);
 }

-static int copy_file(struct io_uring *ring, off_t insize)
+static int copy_file(struct io_uring *ring, off_t _insize)
 {
+	off_t insize = _insize;
 	unsigned long reads, writes;
 	struct io_uring_cqe *cqe;
 	off_t write_left, offset;
 	int ret;

+again:
+	insize = _insize;
 	write_left = insize;
 	writes = reads = offset = 0;

@@ -176,6 +179,7 @@ static int copy_file(struct io_uring *ring, off_t insize)
 					ret = 0;
 				}
 			}
+			if (ret == -EINTR) { cqe = NULL; ret = 0; }
 			if (ret < 0) {
 				fprintf(stderr, "io_uring_peek_cqe: %s\n",
 							strerror(-ret));
@@ -239,6 +243,7 @@ static int copy_file(struct io_uring *ring, off_t insize)
 		writes--;
 		io_uring_cqe_seen(ring, cqe);
 	}
+	goto again;

 	return 0;
 }

I compiled it in 3 versions:

CFLAGS="-g -m32" CXX=false make
=> file examples/io_uring-cp
examples/io_uring-cp: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2,
BuildID[sha1]=1a5cabd082925497d146a041fd5c5cff6ded75da, for GNU/Linux 3.2.0, with debug_info, not stripped

CFLAGS="-g -mx32" CXX=false make
=> file examples/io_uring-cp
examples/io_uring-cp: ELF 32-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /libx32/ld-linux-x32.so.2,
BuildID[sha1]=a8bf06124c44364a8d7aedfeb3faa736feff6452, for GNU/Linux 3.4.0, with debug_info, not stripped

CFLAGS="-g -m64" CXX=false make
=> file examples/io_uring-cp
examples/io_uring-cp: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2,
BuildID[sha1]=638c682173adad3c09c67af4d1888fbb3b260627, for GNU/Linux 3.2.0, with debug_info, not stripped

With a plain 5.12-rc7 gcc prints the following output:

With -m64:

# gdb -q --batch --ex="set height 0" --ex="thread apply all bt" --pid 8743
[New LWP 8744]
[New LWP 8745]

warning: Selected architecture i386:x86-64 is not compatible with reported target architecture i386

warning: Architecture rejected target-supplied description
syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
38      ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or directory.

Thread 3 (LWP 8745):
#0  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x0

Thread 2 (LWP 8744):
#0  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x0

Thread 1 (LWP 8743):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x000055b78731042b in __sys_io_uring_enter2 (fd=5, to_submit=0, min_complete=1, flags=1, sig=0x0, sz=8) at syscall.c:54
#2  0x000055b78730fc52 in _io_uring_get_cqe (ring=0x7ffd91bce680, cqe_ptr=0x7ffd91bce600, data=0x7ffd91bce560) at queue.c:122
#3  0x000055b78730fd19 in __io_uring_get_cqe (ring=0x7ffd91bce680, cqe_ptr=0x7ffd91bce600, submit=0, wait_nr=1, sigmask=0x0) at queue.c:150
#4  0x000055b78730e58c in io_uring_wait_cqe_nr (ring=0x7ffd91bce680, cqe_ptr=0x7ffd91bce600, wait_nr=1) at ../src/include/liburing.h:635
#5  0x000055b78730e5e0 in io_uring_wait_cqe (ring=0x7ffd91bce680, cqe_ptr=0x7ffd91bce600) at ../src/include/liburing.h:655
#6  0x000055b78730ecf2 in copy_file (ring=0x7ffd91bce680, _insize=24) at io_uring-cp.c:232
#7  0x000055b78730eef5 in main (argc=3, argv=0x7ffd91bce858) at io_uring-cp.c:278
[Inferior 1 (process 8743) detached]

With -mx32:

gdb -q --batch --ex="set height 0" --ex="thread apply all bt" --pid 8821
[New LWP 8822]
[New LWP 8823]

warning: Selected architecture i386:x64-32 is not compatible with reported target architecture i386

warning: Architecture rejected target-supplied description
0xf7e66d1d in syscall () from /libx32/libc.so.6

Thread 3 (LWP 8823):
#0  0x00000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x0

Thread 2 (LWP 8822):
#0  0x00000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x0

Thread 1 (LWP 8821):
#0  0xf7e66d1d in syscall () from /libx32/libc.so.6
#1  0x5659663d in __sys_io_uring_enter2 (fd=5, to_submit=0, min_complete=1, flags=1, sig=0x0, sz=8) at syscall.c:54
#2  0x56595dbe in _io_uring_get_cqe (ring=0xffc4feb0, cqe_ptr=0xffc4fe38, data=0xffc4fdb0) at queue.c:122
#3  0x56595e9b in __io_uring_get_cqe (ring=0xffc4feb0, cqe_ptr=0xffc4fe38, submit=0, wait_nr=1, sigmask=0x0) at queue.c:150
#4  0x565945d8 in io_uring_wait_cqe_nr (ring=0xffc4feb0, cqe_ptr=0xffc4fe38, wait_nr=1) at ../src/include/liburing.h:635
#5  0x56594634 in io_uring_wait_cqe (ring=0xffc4feb0, cqe_ptr=0xffc4fe38) at ../src/include/liburing.h:655
#6  0x56594dc5 in copy_file (ring=0xffc4feb0, _insize=24) at io_uring-cp.c:232
#7  0x56594ff1 in main (argc=3, argv=0xffc50024) at io_uring-cp.c:278
[Inferior 1 (process 8821) detached]

With -m32:

gdb -q --batch --ex="set height 0" --ex="thread apply all bt" --pid 8831
[New LWP 8832]
[New LWP 8833]
0xf7f16549 in __kernel_vsyscall ()

Thread 3 (LWP 8833):
#0  0x00000000 in ?? ()

Thread 2 (LWP 8832):
#0  0x00000000 in ?? ()

Thread 1 (LWP 8831):
#0  0xf7f16549 in __kernel_vsyscall ()
#1  0xf7e14efb in syscall () at ../sysdeps/unix/sysv/linux/i386/syscall.S:29
#2  0x5657d297 in __sys_io_uring_enter2 (fd=5, to_submit=0, min_complete=1, flags=1, sig=0x0, sz=8) at syscall.c:54
#3  0x5657cb32 in _io_uring_get_cqe (ring=0xff9557b0, cqe_ptr=0xff95573c, data=0xff955698) at queue.c:122
#4  0x5657cbf5 in __io_uring_get_cqe (ring=0xff9557b0, cqe_ptr=0xff95573c, submit=0, wait_nr=1, sigmask=0x0) at queue.c:150
#5  0x5657b5f2 in io_uring_wait_cqe_nr (ring=0xff9557b0, cqe_ptr=0xff95573c, wait_nr=1) at ../src/include/liburing.h:635
#6  0x5657b63f in io_uring_wait_cqe (ring=0xff9557b0, cqe_ptr=0xff95573c) at ../src/include/liburing.h:655
#7  0x5657bcbe in copy_file (ring=0xff9557b0, _insize=24) at io_uring-cp.c:232
#8  0x5657becd in main (argc=3, argv=0xff9558f4) at io_uring-cp.c:278
[Inferior 1 (process 8831) detached]

So the gdb autodetects 'i386/-m32' and warns in all other cases (including the default of -m64)

As a debugged this deeply now, I know (at least printing a backtrace works anyway),
but for any random advanced admin or userspace developer warnings like this:

  warning: Selected architecture i386:x86-64 is not compatible with reported target architecture i386

  warning: Architecture rejected target-supplied description

typically indicate that something in the system is deeply broken.

With the version proposed by Linus:

+		childregs->cs = __USER_CS;
+		childregs->ss = __USER_DS;
+#ifdef CONFIG_X86_32
+		childregs->ds = __USER_DS;
+		childregs->es = __USER_DS;
+#endif

-m64 and -mx32 are fine, but i386/-m32 generates this:

# gdb -q --batch --ex="set height 0" --ex="thread apply all bt" --pid 984
[New LWP 985]
[New LWP 986]
[New LWP 987]

warning: Selected architecture i386 is not compatible with reported target architecture i386:x64-32

warning: Architecture rejected target-supplied description
0xf7f58549 in __kernel_vsyscall ()

Thread 4 (LWP 987):
#0  0x00000000 in ?? ()

Thread 3 (LWP 986):
#0  0x00000000 in ?? ()

Thread 2 (LWP 985):
#0  0x00000000 in ?? ()

Thread 1 (LWP 984):
#0  0xf7f58549 in __kernel_vsyscall ()
#1  0xf7e56efb in syscall () at ../sysdeps/unix/sysv/linux/i386/syscall.S:29
#2  0x5656c297 in __sys_io_uring_enter2 (fd=5, to_submit=0, min_complete=1, flags=1, sig=0x0, sz=8) at syscall.c:54
#3  0x5656bb32 in _io_uring_get_cqe (ring=0xffc42ce0, cqe_ptr=0xffc42c6c, data=0xffc42bc8) at queue.c:122
#4  0x5656bbf5 in __io_uring_get_cqe (ring=0xffc42ce0, cqe_ptr=0xffc42c6c, submit=0, wait_nr=1, sigmask=0x0) at queue.c:150
#5  0x5656a5f2 in io_uring_wait_cqe_nr (ring=0xffc42ce0, cqe_ptr=0xffc42c6c, wait_nr=1) at ../src/include/liburing.h:635
#6  0x5656a63f in io_uring_wait_cqe (ring=0xffc42ce0, cqe_ptr=0xffc42c6c) at ../src/include/liburing.h:655
#7  0x5656acbe in copy_file (ring=0xffc42ce0, _insize=24) at io_uring-cp.c:232
#8  0x5656aecd in main (argc=3, argv=0xffc42e24) at io_uring-cp.c:278
[Inferior 1 (process 984) detached]


With my patch all 3 are fine.

It also feels more natural to me to just keep the values of
the copy_thread() caller.

Do you plan to do something about this before 5.12 final?

metze

Am 11.04.21 um 17:27 schrieb Stefan Metzmacher:
> This allows gdb attach to userspace processes using io-uring,
> which means that they have io_threads (PF_IO_WORKER), which appear
> just like normal as userspace threads.
> 
> See the code comment for more details.
> 
> Fixes: 4727dc20e04 ("arch: setup PF_IO_WORKER threads like PF_KTHREAD")
> Signed-off-by: Stefan Metzmacher <[email protected]>
> cc: Linus Torvalds <[email protected]>
> cc: Jens Axboe <[email protected]>
> cc: [email protected]
> cc: [email protected]
> ---
>  arch/x86/kernel/process.c | 49 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 49 insertions(+)
> 
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 9c214d7085a4..72120c4b7618 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -163,6 +163,55 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
>  	/* Kernel thread ? */
>  	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>  		memset(childregs, 0, sizeof(struct pt_regs));
> +		/*
> +		 * gdb sees all userspace threads,
> +		 * including io threads (PF_IO_WORKER)!
> +		 *
> +		 * gdb uses:
> +		 * PTRACE_PEEKUSR, offsetof (struct user_regs_struct, cs)
> +		 *  returning with 0x33 (51) to detect 64 bit
> +		 * and:
> +		 * PTRACE_PEEKUSR, offsetof (struct user_regs_struct, ds)
> +		 *  returning 0x2b (43) to detect 32 bit.
> +		 *
> +		 * GDB relies on that the kernel returns the
> +		 * same values for all threads, which means
> +		 * we don't zero these out.
> +		 *
> +		 * Note that CONFIG_X86_64 handles 'es' and 'ds'
> +		 * differently, see the following above:
> +		 *   savesegment(es, p->thread.es);
> +		 *   savesegment(ds, p->thread.ds);
> +		 * and the CONFIG_X86_64 version of get_segment_reg().
> +		 *
> +		 * Linus proposed something like this:
> +		 * (https://lore.kernel.org/io-uring/CAHk-=whEObPkZBe4766DmR46-=5QTUiatWbSOaD468eTgYc1tg@mail.gmail.com/)
> +		 *
> +		 *   childregs->cs = __USER_CS;
> +		 *   childregs->ss = __USER_DS;
> +		 *   childregs->ds = __USER_DS;
> +		 *   childregs->es = __USER_DS;
> +		 *
> +		 * might make sense (just do it unconditionally, rather than making it
> +		 * special to PF_IO_WORKER).
> +		 *
> +		 * But that doesn't make gdb happy in all cases.
> +		 *
> +		 * While 32bit userspace on a 64bit kernel is legacy,
> +		 * it's still useful to allow 32bit libraries or nss modules
> +		 * use the same code as the 64bit version of that library, which
> +		 * can use io-uring just fine.
> +		 *
> +		 * So we better just inherit the values from
> +		 * the originating process instead of hardcoding
> +		 * values, which would imply 64bit userspace.
> +		 */
> +		childregs->cs = current_pt_regs()->cs;
> +		childregs->ss = current_pt_regs()->ss;
> +#ifdef CONFIG_X86_32
> +		childregs->ds = current_pt_regs()->ds;
> +		childregs->es = current_pt_regs()->es;
> +#endif
>  		kthread_frame_init(frame, sp, arg);
>  		return 0;
>  	}
> 


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
       [not found] <[email protected]>
@ 2021-05-03 16:05 ` Andy Lutomirski
  2021-05-03 19:14   ` Linus Torvalds
  0 siblings, 1 reply; 44+ messages in thread
From: Andy Lutomirski @ 2021-05-03 16:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Stefan Metzmacher, Linus Torvalds, Jens Axboe, linux-kernel,
	io-uring, x86, Linus Torvalds


> On May 3, 2021, at 7:00 AM, Thomas Gleixner <[email protected]> wrote:
> 
> Stefan,
> 
> On Sun, Apr 11 2021 at 17:27, Stefan Metzmacher wrote:
> 
> Can you please CC x86 people on patches which are x86 related?
> 
>> This allows gdb attach to userspace processes using io-uring,
>> which means that they have io_threads (PF_IO_WORKER), which appear
>> just like normal as userspace threads.
> 
> That's not a changelog, really. Please describe what the problem is and
> why the chosen solution is correct.
> 
>> See the code comment for more details.
> 
> The changelog should be self contained.
> 
>> Fixes: 4727dc20e04 ("arch: setup PF_IO_WORKER threads like PF_KTHREAD")
>> Signed-off-by: Stefan Metzmacher <[email protected]>
>> cc: Linus Torvalds <[email protected]>
>> cc: Jens Axboe <[email protected]>
>> cc: [email protected]
>> cc: [email protected]
>> ---
>> arch/x86/kernel/process.c | 49 +++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 49 insertions(+)
>> 
>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>> index 9c214d7085a4..72120c4b7618 100644
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -163,6 +163,55 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
>>    /* Kernel thread ? */
>>    if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>>        memset(childregs, 0, sizeof(struct pt_regs));
>> +        /*
>> +         * gdb sees all userspace threads,
>> +         * including io threads (PF_IO_WORKER)!
>> +         *
>> +         * gdb uses:
>> +         * PTRACE_PEEKUSR, offsetof (struct user_regs_struct, cs)
>> +         *  returning with 0x33 (51) to detect 64 bit
>> +         * and:
>> +         * PTRACE_PEEKUSR, offsetof (struct user_regs_struct, ds)
>> +         *  returning 0x2b (43) to detect 32 bit.
>> +         *
>> +         * GDB relies on that the kernel returns the
>> +         * same values for all threads, which means
>> +         * we don't zero these out.
>> +         *
>> +         * Note that CONFIG_X86_64 handles 'es' and 'ds'
>> +         * differently, see the following above:
>> +         *   savesegment(es, p->thread.es);
>> +         *   savesegment(ds, p->thread.ds);
>> +         * and the CONFIG_X86_64 version of get_segment_reg().
>> +         *
>> +         * Linus proposed something like this:
>> +         * (https://lore.kernel.org/io-uring/CAHk-=whEObPkZBe4766DmR46-=5QTUiatWbSOaD468eTgYc1tg@mail.gmail.com/)
>> +         *
>> +         *   childregs->cs = __USER_CS;
>> +         *   childregs->ss = __USER_DS;
>> +         *   childregs->ds = __USER_DS;
>> +         *   childregs->es = __USER_DS;
>> +         *
>> +         * might make sense (just do it unconditionally, rather than making it
>> +         * special to PF_IO_WORKER).
>> +         *
>> +         * But that doesn't make gdb happy in all cases.
>> +         *
>> +         * While 32bit userspace on a 64bit kernel is legacy,
>> +         * it's still useful to allow 32bit libraries or nss modules
>> +         * use the same code as the 64bit version of that library, which
>> +         * can use io-uring just fine.

Whoa there!  Can we take a big step back?

I saw all the hubbub about making io threads visible to gdb.  Fine, but why do we allow gdb to read and write their register files at all?  They *don’t have user state* because they *are not user threads*.  Beyond that, Linux does not really have a concept of a 32-bit thread and a 64-bit thread. I realize that gdb does have this concept, but gdb is *wrong*, and it regularly causes problems when debugging mixed-mode programs or VMs.

Linus, what is the actual effect of allowing gdb to attach these threads?  Can we instead make all the regset ops do:

if (not actually a user thread) return -EINVAL;

Any other solution results in all kinds of nasty questions. For example, kernel threads don’t have FPU state — what happens if gdb tries to access FPU state?  What happens if gdb tries to *allocate* AMX state for an io_uring thread? What happens if the various remote arch_prctl accessors are used?

—Andy

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 16:05 ` [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads Andy Lutomirski
@ 2021-05-03 19:14   ` Linus Torvalds
  2021-05-03 20:15     ` Andy Lutomirski
  2021-05-04  8:39     ` Peter Zijlstra
  0 siblings, 2 replies; 44+ messages in thread
From: Linus Torvalds @ 2021-05-03 19:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Stefan Metzmacher, Jens Axboe,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

On Mon, May 3, 2021 at 9:05 AM Andy Lutomirski <[email protected]> wrote:
>
> Linus, what is the actual effect of allowing gdb to attach these threads?  Can we instead make all the regset ops do:
>
> if (not actually a user thread) return -EINVAL;

I don't think it matters - the end result ends up  being the same, ie
gdb gets confused about whether the (parent) thread is a 32-bit or
64-bit one.

So the basic issue is

 (a) we want the IO threads to look exactly like normal user threads
as far as the kernel is concerned, because we had way too many bugs
due to special cases.

 (b) but that means that they are also visible to user space, and then
gdb has this odd thing where it takes the 64-bit vs 32-bit data for
the whole process from one thread, and picks the worst possible thread
to do it (ie explicitly not even the main thread, so usually the IO
thread!)

That (a) ended up really being critical. The issues with special cases
were just horrendous, both for security issues (ie "make them kernel
threads but carry user credentials" just caused lots of problems) but
also for various just random other state handling issues (signal state
in particular).

So generally, the IO threads are now 100% normal threads - it's
literally just that they never return to user space because they are
always just doing the IO offload on the kernel side.

That part is lovely, but part of the "100% IO threads" really is that
they share the signal struct too, which in turn means that they very
much show up as normal threads. Again, not a problem: they really
_are_ normal threads for all intents and purposes.

But then that (b) issue means that gdb gets confused by them. I
personally think that's just a pure gdb mis-feature, but I also think
that "hey, if we just make the register state look like the main
thread, and unconfuse gdb that way, problem solved".

So I'd actually rather not make these non-special threads any more
special at all. And I strongly suspect that making ptrace() not work
on them will just confuse gdb even more - so it would make them just
unnecessarily special in the kernel, for no actual gain.

Is the right thing to do to fix gdb to not look at irrelevant thread B
when deciding whether thread A is 64-bit or not? Yeah, that seems like
obviously the RightThing(tm) to me.

But at the same time, this is arguably about "regression", although at
the same time it's "gdb doesn't understand new user programs that use
new features, film at 11", so I think that argument is partly bogus
too.

So my personal preference would be:

 - make those threads look even more like user threads, even if that
means giving them pointless user segment data that the threads
themselves will never use

   So I think Stefan's patch is reasonable, if not pretty. Literally
becasue of that "make these threads look even more normal"

 - ALSO fix gdb that is doing obviously garbage stupid things

But I'm obviously not involved in that "ALSO fix gdb" part, and
arguably the kernel hack then makes it more likely that gdb will
continue doing its insane broken thing.

             Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 19:14   ` Linus Torvalds
@ 2021-05-03 20:15     ` Andy Lutomirski
  2021-05-03 20:21       ` Andy Lutomirski
  2021-05-03 20:37       ` Linus Torvalds
  2021-05-04  8:39     ` Peter Zijlstra
  1 sibling, 2 replies; 44+ messages in thread
From: Andy Lutomirski @ 2021-05-03 20:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Stefan Metzmacher, Jens Axboe,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

On Mon, May 3, 2021 at 12:15 PM Linus Torvalds
<[email protected]> wrote:
> So generally, the IO threads are now 100% normal threads - it's
> literally just that they never return to user space because they are
> always just doing the IO offload on the kernel side.
>
> That part is lovely, but part of the "100% IO threads" really is that
> they share the signal struct too, which in turn means that they very
> much show up as normal threads. Again, not a problem: they really
> _are_ normal threads for all intents and purposes.

I'm a bit confused, though.  All the ptrace register access (AFAICS)
goes through ptrace_check_attach(), which should wait until the tracee
is stopped.  Does the io_uring thread now stop in response to ptrace
stop requests?

>
> But then that (b) issue means that gdb gets confused by them. I
> personally think that's just a pure gdb mis-feature, but I also think
> that "hey, if we just make the register state look like the main
> thread, and unconfuse gdb that way, problem solved".
>
> So I'd actually rather not make these non-special threads any more
> special at all. And I strongly suspect that making ptrace() not work
> on them will just confuse gdb even more - so it would make them just
> unnecessarily special in the kernel, for no actual gain.
>
> Is the right thing to do to fix gdb to not look at irrelevant thread B
> when deciding whether thread A is 64-bit or not? Yeah, that seems like
> obviously the RightThing(tm) to me.
>
> But at the same time, this is arguably about "regression", although at
> the same time it's "gdb doesn't understand new user programs that use
> new features, film at 11", so I think that argument is partly bogus
> too.
>

Fair enough.  But I would really, really rather that gdb starts fixing
its amazingly broken assumptions about bitness.

> So my personal preference would be:
>
>  - make those threads look even more like user threads, even if that
> means giving them pointless user segment data that the threads
> themselves will never use
>
>    So I think Stefan's patch is reasonable, if not pretty. Literally
> becasue of that "make these threads look even more normal"

I think it's reasonable except for the bit about copying the segment
regs.  Can we hardcode __USER_CS, etc, and, when gdb crashes or
otherwise malfunctions for compat programs, we can say that gdb needs
to stop sucking.  In general, I think that piling a bitness hack in
here is a mess, and we're going to have to carry it forward forever
once we do it.

Meanwhile, I am going to put my foot down about one thing: NAK to this
patch until it comes with a selftest.  That selftest needs to test the
cs behavior, but it also needs to read the FPU state from an io_uring
thread, write some FPU state to that thread, and read it back.  And it
needs to not OOPS.  Not breaking ABI is nice and all, but even more
important is not breaking the kernel.

--Andy

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 20:15     ` Andy Lutomirski
@ 2021-05-03 20:21       ` Andy Lutomirski
  2021-05-03 20:37       ` Linus Torvalds
  1 sibling, 0 replies; 44+ messages in thread
From: Andy Lutomirski @ 2021-05-03 20:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Thomas Gleixner, Stefan Metzmacher, Jens Axboe,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

On Mon, May 3, 2021 at 1:15 PM Andy Lutomirski <[email protected]> wrote:
>
> On Mon, May 3, 2021 at 12:15 PM Linus Torvalds
> <[email protected]> wrote:
> > So generally, the IO threads are now 100% normal threads - it's
> > literally just that they never return to user space because they are
> > always just doing the IO offload on the kernel side.
> >
> > That part is lovely, but part of the "100% IO threads" really is that
> > they share the signal struct too, which in turn means that they very
> > much show up as normal threads. Again, not a problem: they really
> > _are_ normal threads for all intents and purposes.
>
> I'm a bit confused, though.  All the ptrace register access (AFAICS)
> goes through ptrace_check_attach(), which should wait until the tracee
> is stopped.  Does the io_uring thread now stop in response to ptrace
> stop requests?
>
> >
> > But then that (b) issue means that gdb gets confused by them. I
> > personally think that's just a pure gdb mis-feature, but I also think
> > that "hey, if we just make the register state look like the main
> > thread, and unconfuse gdb that way, problem solved".
> >
> > So I'd actually rather not make these non-special threads any more
> > special at all. And I strongly suspect that making ptrace() not work
> > on them will just confuse gdb even more - so it would make them just
> > unnecessarily special in the kernel, for no actual gain.
> >
> > Is the right thing to do to fix gdb to not look at irrelevant thread B
> > when deciding whether thread A is 64-bit or not? Yeah, that seems like
> > obviously the RightThing(tm) to me.
> >
> > But at the same time, this is arguably about "regression", although at
> > the same time it's "gdb doesn't understand new user programs that use
> > new features, film at 11", so I think that argument is partly bogus
> > too.
> >
>
> Fair enough.  But I would really, really rather that gdb starts fixing
> its amazingly broken assumptions about bitness.
>
> > So my personal preference would be:
> >
> >  - make those threads look even more like user threads, even if that
> > means giving them pointless user segment data that the threads
> > themselves will never use
> >
> >    So I think Stefan's patch is reasonable, if not pretty. Literally
> > becasue of that "make these threads look even more normal"
>
> I think it's reasonable except for the bit about copying the segment
> regs.  Can we hardcode __USER_CS, etc, and, when gdb crashes or
> otherwise malfunctions for compat programs, we can say that gdb needs
> to stop sucking.  In general, I think that piling a bitness hack in
> here is a mess, and we're going to have to carry it forward forever
> once we do it.

Actually... if we have your permission, I'd rather do the -EINVAL
thing.  Arguably, if gdb breaks cleanly, that's a win.  This only
affects programs using io_uring, it avoids a kludge, and hopefully it
will encourage gdb to fix their bug.  May we do that instead?

--Andy

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 20:15     ` Andy Lutomirski
  2021-05-03 20:21       ` Andy Lutomirski
@ 2021-05-03 20:37       ` Linus Torvalds
  2021-05-03 21:26         ` Jens Axboe
  1 sibling, 1 reply; 44+ messages in thread
From: Linus Torvalds @ 2021-05-03 20:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Stefan Metzmacher, Jens Axboe,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

On Mon, May 3, 2021 at 1:15 PM Andy Lutomirski <[email protected]> wrote:
>
> On Mon, May 3, 2021 at 12:15 PM Linus Torvalds
> <[email protected]> wrote:
> > So generally, the IO threads are now 100% normal threads - it's
> > literally just that they never return to user space because they are
> > always just doing the IO offload on the kernel side.
> >
> > That part is lovely, but part of the "100% IO threads" really is that
> > they share the signal struct too, which in turn means that they very
> > much show up as normal threads. Again, not a problem: they really
> > _are_ normal threads for all intents and purposes.
>
> I'm a bit confused, though.  All the ptrace register access (AFAICS)
> goes through ptrace_check_attach(), which should wait until the tracee
> is stopped.  Does the io_uring thread now stop in response to ptrace
> stop requests?

Yup. They really are 100% regular threads. Things like ^Z and friends
also stop them now, and the freezer freezes them etc.

And making PTRACE_ATTACH fail just causes gdb to fail.

> Fair enough.  But I would really, really rather that gdb starts fixing
> its amazingly broken assumptions about bitness.

"Preach it, Brother"

> >    So I think Stefan's patch is reasonable, if not pretty. Literally
> > becasue of that "make these threads look even more normal"
>
> I think it's reasonable except for the bit about copying the segment
> regs.  Can we hardcode __USER_CS, etc, and, when gdb crashes or
> otherwise malfunctions for compat programs, we can say that gdb needs
> to stop sucking.

So that was actually my initial suggestion. Stefan really does seem to
care about compat programs.

Any "gdb breaks" would be good to motivate people to fix gdb, but the
thing is, presumably nobody actually wants to touch gdb with a ten
foot pole.

And a "let's break gdb to encourage people to fix it" only works if
people actually _do_ fit it. Which doesn't seem to be happening.

Two lines of kernel code seems to be the better option than hoping for
gdb to be fixed.

               Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 20:37       ` Linus Torvalds
@ 2021-05-03 21:26         ` Jens Axboe
  2021-05-03 21:49           ` Andy Lutomirski
  0 siblings, 1 reply; 44+ messages in thread
From: Jens Axboe @ 2021-05-03 21:26 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski
  Cc: Thomas Gleixner, Stefan Metzmacher, Linux Kernel Mailing List,
	io-uring, the arch/x86 maintainers

On 5/3/21 2:37 PM, Linus Torvalds wrote:
> On Mon, May 3, 2021 at 1:15 PM Andy Lutomirski <[email protected]> wrote:
>>
>> On Mon, May 3, 2021 at 12:15 PM Linus Torvalds
>> <[email protected]> wrote:
>>> So generally, the IO threads are now 100% normal threads - it's
>>> literally just that they never return to user space because they are
>>> always just doing the IO offload on the kernel side.
>>>
>>> That part is lovely, but part of the "100% IO threads" really is that
>>> they share the signal struct too, which in turn means that they very
>>> much show up as normal threads. Again, not a problem: they really
>>> _are_ normal threads for all intents and purposes.
>>
>> I'm a bit confused, though.  All the ptrace register access (AFAICS)
>> goes through ptrace_check_attach(), which should wait until the tracee
>> is stopped.  Does the io_uring thread now stop in response to ptrace
>> stop requests?
> 
> Yup. They really are 100% regular threads. Things like ^Z and friends
> also stop them now, and the freezer freezes them etc.
> 
> And making PTRACE_ATTACH fail just causes gdb to fail.
> 
>> Fair enough.  But I would really, really rather that gdb starts fixing
>> its amazingly broken assumptions about bitness.
> 
> "Preach it, Brother"

That's actually what the original code did, and the "only" problem with
it was that gdb shits itself and just go into an infinite loop trying to
attach. And yes, that's most certainly a gdb bug, and we entertained a
few options for making that work. One was hiding the threads, but nobody
(myself included) liked that, because then we're special casing
something again, and for no other reason than gdb is buggy.

On principle, I think it's arguably the right change to just -EINVAL the
attach. However, a part of me also finds it very annoying that anyone
attempting to debug any program that uses io_uring will not be able to
do so with a buggy gdb. That's regardless of whether or not you want to
look at the io threads or not, or even if you don't care about debugging
the io_uring side of things. And I'm assuming this will take a while to
get fixed, and then even longer to make its way back to distros.

So... You should just make the call. At least then I can just tell
people that Linus made that decision :-)

>>>    So I think Stefan's patch is reasonable, if not pretty. Literally
>>> becasue of that "make these threads look even more normal"
>>
>> I think it's reasonable except for the bit about copying the segment
>> regs.  Can we hardcode __USER_CS, etc, and, when gdb crashes or
>> otherwise malfunctions for compat programs, we can say that gdb needs
>> to stop sucking.
> 
> So that was actually my initial suggestion. Stefan really does seem to
> care about compat programs.
> 
> Any "gdb breaks" would be good to motivate people to fix gdb, but the
> thing is, presumably nobody actually wants to touch gdb with a ten
> foot pole.
> 
> And a "let's break gdb to encourage people to fix it" only works if
> people actually _do_ fit it. Which doesn't seem to be happening.
> 
> Two lines of kernel code seems to be the better option than hoping for
> gdb to be fixed.

As far as I'm concerned, gdb works "well enough" with io threads as it
stands. Yes, it'll complain a bit, but nothing that prevents you from
attaching to a progrem. If we -EINVAL, then gdb will become useless for
debugging a program that uses io_uring. And I'm not holding my breath
while someone fixes that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 21:26         ` Jens Axboe
@ 2021-05-03 21:49           ` Andy Lutomirski
  2021-05-03 22:08             ` Linus Torvalds
  0 siblings, 1 reply; 44+ messages in thread
From: Andy Lutomirski @ 2021-05-03 21:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Andy Lutomirski, Thomas Gleixner,
	Stefan Metzmacher, Linux Kernel Mailing List, io-uring,
	the arch/x86 maintainers

On Mon, May 3, 2021 at 2:26 PM Jens Axboe <[email protected]> wrote:
>
> On 5/3/21 2:37 PM, Linus Torvalds wrote:
> > On Mon, May 3, 2021 at 1:15 PM Andy Lutomirski <[email protected]> wrote:
> >>
> >> On Mon, May 3, 2021 at 12:15 PM Linus Torvalds
> >> <[email protected]> wrote:
> >>> So generally, the IO threads are now 100% normal threads - it's
> >>> literally just that they never return to user space because they are
> >>> always just doing the IO offload on the kernel side.
> >>>
> >>> That part is lovely, but part of the "100% IO threads" really is that
> >>> they share the signal struct too, which in turn means that they very
> >>> much show up as normal threads. Again, not a problem: they really
> >>> _are_ normal threads for all intents and purposes.
> >>
> >> I'm a bit confused, though.  All the ptrace register access (AFAICS)
> >> goes through ptrace_check_attach(), which should wait until the tracee
> >> is stopped.  Does the io_uring thread now stop in response to ptrace
> >> stop requests?
> >
> > Yup. They really are 100% regular threads. Things like ^Z and friends
> > also stop them now, and the freezer freezes them etc.
> >
> > And making PTRACE_ATTACH fail just causes gdb to fail.
> >
> >> Fair enough.  But I would really, really rather that gdb starts fixing
> >> its amazingly broken assumptions about bitness.
> >
> > "Preach it, Brother"
>
> That's actually what the original code did, and the "only" problem with
> it was that gdb shits itself and just go into an infinite loop trying to
> attach. And yes, that's most certainly a gdb bug, and we entertained a
> few options for making that work. One was hiding the threads, but nobody
> (myself included) liked that, because then we're special casing
> something again, and for no other reason than gdb is buggy.
>
> On principle, I think it's arguably the right change to just -EINVAL the
> attach. However, a part of me also finds it very annoying that anyone
> attempting to debug any program that uses io_uring will not be able to
> do so with a buggy gdb. That's regardless of whether or not you want to
> look at the io threads or not, or even if you don't care about debugging
> the io_uring side of things. And I'm assuming this will take a while to
> get fixed, and then even longer to make its way back to distros.
>
> So... You should just make the call. At least then I can just tell
> people that Linus made that decision :-)
>
> >>>    So I think Stefan's patch is reasonable, if not pretty. Literally
> >>> becasue of that "make these threads look even more normal"
> >>
> >> I think it's reasonable except for the bit about copying the segment
> >> regs.  Can we hardcode __USER_CS, etc, and, when gdb crashes or
> >> otherwise malfunctions for compat programs, we can say that gdb needs
> >> to stop sucking.
> >
> > So that was actually my initial suggestion. Stefan really does seem to
> > care about compat programs.
> >
> > Any "gdb breaks" would be good to motivate people to fix gdb, but the
> > thing is, presumably nobody actually wants to touch gdb with a ten
> > foot pole.
> >
> > And a "let's break gdb to encourage people to fix it" only works if
> > people actually _do_ fit it. Which doesn't seem to be happening.
> >
> > Two lines of kernel code seems to be the better option than hoping for
> > gdb to be fixed.
>
> As far as I'm concerned, gdb works "well enough" with io threads as it
> stands. Yes, it'll complain a bit, but nothing that prevents you from
> attaching to a progrem. If we -EINVAL, then gdb will become useless for
> debugging a program that uses io_uring. And I'm not holding my breath
> while someone fixes that.

To be clear, I'm suggesting that we -EINVAL the PTRACE_GETREGS calls
and such, not the ATTACH.  I have no idea what gdb will do if this
happens, though.

--Andy

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 21:49           ` Andy Lutomirski
@ 2021-05-03 22:08             ` Linus Torvalds
  2021-05-03 22:56               ` Thomas Gleixner
  0 siblings, 1 reply; 44+ messages in thread
From: Linus Torvalds @ 2021-05-03 22:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jens Axboe, Thomas Gleixner, Stefan Metzmacher,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

On Mon, May 3, 2021 at 2:49 PM Andy Lutomirski <[email protected]> wrote:
>
> To be clear, I'm suggesting that we -EINVAL the PTRACE_GETREGS calls
> and such, not the ATTACH.  I have no idea what gdb will do if this
> happens, though.

I feel like the likelihood that it will make gdb work any better is
basically zero.

I think we should just do Stefan's patch - I assume it generates
something like four instructions (two loads, two stores) on x86-64,
and it "just works".

Yeah, yeah, it presumably generates 8 instructions on 32-bit x86, and
we could fix that by just using the constant __USER_CS/DS instead (no
loads necessary) since 32-bit doesn't have any compat issues.

But is it worth complicating the patch for a couple of instructions in
a non-critical path?

And I don't see anybody stepping up to say "yes, I will do the patch
for gdb", so I really think the least pain is to just take the very
straightforward and tested kernel patch.

Yes, yes, that also means admitting to ourselves that the gdb
situation isn't likely going to improve, but hey, if nobody in this
thread is willing to work on the gdb side to fix the known issues
there, isn't that the honest thing to do anyway?

            Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 22:08             ` Linus Torvalds
@ 2021-05-03 22:56               ` Thomas Gleixner
  2021-05-03 23:15                 ` Andy Lutomirski
  2021-05-03 23:16                 ` Linus Torvalds
  0 siblings, 2 replies; 44+ messages in thread
From: Thomas Gleixner @ 2021-05-03 22:56 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski
  Cc: Jens Axboe, Stefan Metzmacher, Linux Kernel Mailing List,
	io-uring, the arch/x86 maintainers

On Mon, May 03 2021 at 15:08, Linus Torvalds wrote:
> On Mon, May 3, 2021 at 2:49 PM Andy Lutomirski <[email protected]> wrote:
>>
>> To be clear, I'm suggesting that we -EINVAL the PTRACE_GETREGS calls
>> and such, not the ATTACH.  I have no idea what gdb will do if this
>> happens, though.
>
> I feel like the likelihood that it will make gdb work any better is
> basically zero.
>
> I think we should just do Stefan's patch - I assume it generates
> something like four instructions (two loads, two stores) on x86-64,
> and it "just works".
>
> Yeah, yeah, it presumably generates 8 instructions on 32-bit x86, and
> we could fix that by just using the constant __USER_CS/DS instead (no
> loads necessary) since 32-bit doesn't have any compat issues.
>
> But is it worth complicating the patch for a couple of instructions in
> a non-critical path?
>
> And I don't see anybody stepping up to say "yes, I will do the patch
> for gdb", so I really think the least pain is to just take the very
> straightforward and tested kernel patch.
>
> Yes, yes, that also means admitting to ourselves that the gdb
> situation isn't likely going to improve, but hey, if nobody in this
> thread is willing to work on the gdb side to fix the known issues
> there, isn't that the honest thing to do anyway?

GDB is one thing. But is this setup actually correct under all
circumstances?

It's all fine that we have lots of blurb about GDB, but there is no
reasoning why this does not affect regular kernel threads which take the
same code path.

Neither is there an answer what happens in case of a signal delivered to
this thread and what any other GDB/ptraced induced poking might cause.

This is a half setup user space thread which is assumed to behave like a
regular kernel thread, but is this assumption actually true?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 22:56               ` Thomas Gleixner
@ 2021-05-03 23:15                 ` Andy Lutomirski
  2021-05-03 23:16                 ` Linus Torvalds
  1 sibling, 0 replies; 44+ messages in thread
From: Andy Lutomirski @ 2021-05-03 23:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Andy Lutomirski, Jens Axboe, Stefan Metzmacher,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers



> On May 3, 2021, at 3:56 PM, Thomas Gleixner <[email protected]> wrote:
> 
> On Mon, May 03 2021 at 15:08, Linus Torvalds wrote:
>>> On Mon, May 3, 2021 at 2:49 PM Andy Lutomirski <[email protected]> wrote:
>>> 
>>> To be clear, I'm suggesting that we -EINVAL the PTRACE_GETREGS calls
>>> and such, not the ATTACH.  I have no idea what gdb will do if this
>>> happens, though.
>> 
>> I feel like the likelihood that it will make gdb work any better is
>> basically zero.
>> 
>> I think we should just do Stefan's patch - I assume it generates
>> something like four instructions (two loads, two stores) on x86-64,
>> and it "just works".
>> 
>> Yeah, yeah, it presumably generates 8 instructions on 32-bit x86, and
>> we could fix that by just using the constant __USER_CS/DS instead (no
>> loads necessary) since 32-bit doesn't have any compat issues.
>> 
>> But is it worth complicating the patch for a couple of instructions in
>> a non-critical path?
>> 
>> And I don't see anybody stepping up to say "yes, I will do the patch
>> for gdb", so I really think the least pain is to just take the very
>> straightforward and tested kernel patch.
>> 
>> Yes, yes, that also means admitting to ourselves that the gdb
>> situation isn't likely going to improve, but hey, if nobody in this
>> thread is willing to work on the gdb side to fix the known issues
>> there, isn't that the honest thing to do anyway?
> 
> GDB is one thing. But is this setup actually correct under all
> circumstances?
> 
> It's all fine that we have lots of blurb about GDB, but there is no
> reasoning why this does not affect regular kernel threads which take the
> same code path.
> 
> Neither is there an answer what happens in case of a signal delivered to
> this thread and what any other GDB/ptraced induced poking might cause.
> 
> This is a half setup user space thread which is assumed to behave like a
> regular kernel thread, but is this assumption actually true?
> 
> 

I’m personally concerned about FPU state. No one ever imagined when writing and reviewing the FPU state code that we were going to let ptrace poke the state on a kernel thread.

Now admittedly kernel_execve() magically turns kernel threads into user threads, but, again, I see no evidence that anyone has thought through all the implications of letting ptrace go to town before doing so.

(Is the io_uring thread a kthread style kernel thread?  kthread does horrible, horrible things with the thread stack.)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 22:56               ` Thomas Gleixner
  2021-05-03 23:15                 ` Andy Lutomirski
@ 2021-05-03 23:16                 ` Linus Torvalds
  2021-05-03 23:19                   ` Linus Torvalds
                                     ` (2 more replies)
  1 sibling, 3 replies; 44+ messages in thread
From: Linus Torvalds @ 2021-05-03 23:16 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Jens Axboe, Stefan Metzmacher,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

On Mon, May 3, 2021 at 3:56 PM Thomas Gleixner <[email protected]> wrote:
>
> It's all fine that we have lots of blurb about GDB, but there is no
> reasoning why this does not affect regular kernel threads which take the
> same code path.

Actual kernel threads don't get attached to by ptrace.

> This is a half setup user space thread which is assumed to behave like a
> regular kernel thread, but is this assumption actually true?

No, no.

It's a *fully set up USER thread*.

Those IO threads used to be kernel threads. That didn't work out for
the reasons already mentioned earlier.

These days they really are fully regular user threads, they just don't
return to user space because they continue to do the IO work that they
were created for.

Maybe instead of Stefan's patch, we could do something like this:

   diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
   index 43cbfc84153a..890f3992e781 100644
   --- a/arch/x86/kernel/process.c
   +++ b/arch/x86/kernel/process.c
   @@ -156,7 +156,7 @@ int copy_thread(unsigned long clone_flags,
unsigned long sp, unsigned long arg,
    #endif

        /* Kernel thread ? */
   -    if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
   +    if (unlikely(p->flags & PF_KTHREAD)) {
                memset(childregs, 0, sizeof(struct pt_regs));
                kthread_frame_init(frame, sp, arg);
                return 0;
   @@ -168,6 +168,17 @@ int copy_thread(unsigned long clone_flags,
unsigned long sp, unsigned long arg,
        if (sp)
                childregs->sp = sp;

   +    /*
   +     * An IO thread is a user space thread, but it doesn't
   +     * return to ret_after_fork(), it does the same kernel
   +     * frame setup to return to a kernel function that
   +     * a kernel thread does.
   +     */
   +    if (unlikely(p->flags & PF_IO_WORKER)) {
   +            kthread_frame_init(frame, sp, arg);
   +            return 0;
   +    }
   +
    #ifdef CONFIG_X86_32
        task_user_gs(p) = get_user_gs(current_pt_regs());
    #endif

does that clarify things and make people happier?

Maybe the compiler might even notice that the

                kthread_frame_init(frame, sp, arg);
                return 0;

part is common code and then it will result in less generated code too.

NOTE! The above is - as usual - COMPLETELY UNTESTED. It looks obvious
enough, and it builds cleanly. But that's all I'm going to guarantee.

It's whitespace-damaged on purpose.

            Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 23:16                 ` Linus Torvalds
@ 2021-05-03 23:19                   ` Linus Torvalds
  2021-05-03 23:27                   ` Stefan Metzmacher
  2021-05-04  0:01                   ` Andy Lutomirski
  2 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2021-05-03 23:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Jens Axboe, Stefan Metzmacher,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

On Mon, May 3, 2021 at 4:16 PM Linus Torvalds
<[email protected]> wrote:
>
> These days they really are fully regular user threads, they just don't
> return to user space because they continue to do the IO work that they
> were created for.

IOW, you should think of them as "io_uring does an interface clone()
to generate an async thread, it does the work and then exits".

Now, that's the conceptual thing - in reality one thread can do
multiple async things - but it's the mental image you should have.

Don't think of them as kernel threads. Yes, that's how it started, but
that really ended up being very very painful indeed.

            Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 23:16                 ` Linus Torvalds
  2021-05-03 23:19                   ` Linus Torvalds
@ 2021-05-03 23:27                   ` Stefan Metzmacher
  2021-05-03 23:48                     ` Linus Torvalds
  2021-05-04  0:01                   ` Andy Lutomirski
  2 siblings, 1 reply; 44+ messages in thread
From: Stefan Metzmacher @ 2021-05-03 23:27 UTC (permalink / raw)
  To: Linus Torvalds, Thomas Gleixner
  Cc: Andy Lutomirski, Jens Axboe, Linux Kernel Mailing List, io-uring,
	the arch/x86 maintainers

Am 04.05.21 um 01:16 schrieb Linus Torvalds:
> On Mon, May 3, 2021 at 3:56 PM Thomas Gleixner <[email protected]> wrote:
>>
>> It's all fine that we have lots of blurb about GDB, but there is no
>> reasoning why this does not affect regular kernel threads which take the
>> same code path.
> 
> Actual kernel threads don't get attached to by ptrace.
> 
>> This is a half setup user space thread which is assumed to behave like a
>> regular kernel thread, but is this assumption actually true?
> 
> No, no.
> 
> It's a *fully set up USER thread*.
> 
> Those IO threads used to be kernel threads. That didn't work out for
> the reasons already mentioned earlier.
> 
> These days they really are fully regular user threads, they just don't
> return to user space because they continue to do the IO work that they
> were created for.
> 
> Maybe instead of Stefan's patch, we could do something like this:
> 
>    diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>    index 43cbfc84153a..890f3992e781 100644
>    --- a/arch/x86/kernel/process.c
>    +++ b/arch/x86/kernel/process.c
>    @@ -156,7 +156,7 @@ int copy_thread(unsigned long clone_flags,
> unsigned long sp, unsigned long arg,
>     #endif
> 
>         /* Kernel thread ? */
>    -    if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>    +    if (unlikely(p->flags & PF_KTHREAD)) {
>                 memset(childregs, 0, sizeof(struct pt_regs));
>                 kthread_frame_init(frame, sp, arg);
>                 return 0;
>    @@ -168,6 +168,17 @@ int copy_thread(unsigned long clone_flags,
> unsigned long sp, unsigned long arg,
>         if (sp)
>                 childregs->sp = sp;
> 
>    +    /*
>    +     * An IO thread is a user space thread, but it doesn't
>    +     * return to ret_after_fork(), it does the same kernel
>    +     * frame setup to return to a kernel function that
>    +     * a kernel thread does.
>    +     */
>    +    if (unlikely(p->flags & PF_IO_WORKER)) {
>    +            kthread_frame_init(frame, sp, arg);
>    +            return 0;
>    +    }
>    +
>     #ifdef CONFIG_X86_32
>         task_user_gs(p) = get_user_gs(current_pt_regs());
>     #endif
> 
> does that clarify things and make people happier?
> 
> Maybe the compiler might even notice that the
> 
>                 kthread_frame_init(frame, sp, arg);
>                 return 0;
> 
> part is common code and then it will result in less generated code too.
> 
> NOTE! The above is - as usual - COMPLETELY UNTESTED. It looks obvious
> enough, and it builds cleanly. But that's all I'm going to guarantee.

I think I also tested something similar, see:

https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=82fcee2774add04fbc0e4755c405e6c0b7467e3a

If I remember correctly gdb showed bogus addresses for the backtraces of the io_threads,
as some regs where not cleared.

The patch I posted shows this instead:

  Thread 2 (LWP 8744):
  #0  0x0000000000000000 in ?? ()
  Backtrace stopped: Cannot access memory at address 0x0

I think that's a saner behavior.

However splitting the if statements might be a good idea to make things
more clear.

Thanks discussing this again!
metze

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 23:27                   ` Stefan Metzmacher
@ 2021-05-03 23:48                     ` Linus Torvalds
  2021-05-04  2:50                       ` Jens Axboe
  2021-05-04  8:22                       ` David Laight
  0 siblings, 2 replies; 44+ messages in thread
From: Linus Torvalds @ 2021-05-03 23:48 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Thomas Gleixner, Andy Lutomirski, Jens Axboe,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

On Mon, May 3, 2021 at 4:27 PM Stefan Metzmacher <[email protected]> wrote:
>
> If I remember correctly gdb showed bogus addresses for the backtraces of the io_threads,
> as some regs where not cleared.

Yeah, so that patch will make the IO thread have the user stack
pointer point to the original user stack, but that stack will
obviously be used by the original thread which means that it will
contain random stuff on it.

Doing a

        childregs->sp = 0;

is probably a good idea for that PF_IO_WORKER case, since it really
doesn't have - or need - a user stack.

Of course, it doesn't really have - or need - any of the other user
registers either, but once you fill in the segment stuff to make gdb
happy, you might as well fill it all in using the same code that the
regular case does.

          Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 23:16                 ` Linus Torvalds
  2021-05-03 23:19                   ` Linus Torvalds
  2021-05-03 23:27                   ` Stefan Metzmacher
@ 2021-05-04  0:01                   ` Andy Lutomirski
  2 siblings, 0 replies; 44+ messages in thread
From: Andy Lutomirski @ 2021-05-04  0:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Andy Lutomirski, Jens Axboe, Stefan Metzmacher,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

On Mon, May 3, 2021 at 4:16 PM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, May 3, 2021 at 3:56 PM Thomas Gleixner <[email protected]> wrote:
> >
> > It's all fine that we have lots of blurb about GDB, but there is no
> > reasoning why this does not affect regular kernel threads which take the
> > same code path.
>
> Actual kernel threads don't get attached to by ptrace.
>
> > This is a half setup user space thread which is assumed to behave like a
> > regular kernel thread, but is this assumption actually true?
>
> No, no.
>
> It's a *fully set up USER thread*.
>
> Those IO threads used to be kernel threads. That didn't work out for
> the reasons already mentioned earlier.
>
> These days they really are fully regular user threads, they just don't
> return to user space because they continue to do the IO work that they
> were created for.
>
> Maybe instead of Stefan's patch, we could do something like this:
>
>    diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>    index 43cbfc84153a..890f3992e781 100644
>    --- a/arch/x86/kernel/process.c
>    +++ b/arch/x86/kernel/process.c
>    @@ -156,7 +156,7 @@ int copy_thread(unsigned long clone_flags,
> unsigned long sp, unsigned long arg,
>     #endif
>
>         /* Kernel thread ? */
>    -    if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>    +    if (unlikely(p->flags & PF_KTHREAD)) {
>                 memset(childregs, 0, sizeof(struct pt_regs));
>                 kthread_frame_init(frame, sp, arg);
>                 return 0;
>    @@ -168,6 +168,17 @@ int copy_thread(unsigned long clone_flags,
> unsigned long sp, unsigned long arg,
>         if (sp)
>                 childregs->sp = sp;
>
>    +    /*
>    +     * An IO thread is a user space thread, but it doesn't
>    +     * return to ret_after_fork(), it does the same kernel
>    +     * frame setup to return to a kernel function that
>    +     * a kernel thread does.
>    +     */
>    +    if (unlikely(p->flags & PF_IO_WORKER)) {
>    +            kthread_frame_init(frame, sp, arg);
>    +            return 0;
>    +    }
>    +
>     #ifdef CONFIG_X86_32
>         task_user_gs(p) = get_user_gs(current_pt_regs());
>     #endif
>
> does that clarify things and make people happier?
>
> Maybe the compiler might even notice that the
>
>                 kthread_frame_init(frame, sp, arg);
>                 return 0;
>
> part is common code and then it will result in less generated code too.
>
> NOTE! The above is - as usual - COMPLETELY UNTESTED. It looks obvious
> enough, and it builds cleanly. But that's all I'm going to guarantee.
>
> It's whitespace-damaged on purpose.

I like this patch considerably more than I liked the previous patch.

FWIW, I have this fixlet sitting around:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/kentry&id=1eef07ae5b236112c9a0c5d880d7f9bb13e73761

Your patch fixes the same bug for the specific case of io_uring.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 23:48                     ` Linus Torvalds
@ 2021-05-04  2:50                       ` Jens Axboe
  2021-05-04 11:39                         ` Stefan Metzmacher
  2021-05-12  4:24                         ` Olivier Langlois
  2021-05-04  8:22                       ` David Laight
  1 sibling, 2 replies; 44+ messages in thread
From: Jens Axboe @ 2021-05-04  2:50 UTC (permalink / raw)
  To: Linus Torvalds, Stefan Metzmacher
  Cc: Thomas Gleixner, Andy Lutomirski, Linux Kernel Mailing List,
	io-uring, the arch/x86 maintainers

On 5/3/21 5:48 PM, Linus Torvalds wrote:
> On Mon, May 3, 2021 at 4:27 PM Stefan Metzmacher <[email protected]> wrote:
>>
>> If I remember correctly gdb showed bogus addresses for the backtraces of the io_threads,
>> as some regs where not cleared.
> 
> Yeah, so that patch will make the IO thread have the user stack
> pointer point to the original user stack, but that stack will
> obviously be used by the original thread which means that it will
> contain random stuff on it.
> 
> Doing a
> 
>         childregs->sp = 0;
> 
> is probably a good idea for that PF_IO_WORKER case, since it really
> doesn't have - or need - a user stack.
> 
> Of course, it doesn't really have - or need - any of the other user
> registers either, but once you fill in the segment stuff to make gdb
> happy, you might as well fill it all in using the same code that the
> regular case does.

I tested the below, which is the two combined, with a case that
deliberately has two types of io threads - one for SQPOLL submission,
and one that was created due to async work being needed. gdb attaches
just fine to the creator, with a slight complaint:

Attaching to process 370
[New LWP 371]
[New LWP 372]
Error while reading shared library symbols for /usr/lib/libpthread.so.0:
Cannot find user-level thread for LWP 372: generic error
0x00007f1a74675125 in clock_nanosleep@GLIBC_2.2.5 () from /usr/lib/libc.so.6
(gdb) info threads
  Id   Target Id             Frame 
* 1    LWP 370 "io_uring"    0x00007f1a74675125 in clock_nanosleep@GLIBC_2.2.5 ()
   from /usr/lib/libc.so.6
  2    LWP 371 "iou-sqp-370" 0x00007f1a746a7a9d in syscall () from /usr/lib/libc.so.6
  3    LWP 372 "io_uring"    0x00007f1a74675125 in clock_nanosleep@GLIBC_2.2.5 ()
   from /usr/lib/libc.so.6

(gdb) thread 2
[Switching to thread 2 (LWP 371)]
#0  0x00007f1a746a7a9d in syscall () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007f1a746a7a9d in syscall () from /usr/lib/libc.so.6
Backtrace stopped: Cannot access memory at address 0x0

(gdb) thread 1
[Switching to thread 1 (LWP 370)]
#0  0x00007f1a74675125 in clock_nanosleep@GLIBC_2.2.5 () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007f1a74675125 in clock_nanosleep@GLIBC_2.2.5 () from /usr/lib/libc.so.6
#1  0x00007f1a7467a357 in nanosleep () from /usr/lib/libc.so.6
#2  0x00007f1a7467a28e in sleep () from /usr/lib/libc.so.6
#3  0x000055bd41e929ba in main (argc=<optimized out>, argv=<optimized out>)
    at t/io_uring.c:658

which looks very reasonable to me - no backtraces for the io threads, and
no arch complaints.


diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 43cbfc84153a..58987bce90e2 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -156,7 +156,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 #endif
 
 	/* Kernel thread ? */
-	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+	if (unlikely(p->flags & PF_KTHREAD)) {
 		memset(childregs, 0, sizeof(struct pt_regs));
 		kthread_frame_init(frame, sp, arg);
 		return 0;
@@ -168,6 +168,12 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 	if (sp)
 		childregs->sp = sp;
 
+	if (unlikely(p->flags & PF_IO_WORKER)) {
+		childregs->sp = 0;
+		kthread_frame_init(frame, sp, arg);
+		return 0;
+	}
+
 #ifdef CONFIG_X86_32
 	task_user_gs(p) = get_user_gs(current_pt_regs());
 #endif


-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* RE: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 23:48                     ` Linus Torvalds
  2021-05-04  2:50                       ` Jens Axboe
@ 2021-05-04  8:22                       ` David Laight
  1 sibling, 0 replies; 44+ messages in thread
From: David Laight @ 2021-05-04  8:22 UTC (permalink / raw)
  To: 'Linus Torvalds', Stefan Metzmacher
  Cc: Thomas Gleixner, Andy Lutomirski, Jens Axboe,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

From: Linus Torvalds
> Sent: 04 May 2021 00:48
> 
> On Mon, May 3, 2021 at 4:27 PM Stefan Metzmacher <[email protected]> wrote:
> >
> > If I remember correctly gdb showed bogus addresses for the backtraces of the io_threads,
> > as some regs where not cleared.
> 
> Yeah, so that patch will make the IO thread have the user stack
> pointer point to the original user stack, but that stack will
> obviously be used by the original thread which means that it will
> contain random stuff on it.
> 
> Doing a
> 
>         childregs->sp = 0;
> 
> is probably a good idea for that PF_IO_WORKER case, since it really
> doesn't have - or need - a user stack.
> 
> Of course, it doesn't really have - or need - any of the other user
> registers either, but once you fill in the segment stuff to make gdb
> happy, you might as well fill it all in using the same code that the
> regular case does.

Presumably gdb can only read/write the 'user' registers (normally
saved on kernel entry).
Since these will never be loaded it really doesn't matter (to the
kernel) what is returned to gdb or what gdb writes into them.
The same ought to be true of the FP state.

If gdb writes to an FP (etc) register the process doesn't currently
have (eg an AVX512 register) then that is not really different from
doing it to a normal process.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-03 19:14   ` Linus Torvalds
  2021-05-03 20:15     ` Andy Lutomirski
@ 2021-05-04  8:39     ` Peter Zijlstra
  2021-05-04 15:35       ` Borislav Petkov
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2021-05-04  8:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Thomas Gleixner, Stefan Metzmacher, Jens Axboe,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers,
	linux-toolchains


+ linux-toolchains

On Mon, May 03, 2021 at 12:14:45PM -0700, Linus Torvalds wrote:
> On Mon, May 3, 2021 at 9:05 AM Andy Lutomirski <[email protected]> wrote:
> >
> > Linus, what is the actual effect of allowing gdb to attach these threads?  Can we instead make all the regset ops do:
> >
> > if (not actually a user thread) return -EINVAL;
> 
> I don't think it matters - the end result ends up  being the same, ie
> gdb gets confused about whether the (parent) thread is a 32-bit or
> 64-bit one.
> 
> So the basic issue is
> 
>  (a) we want the IO threads to look exactly like normal user threads
> as far as the kernel is concerned, because we had way too many bugs
> due to special cases.
> 
>  (b) but that means that they are also visible to user space, and then
> gdb has this odd thing where it takes the 64-bit vs 32-bit data for
> the whole process from one thread, and picks the worst possible thread
> to do it (ie explicitly not even the main thread, so usually the IO
> thread!)
> 
> That (a) ended up really being critical. The issues with special cases
> were just horrendous, both for security issues (ie "make them kernel
> threads but carry user credentials" just caused lots of problems) but
> also for various just random other state handling issues (signal state
> in particular).
> 
> So generally, the IO threads are now 100% normal threads - it's
> literally just that they never return to user space because they are
> always just doing the IO offload on the kernel side.
> 
> That part is lovely, but part of the "100% IO threads" really is that
> they share the signal struct too, which in turn means that they very
> much show up as normal threads. Again, not a problem: they really
> _are_ normal threads for all intents and purposes.
> 
> But then that (b) issue means that gdb gets confused by them. I
> personally think that's just a pure gdb mis-feature, but I also think
> that "hey, if we just make the register state look like the main
> thread, and unconfuse gdb that way, problem solved".
> 
> So I'd actually rather not make these non-special threads any more
> special at all. And I strongly suspect that making ptrace() not work
> on them will just confuse gdb even more - so it would make them just
> unnecessarily special in the kernel, for no actual gain.
> 
> Is the right thing to do to fix gdb to not look at irrelevant thread B
> when deciding whether thread A is 64-bit or not? Yeah, that seems like
> obviously the RightThing(tm) to me.
> 
> But at the same time, this is arguably about "regression", although at
> the same time it's "gdb doesn't understand new user programs that use
> new features, film at 11", so I think that argument is partly bogus
> too.
> 
> So my personal preference would be:
> 
>  - make those threads look even more like user threads, even if that
> means giving them pointless user segment data that the threads
> themselves will never use
> 
>    So I think Stefan's patch is reasonable, if not pretty. Literally
> becasue of that "make these threads look even more normal"
> 
>  - ALSO fix gdb that is doing obviously garbage stupid things
> 
> But I'm obviously not involved in that "ALSO fix gdb" part, and
> arguably the kernel hack then makes it more likely that gdb will
> continue doing its insane broken thing.

Anybody on toolchains that can help get GDB fixed?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-04  2:50                       ` Jens Axboe
@ 2021-05-04 11:39                         ` Stefan Metzmacher
  2021-05-04 15:53                           ` Linus Torvalds
  2021-05-12  4:24                         ` Olivier Langlois
  1 sibling, 1 reply; 44+ messages in thread
From: Stefan Metzmacher @ 2021-05-04 11:39 UTC (permalink / raw)
  To: Jens Axboe, Linus Torvalds
  Cc: Thomas Gleixner, Andy Lutomirski, Linux Kernel Mailing List,
	io-uring, the arch/x86 maintainers


Am 04.05.21 um 04:50 schrieb Jens Axboe:
> On 5/3/21 5:48 PM, Linus Torvalds wrote:
>> On Mon, May 3, 2021 at 4:27 PM Stefan Metzmacher <[email protected]> wrote:
>>>
>>> If I remember correctly gdb showed bogus addresses for the backtraces of the io_threads,
>>> as some regs where not cleared.
>>
>> Yeah, so that patch will make the IO thread have the user stack
>> pointer point to the original user stack, but that stack will
>> obviously be used by the original thread which means that it will
>> contain random stuff on it.
>>
>> Doing a
>>
>>         childregs->sp = 0;
>>
>> is probably a good idea for that PF_IO_WORKER case, since it really
>> doesn't have - or need - a user stack.
>>
>> Of course, it doesn't really have - or need - any of the other user
>> registers either, but once you fill in the segment stuff to make gdb
>> happy, you might as well fill it all in using the same code that the
>> regular case does.
> 
> I tested the below, which is the two combined, with a case that
> deliberately has two types of io threads - one for SQPOLL submission,
> and one that was created due to async work being needed. gdb attaches
> just fine to the creator, with a slight complaint:
> 
> Attaching to process 370
> [New LWP 371]
> [New LWP 372]
> Error while reading shared library symbols for /usr/lib/libpthread.so.0:
> Cannot find user-level thread for LWP 372: generic error
> 0x00007f1a74675125 in clock_nanosleep@GLIBC_2.2.5 () from /usr/lib/libc.so.6
> (gdb) info threads
>   Id   Target Id             Frame 
> * 1    LWP 370 "io_uring"    0x00007f1a74675125 in clock_nanosleep@GLIBC_2.2.5 ()
>    from /usr/lib/libc.so.6
>   2    LWP 371 "iou-sqp-370" 0x00007f1a746a7a9d in syscall () from /usr/lib/libc.so.6
>   3    LWP 372 "io_uring"    0x00007f1a74675125 in clock_nanosleep@GLIBC_2.2.5 ()
>    from /usr/lib/libc.so.6
> 
> (gdb) thread 2
> [Switching to thread 2 (LWP 371)]
> #0  0x00007f1a746a7a9d in syscall () from /usr/lib/libc.so.6
> (gdb) bt
> #0  0x00007f1a746a7a9d in syscall () from /usr/lib/libc.so.6
> Backtrace stopped: Cannot access memory at address 0x0
> 
> (gdb) thread 1
> [Switching to thread 1 (LWP 370)]
> #0  0x00007f1a74675125 in clock_nanosleep@GLIBC_2.2.5 () from /usr/lib/libc.so.6
> (gdb) bt
> #0  0x00007f1a74675125 in clock_nanosleep@GLIBC_2.2.5 () from /usr/lib/libc.so.6
> #1  0x00007f1a7467a357 in nanosleep () from /usr/lib/libc.so.6
> #2  0x00007f1a7467a28e in sleep () from /usr/lib/libc.so.6
> #3  0x000055bd41e929ba in main (argc=<optimized out>, argv=<optimized out>)
>     at t/io_uring.c:658
> 
> which looks very reasonable to me - no backtraces for the io threads, and
> no arch complaints.
> 
> 
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 43cbfc84153a..58987bce90e2 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -156,7 +156,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
>  #endif
>  
>  	/* Kernel thread ? */
> -	if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
> +	if (unlikely(p->flags & PF_KTHREAD)) {
>  		memset(childregs, 0, sizeof(struct pt_regs));
>  		kthread_frame_init(frame, sp, arg);
>  		return 0;
> @@ -168,6 +168,12 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
>  	if (sp)
>  		childregs->sp = sp;
>  
> +	if (unlikely(p->flags & PF_IO_WORKER)) {
> +		childregs->sp = 0;
> +		kthread_frame_init(frame, sp, arg);
> +		return 0;
> +	}
> +
>  #ifdef CONFIG_X86_32
>  	task_user_gs(p) = get_user_gs(current_pt_regs());
>  #endif

I'm currently testing this (moving things to the end and resetting ->ip = 0 too)

--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -161,7 +161,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
 #endif

        /* Kernel thread ? */
-       if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+       if (unlikely(p->flags & PF_KTHREAD)) {
                memset(childregs, 0, sizeof(struct pt_regs));
                kthread_frame_init(frame, sp, arg);
                return 0;
@@ -184,6 +184,23 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
        if (!ret && unlikely(test_tsk_thread_flag(current, TIF_IO_BITMAP)))
                io_bitmap_share(p);

+       /*
+        * An IO thread is a user space thread, but it doesn't
+        * return to ret_after_fork().
+        *
+        * In order to indicate that to tools like gdb,
+        * we reset the stack and instruction pointers.
+        *
+        * It does the same kernel frame setup to return to a kernel
+        * function that a kernel thread does.
+        */
+       if (!ret && unlikely(p->flags & PF_IO_WORKER)) {
+               childregs->sp = 0;
+               childregs->ip = 0;
+               kthread_frame_init(frame, sp, arg);
+               return 0;
+       }
+
        return ret;
 }

which means the output looks like this:

(gdb) info threads
  Id   Target Id                  Frame
* 1    LWP 4863 "io_uring-cp-for" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  2    LWP 4864 "iou-mgr-4863"    0x0000000000000000 in ?? ()
  3    LWP 4865 "iou-wrk-4863"    0x0000000000000000 in ?? ()
(gdb) thread 3
[Switching to thread 3 (LWP 4865)]
#0  0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x0

I think "0x0000000000000000 in ?? ()" is a relative sane indication that this thread
will never return to userspace. I'd prefer this over:

> (gdb) thread 2
> [Switching to thread 2 (LWP 371)]
> #0  0x00007f1a746a7a9d in syscall () from /usr/lib/libc.so.6
> (gdb) bt
> #0  0x00007f1a746a7a9d in syscall () from /usr/lib/libc.so.6
> Backtrace stopped: Cannot access memory at address 0x0

which seem to indicate that the syscall returns eventually.

What do you think? Should I post that as v2 if my final testing doesn't find any problem?

Thanks!
metze

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-04  8:39     ` Peter Zijlstra
@ 2021-05-04 15:35       ` Borislav Petkov
  2021-05-04 15:55         ` Simon Marchi
  0 siblings, 1 reply; 44+ messages in thread
From: Borislav Petkov @ 2021-05-04 15:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andy Lutomirski, Thomas Gleixner,
	Stefan Metzmacher, Jens Axboe, Linux Kernel Mailing List,
	io-uring, the arch/x86 maintainers, linux-toolchains

On Tue, May 04, 2021 at 10:39:23AM +0200, Peter Zijlstra wrote:
> Anybody on toolchains that can help get GDB fixed?

In the meantime, Tom is looking at fixing this, in case people wanna try
gdb patches or give him a test case or so...

https://sourceware.org/bugzilla/show_bug.cgi?id=27822

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-04 11:39                         ` Stefan Metzmacher
@ 2021-05-04 15:53                           ` Linus Torvalds
  0 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2021-05-04 15:53 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Jens Axboe, Thomas Gleixner, Andy Lutomirski,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

On Tue, May 4, 2021 at 4:39 AM Stefan Metzmacher <[email protected]> wrote:
>
> I'm currently testing this (moving things to the end and resetting ->ip = 0 too)

This part is not right (or at least very questionable):

> +       if (!ret && unlikely(p->flags & PF_IO_WORKER)) {

That testing "ret" is misleading, in my opinion.

If PF_IO_WORKER is set, there is no way we  wouldn't want to do the
kthread_frame_init().

Now, ret can never be non-zero, because PF_IO_WORKER will never have
CLONE_SETTLS set, so this is kind of moot, but it does mean that the
test for 'ret' is just pointless, and makes the code look like it
would care.

For similar reasons, we probably don't want to go down to the whole
io_bitmap_share() case - the IO bitmap only makes sense in user space,
so going through all that code is pointless, but also would make
people think it might be relevant (and we _would_ copy the io bitmap
pointer and increment the ref if the real user thread had one, so we'd
do all that pointless stuff that doesn't actually matter).

So don't move that code down. It's best done right after the register
initialization.

Moving it down to below the setting of 'gs' for the 32-bit case is ok,
though. I think my original patch had it above it, but technically it
makes sense to do it below - that's when all the register state really
is initialized.

As to:

> +               childregs->ip = 0;
> [..]
> which means the output looks like this:
>
> (gdb) info threads
>   Id   Target Id                  Frame
> * 1    LWP 4863 "io_uring-cp-for" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
>   2    LWP 4864 "iou-mgr-4863"    0x0000000000000000 in ?? ()
>   3    LWP 4865 "iou-wrk-4863"    0x0000000000000000 in ?? ()
> (gdb) thread 3
> [Switching to thread 3 (LWP 4865)]
> #0  0x0000000000000000 in ?? ()
> (gdb) bt
> #0  0x0000000000000000 in ?? ()
> Backtrace stopped: Cannot access memory at address 0x0

Yeah, that's probably sensible.

I'm not sure it's a bad idea to show the IO thread as being in the
original system call - that makes perfect sense to me too, but I guess
it could go either way. So I don't think it's wrong to clear the user
space ->ip.

> What do you think? Should I post that as v2 if my final testing doesn't find any problem?

Yes, please, with the above "move the IO thread return up a bit"
comment, please do post a tested version with some nice commit log,
and we can close this issue.

It even looks like gdb will be cleaned up too. Yay. But I think having
that separate test for PF_IO_WORKER is a good idea regardless, since
it just makes clear that an IO worker isn't the same thing as a kernel
thread in that code.

         Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-04 15:35       ` Borislav Petkov
@ 2021-05-04 15:55         ` Simon Marchi
  2021-05-05 11:29           ` Stefan Metzmacher
  0 siblings, 1 reply; 44+ messages in thread
From: Simon Marchi @ 2021-05-04 15:55 UTC (permalink / raw)
  To: Borislav Petkov, Peter Zijlstra
  Cc: Linus Torvalds, Andy Lutomirski, Thomas Gleixner,
	Stefan Metzmacher, Jens Axboe, Linux Kernel Mailing List,
	io-uring, the arch/x86 maintainers, linux-toolchains

On 2021-05-04 11:35 a.m., Borislav Petkov wrote:
> On Tue, May 04, 2021 at 10:39:23AM +0200, Peter Zijlstra wrote:
>> Anybody on toolchains that can help get GDB fixed?
> 
> In the meantime, Tom is looking at fixing this, in case people wanna try
> gdb patches or give him a test case or so...
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=27822

Yes, please provide reproducing steps in that bug.  Unlike what was said
in this thread, some people do work on gdb and are willing to fix
things, but they can only do so if they know about the problem.

Simon

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-04 15:55         ` Simon Marchi
@ 2021-05-05 11:29           ` Stefan Metzmacher
  2021-05-05 21:59             ` Simon Marchi
  0 siblings, 1 reply; 44+ messages in thread
From: Stefan Metzmacher @ 2021-05-05 11:29 UTC (permalink / raw)
  To: Simon Marchi, Borislav Petkov, Peter Zijlstra
  Cc: Linus Torvalds, Andy Lutomirski, Thomas Gleixner, Jens Axboe,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers,
	linux-toolchains


Am 04.05.21 um 17:55 schrieb Simon Marchi:
> On 2021-05-04 11:35 a.m., Borislav Petkov wrote:
>> On Tue, May 04, 2021 at 10:39:23AM +0200, Peter Zijlstra wrote:
>>> Anybody on toolchains that can help get GDB fixed?
>>
>> In the meantime, Tom is looking at fixing this, in case people wanna try
>> gdb patches or give him a test case or so...
>>
>> https://sourceware.org/bugzilla/show_bug.cgi?id=27822
> 
> Yes, please provide reproducing steps in that bug.  Unlike what was said
> in this thread, some people do work on gdb and are willing to fix
> things, but they can only do so if they know about the problem.

See https://lore.kernel.org/io-uring/[email protected]/
and https://lore.kernel.org/io-uring/[email protected]/T/#m461f280e8c3d32a49bc7da7bb5e214e90d97cf65

The question is why does inferior_ptid doesn't represent the thread
that was specified by 'gdb --pid PIDVAL'

https://www.samba.org/~metze/strace-uring-fail.txt
used "gdb --pid 1396" and does the following ptrace calls:

# grep ptrace strace-uring-fail.txt

> 18:46:35.319925 ptrace(PTRACE_ATTACH, 1396) = 0 <0.000048>
> 18:46:35.321622 ptrace(PTRACE_ATTACH, 1397) = 0 <0.000059>
> 18:46:35.322813 ptrace(PTRACE_ATTACH, 1398) = 0 <0.003052>
> 18:46:35.327287 ptrace(PTRACE_ATTACH, 1399) = 0 <0.000028>
> 18:46:35.334920 ptrace(PTRACE_GETREGS, 1396, NULL, 0x7ffed6173ea0) = 0 <0.000067>
> 18:46:35.341506 ptrace(PTRACE_SETOPTIONS, 1410, NULL, PTRACE_O_TRACESYSGOOD) = 0 <0.000056>
> 18:46:35.341681 ptrace(PTRACE_SETOPTIONS, 1410, NULL, PTRACE_O_TRACEFORK) = 0 <0.000051>
> 18:46:35.341816 ptrace(PTRACE_SETOPTIONS, 1410, NULL, PTRACE_O_TRACEFORK|PTRACE_O_TRACEVFORKDONE) = 0 <0.000054>
> 18:46:35.341957 ptrace(PTRACE_CONT, 1410, NULL, 0) = 0 <0.000056>
> 18:46:35.345568 ptrace(PTRACE_GETEVENTMSG, 1410, NULL, [1411]) = 0 <0.000081>
> 18:46:35.350541 ptrace(PTRACE_SETOPTIONS, 1410, NULL, PTRACE_O_EXITKILL) = 0 <0.000019>
> 18:46:35.354010 ptrace(PTRACE_SETOPTIONS, 1397, NULL, PTRACE_O_TRACESYSGOOD|PTRACE_O_TRACEFORK|PTRACE_O_TRACEVFORK|PTRACE_O_TRACECLONE|PTRACE_O_TRACEEXEC|PTRACE_O_TRACEVFORKDONE) = 0 <0.000019>
> 18:46:35.415730 ptrace(PTRACE_SETOPTIONS, 1396, NULL, PTRACE_O_TRACESYSGOOD|PTRACE_O_TRACEFORK|PTRACE_O_TRACEVFORK|PTRACE_O_TRACECLONE|PTRACE_O_TRACEEXEC|PTRACE_O_TRACEVFORKDONE) = 0 <0.000076>
> 18:46:35.421076 ptrace(PTRACE_GETREGS, 1412, NULL, 0x7ffed6174980) = 0 <0.000088>
> 18:46:35.429498 ptrace(PTRACE_PEEKUSER, 1397, 8*CS, [NULL]) = 0 <0.000022>
> 18:46:35.429632 ptrace(PTRACE_PEEKUSER, 1397, 8*SS + 24, [NULL]) = 0 <0.000019>
> 18:46:35.429732 ptrace(PTRACE_GETREGSET, 1397, NT_X86_XSTATE, [{iov_base=0x7ffed6174780, iov_len=576}]) = 0 <0.000030>
> 18:46:35.435507 ptrace(PTRACE_GETREGS, 1397, NULL, 0x7ffed6173cb0) = 0 <0.000019>
> 18:46:35.445877 ptrace(PTRACE_PEEKTEXT, 1397, 0x56357e99de00, [0x7f49d572b160]) = 0 <0.000057>
> 18:46:35.446043 ptrace(PTRACE_PEEKTEXT, 1397, 0x7f49d572b168, [0x7f49d572b190]) = 0 <0.000049>
> 18:46:35.447192 ptrace(PTRACE_PEEKTEXT, 1397, 0x7f49d572bbf0, [0x64762d78756e696c]) = 0 <0.000060>
> 18:46:35.447368 ptrace(PTRACE_PEEKTEXT, 1397, 0x7f49d572bbf0, [0x64762d78756e696c]) = 0 <0.000075>
> 18:46:35.447571 ptrace(PTRACE_PEEKTEXT, 1397, 0x7f49d572bbf8, [0x312e6f732e6f73]) = 0 <0.000070>
> 18:46:35.447762 ptrace(PTRACE_PEEKTEXT, 1397, 0x7f49d572bbf8, [0x312e6f732e6f73]) = 0 <0.000067>
> 18:46:35.448658 ptrace(PTRACE_PEEKTEXT, 1397, 0x7f49d572be10, [0x3638782f62696c2f]) = 0 <0.000076>
> 18:46:35.448917 ptrace(PTRACE_PEEKTEXT, 1397, 0x7f49d572be10, [0x3638782f62696c2f]) = 0 <0.000050>
> 18:46:35.449051 ptrace(PTRACE_PEEKTEXT, 1397, 0x7f49d572be18, [0x756e696c2d34365f]) = 0 <0.000045>
> 18:46:35.449173 ptrace(PTRACE_PEEKTEXT, 1397, 0x7f49d572be18, [0x756e696c2d34365f]) = 0 <0.000043>
> 18:46:35.449292 ptrace(PTRACE_PEEKTEXT, 1397, 0x7f49d572be20, [0x696c2f756e672d78]) = 0 <0.000042>
> 18:46:35.449414 ptrace(PTRACE_PEEKTEXT, 1397, 0x7f49d572be20, [0x696c2f756e672d78]) = 0 <0.000048>

ptrace(PTRACE_GETREGS, 1396, ... looks expected to me, but
starting with ptrace(PTRACE_PEEKUSER, 1397, 8*CS, [NULL]) (which triggers the actual problem)
it's unexpected to me why 1397 is used instead of 1396.

1397 is the iou-mgr-1396 iothread.

I hope that helps!
metze

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-05 11:29           ` Stefan Metzmacher
@ 2021-05-05 21:59             ` Simon Marchi
  2021-05-05 22:11               ` Andy Lutomirski
  2021-05-05 22:21               ` Stefan Metzmacher
  0 siblings, 2 replies; 44+ messages in thread
From: Simon Marchi @ 2021-05-05 21:59 UTC (permalink / raw)
  To: Stefan Metzmacher, Borislav Petkov, Peter Zijlstra
  Cc: Linus Torvalds, Andy Lutomirski, Thomas Gleixner, Jens Axboe,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers,
	linux-toolchains

On 2021-05-05 7:29 a.m., Stefan Metzmacher wrote:
> See https://lore.kernel.org/io-uring/[email protected]/
> and https://lore.kernel.org/io-uring/[email protected]/T/#m461f280e8c3d32a49bc7da7bb5e214e90d97cf65
> 
> The question is why does inferior_ptid doesn't represent the thread
> that was specified by 'gdb --pid PIDVAL'

Hi Stefan,

When you attach to PIDVAL (assuming that PIDVAL is a thread-group
leader), GDB attaches to all the threads of that thread group.  The
inferior_ptid global variable is "the thread we are currently working
with", and changes whenever GDB wants to deal with a different thread.

After attaching to all threads, GDB wants to know more about that
process' architecture (that read_description call mentioned in [1]).
The way this is implemented varies from arch to arch, but as you found
out, for x86-64 it consists of peeking into a thread's CS/DS to
determine whether the process is x86-64, x32 or i386.  The thread used
for this - the inferior_ptid value - just happens to be the latest
thread we switched inferior_ptid to (presumably because we iterated on
the thread list to do something and that was the last thread in the
list).  And up to now, this was working under the assumption that
picking any thread of the process would yield the same values for that
purpose.  I don't think it was intentionally done this way, but it
worked and didn't cause any trouble until now.

With io threads, that assumption doesn't hold anymore.  From what I
understand, your v1 patch made it so that the kernel puts the CS/DS
values GDB expects in the io threads (even though they are never
actually used otherwise).  I don't understand how your v2 patch
addresses the problem though.

I don't think it would be a problem on the GDB-side to make sure to
fetch these values from a "standard" thread.  Most likely the thread
group leader, like Tom has proposed in [1].

Thanks,

Simon

[1] https://sourceware.org/bugzilla/show_bug.cgi?id=27822#c0

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-05 21:59             ` Simon Marchi
@ 2021-05-05 22:11               ` Andy Lutomirski
  2021-05-05 23:12                 ` Borislav Petkov
                                   ` (2 more replies)
  2021-05-05 22:21               ` Stefan Metzmacher
  1 sibling, 3 replies; 44+ messages in thread
From: Andy Lutomirski @ 2021-05-05 22:11 UTC (permalink / raw)
  To: Simon Marchi
  Cc: Stefan Metzmacher, Borislav Petkov, Peter Zijlstra,
	Linus Torvalds, Thomas Gleixner, Jens Axboe,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers,
	linux-toolchains

I'm not holding my breath, but:

On Wed, May 5, 2021 at 2:59 PM Simon Marchi <[email protected]> wrote:
>
> On 2021-05-05 7:29 a.m., Stefan Metzmacher wrote:
> > See https://lore.kernel.org/io-uring/[email protected]/
> > and https://lore.kernel.org/io-uring/[email protected]/T/#m461f280e8c3d32a49bc7da7bb5e214e90d97cf65
> >
> > The question is why does inferior_ptid doesn't represent the thread
> > that was specified by 'gdb --pid PIDVAL'
>
> Hi Stefan,
>
> When you attach to PIDVAL (assuming that PIDVAL is a thread-group
> leader), GDB attaches to all the threads of that thread group.  The
> inferior_ptid global variable is "the thread we are currently working
> with", and changes whenever GDB wants to deal with a different thread.
>
> After attaching to all threads, GDB wants to know more about that
> process' architecture (that read_description call mentioned in [1]).

^^^^^^

For what it's worth, this is already fundamentally incorrect.  On
x86_64 Linux, a process *does* *not* *have* an architecture.  Every
task on an x86_64 Linux host has a full 64-bit register state.  The
task can, and sometimes does, change CS using far transfers or other
bizarre techniques, and neither the kernel nor GDB will be notified or
have a chance to take any action in response.  ELF files can be
32-bit, CS:rIP can point at 32-bit code, and system calls can be
32-bit (even from 64-bit code), but *tasks* are not 32-bit.

Now I realize that the ptrace() API is awful and makes life difficult
in several respects for no good reason but, if gdb is ever interested
in fixing its ideas about architecture to understand that all tasks,
even those that think of themselves as "compat", have full 64-bit
state, I would be more than willing to improve the ptrace() API as
needed to make this work well.

Since I'm not holding my breath, please at least keep in mind that
anything you do here is merely a heuristic, cannot be fully correct,
and then whenever gdb determines that a thread group or a thread is
"32-bit", gdb is actually deciding to operate in a degraded mode for
that task, is not accurately representing the task state, and is at
risk of crashing, malfunctioning, or crashing the inferior due to its
incorrect assumptions.  If you have ever attached gdb to QEMU's
gdbserver and tried to debug the early boot process of a 64-bit Linux
kernel, you may have encountered this class of bugs.  gdb works very,
very poorly for this use case.

(To avoid confusion, this is not a universal property of Linux.  arm64
and arm32 tasks on an arm64 Linux host are different and cannot
arbitrarily switch modes.)

--Andy

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-05 21:59             ` Simon Marchi
  2021-05-05 22:11               ` Andy Lutomirski
@ 2021-05-05 22:21               ` Stefan Metzmacher
  2021-05-05 23:15                 ` Simon Marchi
  1 sibling, 1 reply; 44+ messages in thread
From: Stefan Metzmacher @ 2021-05-05 22:21 UTC (permalink / raw)
  To: Simon Marchi, Borislav Petkov, Peter Zijlstra
  Cc: Linus Torvalds, Andy Lutomirski, Thomas Gleixner, Jens Axboe,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers,
	linux-toolchains


Hi Simon,

> When you attach to PIDVAL (assuming that PIDVAL is a thread-group
> leader), GDB attaches to all the threads of that thread group.  The
> inferior_ptid global variable is "the thread we are currently working
> with", and changes whenever GDB wants to deal with a different thread.
> 
> After attaching to all threads, GDB wants to know more about that
> process' architecture (that read_description call mentioned in [1]).
> The way this is implemented varies from arch to arch, but as you found
> out, for x86-64 it consists of peeking into a thread's CS/DS to
> determine whether the process is x86-64, x32 or i386.  The thread used
> for this - the inferior_ptid value - just happens to be the latest
> thread we switched inferior_ptid to (presumably because we iterated on
> the thread list to do something and that was the last thread in the
> list).  And up to now, this was working under the assumption that
> picking any thread of the process would yield the same values for that
> purpose.  I don't think it was intentionally done this way, but it
> worked and didn't cause any trouble until now.
> 
> With io threads, that assumption doesn't hold anymore.

Yes, in 5.12

> From what I understand, your v1 patch made it so that the kernel puts the CS/DS
> values GDB expects in the io threads (even though they are never
> actually used otherwise).  I don't understand how your v2 patch
> addresses the problem though.

v1 did clear everything and tried to keep some selected registers:

  'memset(childregs, 0, sizeof(struct pt_regs));'
  childregs->cs = currenttrgs->cs;
  childregs->ss = currenttrgs->ss;
  childregs->ds = currenttrgs->ds;
  childregs->es = currenttrgs->es;

v2 copies everything and only clears 3 selected registers (I added the last two for
   the PF_IO_WORKER case:

  *childregs = *current_pt_regs();
  childregs->ax = 0;
  ...
  childregs->ip = 0;
  childregs->sp = 0;

So regarding childregs->cs and childregs->ds the effect is the same.

(Also note that on x86_64 ds in handled by savesegment(ds, p->thread.ds);
already instead so the effective problem as only childregs->cs which
is cleared in 5.12, but will be kept with both of my patches.

> I don't think it would be a problem on the GDB-side to make sure to
> fetch these values from a "standard" thread.  Most likely the thread
> group leader, like Tom has proposed in [1].

Ok.

Is it clear now?
metze

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-05 22:11               ` Andy Lutomirski
@ 2021-05-05 23:12                 ` Borislav Petkov
  2021-05-05 23:22                   ` Andy Lutomirski
  2021-05-06  1:04                 ` Simon Marchi
  2021-05-06  9:47                 ` David Laight
  2 siblings, 1 reply; 44+ messages in thread
From: Borislav Petkov @ 2021-05-05 23:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Simon Marchi, Stefan Metzmacher, Peter Zijlstra, Linus Torvalds,
	Thomas Gleixner, Jens Axboe, Linux Kernel Mailing List, io-uring,
	the arch/x86 maintainers, linux-toolchains

On Wed, May 05, 2021 at 03:11:18PM -0700, Andy Lutomirski wrote:
> Since I'm not holding my breath, please at least keep in mind that
> anything you do here is merely a heuristic, cannot be fully correct,
> and then whenever gdb determines that a thread group or a thread is
> "32-bit", gdb is actually deciding to operate in a degraded mode for
> that task, is not accurately representing the task state, and is at
> risk of crashing, malfunctioning, or crashing the inferior due to its
> incorrect assumptions.  If you have ever attached gdb to QEMU's
> gdbserver and tried to debug the early boot process of a 64-bit Linux
> kernel, you may have encountered this class of bugs.  gdb works very,
> very poorly for this use case.

So we were talking about this with toolchain folks today and they gave
me this example:

Imagine you've stopped the target this way:

	<insn><-- stopped here
	<insn>
	<mode changing insn>
	<insn>
	<insn>
	...

now, if you dump rIP and say, rIP + the 10 following insns at the place
you've stopped it, gdb cannot know that 2 insns further into the stream
a

<mode changing insn>

is coming and it should change the disassembly of the insns after that
<mode changing insn> to the new mode. Unless it goes and inspects all
further instructions and disassembles them and analyzes the flow...

So what you can do is 

(gdb) set arch ...

at the <mode changing insn> to the mode you're changing to.

Dunno, maybe I'm missing something but this sounds like without user
help gdb can only assume things.

Good night and good luck.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-05 22:21               ` Stefan Metzmacher
@ 2021-05-05 23:15                 ` Simon Marchi
  0 siblings, 0 replies; 44+ messages in thread
From: Simon Marchi @ 2021-05-05 23:15 UTC (permalink / raw)
  To: Stefan Metzmacher, Borislav Petkov, Peter Zijlstra
  Cc: Linus Torvalds, Andy Lutomirski, Thomas Gleixner, Jens Axboe,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers,
	linux-toolchains

> Is it clear now?
> metze

Yes, thanks.  I had missed that the point of the 2nd patch is to not do
the memset for these threads.  Makes sense now.

Simon

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-05 23:12                 ` Borislav Petkov
@ 2021-05-05 23:22                   ` Andy Lutomirski
  0 siblings, 0 replies; 44+ messages in thread
From: Andy Lutomirski @ 2021-05-05 23:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Simon Marchi, Stefan Metzmacher, Peter Zijlstra,
	Linus Torvalds, Thomas Gleixner, Jens Axboe,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers,
	linux-toolchains

On Wed, May 5, 2021 at 4:12 PM Borislav Petkov <[email protected]> wrote:
>
> On Wed, May 05, 2021 at 03:11:18PM -0700, Andy Lutomirski wrote:
> > Since I'm not holding my breath, please at least keep in mind that
> > anything you do here is merely a heuristic, cannot be fully correct,
> > and then whenever gdb determines that a thread group or a thread is
> > "32-bit", gdb is actually deciding to operate in a degraded mode for
> > that task, is not accurately representing the task state, and is at
> > risk of crashing, malfunctioning, or crashing the inferior due to its
> > incorrect assumptions.  If you have ever attached gdb to QEMU's
> > gdbserver and tried to debug the early boot process of a 64-bit Linux
> > kernel, you may have encountered this class of bugs.  gdb works very,
> > very poorly for this use case.
>
> So we were talking about this with toolchain folks today and they gave
> me this example:
>
> Imagine you've stopped the target this way:
>
>         <insn><-- stopped here
>         <insn>
>         <mode changing insn>
>         <insn>
>         <insn>
>         ...
>
> now, if you dump rIP and say, rIP + the 10 following insns at the place
> you've stopped it, gdb cannot know that 2 insns further into the stream
> a
>
> <mode changing insn>
>
> is coming and it should change the disassembly of the insns after that
> <mode changing insn> to the new mode. Unless it goes and inspects all
> further instructions and disassembles them and analyzes the flow...

That's fine.  x86 machine code is like this.  You also can't
disassemble instructions before rIP accurately either.

>
> So what you can do is
>
> (gdb) set arch ...
>
> at the <mode changing insn> to the mode you're changing to.
>
> Dunno, maybe I'm missing something but this sounds like without user
> help gdb can only assume things.

In the tools/testing/x86/selftests directory, edited slightly for brevity:

$ gdb ./test_syscall_vdso_32
(gdb) b call64_from_32
Breakpoint 1 at 0x80499ef: file thunks_32.S, line 19.
(gdb) display/i $pc
1: x/i $pc
<error: No registers.>
(gdb) r
Starting program:
/home/luto/apps/linux/tools/testing/selftests/x86/test_syscall_vdso_32
...
[RUN]    Executing 6-argument 32-bit syscall via VDSO

Breakpoint 1, call64_from_32 () at thunks_32.S:19
19        mov    4(%esp), %eax
1: x/i $pc
=> 0x80499ef <call64_from_32>:    mov    0x4(%esp),%eax
(gdb) si
22        push    %ecx
1: x/i $pc
=> 0x80499f3 <call64_from_32+4>:    push   %ecx
(gdb)
call64_from_32 () at thunks_32.S:23
23        push    %edx
1: x/i $pc
=> 0x80499f4 <call64_from_32+5>:    push   %edx
(gdb)
call64_from_32 () at thunks_32.S:24
24        push    %esi
1: x/i $pc
=> 0x80499f5 <call64_from_32+6>:    push   %esi
(gdb)
call64_from_32 () at thunks_32.S:25
25        push    %edi
1: x/i $pc
=> 0x80499f6 <call64_from_32+7>:    push   %edi
(gdb)
call64_from_32 () at thunks_32.S:28
28        jmp    $0x33,$1f
1: x/i $pc
=> 0x80499f7 <call64_from_32+8>:    ljmp   $0x33,$0x80499fe
(gdb) info registers
eax            0x80492e8           134517480
ecx            0x3f                63
edx            0x1                 1
ebx            0xf7fc8550          -134445744
esp            0xffffc57c          0xffffc57c
ebp            0xffffc5e8          0xffffc5e8
esi            0x0                 0
edi            0x8049180           134517120
eip            0x80499f7           0x80499f7 <call64_from_32+8>
eflags         0x292               [ AF SF IF ]
cs             0x23                35
ss             0x2b                43
ds             0x2b                43
es             0x2b                43
fs             0x0                 0
gs             0x63                99
(gdb) si
32        call    *%rax
1: x/i $pc
=> 0x80499fe <call64_from_32+15>:    call   *%eax
(gdb) info registers
eax            0x80492e8           134517480

^^^ Should be rax

ecx            0x3f                63
edx            0x1                 1
ebx            0xf7fc8550          -134445744
esp            0xffffc57c          0xffffc57c
ebp            0xffffc5e8          0xffffc5e8
esi            0x0                 0
edi            0x8049180           134517120
eip            0x80499fe           0x80499fe <call64_from_32+15>

^^^ r8, etc are all missing

eflags         0x292               [ AF SF IF ]
cs             0x33                51

^^^ 64-bit!

ss             0x2b                43
ds             0x2b                43
es             0x2b                43
fs             0x0                 0
gs             0x63                99
(gdb) si
poison_regs64 () at test_syscall_vdso.c:35
35    long syscall_addr;
1: x/i $pc
=> 0x80492e8 <poison_regs64>:    dec    %ecx
(gdb) si
36    long get_syscall(char **envp)
1: x/i $pc
=> 0x80492ef <poison_regs64+7>:    dec    %ecx
(gdb) set arch i386:x86-64
warning: Selected architecture i386:x86-64 is not compatible with
reported target architecture i386
Architecture `i386:x86-64' not recognized.
The target architecture is set to "auto" (currently "i386").
(gdb) set arch i386:x86-64:intel
warning: Selected architecture i386:x86-64:intel is not compatible
with reported target architecture i386
Architecture `i386:x86-64:intel' not recognized.
The target architecture is set to "auto" (currently "i386").

I don't know enough about gdb internals to know precisely what failed
here, but this did not work the way it should have.

Sure, ptrace should provide a nice API to figure out that CS == 0x33
means long mode, but gdb could do a whole lot better here.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-05 22:11               ` Andy Lutomirski
  2021-05-05 23:12                 ` Borislav Petkov
@ 2021-05-06  1:04                 ` Simon Marchi
  2021-05-06 15:11                   ` Andy Lutomirski
  2021-05-06  9:47                 ` David Laight
  2 siblings, 1 reply; 44+ messages in thread
From: Simon Marchi @ 2021-05-06  1:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Stefan Metzmacher, Borislav Petkov, Peter Zijlstra,
	Linus Torvalds, Thomas Gleixner, Jens Axboe,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers,
	linux-toolchains

On 2021-05-05 6:11 p.m., Andy Lutomirski wrote:
> For what it's worth, this is already fundamentally incorrect.  On
> x86_64 Linux, a process *does* *not* *have* an architecture.  Every
> task on an x86_64 Linux host has a full 64-bit register state.  The
> task can, and sometimes does, change CS using far transfers or other
> bizarre techniques, and neither the kernel nor GDB will be notified or
> have a chance to take any action in response.  ELF files can be
> 32-bit, CS:rIP can point at 32-bit code, and system calls can be
> 32-bit (even from 64-bit code), but *tasks* are not 32-bit.

Thanks for that explanation: I didn't know that "32-bit" tasks had the
same register state as any other task.  I never really looked into it,
but I was assuming that tasks were either 32-bit or 64-bit (based on the
ELF header of the file they exec'd or something) and that 32-bit tasks
had the register state as a task on an i386 machine would have.  And
that PEEKUSER would return the 64-bit register state for 64-bit tasks,
and 32-bit register state for 32-bit tasks.

I looked at how GDB reads registers from a "64-bit" task and a "32-bit"
task (I have to quote now, since I now know it's an abuse of
terminology) side by side.  And indeed, GDB reads a full 64-bit state in
both cases.  For the 32-bit case, it picks the 32-bit values from that
buffer.  For example, to get the eax value it picks the low 4 bytes of
rax (well, ax in user_regs_struct).

So I suppose that if GDB wanted to tell nothing but the truth, it would
present the full 64-bit register state to the user even when debugging a
32-bit program.  But at the end of the day, the typical user debugging a
32-bit program on a 64-bit probably just wants the illusion that they
are on i386.

> Now I realize that the ptrace() API is awful and makes life difficult
> in several respects for no good reason but, if gdb is ever interested
> in fixing its ideas about architecture to understand that all tasks,
> even those that think of themselves as "compat", have full 64-bit
> state, I would be more than willing to improve the ptrace() API as
> needed to make this work well.

Just wondering, do you have specific ptrace shortcomings in mind when
saying this?  As I found above, ptrace lets us read the whole 64-bit
register state.  After that it's up to us to analyze the state of the
program based on its registers and memory.  What more could ptrace give
us?

> Since I'm not holding my breath, please at least keep in mind that
> anything you do here is merely a heuristic, cannot be fully correct,
> and then whenever gdb determines that a thread group or a thread is
> "32-bit", gdb is actually deciding to operate in a degraded mode for
> that task, is not accurately representing the task state, and is at
> risk of crashing, malfunctioning, or crashing the inferior due to its
> incorrect assumptions.  If you have ever attached gdb to QEMU's
> gdbserver and tried to debug the early boot process of a 64-bit Linux
> kernel, you may have encountered this class of bugs.  gdb works very,
> very poorly for this use case.

Yes, that QEMU case comes up often.  I wish that things were better, but
the reality is that this is an edge case, it would require somebody with
that particular itch to scratch to work on GDB to improve that use case.
So as you said, don't hold your breath :).

I completely understand that GDB putting processes in the "32-bit" or
"64-bit" bin is not the right thing to do in general, from a kernel
perspective.  But it converged to this because it's enough for and
useful to the 99.9% of users who debug programs that don't do funky
things.  At least, it's good to know about it in case problems related
to this arise in the future.

Simon

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-05 22:11               ` Andy Lutomirski
  2021-05-05 23:12                 ` Borislav Petkov
  2021-05-06  1:04                 ` Simon Marchi
@ 2021-05-06  9:47                 ` David Laight
  2021-05-06  9:53                   ` David Laight
  2 siblings, 1 reply; 44+ messages in thread
From: David Laight @ 2021-05-06  9:47 UTC (permalink / raw)
  To: 'Andy Lutomirski', Simon Marchi
  Cc: Stefan Metzmacher, Borislav Petkov, Peter Zijlstra,
	Linus Torvalds, Thomas Gleixner, Jens Axboe,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers,
	[email protected]

> (To avoid confusion, this is not a universal property of Linux.  arm64
> and arm32 tasks on an arm64 Linux host are different and cannot
> arbitrarily switch modes.)

Although there are patches lurking to change that.
(not from me).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-06  9:47                 ` David Laight
@ 2021-05-06  9:53                   ` David Laight
  0 siblings, 0 replies; 44+ messages in thread
From: David Laight @ 2021-05-06  9:53 UTC (permalink / raw)
  To: 'Andy Lutomirski', 'Simon Marchi'
  Cc: 'Stefan Metzmacher', 'Borislav Petkov',
	'Peter Zijlstra', 'Linus Torvalds',
	'Thomas Gleixner', 'Jens Axboe',
	'Linux Kernel Mailing List', 'io-uring',
	'the arch/x86 maintainers',
	'[email protected]'

> > (To avoid confusion, this is not a universal property of Linux.  arm64
> > and arm32 tasks on an arm64 Linux host are different and cannot
> > arbitrarily switch modes.)
> 
> Although there are patches lurking to change that.
> (not from me).

Actually they may be just to allow 64bit tasks make 32bit system calls.
The code is almost certainly still 54bit.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-06  1:04                 ` Simon Marchi
@ 2021-05-06 15:11                   ` Andy Lutomirski
  0 siblings, 0 replies; 44+ messages in thread
From: Andy Lutomirski @ 2021-05-06 15:11 UTC (permalink / raw)
  To: Simon Marchi
  Cc: Andy Lutomirski, Stefan Metzmacher, Borislav Petkov,
	Peter Zijlstra, Linus Torvalds, Thomas Gleixner, Jens Axboe,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers,
	linux-toolchains

On Wed, May 5, 2021 at 6:04 PM Simon Marchi <[email protected]> wrote:
>
> On 2021-05-05 6:11 p.m., Andy Lutomirski wrote:

> I looked at how GDB reads registers from a "64-bit" task and a "32-bit"
> task (I have to quote now, since I now know it's an abuse of
> terminology) side by side.  And indeed, GDB reads a full 64-bit state in
> both cases.  For the 32-bit case, it picks the 32-bit values from that
> buffer.  For example, to get the eax value it picks the low 4 bytes of
> rax (well, ax in user_regs_struct).
>
> So I suppose that if GDB wanted to tell nothing but the truth, it would
> present the full 64-bit register state to the user even when debugging a
> 32-bit program.  But at the end of the day, the typical user debugging a
> 32-bit program on a 64-bit probably just wants the illusion that they
> are on i386.

True.  I see no reason, especially by default, to show the extra
registers.  On the other hand, if the program switches modes, having
gdb notice would be nice.  And, if gdb handled this correctly, all
this io_uring stuff would be entirely moot.  The made-up register
state of the io_uring thread would have no bearing on the debugging of
other threads.

>
> > Now I realize that the ptrace() API is awful and makes life difficult
> > in several respects for no good reason but, if gdb is ever interested
> > in fixing its ideas about architecture to understand that all tasks,
> > even those that think of themselves as "compat", have full 64-bit
> > state, I would be more than willing to improve the ptrace() API as
> > needed to make this work well.
>
> Just wondering, do you have specific ptrace shortcomings in mind when
> saying this?  As I found above, ptrace lets us read the whole 64-bit
> register state.  After that it's up to us to analyze the state of the
> program based on its registers and memory.  What more could ptrace give
> us?

Two specific issues come to mind:

1. PTRACE_GETREGSET and PTRACE_SETREGSET are terminally broken.  See
the comment above task_user_regset_view() in arch/x86/kernel/ptrace.c.
We need a new version of those APIs that takes an e_machine parameter.
(I don't even see how you can call these APIs safely at all, short of
allocating a buffer with a guard page or intentionally over-allocating
and calculating the maximum possible size of buffer that could be used
in case of a screwup.)

2. There should be an API to either read the descriptor table or to
look up a specific descriptor.  How else are you supposed to know
whether CS.L is set?  (Keep in mind that 0x33 is not necessarily the
only long mode segment that gets used.  Linux on Xen PV has an extra
one.)

--Andy

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-04  2:50                       ` Jens Axboe
  2021-05-04 11:39                         ` Stefan Metzmacher
@ 2021-05-12  4:24                         ` Olivier Langlois
  2021-05-12 17:44                           ` Linus Torvalds
  1 sibling, 1 reply; 44+ messages in thread
From: Olivier Langlois @ 2021-05-12  4:24 UTC (permalink / raw)
  To: Jens Axboe, Linus Torvalds, Stefan Metzmacher
  Cc: Thomas Gleixner, Andy Lutomirski, Linux Kernel Mailing List,
	io-uring, the arch/x86 maintainers

On Mon, 2021-05-03 at 20:50 -0600, Jens Axboe wrote:
> 
> I tested the below, which is the two combined, with a case that
> deliberately has two types of io threads - one for SQPOLL submission,
> and one that was created due to async work being needed. gdb attaches
> just fine to the creator, with a slight complaint:
> 
> Attaching to process 370
> [New LWP 371]
> [New LWP 372]
> Error while reading shared library symbols for
> /usr/lib/libpthread.so.0:
> Cannot find user-level thread for LWP 372: generic error
> 0x00007f1a74675125 in clock_nanosleep@GLIBC_2.2.5 () from
> /usr/lib/libc.so.6
> (gdb) info threads
>   Id   Target Id             Frame 
> * 1    LWP 370 "io_uring"    0x00007f1a74675125 in
> clock_nanosleep@GLIBC_2.2.5 ()
>    from /usr/lib/libc.so.6
>   2    LWP 371 "iou-sqp-370" 0x00007f1a746a7a9d in syscall () from
> /usr/lib/libc.so.6
>   3    LWP 372 "io_uring"    0x00007f1a74675125 in
> clock_nanosleep@GLIBC_2.2.5 ()
>    from /usr/lib/libc.so.6
> 
> (gdb) thread 2
> [Switching to thread 2 (LWP 371)]
> #0  0x00007f1a746a7a9d in syscall () from /usr/lib/libc.so.6
> (gdb) bt
> #0  0x00007f1a746a7a9d in syscall () from /usr/lib/libc.so.6
> Backtrace stopped: Cannot access memory at address 0x0
> 
> (gdb) thread 1
> [Switching to thread 1 (LWP 370)]
> #0  0x00007f1a74675125 in clock_nanosleep@GLIBC_2.2.5 () from
> /usr/lib/libc.so.6
> (gdb) bt
> #0  0x00007f1a74675125 in clock_nanosleep@GLIBC_2.2.5 () from
> /usr/lib/libc.so.6
> #1  0x00007f1a7467a357 in nanosleep () from /usr/lib/libc.so.6
> #2  0x00007f1a7467a28e in sleep () from /usr/lib/libc.so.6
> #3  0x000055bd41e929ba in main (argc=<optimized out>, argv=<optimized
> out>)
>     at t/io_uring.c:658
> 
> which looks very reasonable to me - no backtraces for the io threads,
> and
> no arch complaints.
> 
I have reported an issue that I have with a user process using io_uring
where when it core dumps, the dump fails to be generated.
https://github.com/axboe/liburing/issues/346

Pavel did comment to my report and he did point out this thread as
possibly a related issue.

I'm far from being 100% convinced that Stefan patch can help but I am
going to give it a try and report back here if it does help.

Greetings,
Olivier


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-12  4:24                         ` Olivier Langlois
@ 2021-05-12 17:44                           ` Linus Torvalds
  2021-05-12 20:55                             ` Jens Axboe
  0 siblings, 1 reply; 44+ messages in thread
From: Linus Torvalds @ 2021-05-12 17:44 UTC (permalink / raw)
  To: Olivier Langlois
  Cc: Jens Axboe, Stefan Metzmacher, Thomas Gleixner, Andy Lutomirski,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

On Tue, May 11, 2021 at 9:24 PM Olivier Langlois <[email protected]> wrote:
>
> I have reported an issue that I have with a user process using io_uring
> where when it core dumps, the dump fails to be generated.
> https://github.com/axboe/liburing/issues/346

I suspect most kernel developers don't have github notifications
enabled. I know I have them disabled because it would be *way* too
noisy not to.

But maybe Jens does for that libiouring part.

> Pavel did comment to my report and he did point out this thread as
> possibly a related issue.

I don't think this is related. The gdb confusion wouldn't affect core
dump generation.

I don't see why a core-dump shouldn't work from an IO thread these
days - the signal struct and synchronization should all be the same as
for a regular user thread.

That said, I do wonder if we should avoid generating core dumps from
the IO worker thread itself. The IO thread itself should never get a
SIGSEGV/SIGBUS anyway, it should have been turned into -EFAULT.

So maybe the

                if (current->flags & PF_IO_WORKER)
                        goto out;

in kernel/signal.c should be moved up above the do_coredump() logic regardless.

Jens, have you played with core-dumping when there are active io_uring
threads? There's a test-program in that github issue report..

              Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-12 17:44                           ` Linus Torvalds
@ 2021-05-12 20:55                             ` Jens Axboe
  2021-05-20  4:13                               ` Olivier Langlois
  0 siblings, 1 reply; 44+ messages in thread
From: Jens Axboe @ 2021-05-12 20:55 UTC (permalink / raw)
  To: Linus Torvalds, Olivier Langlois
  Cc: Stefan Metzmacher, Thomas Gleixner, Andy Lutomirski,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

On 5/12/21 11:44 AM, Linus Torvalds wrote:
> On Tue, May 11, 2021 at 9:24 PM Olivier Langlois <[email protected]> wrote:
>>
>> I have reported an issue that I have with a user process using io_uring
>> where when it core dumps, the dump fails to be generated.
>> https://github.com/axboe/liburing/issues/346
> 
> I suspect most kernel developers don't have github notifications
> enabled. I know I have them disabled because it would be *way* too
> noisy not to.
> 
> But maybe Jens does for that libiouring part.
> 
>> Pavel did comment to my report and he did point out this thread as
>> possibly a related issue.
> 
> I don't think this is related. The gdb confusion wouldn't affect core
> dump generation.
> 
> I don't see why a core-dump shouldn't work from an IO thread these
> days - the signal struct and synchronization should all be the same as
> for a regular user thread.
> 
> That said, I do wonder if we should avoid generating core dumps from
> the IO worker thread itself. The IO thread itself should never get a
> SIGSEGV/SIGBUS anyway, it should have been turned into -EFAULT.
> 
> So maybe the
> 
>                 if (current->flags & PF_IO_WORKER)
>                         goto out;
> 
> in kernel/signal.c should be moved up above the do_coredump() logic regardless.

I actually think that's how I originally had it, but Eric had some comment
on that and we moved it. IIRC. I'll dig out the conversation.

> Jens, have you played with core-dumping when there are active io_uring
> threads? There's a test-program in that github issue report..

Yes, I also did that again after the report, and did so again right now
just to verify. I'm not seeing any issues with coredumps being generated
if the app crashes, or if I send it SIGILL, for example... I also just
now tried Olivier's test case, and it seems to dump just fine for me.

I then tried backing out the patch from Stefan, and it works fine with
that reverted too. So a bit puzzled as to what is going on here...

Anyway, I'll check in on that github thread and see if we can narrow
this down.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-12 20:55                             ` Jens Axboe
@ 2021-05-20  4:13                               ` Olivier Langlois
  2021-05-21  7:31                                 ` Olivier Langlois
  0 siblings, 1 reply; 44+ messages in thread
From: Olivier Langlois @ 2021-05-20  4:13 UTC (permalink / raw)
  To: Jens Axboe, Linus Torvalds
  Cc: Stefan Metzmacher, Thomas Gleixner, Andy Lutomirski,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

Hi Jens,

On Wed, 2021-05-12 at 14:55 -0600, Jens Axboe wrote:
> 
> > Jens, have you played with core-dumping when there are active
> > io_uring
> > threads? There's a test-program in that github issue report..
> 
> Yes, I also did that again after the report, and did so again right now
> just to verify. I'm not seeing any issues with coredumps being
> generated
> if the app crashes, or if I send it SIGILL, for example... I also just
> now tried Olivier's test case, and it seems to dump just fine for me.
> 
> I then tried backing out the patch from Stefan, and it works fine with
> that reverted too. So a bit puzzled as to what is going on here...
> 
> Anyway, I'll check in on that github thread and see if we can narrow
> this down.
> 
I know that my test case isn't conclusive. It is a failed attempt to
capture what my program is doing.

The priority of investigating my core dump issue has substantially
dropped last week because I did solve my primary issue (A buffer leak
in the provided buffers to io_uring during disconnection). My program
did run for days but it did crash morning without any core dump again.
It is a very frustrating situation because it would probably be a bug
trivial to diagnostic and fix but without the core, the logs are opaque
and they just don't give no clue about why the program did crash.

A key characteristic of my program, it is that it generates at least 1
io-worker thread per io_uring instance.

Oddly enough, I am having a hard time recreating a test case that will
generate io-worker threads.

My first attempt was with the github issue test-case. I have kept
tweaking it and I know that I will find the right sequence to get io-
worker threads spawned.

I suspect that once you meet that condition, it might be sufficient to
trigger the core dump generation problem.

I have also tried to run benchmark io_uring with
https://github.com/frevib/io_uring-echo-server/blob/io-uring-feat-fast-poll/benchmarks/benchmarks.md
(If you give it a try, make sure you erase its private out-of-date
liburing copy before compiling it...)
This didn't generate any io-worker thread neither.

In a nutshell here is what my program does for most of its 85-86
sockets:
1. create TCP socket
2. Set O_NONBLOCK to it
3. Call connect()
4. Use IORING_OP_POLL_ADD with POLLOUT to be notified when the
connection completes
5. Once connection is completed, clear the socket O_NONBLOCK flag, use
IORING_OP_WRITE to send a request
6. Submit a IORING_OP_READ with IOSQE_BUFFER_SELECT to read server
reply asynchronously.

Here are 2 more notes about the sequence:
a) If you wonder about the flip-flop about blocking and non-nblocking,
it is because I have adapated existing code to use io_uring. To
minimize the required code change, I left untouched the non-blocking
connection code.
b) If I add IOSQE_ASYNC to the IORING_OP_READ, io_uring will generate a
lot of io-worker threads. I mean a lot... You can see here:
https://github.com/axboe/liburing/issues/349

So what I am currently doing is to tweak my test-case to emulate as
much as possible the described sequence to have some io-worker threads
spawn and then force a core dump to validate that it is the io-worker
thread presence that is causing the core dump generation issue (or
not!)

Quick question to the devs: Is there any example program bundled with
liburing that is creating some io-workers thread in a sure way?

Greetings,
Olivier



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-20  4:13                               ` Olivier Langlois
@ 2021-05-21  7:31                                 ` Olivier Langlois
  2021-05-25 19:39                                   ` Olivier Langlois
  0 siblings, 1 reply; 44+ messages in thread
From: Olivier Langlois @ 2021-05-21  7:31 UTC (permalink / raw)
  To: Jens Axboe, Linus Torvalds
  Cc: Stefan Metzmacher, Thomas Gleixner, Andy Lutomirski,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

[-- Attachment #1: Type: text/plain, Size: 1940 bytes --]

On Thu, 2021-05-20 at 00:13 -0400, Olivier Langlois wrote:
> I know that my test case isn't conclusive. It is a failed attempt to
> capture what my program is doing.
> 
> The priority of investigating my core dump issue has substantially
> dropped last week because I did solve my primary issue (A buffer leak
> in the provided buffers to io_uring during disconnection). My program
> did run for days but it did crash morning without any core dump
> again.
> It is a very frustrating situation because it would probably be a bug
> trivial to diagnostic and fix but without the core, the logs are
> opaque
> and they just don't give no clue about why the program did crash.
> 
> A key characteristic of my program, it is that it generates at least
> 1
> io-worker thread per io_uring instance.

As I get more familiar with io_uring source code, I have come to
realize that it is practically impossible to not end up with NO io-wq
threads. They are created in from io_uring_install_fd() which is called
for every instances created.

I'm a bit lazy for rebooting my desktop and I am still running 5.11.5
on it. I guess that with this kernel version, the io_uring threads
weren't threads belonging to the user process and arent showing with ps
by searching for a specific process LWPs.

I correctly see all the generated threads when I run the test program
on an up-to-date server (5.12.4).

I have rewritten the whole test program. It has now become an io_uring
multi connection http client (it could make a nice io_uring example
program, IMHO). Still no success in reproducing the problem with it.

So, I am giving up the idea of reproducing the problem with a showcase
program unless I have some clue about how to reproduce it.

However, I can reproduce it at will with my real program. So as Linus
has suggested, I'll investigate by searching where the PF_IO_WORKER is
used.

I'll keep the list updated if I discover something.

Greetings,

[-- Attachment #2: test_io_uring_coredump.cpp --]
[-- Type: text/x-c++src, Size: 12820 bytes --]

/*
 * Test program to reproduce issue generating core dumps while using io_uring.
 *
 * To compile: g++ -pthread test_io_uring_coredump.cpp -luring
 *
 */

#include <liburing.h>
#include <stdio.h>
#include <stdlib.h>   // for abort()
#include <unistd.h>   // for close()
#include <errno.h>
#include <fcntl.h>
#include <poll.h>
#include <string.h>   // for strerror()
#include <memory.h>   // for memset()
#include <arpa/inet.h>
#include <sys/socket.h>
#include <netdb.h>
#include <stdint.h>   // for uint32_t, uint64_t
#include <pthread.h>
#include <sstream>
#include <string>

#define QD 256
#define BGID1 0
#define BGID2 1
#define PORT1 1975
#define PORT2 1976
#define BUFSZ 5*4096

enum {
    IOURING_POLL_ADD,
    IOURING_POLL_REMOVE,
    IOURING_PROV_BUF,
    IOURING_BUF_READ,
    IOURING_WRITE
};

static
const char *Io_Uring_OpTypeToStr(char type)
{
    const char *res;
    switch (type) {
        case IOURING_POLL_ADD:
            res = "IOURING_POLL_ADD";
            break;
        case IOURING_POLL_REMOVE:
            res = "IOURING_POLL_REMOVE";
            break;
        case IOURING_PROV_BUF:
            res = "IOURING_PROV_BUF";
            break;
        case IOURING_BUF_READ:
            res = "IOURING_BUF_READ";
            break;
        case IOURING_WRITE:
            res = "IOURING_WRITE";
            break;
        default:
            res = "Unknown";
    }
    return res;
}

enum { THREAD_MAX_FD = 86, BIO_RING_SZ = 8 };

inline
void *
iouring_build_user_data(char type, int fd, uint32_t clientId)
{
    return (void *)((uint32_t)fd | ((__u64)(clientId & 0x00ffffff) << 32 ) |
                    ((__u64)type << 56));
}

inline
void
iouring_decode_user_data(uint64_t data, char *type, int *fd, uint32_t *clientId)
{
    *type = data >> 56;
    *fd   = data & 0xffffffffU;
    *clientId = (data >> 32) & 0x00ffffffU;
}

/*
 * provideSimpleBuffer()
 */
static int provideSimpleBuffer(struct io_uring *ring, void *addr, int len, int nr, int bgid, int bid, int block)
{
    int res;
    // register buffers for buffer selection
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;

    sqe = io_uring_get_sqe(ring);
    io_uring_prep_provide_buffers(sqe, addr, len, nr, bgid, bid);
    io_uring_sqe_set_data(sqe, iouring_build_user_data(IOURING_PROV_BUF, 0, -1));
    if (block) {
        io_uring_submit(ring);
        io_uring_wait_cqe(ring, &cqe);
        res = cqe->res;
        if (res < 0) {
            errno = -res;
        }
        io_uring_cqe_seen(ring, cqe);
    }
    else {
        res = 0;
    }
    return res;
}

/*
 * initBufSelectRead()
 */
static void initBufSelectRead(struct io_uring *ring, int fd, size_t max_read, int gid, uint32_t clientId)
{
    struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
    io_uring_prep_read(sqe, fd, NULL, max_read, 0);
    io_uring_sqe_set_flags(sqe, IOSQE_BUFFER_SELECT|IOSQE_ASYNC);
    sqe->buf_group = gid;
    io_uring_sqe_set_data(sqe, iouring_build_user_data(IOURING_BUF_READ, fd, clientId));
}

static void
iouring_write(struct io_uring *ring, int fd, const void *addr, int len, uint32_t clientId)
{
  struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
  io_uring_prep_write(sqe, fd, addr, len, 0);
  io_uring_sqe_set_data(sqe, iouring_build_user_data(IOURING_WRITE, fd, clientId));
}

class BufferManager
{
public:
    BufferManager(struct io_uring *ring, int nr, int max_sz, int gid);
    ~BufferManager();
    
    char *getBuffer(int bid) { return m_bufferBase + bid*m_max_sz; }
    void  releaseBuffer(int bid);
    int getMaxSz() const { return m_max_sz; }
    int getGid()   const { return m_gid; }
    
private:
    struct io_uring *m_ring;
    char *m_bufferBase;
    int   m_nr;
    int   m_max_sz;
    int   m_gid;
};

BufferManager::BufferManager(struct io_uring *ring, int nr, int max_sz, int gid)
: m_ring(ring),
  m_bufferBase(nullptr),
  m_nr(nr),
  m_max_sz(max_sz),
  m_gid(gid)
{
    int res = posix_memalign(reinterpret_cast<void **>(&m_bufferBase), 1024, m_nr*m_max_sz);
    if (res) {
        fprintf(stderr, "posix_memalign: (%d) %s", res, strerror(res));
        exit(0);
    }
    if (provideSimpleBuffer(m_ring, m_bufferBase, m_max_sz, m_nr, gid, 0, 1) < 0) {
        fprintf(stderr, "provideSimpleBuffer() failed: (%d) %s\n", errno, strerror(errno));
        exit(0);
    }
}

BufferManager::~BufferManager()
{
    // We could remove the buffer from io_uring here but for the sake of the test we won't
}

/*
 * releaseBuffer()
 */
void BufferManager::releaseBuffer(int bid)
{
    provideSimpleBuffer(m_ring, getBuffer(bid), m_max_sz, 1, m_gid, bid, 0);
}

class HttpClient
{
public:
    HttpClient();
    ~HttpClient();
    
    /*
     * I don't want to write an URL parser for a test.
     */
    void initiate(struct io_uring *ring, const char *host, const char *path, uint32_t clientId);
    /*
     * As long as the read succeeds, we initiate a new read. Eventually, the server
     * is going to close its side of the connection and we will stop submitting operations.
     * When all the clients reach that state, the thread should block into io_uring_enter waiting
     * for new completion that will never come
     */
    void processCompletion(struct io_uring *ring, struct io_uring_cqe *cqe, BufferManager &bufmgr);

private:
    void sendRequest(struct io_uring *ring, uint32_t clientId);
    void recvReply(struct io_uring *ring, struct io_uring_cqe *cqe, BufferManager &bufmgr);
    int m_fd;
    enum { DOWN, CONNECTING, WAIT_REPLY } m_state;
    const char *m_host;
    const char *m_path;
    std::string m_request;
};

HttpClient::HttpClient()
: m_fd(-1),
  m_state(DOWN),
  m_host(nullptr),
  m_path(nullptr)
{
}

HttpClient::~HttpClient()
{
    if (m_fd >= 0)
        close(m_fd);
    m_fd = -1;
}

void HttpClient::initiate(struct io_uring *ring, const char *host, const char *path, uint32_t clientId)
{
    m_host = host;
    m_path = path;

    struct addrinfo hints;
    struct addrinfo *result = nullptr;
    memset(&hints, 0, sizeof(hints));
    hints.ai_socktype = SOCK_STREAM;
    hints.ai_family = AF_INET;
    if (getaddrinfo(m_host, nullptr, &hints, &result) < 0) {
        fprintf(stderr, "getaddinfo() failed: (%d) %s\n", errno, strerror(errno));
        exit(0);
    }
    void *p = NULL;
    struct addrinfo *res = result;

    /* pick the first AF_INET (IPv4) result */

    while (!p && res) {
        switch (res->ai_family) {
            case AF_INET:
                p = &((struct sockaddr_in *)res->ai_addr)->sin_addr;
                break;
        }
        res = res->ai_next;
    }
    struct sockaddr_in sockAddr;
    memset(&sockAddr, 0, sizeof(struct sockaddr_in));
    sockAddr.sin_family = AF_INET;
    sockAddr.sin_addr = *((struct in_addr *)p);
    sockAddr.sin_port = htons(80);
    freeaddrinfo(result);
    m_fd = socket(AF_INET, SOCK_STREAM, 0);
    if (m_fd < 0) {
        fprintf(stderr, "socket() failed: (%d) %s\n", errno, strerror(errno));
        exit(0);
    }
    if (fcntl(m_fd, F_SETFL, O_NONBLOCK) < 0) {
        fprintf(stderr, "fcntl() failed: (%d) %s\n", errno, strerror(errno));
        exit(0);
    }
    if (connect(m_fd, reinterpret_cast<struct sockaddr *>(&sockAddr), sizeof(struct sockaddr_in)) < 0 &&
        errno != EINPROGRESS) {
        fprintf(stderr, "connect() failed: (%d) %s\n", errno, strerror(errno));
        exit(0);
    }
    struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
    io_uring_prep_poll_add(sqe, m_fd, POLLOUT);
    io_uring_sqe_set_data(sqe, iouring_build_user_data(IOURING_POLL_ADD, m_fd, clientId));

    m_state = CONNECTING;
}

/*
 * sendRequest()
 * 
 * GET / HTTP/1.1
 * Host: kernel.org:80
 */
void HttpClient::sendRequest(struct io_uring *ring, uint32_t clientId)
{
    std::ostringstream ost;
    ost << "GET " << m_path << " HTTP/1.1\r\n";
    ost << "Host: " << m_host << ":80\r\n";
    ost << "User-Agent: io_uring coredump test 1.0\r\n\r\n";
    m_request = ost.str();
    iouring_write(ring, m_fd, m_request.c_str(), m_request.size(), clientId);
}

/*
 * recvReply()
 */
void HttpClient::recvReply(struct io_uring *ring, struct io_uring_cqe *cqe, BufferManager &bufmgr)
{
    char type;
    int  fd, bid;
    uint32_t clientId;
    int res = cqe->res;

    iouring_decode_user_data(cqe->user_data, &type, &fd, &clientId);
    bid = cqe->flags >> IORING_CQE_BUFFER_SHIFT;
    char *buf = bufmgr.getBuffer(bid);
    fprintf(stderr, "client %d reply:\n\n%.*s\n", clientId, res, buf);
    
    // returns the buffer back to io_uring.
    provideSimpleBuffer(ring, buf, bufmgr.getMaxSz(), 1, bufmgr.getGid(), bid, 0);
}

/*
 * processCompletion()
 */
void HttpClient::processCompletion(struct io_uring *ring,
                                   struct io_uring_cqe *cqe,
                                   BufferManager &bufmgr)
{
    char type;
    int  fd, bid;
    uint32_t clientId;
    int res = cqe->res;

    iouring_decode_user_data(cqe->user_data, &type, &fd, &clientId);
    switch (type) {
        case IOURING_POLL_ADD:
            if (m_state != CONNECTING) {
                fprintf(stderr, "client %d: unexpected IOURING_POLL_ADD completion\n", clientId);
                abort();
            }
            // clear the O_NONBLOCK flag
            fcntl(m_fd, F_SETFL, 0);
            /*
             * Setup the next read before sending the request.
             * Maybe it is having multiple concurrent operations on the same
             * fd that is triggering the io thread spawn.
             */
            initBufSelectRead(ring, m_fd, bufmgr.getMaxSz(), bufmgr.getGid(), clientId);
            sendRequest(ring, clientId);
            m_state = WAIT_REPLY;
            break;
        case IOURING_POLL_REMOVE:
            fprintf(stderr, "client %d: unexpected IOURING_POLL_REMOVE completion\n", clientId);
            abort();
            break;
        case IOURING_PROV_BUF:
            if (res < 0) {
                fprintf(stderr, "client %d: unexpected IOURING_PROV_BUF failure: (%d) %s\n",
                        clientId, -res, strerror(-res));
                abort();
            }
            break;
        case IOURING_BUF_READ:
            if (res <= 0) {
                fprintf(stderr, "client %d: unexpected IOURING_BUF_READ res: %d\n",
                        clientId, res);
            }
            else {
                recvReply(ring, cqe, bufmgr);
                // Reload the read
                initBufSelectRead(ring, m_fd, bufmgr.getMaxSz(), bufmgr.getGid(), clientId);
            }
            break;
        case IOURING_WRITE:
            fprintf(stderr, "client %d: request tx completed: %d\n", clientId, res);
            break;
    }
}

#define NUMCLIENTS 6

int main(int argc, char *argv[])
{
    int res;
    struct io_uring_params params;
    struct io_uring ring;
    
    memset(&params, 0, sizeof(params));

    if (io_uring_queue_init_params(QD, &ring, &params) < 0) {
        fprintf(stderr, "io_uring_queue_init_params() failed: (%d) %s\n", errno, strerror(errno));
        return res;
    }
    BufferManager bufmgr(&ring, THREAD_MAX_FD*BIO_RING_SZ, BUFSZ, BGID1);

    const char *servers[NUMCLIENTS] = {
        "google.com",
        "facebook.com",
        "twitter.com",
        "www.cloudflare.com",
        "kernel.org",
        "phoronix.com"
    };

    HttpClient clients[NUMCLIENTS];

    /*
     * Give the tester
     */
    sleep(10);
    
    for (uint32_t i = 0; i < NUMCLIENTS; ++i) {
        clients[i].initiate(&ring, servers[i], "/", i);
    }

    while (1) {
        io_uring_submit_and_wait(&ring, 1);
        unsigned completed = 0;
        unsigned head;
        struct io_uring_cqe *cqe;
        io_uring_for_each_cqe(&ring, head, cqe) {
            ++completed;
            char type;
            int  fd, bid;
            uint32_t clientId;
            iouring_decode_user_data(cqe->user_data, &type, &fd, &clientId);
            if (clientId < NUMCLIENTS) {
                clients[clientId].processCompletion(&ring, cqe, bufmgr);
            }
            else {
                /*
                 * The only operation for which we don't assign a client id is IORING_OP_PROVIDE_BUFFERS
                 */
                if (type != IOURING_PROV_BUF) {
                    fprintf(stderr, "Unexpected completion event with no client association: %s\n",
                            Io_Uring_OpTypeToStr(type));
                    abort();
                }
                else if (cqe->res < 0) {
                    fprintf(stderr, "unexpected IOURING_PROV_BUF failure: (%d) %s\n",
                            -cqe->res, strerror(-cqe->res));
                    abort();
                }
            }
        }
        if (completed) {
            io_uring_cq_advance(&ring, completed);
        }
    }
err2:
    io_uring_queue_exit(&ring);
err1:
    return res;
}

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-21  7:31                                 ` Olivier Langlois
@ 2021-05-25 19:39                                   ` Olivier Langlois
  2021-05-25 19:45                                     ` Olivier Langlois
                                                       ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Olivier Langlois @ 2021-05-25 19:39 UTC (permalink / raw)
  To: Jens Axboe, Linus Torvalds
  Cc: Stefan Metzmacher, Thomas Gleixner, Andy Lutomirski,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

On Fri, 2021-05-21 at 03:31 -0400, Olivier Langlois wrote:
> 
> However, I can reproduce it at will with my real program. So as Linus
> has suggested, I'll investigate by searching where the PF_IO_WORKER is
> used.
> 
> I'll keep the list updated if I discover something.
> 
I think that I am about to stumble into the key to unravel the mystery
of my core dump generation issue. I am going ask you a quick question
and it is very likely to trigger an ahah moment...

To what value is the task_struct mm field is set to for the io-wkr
threads?

If I look in the create_io_thread() function, I can see that CLONE_VM
isn't set...

There are still some fuzzy areas in my io_uring inner design
understanding but I would think that the io-wrk threads must use the
user process mm at some point in order to be able to fill in the user
provided buffers...

This notion appears to be central when creating a coredump...
Only tasks having the same mm than the one receiving the SIGSEGV will
be zapped...

in zap_threads():
		for_each_thread(g, p) {
			if (unlikely(!p->mm))
				continue;
			if (unlikely(p->mm == mm)) {
				lock_task_sighand(p, &flags);
				nr += zap_process(p, exit_code,
							SIGNAL_GROUP_E
XIT);
				unlock_task_sighand(p, &flags);
			}
			break;
		}



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-25 19:39                                   ` Olivier Langlois
@ 2021-05-25 19:45                                     ` Olivier Langlois
  2021-05-25 19:52                                     ` Jens Axboe
  2021-05-25 20:23                                     ` Linus Torvalds
  2 siblings, 0 replies; 44+ messages in thread
From: Olivier Langlois @ 2021-05-25 19:45 UTC (permalink / raw)
  To: Jens Axboe, Linus Torvalds
  Cc: Stefan Metzmacher, Thomas Gleixner, Andy Lutomirski,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

On Tue, 2021-05-25 at 15:39 -0400, Olivier Langlois wrote:
> This notion appears to be central when creating a coredump...
> Only tasks having the same mm than the one receiving the SIGSEGV will
> be zapped...
> 
> in zap_threads():
>                 for_each_thread(g, p) {
>                         if (unlikely(!p->mm))
>                                 continue;
>                         if (unlikely(p->mm == mm)) {
>                                 lock_task_sighand(p, &flags);
>                                 nr += zap_process(p, exit_code,
>                                                         SIGNAL_GROUP_
> E
> XIT);
>                                 unlock_task_sighand(p, &flags);
>                         }
>                         break;
>                 }

without fully understanding what I am doing... I am tempted to tweak
the condition

(unlikely(p->mm == mm))

for

(unlikely(p->mm == mm || p->flags & PF_IO_WORKER))

and see if it would resolve my coredump problem...



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-25 19:39                                   ` Olivier Langlois
  2021-05-25 19:45                                     ` Olivier Langlois
@ 2021-05-25 19:52                                     ` Jens Axboe
  2021-05-25 20:23                                     ` Linus Torvalds
  2 siblings, 0 replies; 44+ messages in thread
From: Jens Axboe @ 2021-05-25 19:52 UTC (permalink / raw)
  To: Olivier Langlois, Linus Torvalds
  Cc: Stefan Metzmacher, Thomas Gleixner, Andy Lutomirski,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

On 5/25/21 1:39 PM, Olivier Langlois wrote:
> On Fri, 2021-05-21 at 03:31 -0400, Olivier Langlois wrote:
>>
>> However, I can reproduce it at will with my real program. So as Linus
>> has suggested, I'll investigate by searching where the PF_IO_WORKER is
>> used.
>>
>> I'll keep the list updated if I discover something.
>>
> I think that I am about to stumble into the key to unravel the mystery
> of my core dump generation issue. I am going ask you a quick question
> and it is very likely to trigger an ahah moment...
> 
> To what value is the task_struct mm field is set to for the io-wkr
> threads?
> 
> If I look in the create_io_thread() function, I can see that CLONE_VM
> isn't set...

It definitely should be, what kernel are you looking at? If CLONE_VM
wasn't set, we'd have various issues with requests accessing user mm.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads
  2021-05-25 19:39                                   ` Olivier Langlois
  2021-05-25 19:45                                     ` Olivier Langlois
  2021-05-25 19:52                                     ` Jens Axboe
@ 2021-05-25 20:23                                     ` Linus Torvalds
  2 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2021-05-25 20:23 UTC (permalink / raw)
  To: Olivier Langlois
  Cc: Jens Axboe, Stefan Metzmacher, Thomas Gleixner, Andy Lutomirski,
	Linux Kernel Mailing List, io-uring, the arch/x86 maintainers

On Tue, May 25, 2021 at 9:40 AM Olivier Langlois <[email protected]> wrote:
>
> If I look in the create_io_thread() function, I can see that CLONE_VM
> isn't set...

That would indeed be horribly buggy, but I'm not seeing it:

                .flags          = ((lower_32_bits(flags) | CLONE_VM |
                                    CLONE_UNTRACED) & ~CSIGNAL),

Yeah, it has that odd "first create 'flags' without the CLONE_VM, but
that is only used for that

        lower_32_bits(flags)

thing, and then we explicitly add CLONE_VM in there in the actual
kernel_clone_args.

It's because of how 'kernel_thread()' is written (that has historical
reasons), and the oddity is copied from there.

           Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2021-05-25 20:24 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <[email protected]>
2021-05-03 16:05 ` [PATCH] io_thread/x86: don't reset 'cs', 'ss', 'ds' and 'es' registers for io_threads Andy Lutomirski
2021-05-03 19:14   ` Linus Torvalds
2021-05-03 20:15     ` Andy Lutomirski
2021-05-03 20:21       ` Andy Lutomirski
2021-05-03 20:37       ` Linus Torvalds
2021-05-03 21:26         ` Jens Axboe
2021-05-03 21:49           ` Andy Lutomirski
2021-05-03 22:08             ` Linus Torvalds
2021-05-03 22:56               ` Thomas Gleixner
2021-05-03 23:15                 ` Andy Lutomirski
2021-05-03 23:16                 ` Linus Torvalds
2021-05-03 23:19                   ` Linus Torvalds
2021-05-03 23:27                   ` Stefan Metzmacher
2021-05-03 23:48                     ` Linus Torvalds
2021-05-04  2:50                       ` Jens Axboe
2021-05-04 11:39                         ` Stefan Metzmacher
2021-05-04 15:53                           ` Linus Torvalds
2021-05-12  4:24                         ` Olivier Langlois
2021-05-12 17:44                           ` Linus Torvalds
2021-05-12 20:55                             ` Jens Axboe
2021-05-20  4:13                               ` Olivier Langlois
2021-05-21  7:31                                 ` Olivier Langlois
2021-05-25 19:39                                   ` Olivier Langlois
2021-05-25 19:45                                     ` Olivier Langlois
2021-05-25 19:52                                     ` Jens Axboe
2021-05-25 20:23                                     ` Linus Torvalds
2021-05-04  8:22                       ` David Laight
2021-05-04  0:01                   ` Andy Lutomirski
2021-05-04  8:39     ` Peter Zijlstra
2021-05-04 15:35       ` Borislav Petkov
2021-05-04 15:55         ` Simon Marchi
2021-05-05 11:29           ` Stefan Metzmacher
2021-05-05 21:59             ` Simon Marchi
2021-05-05 22:11               ` Andy Lutomirski
2021-05-05 23:12                 ` Borislav Petkov
2021-05-05 23:22                   ` Andy Lutomirski
2021-05-06  1:04                 ` Simon Marchi
2021-05-06 15:11                   ` Andy Lutomirski
2021-05-06  9:47                 ` David Laight
2021-05-06  9:53                   ` David Laight
2021-05-05 22:21               ` Stefan Metzmacher
2021-05-05 23:15                 ` Simon Marchi
2021-04-11 15:27 Stefan Metzmacher
2021-04-14 21:28 ` Stefan Metzmacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox