public inbox for [email protected]
 help / color / mirror / Atom feed
* Re: [PATCH] signal: Set PF_SIGNALED flag for io workers during a group exit
       [not found] <[email protected]>
@ 2021-06-08  3:49 ` Eric W. Biederman
  2021-06-08 17:43   ` Olivier Langlois
  0 siblings, 1 reply; 3+ messages in thread
From: Eric W. Biederman @ 2021-06-08  3:49 UTC (permalink / raw)
  To: Olivier Langlois
  Cc: Oleg Nesterov, Peter Zijlstra (Intel), Marco Elver,
	Peter Collingbourne, Thomas Gleixner, Jens Axboe, linux-kernel,
	io-uring

Olivier Langlois <[email protected]> writes:

> io worker threads are in most regards userspace threads except
> that they never resume userspace. Therefore, they need to explicitly
> handle signals.
>
> On delivering a fatal signal generating a core dump to a thread of
> a group having 1 or more io workers, it is possible for the io_workers
> to exit with pending signals.
>
> One example of this is the io_wqe_worker() function thread in fs/io-wq.c
> This thread can exit the function with pending signals when its
> IO_WQ_BIT_EXIT bit is set.
>
> The consequence of exiting with pending signals is that PF_SIGNALED
> will not be set. This flag is used in exit_mm() to engage into
> the synchronization between do_coredump() and exit_mm().
>
> The purpose of this synchronization is not well documented and all
> that I have found is that it is used to avoid corruption in the core file
> in the section "Deleting a Process Address Space", chapter 9 of the
> Bovet & Cesati book.

We added the check just a little while ago.  I am surprised it shows up
in any book.  What is the Bovett & Cesati book?

The flag PF_SIGNALED today is set in exactly one place, and that
is in get_signal.  The meaning of PF_SIGNALED is that do_group_exit
was called from get_signal.  AKA your task was killed by a signal.

The check in exit_mm() that tests PF_SIGNALED is empirically testing
to see if all of the necessary state is saved on the kernel stack.
That state is the state accessed by fs/binfmt_elf.c:fill_note_info.

The very good description from the original change can be found in
the commit 123cbec460db ("signal: Remove the helper signal_group_exit").

For alpha it is has the assembly function do_switch_stack been called
before your code path was called in the kernel.

Since io_uring does not have a userspace  I don't know if testing
for PF_SIGNALED is at all meaningful to detect values saved on the
stack.

I suspect io_uring is simply broken on architectures that need
extra state saved on the stack, but I haven't looked yet.


> So I am not sure if the synchronizatin MUST be applied to io_workers
> or not but the proposed patch is making sure that it is applied in
> all cases if it is needed.

That patch is definitely wrong.  If anything the check in exit_mm
should be updated.

Can you share which code paths in io_uring exit with a
fatal_signal_pending and don't bother to call get_signal?

I am currently looking to see if the wait for a coredump to read
a threads data can be moved from exit_mm into get_signal.  Even
with that io_uring might need a some additional fixes.

Eric

> Signed-off-by: Olivier Langlois <[email protected]>
> ---
>  kernel/signal.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index f7c6ffcbd044..477bfe55fd3c 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2925,6 +2925,15 @@ void exit_signals(struct task_struct *tsk)
>  
>  	if (thread_group_empty(tsk) || signal_group_exit(tsk->signal)) {
>  		tsk->flags |= PF_EXITING;
> +		/*
> +		 * It is possible for an io worker thread to reach this
> +		 * function with a pending SIGKILL.
> +		 * Set PF_SIGNALED for proper core dump generation
> +		 * (See exit_mm())
> +		 */
> +		if (tsk->flags & PF_IO_WORKER &&
> +		    signal_group_exit(tsk->signal))
> +			tsk->flags |= PF_SIGNALED;
>  		cgroup_threadgroup_change_end(tsk);
>  		return;
>  	}

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] signal: Set PF_SIGNALED flag for io workers during a group exit
  2021-06-08  3:49 ` [PATCH] signal: Set PF_SIGNALED flag for io workers during a group exit Eric W. Biederman
@ 2021-06-08 17:43   ` Olivier Langlois
  2021-06-08 18:44     ` Eric W. Biederman
  0 siblings, 1 reply; 3+ messages in thread
From: Olivier Langlois @ 2021-06-08 17:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Peter Zijlstra (Intel), Marco Elver,
	Peter Collingbourne, Thomas Gleixner, Jens Axboe, linux-kernel,
	io-uring

On Mon, 2021-06-07 at 22:49 -0500, Eric W. Biederman wrote:
> We added the check just a little while ago.  I am surprised it shows up
> in any book.  What is the Bovett & Cesati book?

It is an O'Reilly book named 'Understanding the Linux Kernel'. It is
quite old (2005) and covers kernel 2.6 but surprisingly it did age very
well and it is still very useful today.

For instance, the discussed feature about exit_mm() sync mecanism, when
the book has been published, the struct core_state was not existing and
its members were straight into mm_struct.

I would think that beside this detail, not much has changed since then
in the general concepts... 
> 
> The flag PF_SIGNALED today is set in exactly one place, and that
> is in get_signal.  The meaning of PF_SIGNALED is that do_group_exit
> was called from get_signal.  AKA your task was killed by a signal.
> 
> The check in exit_mm() that tests PF_SIGNALED is empirically testing
> to see if all of the necessary state is saved on the kernel stack.
> That state is the state accessed by fs/binfmt_elf.c:fill_note_info.
> 
> The very good description from the original change can be found in
> the commit 123cbec460db ("signal: Remove the helper
> signal_group_exit").

thx for sharing the link. Help to improve my kernel understanding is
always very appreciated. However, I am clueless about where to look to
retrieve it:
$ git show 123cbec460db --
fatal: bad revision '123cbec460db'

is it supposed to be found in the master branch or this a commit prior
2005?

> 
> For alpha it is has the assembly function do_switch_stack been called
> before your code path was called in the kernel.
> 
> Since io_uring does not have a userspace  I don't know if testing
> for PF_SIGNALED is at all meaningful to detect values saved on the
> stack.
> 
> I suspect io_uring is simply broken on architectures that need
> extra state saved on the stack, but I haven't looked yet.
> 
> 
> > So I am not sure if the synchronizatin MUST be applied to io_workers
> > or not but the proposed patch is making sure that it is applied in
> > all cases if it is needed.
> 
> That patch is definitely wrong.  If anything the check in exit_mm
> should be updated.

with your explanation, if the only purpose of the synchronization is to
make sure that the task stack register is pointing on its kernel stack
to be able to dump its user hardware context, it is not needed for io-
worker tasks since they never exit the kernel.

I think that this is the confirmation that I wanted to get from the
patch code review because this point wasn't 100% clear to me (and I was
having issues getting a core dump generated properly when using
io_uring. I found the culprit of that issue. This isn't the problem
described in this patch but I thought that the current problem despite
not being the one responsible for my coredump issue could still have
some merit).

the doubt was coming from the fact that io-worker tasks are almost
userspace tasks except that they never escape the kernel and the
observation that it is the first type of task that could theoritically
exit from a group_exit without having their PF_SIGNALED bit set.
> 
> Can you share which code paths in io_uring exit with a
> fatal_signal_pending and don't bother to call get_signal?

Yes. One was provided as part of the description of this patch:
io_wqe_worker() (fs/io-wq.c)

and
io_sq_thread() (fs/io_uring.c)
is another example

and possibly all the other future threads created with
create_io_thread() will share the same pattern.

They all enter an while loop and contain 1 or many other break points
beside getting a signal.

imho, it is quite exceptional situation but it could happen that those
threads exit with a pending signal.

Before 5.13, when io-wq had a second type of io thread (io-mgr), there
was a real chance of that happening by a race condition between the io-
mgr setting the wq state bit IO_WQ_BIT_EXIT upon receiving a SIGKILL
and its io-workers threads exiting their while loop.


but with io-mgr removal in 5.13, this risk is now gone...

I have first offered a patch to io_uring to call get_signal() one last
time before exiting:
https://lore.kernel.org/io-uring/[email protected]/T/#t

but discussing with Jens made me realize that it wasn't the right place
to do this because:
1. Interrupts aren't disabled and the task can be preemted even after
this last get_signal() call.
2. If that is something needed, it was the bad place to do it because
you would have to repeat it for all the current existing io_threads and
the future ones too.

but again, after your explanations, this might not be required at all
after all...

Greetings,


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] signal: Set PF_SIGNALED flag for io workers during a group exit
  2021-06-08 17:43   ` Olivier Langlois
@ 2021-06-08 18:44     ` Eric W. Biederman
  0 siblings, 0 replies; 3+ messages in thread
From: Eric W. Biederman @ 2021-06-08 18:44 UTC (permalink / raw)
  To: Olivier Langlois
  Cc: Oleg Nesterov, Peter Zijlstra (Intel), Marco Elver,
	Peter Collingbourne, Thomas Gleixner, Jens Axboe, linux-kernel,
	io-uring

Olivier Langlois <[email protected]> writes:

> On Mon, 2021-06-07 at 22:49 -0500, Eric W. Biederman wrote:
>> We added the check just a little while ago.  I am surprised it shows up
>> in any book.  What is the Bovett & Cesati book?
>
> It is an O'Reilly book named 'Understanding the Linux Kernel'. It is
> quite old (2005) and covers kernel 2.6 but surprisingly it did age very
> well and it is still very useful today.
>
> For instance, the discussed feature about exit_mm() sync mecanism, when
> the book has been published, the struct core_state was not existing and
> its members were straight into mm_struct.
>
> I would think that beside this detail, not much has changed since then
> in the general concepts... 
>> 
>> The flag PF_SIGNALED today is set in exactly one place, and that
>> is in get_signal.  The meaning of PF_SIGNALED is that do_group_exit
>> was called from get_signal.  AKA your task was killed by a signal.
>> 
>> The check in exit_mm() that tests PF_SIGNALED is empirically testing
>> to see if all of the necessary state is saved on the kernel stack.
>> That state is the state accessed by fs/binfmt_elf.c:fill_note_info.
>> 
>> The very good description from the original change can be found in
>> the commit 123cbec460db ("signal: Remove the helper
>> signal_group_exit").
>
> thx for sharing the link. Help to improve my kernel understanding is
> always very appreciated. However, I am clueless about where to look to
> retrieve it:
> $ git show 123cbec460db --
> fatal: bad revision '123cbec460db'
>
> is it supposed to be found in the master branch or this a commit prior
> 2005?

Sigh.  I don't know how that happened.  The commit I posted is something
on my development branch.  That was commit was supposed to be
77f6ab8b7768 ("don't dump the threads that had been already exiting when
zapped.")

>> 
>> For alpha it is has the assembly function do_switch_stack been called
>> before your code path was called in the kernel.
>> 
>> Since io_uring does not have a userspace  I don't know if testing
>> for PF_SIGNALED is at all meaningful to detect values saved on the
>> stack.
>> 
>> I suspect io_uring is simply broken on architectures that need
>> extra state saved on the stack, but I haven't looked yet.
>> 
>> 
>> > So I am not sure if the synchronizatin MUST be applied to io_workers
>> > or not but the proposed patch is making sure that it is applied in
>> > all cases if it is needed.
>> 
>> That patch is definitely wrong.  If anything the check in exit_mm
>> should be updated.
>
> with your explanation, if the only purpose of the synchronization is to
> make sure that the task stack register is pointing on its kernel stack
> to be able to dump its user hardware context, it is not needed for io-
> worker tasks since they never exit the kernel.

Not true.  The registers will be copied into the coredump if PF_SIGNALED
is set.  Similarly if ptrace reads the process's registers.  If those
registers are not at the expected place on the stack then random stack
contents will be copied to userspace resulting in an information leak.

If it was just coredumps I would say clearing PF_SIGNALED or something
equivalent is the way to go.  However since it is also ptrace as well
it looks something io_uring tasks need to handle just to so they are not
a special case.

Just to be technical PF_SIGNALED isn't the synchronization.  The
synchronization is noticing that mm->core_state is set under mmap_lock,
and causing all such tasks to decrement a count.  When the count goes to
zero the coredump is started because it knows all of the tasks that need
to be have stopped and are waiting for their state to be read.

> I think that this is the confirmation that I wanted to get from the
> patch code review because this point wasn't 100% clear to me (and I was
> having issues getting a core dump generated properly when using
> io_uring. I found the culprit of that issue. This isn't the problem
> described in this patch but I thought that the current problem despite
> not being the one responsible for my coredump issue could still have
> some merit).
>
> the doubt was coming from the fact that io-worker tasks are almost
> userspace tasks except that they never escape the kernel and the
> observation that it is the first type of task that could theoritically
> exit from a group_exit without having their PF_SIGNALED bit set.

Which is fine.  The code today considers such tasks as tasks
that have logically exited before the coredump started.

>> Can you share which code paths in io_uring exit with a
>> fatal_signal_pending and don't bother to call get_signal?
>
> Yes. One was provided as part of the description of this patch:
> io_wqe_worker() (fs/io-wq.c)
>
> and
> io_sq_thread() (fs/io_uring.c)
> is another example
>
> and possibly all the other future threads created with
> create_io_thread() will share the same pattern.
>
> They all enter an while loop and contain 1 or many other break points
> beside getting a signal.
>
> imho, it is quite exceptional situation but it could happen that those
> threads exit with a pending signal.
>
> Before 5.13, when io-wq had a second type of io thread (io-mgr), there
> was a real chance of that happening by a race condition between the io-
> mgr setting the wq state bit IO_WQ_BIT_EXIT upon receiving a SIGKILL
> and its io-workers threads exiting their while loop.
>
>
> but with io-mgr removal in 5.13, this risk is now gone...
>
> I have first offered a patch to io_uring to call get_signal() one last
> time before exiting:
> https://lore.kernel.org/io-uring/[email protected]/T/#t
>
> but discussing with Jens made me realize that it wasn't the right place
> to do this because:
> 1. Interrupts aren't disabled and the task can be preemted even after
> this last get_signal() call.
> 2. If that is something needed, it was the bad place to do it because
> you would have to repeat it for all the current existing io_threads and
> the future ones too.
>
> but again, after your explanations, this might not be required at all
> after all...

Eric

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-06-08 18:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <[email protected]>
2021-06-08  3:49 ` [PATCH] signal: Set PF_SIGNALED flag for io workers during a group exit Eric W. Biederman
2021-06-08 17:43   ` Olivier Langlois
2021-06-08 18:44     ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox