public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET RFC v2 0/5] Cancel and wait for all requests on exit
@ 2025-03-21 19:24 Jens Axboe
  2025-03-21 19:24 ` [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK Jens Axboe
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Jens Axboe @ 2025-03-21 19:24 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

Hi,

Currently, when a ring is being shut down, some cancelations may happen
out-of-line. This means that an application cannot rely on the ring
exit meaning that any IO has fully completed, or someone else waiting
on an application (which has a ring with pending IO) being terminated
will mean that all requests are done. This has also manifested itself
as various testing sometimes finding a mount point busy after a test
has exited, because it may take a brief period of time for things to
quiesce and be fully done.

This patchset makes the task wait on the cancelations, if any, when
the io_uring file fd is being put. That has the effect of ensuring that
pending IO has fully completed, and files closed, before the ring exit
returns.

I did post a previous version of this - fundamentally this one is the
same, with the main difference being that rather than invent our own
type of references for the ring, a basic atomic_long_t is used.
io_uring batches the reference gets and puts on the ring, so this should
not be noticeable. The only potential outlier is setting up a ring
without DEFER_TASKRUN, where running task_work will result in an atomic
dec and inc per ring in running the task_work. We can probably do
something about that, but I don't consider it pressing.

The switch away from percpu reference counts is done mostly because
exiting those references will cost us an RCU grace period. That will
noticeably slow down the ring tear down.

The changes can also be found here:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-exit-cancel.2

 fs/file_table.c                |  2 +-
 include/linux/io_uring_types.h |  4 +-
 include/linux/sched.h          |  2 +-
 io_uring/io_uring.c            | 79 +++++++++++++++++++++++-----------
 io_uring/io_uring.h            |  3 +-
 io_uring/msg_ring.c            |  4 +-
 io_uring/refs.h                | 43 ++++++++++++++++++
 io_uring/register.c            |  2 +-
 io_uring/rw.c                  |  2 +-
 io_uring/sqpoll.c              |  2 +-
 io_uring/zcrx.c                |  4 +-
 kernel/fork.c                  |  2 +-
 12 files changed, 111 insertions(+), 38 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread
* [PATCHSET RFC 0/5] Wait on cancelations at release time
@ 2024-06-04 19:01 Jens Axboe
  2024-06-04 19:01 ` [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2024-06-04 19:01 UTC (permalink / raw)
  To: io-uring

Hi,

I've posted this before, but did a bit more work on it and sending it
out again. The idea is to ensure that we've done any fputs that we need
to when a task using a ring exit, so that we don't leave references that
will get put "shortly afterwards". Currently cancelations are done by
ring exit work, which is punted to a kworker. This means that after the
final ->release() on the io_uring fd has completed, there can still be
pending fputs. This can be observed by running the following script:


#!/bin/bash

DEV=/dev/nvme0n1
MNT=/data
ITER=0

while true; do
	echo loop $ITER
	sudo mount $DEV $MNT
	fio --name=test --ioengine=io_uring --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --thread=1 --output=/dev/null --eta=never &
	Y=$(($RANDOM % 3))
	X=$(($RANDOM % 10))
	VAL="$Y.$X"
	sleep $VAL
	FIO_PID=$(pidof fio)
	if [ -z "$FIO_PID" ]; then
		((ITER++))
		continue
	fi
	ps -e | grep fio > /dev/null 2>&1
	while [ $? -eq 0 ]; do
		killall -KILL $FIO_PID > /dev/null 2>&1
		echo will wait
		wait > /dev/null 2>&1
		echo done waiting
		ps -e | grep "fio " > /dev/null 2>&1
	done
	sudo umount /data
	if [ $? -ne 0 ]; then
		break
	fi
	((ITER++))
done

which just starts a fio job doing writes, kills it, waits on the task
to exit, and then immediately tries to umount it. Currently that will
at some point trigger:

[...]
loop 9
will wait(f=12)
done waiting
umount: /data: target is busy.

as the umount raced with the final fputs on the files being accessed
on the mount point.

There are a few parts to this:

1) Final fput is done via task_work, but for kernel threads, it's done
   via a delayed work queue. Patches 1+2 allows for kernel threads to
   use task_work like other threads, as we can then quiesce the fputs
   for the task rather than need to flush a system wide global pending
   list that can have pending final releases for any task or file.

2) Patch 3 moves away from percpu reference counts, as those require
   an RCU sync on freeing. As the goal is to move to sync cancelations
   on exit, this can add considerable latency. Outside of that, percpu
   ref counts provide a lot of guarantees and features that io_uring
   doesn't need, and the new approach is faster.

3) Finally, make the cancelations sync. They are still offloaded to
   a kworker, but the task doing ->release() waits for them to finish.

With this, the above test case works fine, as expected.

I'll send patches 1+2 separately, but wanted to get this out for review
and discussion first.

Patches are against current -git, with io_uring 6.10 and 6.11 pending
changes pulled in. You can also find the patches here:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-exit-cancel

 fs/file_table.c                |  2 +-
 include/linux/io_uring_types.h |  4 +-
 include/linux/sched.h          |  2 +-
 io_uring/Makefile              |  2 +-
 io_uring/io_uring.c            | 77 ++++++++++++++++++++++++----------
 io_uring/io_uring.h            |  3 +-
 io_uring/refs.c                | 58 +++++++++++++++++++++++++
 io_uring/refs.h                | 53 +++++++++++++++++++++++
 io_uring/register.c            |  3 +-
 io_uring/rw.c                  |  3 +-
 io_uring/sqpoll.c              |  3 +-
 kernel/fork.c                  |  2 +-
 12 files changed, 182 insertions(+), 30 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-03-21 21:21 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-21 19:24 [PATCHSET RFC v2 0/5] Cancel and wait for all requests on exit Jens Axboe
2025-03-21 19:24 ` [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK Jens Axboe
2025-03-21 19:24 ` [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable Jens Axboe
2025-03-21 19:24 ` [PATCH 3/5] io_uring: consider ring dead once the ref is marked dying Jens Axboe
2025-03-21 21:22   ` Pavel Begunkov
2025-03-21 19:24 ` [PATCH 4/5] io_uring: wait for cancelations on final ring put Jens Axboe
2025-03-21 19:24 ` [PATCH 5/5] io_uring: switch away from percpu refcounts Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2024-06-04 19:01 [PATCHSET RFC 0/5] Wait on cancelations at release time Jens Axboe
2024-06-04 19:01 ` [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable Jens Axboe
2024-06-05 15:01   ` Pavel Begunkov
2024-06-05 18:08     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox