[PATCHSET v4 0/8] Improve async iomap DIO performance

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCHSET v4 0/8] Improve async iomap DIO performance
@ 2023-07-20 18:13 Jens Axboe
  2023-07-20 18:13 ` [PATCH 1/8] iomap: cleanup up iomap_dio_bio_end_io() Jens Axboe
                   ` (7 more replies)
  0 siblings, 8 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-20 18:13 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david

Hi,

iomap always punts async dio write completions to a workqueue, which has
a cost in terms of efficiency (now you need an unrelated worker to
process it) and latency (now you're bouncing a completion through an
async worker, which is a classic slowdown scenario).

Even for writes that should, in theory, be able to complete inline,
if we race with truncate or need to invalidate pages post completion,
we cannot sanely be in IRQ context as the locking types don't allow
for that.

io_uring handles IRQ completions via task_work, and for writes that
don't need to do extra IO at completion time, we can safely complete
them inline from that. This patchset adds IOCB_DEFER, which an IO
issuer can set to inform the completion side that any extra work that
needs doing for that completion can be punted to a safe task context.

The iomap dio completion will happen in hard/soft irq context, and we
need a saner context to process these completions. IOCB_DIO_DEFER is
added, which can be set in a struct kiocb->ki_flags by the issuer. If
the completion side of the iocb handling understands this flag, it can
choose to set a kiocb->dio_complete() handler and just call ki_complete
from IRQ context. The issuer must then ensure that this callback is
processed from a task. io_uring punts IRQ completions to task_work
already, so it's trivial wire it up to run more of the completion before
posting a CQE. This is good for up to a 37% improvement in
throughput/latency for low queue depth IO, patch 5 has the details.

If we need to do real work at completion time, iomap will clear the
IOMAP_DIO_DEFER_COMP flag.

This work came about when Andres tested low queue depth dio writes
for postgres and compared it to doing sync dio writes, showing that the
async processing slows us down a lot.

Dave, would appreciate your input on if the logic is right now in
terms of when we can inline complete when DEFER is set!

 fs/iomap/direct-io.c | 154 +++++++++++++++++++++++++++++++++----------
 include/linux/fs.h   |  34 +++++++++-
 io_uring/rw.c        |  27 +++++++-
 3 files changed, 176 insertions(+), 39 deletions(-)

Can also be found in a git branch here:

https://git.kernel.dk/cgit/linux/log/?h=xfs-async-dio.4

Since v3:
- Add two patches for polled IO. One that completes inline if it's set
  at completion time, and one that cleans up the iocb->private handling
  and adds comments as to why they are only relevant on polled IO.
- Rename IOMAP_DIO_WRITE_FUA to IOMAP_DIO_STABLE_WRITE in conjunction
  with treating fua && vwc the same as !vwc.
- Address review comments from Christoph
- Add comments and expand commit messages, where appropriate.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/8] iomap: cleanup up iomap_dio_bio_end_io()
  2023-07-20 18:13 [PATCHSET v4 0/8] Improve async iomap DIO performance Jens Axboe
@ 2023-07-20 18:13 ` Jens Axboe
  2023-07-21  6:14   ` Christoph Hellwig
  2023-07-21 15:13   ` Darrick J. Wong
  2023-07-20 18:13 ` [PATCH 2/8] iomap: add IOMAP_DIO_INLINE_COMP Jens Axboe
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-20 18:13 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, Jens Axboe

Make the logic a bit easier to follow:

1) Add a release_bio out path, as everybody needs to touch that, and
   have our bio ref check jump there if it's non-zero.
2) Add a kiocb local variable.
3) Add comments for each of the three conditions (sync, inline, or
   async workqueue punt).

No functional changes in this patch.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/iomap/direct-io.c | 46 +++++++++++++++++++++++++++++---------------
 1 file changed, 31 insertions(+), 15 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index ea3b868c8355..0ce60e80c901 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -152,27 +152,43 @@ void iomap_dio_bio_end_io(struct bio *bio)
 {
 	struct iomap_dio *dio = bio->bi_private;
 	bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY);
+	struct kiocb *iocb = dio->iocb;
 
 	if (bio->bi_status)
 		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
+	if (!atomic_dec_and_test(&dio->ref))
+		goto release_bio;
 
-	if (atomic_dec_and_test(&dio->ref)) {
-		if (dio->wait_for_completion) {
-			struct task_struct *waiter = dio->submit.waiter;
-			WRITE_ONCE(dio->submit.waiter, NULL);
-			blk_wake_io_task(waiter);
-		} else if (dio->flags & IOMAP_DIO_WRITE) {
-			struct inode *inode = file_inode(dio->iocb->ki_filp);
-
-			WRITE_ONCE(dio->iocb->private, NULL);
-			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
-			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
-		} else {
-			WRITE_ONCE(dio->iocb->private, NULL);
-			iomap_dio_complete_work(&dio->aio.work);
-		}
+	/*
+	 * Synchronous dio, task itself will handle any completion work
+	 * that needs after IO. All we need to do is wake the task.
+	 */
+	if (dio->wait_for_completion) {
+		struct task_struct *waiter = dio->submit.waiter;
+
+		WRITE_ONCE(dio->submit.waiter, NULL);
+		blk_wake_io_task(waiter);
+		goto release_bio;
+	}
+
+	/* Read completion can always complete inline. */
+	if (!(dio->flags & IOMAP_DIO_WRITE)) {
+		WRITE_ONCE(iocb->private, NULL);
+		iomap_dio_complete_work(&dio->aio.work);
+		goto release_bio;
 	}
 
+	/*
+	 * Async DIO completion that requires filesystem level completion work
+	 * gets punted to a work queue to complete as the operation may require
+	 * more IO to be issued to finalise filesystem metadata changes or
+	 * guarantee data integrity.
+	 */
+	WRITE_ONCE(iocb->private, NULL);
+	INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
+	queue_work(file_inode(iocb->ki_filp)->i_sb->s_dio_done_wq,
+			&dio->aio.work);
+release_bio:
 	if (should_dirty) {
 		bio_check_pages_dirty(bio);
 	} else {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/8] iomap: add IOMAP_DIO_INLINE_COMP
  2023-07-20 18:13 [PATCHSET v4 0/8] Improve async iomap DIO performance Jens Axboe
  2023-07-20 18:13 ` [PATCH 1/8] iomap: cleanup up iomap_dio_bio_end_io() Jens Axboe
@ 2023-07-20 18:13 ` Jens Axboe
  2023-07-21  6:14   ` Christoph Hellwig
  2023-07-21 15:16   ` Darrick J. Wong
  2023-07-20 18:13 ` [PATCH 3/8] iomap: treat a write through cache the same as FUA Jens Axboe
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-20 18:13 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, Jens Axboe

Rather than gate whether or not we need to punt a dio completion to a
workqueue on whether the IO is a write or not, add an explicit flag for
it. For now we treat them the same, reads always set the flags and async
writes do not.

No functional changes in this patch.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/iomap/direct-io.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 0ce60e80c901..c654612b24e5 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -20,6 +20,7 @@
  * Private flags for iomap_dio, must not overlap with the public ones in
  * iomap.h:
  */
+#define IOMAP_DIO_INLINE_COMP	(1 << 27)
 #define IOMAP_DIO_WRITE_FUA	(1 << 28)
 #define IOMAP_DIO_NEED_SYNC	(1 << 29)
 #define IOMAP_DIO_WRITE		(1 << 30)
@@ -171,8 +172,10 @@ void iomap_dio_bio_end_io(struct bio *bio)
 		goto release_bio;
 	}
 
-	/* Read completion can always complete inline. */
-	if (!(dio->flags & IOMAP_DIO_WRITE)) {
+	/*
+	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
+	 */
+	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
 		WRITE_ONCE(iocb->private, NULL);
 		iomap_dio_complete_work(&dio->aio.work);
 		goto release_bio;
@@ -527,6 +530,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		iomi.flags |= IOMAP_NOWAIT;
 
 	if (iov_iter_rw(iter) == READ) {
+		/* reads can always complete inline */
+		dio->flags |= IOMAP_DIO_INLINE_COMP;
+
 		if (iomi.pos >= dio->i_size)
 			goto out_free_dio;
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/8] iomap: treat a write through cache the same as FUA
  2023-07-20 18:13 [PATCHSET v4 0/8] Improve async iomap DIO performance Jens Axboe
  2023-07-20 18:13 ` [PATCH 1/8] iomap: cleanup up iomap_dio_bio_end_io() Jens Axboe
  2023-07-20 18:13 ` [PATCH 2/8] iomap: add IOMAP_DIO_INLINE_COMP Jens Axboe
@ 2023-07-20 18:13 ` Jens Axboe
  2023-07-21  6:15   ` Christoph Hellwig
  2023-07-20 18:13 ` [PATCH 4/8] iomap: completed polled IO inline Jens Axboe
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2023-07-20 18:13 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, Jens Axboe

Whether we have a write back cache and are using FUA or don't have
a write back cache at all is the same situation. Treat them the same.

This makes the IOMAP_DIO_WRITE_FUA name a bit misleading, as we have
two cases that provide stable writes:

1) Volatile write cache with FUA writes
2) Normal write without a volatile write cache

Rename that flag to IOMAP_DIO_STABLE_WRITE to make that clearer, and
update some of the FUA comments as well.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/iomap/direct-io.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index c654612b24e5..9f97d0d03724 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -21,7 +21,7 @@
  * iomap.h:
  */
 #define IOMAP_DIO_INLINE_COMP	(1 << 27)
-#define IOMAP_DIO_WRITE_FUA	(1 << 28)
+#define IOMAP_DIO_STABLE_WRITE	(1 << 28)
 #define IOMAP_DIO_NEED_SYNC	(1 << 29)
 #define IOMAP_DIO_WRITE		(1 << 30)
 #define IOMAP_DIO_DIRTY		(1 << 31)
@@ -222,7 +222,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 /*
  * Figure out the bio's operation flags from the dio request, the
  * mapping, and whether or not we want FUA.  Note that we can end up
- * clearing the WRITE_FUA flag in the dio request.
+ * clearing the STABLE_WRITE flag in the dio request.
  */
 static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
 		const struct iomap *iomap, bool use_fua)
@@ -236,7 +236,7 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
 	if (use_fua)
 		opflags |= REQ_FUA;
 	else
-		dio->flags &= ~IOMAP_DIO_WRITE_FUA;
+		dio->flags &= ~IOMAP_DIO_STABLE_WRITE;
 
 	return opflags;
 }
@@ -276,11 +276,13 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		 * Use a FUA write if we need datasync semantics, this is a pure
 		 * data IO that doesn't require any metadata updates (including
 		 * after IO completion such as unwritten extent conversion) and
-		 * the underlying device supports FUA. This allows us to avoid
-		 * cache flushes on IO completion.
+		 * the underlying device either supports FUA or doesn't have
+		 * a volatile write cache. This allows us to avoid cache flushes
+		 * on IO completion.
 		 */
 		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
-		    (dio->flags & IOMAP_DIO_WRITE_FUA) && bdev_fua(iomap->bdev))
+		    (dio->flags & IOMAP_DIO_STABLE_WRITE) &&
+		    (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
 			use_fua = true;
 	}
 
@@ -560,12 +562,15 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 		       /*
 			* For datasync only writes, we optimistically try
-			* using FUA for this IO.  Any non-FUA write that
-			* occurs will clear this flag, hence we know before
-			* completion whether a cache flush is necessary.
+			* using STABLE_WRITE for this IO. Stable writes are
+			* either FUA with a write cache, or a normal write to
+			* a device without a volatile write cache. For the
+			* former, Any non-FUA write that occurs will clear this
+			* flag, hence we know before completion whether a cache
+			* flush is necessary.
 			*/
 			if (!(iocb->ki_flags & IOCB_SYNC))
-				dio->flags |= IOMAP_DIO_WRITE_FUA;
+				dio->flags |= IOMAP_DIO_STABLE_WRITE;
 		}
 
 		/*
@@ -627,10 +632,10 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		iomap_dio_set_error(dio, ret);
 
 	/*
-	 * If all the writes we issued were FUA, we don't need to flush the
+	 * If all the writes we issued were stable, we don't need to flush the
 	 * cache on IO completion. Clear the sync flag for this case.
 	 */
-	if (dio->flags & IOMAP_DIO_WRITE_FUA)
+	if (dio->flags & IOMAP_DIO_STABLE_WRITE)
 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
 
 	WRITE_ONCE(iocb->private, dio->submit.poll_bio);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/8] iomap: completed polled IO inline
  2023-07-20 18:13 [PATCHSET v4 0/8] Improve async iomap DIO performance Jens Axboe
                   ` (2 preceding siblings ...)
  2023-07-20 18:13 ` [PATCH 3/8] iomap: treat a write through cache the same as FUA Jens Axboe
@ 2023-07-20 18:13 ` Jens Axboe
  2023-07-21  6:16   ` Christoph Hellwig
                     ` (2 more replies)
  2023-07-20 18:13 ` [PATCH 5/8] iomap: only set iocb->private for polled bio Jens Axboe
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-20 18:13 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, Jens Axboe

Polled IO is only allowed for conditions where task completion is safe
anyway, so we can always complete it inline. This cannot easily be
checked with a submission side flag, as the block layer may clear the
polled flag and turn it into a regular IO instead. Hence we need to
check this at completion time. If REQ_POLLED is still set, then we know
that this IO was successfully polled, and is completing in task context.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/iomap/direct-io.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9f97d0d03724..c3ea1839628f 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -173,9 +173,19 @@ void iomap_dio_bio_end_io(struct bio *bio)
 	}
 
 	/*
-	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
+	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline.
+	 * Ditto for polled requests - if the flag is still at completion
+	 * time, then we know the request was actually polled and completion
+	 * is called from the task itself. This is why we need to check it
+	 * here rather than flag it at issue time.
 	 */
-	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
+	if ((dio->flags & IOMAP_DIO_INLINE_COMP) || (bio->bi_opf & REQ_POLLED)) {
+		/*
+		 * For polled IO, we need to clear ->private as it points to
+		 * the bio being polled for. The completion side uses it to
+		 * know if a given request has been found yet or not. For
+		 * non-polled IO, ->private isn't applicable.
+		 */
 		WRITE_ONCE(iocb->private, NULL);
 		iomap_dio_complete_work(&dio->aio.work);
 		goto release_bio;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/8] iomap: only set iocb->private for polled bio
  2023-07-20 18:13 [PATCHSET v4 0/8] Improve async iomap DIO performance Jens Axboe
                   ` (3 preceding siblings ...)
  2023-07-20 18:13 ` [PATCH 4/8] iomap: completed polled IO inline Jens Axboe
@ 2023-07-20 18:13 ` Jens Axboe
  2023-07-21  6:18   ` Christoph Hellwig
  2023-07-21 15:35   ` Darrick J. Wong
  2023-07-20 18:13 ` [PATCH 6/8] fs: add IOCB flags related to passing back dio completions Jens Axboe
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-20 18:13 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, Jens Axboe

iocb->private is only used for polled IO, where the completer will
find the bio to poll through that field.

Assign it when we're submitting a polled bio, and get rid of the
dio->poll_bio indirection.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/iomap/direct-io.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index c3ea1839628f..cce9af019705 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -42,7 +42,6 @@ struct iomap_dio {
 		struct {
 			struct iov_iter		*iter;
 			struct task_struct	*waiter;
-			struct bio		*poll_bio;
 		} submit;
 
 		/* used for aio completion: */
@@ -64,12 +63,14 @@ static struct bio *iomap_dio_alloc_bio(const struct iomap_iter *iter,
 static void iomap_dio_submit_bio(const struct iomap_iter *iter,
 		struct iomap_dio *dio, struct bio *bio, loff_t pos)
 {
+	struct kiocb *iocb = dio->iocb;
+
 	atomic_inc(&dio->ref);
 
 	/* Sync dio can't be polled reliably */
-	if ((dio->iocb->ki_flags & IOCB_HIPRI) && !is_sync_kiocb(dio->iocb)) {
-		bio_set_polled(bio, dio->iocb);
-		dio->submit.poll_bio = bio;
+	if ((iocb->ki_flags & IOCB_HIPRI) && !is_sync_kiocb(iocb)) {
+		bio_set_polled(bio, iocb);
+		WRITE_ONCE(iocb->private, bio);
 	}
 
 	if (dio->dops && dio->dops->submit_io)
@@ -197,7 +198,6 @@ void iomap_dio_bio_end_io(struct bio *bio)
 	 * more IO to be issued to finalise filesystem metadata changes or
 	 * guarantee data integrity.
 	 */
-	WRITE_ONCE(iocb->private, NULL);
 	INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
 	queue_work(file_inode(iocb->ki_filp)->i_sb->s_dio_done_wq,
 			&dio->aio.work);
@@ -536,7 +536,6 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	dio->submit.iter = iter;
 	dio->submit.waiter = current;
-	dio->submit.poll_bio = NULL;
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		iomi.flags |= IOMAP_NOWAIT;
@@ -648,8 +647,6 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (dio->flags & IOMAP_DIO_STABLE_WRITE)
 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
 
-	WRITE_ONCE(iocb->private, dio->submit.poll_bio);
-
 	/*
 	 * We are about to drop our additional submission reference, which
 	 * might be the last reference to the dio.  There are three different
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 6/8] fs: add IOCB flags related to passing back dio completions
  2023-07-20 18:13 [PATCHSET v4 0/8] Improve async iomap DIO performance Jens Axboe
                   ` (4 preceding siblings ...)
  2023-07-20 18:13 ` [PATCH 5/8] iomap: only set iocb->private for polled bio Jens Axboe
@ 2023-07-20 18:13 ` Jens Axboe
  2023-07-21  6:18   ` Christoph Hellwig
  2023-07-21 15:48   ` Darrick J. Wong
  2023-07-20 18:13 ` [PATCH 7/8] io_uring/rw: add write support for IOCB_DIO_DEFER Jens Axboe
  2023-07-20 18:13 ` [PATCH 8/8] iomap: support IOCB_DIO_DEFER Jens Axboe
  7 siblings, 2 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-20 18:13 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, Jens Axboe

Async dio completions generally happen from hard/soft IRQ context, which
means that users like iomap may need to defer some of the completion
handling to a workqueue. This is less efficient than having the original
issuer handle it, like we do for sync IO, and it adds latency to the
completions.

Add IOCB_DIO_DEFER, which the issuer can set if it is able to safely
punt these completions to a safe context. If the dio handler is aware
of this flag, assign a callback handler in kiocb->dio_complete and
associated data io kiocb->private. The issuer will then call this handler
with that data from task context.

No functional changes in this patch.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/fs.h | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6867512907d6..2c589418a078 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -338,6 +338,20 @@ enum rw_hint {
 #define IOCB_NOIO		(1 << 20)
 /* can use bio alloc cache */
 #define IOCB_ALLOC_CACHE	(1 << 21)
+/*
+ * IOCB_DIO_DEFER can be set by the iocb owner, to indicate that the
+ * iocb completion can be passed back to the owner for execution from a safe
+ * context rather than needing to be punted through a workqueue. If this
+ * flag is set, the completion handling may set iocb->dio_complete to a
+ * handler, which the issuer will then call from task context to complete
+ * the processing of the iocb. iocb->private should then also be set to
+ * the argument being passed to this handler. Note that while this provides
+ * a task context for the dio_complete() callback, it should only be used
+ * on the completion side for non-IO generating completions. It's fine to
+ * call blocking functions from this callback, but they should not wait for
+ * unrelated IO (like cache flushing, new IO generation, etc).
+ */
+#define IOCB_DIO_DEFER		(1 << 22)
 
 /* for use in trace events */
 #define TRACE_IOCB_STRINGS \
@@ -351,7 +365,8 @@ enum rw_hint {
 	{ IOCB_WRITE,		"WRITE" }, \
 	{ IOCB_WAITQ,		"WAITQ" }, \
 	{ IOCB_NOIO,		"NOIO" }, \
-	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }
+	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }, \
+	{ IOCB_DIO_DEFER,	"DIO_DEFER" }
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -360,7 +375,22 @@ struct kiocb {
 	void			*private;
 	int			ki_flags;
 	u16			ki_ioprio; /* See linux/ioprio.h */
-	struct wait_page_queue	*ki_waitq; /* for async buffered IO */
+	union {
+		/*
+		 * Only used for async buffered reads, where it denotes the
+		 * page waitqueue associated with completing the read. Valid
+		 * IFF IOCB_WAITQ is set.
+		 */
+		struct wait_page_queue	*ki_waitq;
+		/*
+		 * Can be used for O_DIRECT IO, where the completion handling
+		 * is punted back to the issuer of the IO. May only be set
+		 * if IOCB_DIO_DEFER is set by the issuer, and the issuer must
+		 * then check for presence of this handler when ki_complete is
+		 * invoked.
+		 */
+		ssize_t (*dio_complete)(void *data);
+	};
 };
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 7/8] io_uring/rw: add write support for IOCB_DIO_DEFER
  2023-07-20 18:13 [PATCHSET v4 0/8] Improve async iomap DIO performance Jens Axboe
                   ` (5 preceding siblings ...)
  2023-07-20 18:13 ` [PATCH 6/8] fs: add IOCB flags related to passing back dio completions Jens Axboe
@ 2023-07-20 18:13 ` Jens Axboe
  2023-07-21  6:19   ` Christoph Hellwig
  2023-07-21 15:50   ` Darrick J. Wong
  2023-07-20 18:13 ` [PATCH 8/8] iomap: support IOCB_DIO_DEFER Jens Axboe
  7 siblings, 2 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-20 18:13 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, Jens Axboe

If the filesystem dio handler understands IOCB_DIO_DEFER, we'll get
a kiocb->ki_complete() callback with kiocb->dio_complete set. In that
case, rather than complete the IO directly through task_work, queue
up an intermediate task_work handler that first processes this
callback and then immediately completes the request.

For XFS, this avoids a punt through a workqueue, which is a lot less
efficient and adds latency to lower queue depth (or sync) O_DIRECT
writes.

Only do this for non-polled IO, as polled IO doesn't need this kind
of deferral as it always completes within the task itself. This then
avoids a check for deferral in the polled IO completion handler.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/rw.c | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/io_uring/rw.c b/io_uring/rw.c
index 1bce2208b65c..f4f700383b4e 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -285,6 +285,14 @@ static inline int io_fixup_rw_res(struct io_kiocb *req, long res)
 
 void io_req_rw_complete(struct io_kiocb *req, struct io_tw_state *ts)
 {
+	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
+
+	if (rw->kiocb.dio_complete) {
+		long res = rw->kiocb.dio_complete(rw->kiocb.private);
+
+		io_req_set_res(req, io_fixup_rw_res(req, res), 0);
+	}
+
 	io_req_io_end(req);
 
 	if (req->flags & (REQ_F_BUFFER_SELECTED|REQ_F_BUFFER_RING)) {
@@ -300,9 +308,11 @@ static void io_complete_rw(struct kiocb *kiocb, long res)
 	struct io_rw *rw = container_of(kiocb, struct io_rw, kiocb);
 	struct io_kiocb *req = cmd_to_io_kiocb(rw);
 
-	if (__io_complete_rw_common(req, res))
-		return;
-	io_req_set_res(req, io_fixup_rw_res(req, res), 0);
+	if (!rw->kiocb.dio_complete) {
+		if (__io_complete_rw_common(req, res))
+			return;
+		io_req_set_res(req, io_fixup_rw_res(req, res), 0);
+	}
 	req->io_task_work.func = io_req_rw_complete;
 	__io_req_task_work_add(req, IOU_F_TWQ_LAZY_WAKE);
 }
@@ -916,6 +926,17 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	}
 	kiocb->ki_flags |= IOCB_WRITE;
 
+	/*
+	 * For non-polled IO, set IOCB_DIO_DEFER, stating that our handler
+	 * groks deferring the completion to task context. This isn't
+	 * necessary and useful for polled IO as that can always complete
+	 * directly.
+	 */
+	if (!(kiocb->ki_flags & IOCB_HIPRI)) {
+		kiocb->ki_flags |= IOCB_DIO_DEFER;
+		kiocb->dio_complete = NULL;
+	}
+
 	if (likely(req->file->f_op->write_iter))
 		ret2 = call_write_iter(req->file, kiocb, &s->iter);
 	else if (req->file->f_op->write)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 8/8] iomap: support IOCB_DIO_DEFER
  2023-07-20 18:13 [PATCHSET v4 0/8] Improve async iomap DIO performance Jens Axboe
                   ` (6 preceding siblings ...)
  2023-07-20 18:13 ` [PATCH 7/8] io_uring/rw: add write support for IOCB_DIO_DEFER Jens Axboe
@ 2023-07-20 18:13 ` Jens Axboe
  2023-07-21  6:19   ` Christoph Hellwig
                     ` (2 more replies)
  7 siblings, 3 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-20 18:13 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, Jens Axboe

If IOCB_DIO_DEFER is set, utilize that to set kiocb->dio_complete handler
and data for that callback. Rather than punt the completion to a
workqueue, we pass back the handler and data to the issuer and will get a
callback from a safe task context.

Using the following fio job to randomly dio write 4k blocks at
queue depths of 1..16:

fio --name=dio-write --filename=/data1/file --time_based=1 \
--runtime=10 --bs=4096 --rw=randwrite --norandommap --buffered=0 \
--cpus_allowed=4 --ioengine=io_uring --iodepth=$depth

shows the following results before and after this patch:

	Stock	Patched		Diff
=======================================
QD1	155K	162K		+ 4.5%
QD2	290K	313K		+ 7.9%
QD4	533K	597K		+12.0%
QD8	604K	827K		+36.9%
QD16	615K	845K		+37.4%

which shows nice wins all around. If we factored in per-IOP efficiency,
the wins look even nicer. This becomes apparent as queue depth rises,
as the offloaded workqueue completions runs out of steam.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/iomap/direct-io.c | 54 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index cce9af019705..de86680968a4 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -20,6 +20,7 @@
  * Private flags for iomap_dio, must not overlap with the public ones in
  * iomap.h:
  */
+#define IOMAP_DIO_DEFER_COMP	(1 << 26)
 #define IOMAP_DIO_INLINE_COMP	(1 << 27)
 #define IOMAP_DIO_STABLE_WRITE	(1 << 28)
 #define IOMAP_DIO_NEED_SYNC	(1 << 29)
@@ -132,6 +133,11 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 }
 EXPORT_SYMBOL_GPL(iomap_dio_complete);
 
+static ssize_t iomap_dio_deferred_complete(void *data)
+{
+	return iomap_dio_complete(data);
+}
+
 static void iomap_dio_complete_work(struct work_struct *work)
 {
 	struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
@@ -192,6 +198,31 @@ void iomap_dio_bio_end_io(struct bio *bio)
 		goto release_bio;
 	}
 
+	/*
+	 * If this dio is flagged with IOMAP_DIO_DEFER_COMP, then schedule
+	 * our completion that way to avoid an async punt to a workqueue.
+	 */
+	if (dio->flags & IOMAP_DIO_DEFER_COMP) {
+		/* only polled IO cares about private cleared */
+		iocb->private = dio;
+		iocb->dio_complete = iomap_dio_deferred_complete;
+
+		/*
+		 * Invoke ->ki_complete() directly. We've assigned out
+		 * dio_complete callback handler, and since the issuer set
+		 * IOCB_DIO_DEFER, we know their ki_complete handler will
+		 * notice ->dio_complete being set and will defer calling that
+		 * handler until it can be done from a safe task context.
+		 *
+		 * Note that the 'res' being passed in here is not important
+		 * for this case. The actual completion value of the request
+		 * will be gotten from dio_complete when that is run by the
+		 * issuer.
+		 */
+		iocb->ki_complete(iocb, 0);
+		goto release_bio;
+	}
+
 	/*
 	 * Async DIO completion that requires filesystem level completion work
 	 * gets punted to a work queue to complete as the operation may require
@@ -288,12 +319,17 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		 * after IO completion such as unwritten extent conversion) and
 		 * the underlying device either supports FUA or doesn't have
 		 * a volatile write cache. This allows us to avoid cache flushes
-		 * on IO completion.
+		 * on IO completion. If we can't use stable writes and need to
+		 * sync, disable in-task completions as dio completion will
+		 * need to call generic_write_sync() which will do a blocking
+		 * fsync / cache flush call.
 		 */
 		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
 		    (dio->flags & IOMAP_DIO_STABLE_WRITE) &&
 		    (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
 			use_fua = true;
+		else if (dio->flags & IOMAP_DIO_NEED_SYNC)
+			dio->flags &= ~IOMAP_DIO_DEFER_COMP;
 	}
 
 	/*
@@ -319,6 +355,13 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		pad = pos & (fs_block_size - 1);
 		if (pad)
 			iomap_dio_zero(iter, dio, pos - pad, pad);
+
+		/*
+		 * If need_zeroout is set, then this is a new or unwritten
+		 * extent. These need extra handling at completion time, so
+		 * disable in-task deferred completion for those.
+		 */
+		dio->flags &= ~IOMAP_DIO_DEFER_COMP;
 	}
 
 	/*
@@ -557,6 +600,15 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		iomi.flags |= IOMAP_WRITE;
 		dio->flags |= IOMAP_DIO_WRITE;
 
+		/*
+		 * Flag as supporting deferred completions, if the issuer
+		 * groks it. This can avoid a workqueue punt for writes.
+		 * We may later clear this flag if we need to do other IO
+		 * as part of this IO completion.
+		 */
+		if (iocb->ki_flags & IOCB_DIO_DEFER)
+			dio->flags |= IOMAP_DIO_DEFER_COMP;
+
 		if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
 			ret = -EAGAIN;
 			if (iomi.pos >= dio->i_size ||
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/8] iomap: cleanup up iomap_dio_bio_end_io()
  2023-07-20 18:13 ` [PATCH 1/8] iomap: cleanup up iomap_dio_bio_end_io() Jens Axboe
@ 2023-07-21  6:14   ` Christoph Hellwig
  2023-07-21 15:13   ` Darrick J. Wong
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2023-07-21  6:14 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/8] iomap: add IOMAP_DIO_INLINE_COMP
  2023-07-20 18:13 ` [PATCH 2/8] iomap: add IOMAP_DIO_INLINE_COMP Jens Axboe
@ 2023-07-21  6:14   ` Christoph Hellwig
  2023-07-21 15:16   ` Darrick J. Wong
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2023-07-21  6:14 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/8] iomap: treat a write through cache the same as FUA
  2023-07-20 18:13 ` [PATCH 3/8] iomap: treat a write through cache the same as FUA Jens Axboe
@ 2023-07-21  6:15   ` Christoph Hellwig
  2023-07-21 14:04     ` Jens Axboe
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2023-07-21  6:15 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

On Thu, Jul 20, 2023 at 12:13:05PM -0600, Jens Axboe wrote:
> Whether we have a write back cache and are using FUA or don't have
> a write back cache at all is the same situation. Treat them the same.
> 
> This makes the IOMAP_DIO_WRITE_FUA name a bit misleading, as we have
> two cases that provide stable writes:
> 
> 1) Volatile write cache with FUA writes
> 2) Normal write without a volatile write cache
> 
> Rename that flag to IOMAP_DIO_STABLE_WRITE to make that clearer, and
> update some of the FUA comments as well.

I would have preferred IOMAP_DIO_WRITE_THROUGH, STABLE_WRITES is a flag
we use in file systems and the page cache for cases where the page
can't be touched before writeback has completed, e.g.
QUEUE_FLAG_STABLE_WRITES and SB_I_STABLE_WRITES.

Otherwise this looks good to me:

Reviewed-by: Christoph Hellwig <[email protected]>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/8] iomap: completed polled IO inline
  2023-07-20 18:13 ` [PATCH 4/8] iomap: completed polled IO inline Jens Axboe
@ 2023-07-21  6:16   ` Christoph Hellwig
  2023-07-21 15:19   ` Darrick J. Wong
  2023-07-21 21:43   ` Dave Chinner
  2 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2023-07-21  6:16 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/8] iomap: only set iocb->private for polled bio
  2023-07-20 18:13 ` [PATCH 5/8] iomap: only set iocb->private for polled bio Jens Axboe
@ 2023-07-21  6:18   ` Christoph Hellwig
  2023-07-21 15:35   ` Darrick J. Wong
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2023-07-21  6:18 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

On Thu, Jul 20, 2023 at 12:13:07PM -0600, Jens Axboe wrote:
> iocb->private is only used for polled IO, where the completer will
> find the bio to poll through that field.
> 
> Assign it when we're submitting a polled bio, and get rid of the
> dio->poll_bio indirection.

Nice, with the current shape of the code poll_bio can indeed go away.

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/8] fs: add IOCB flags related to passing back dio completions
  2023-07-20 18:13 ` [PATCH 6/8] fs: add IOCB flags related to passing back dio completions Jens Axboe
@ 2023-07-21  6:18   ` Christoph Hellwig
  2023-07-21 15:48   ` Darrick J. Wong
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2023-07-21  6:18 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 7/8] io_uring/rw: add write support for IOCB_DIO_DEFER
  2023-07-20 18:13 ` [PATCH 7/8] io_uring/rw: add write support for IOCB_DIO_DEFER Jens Axboe
@ 2023-07-21  6:19   ` Christoph Hellwig
  2023-07-21 15:50   ` Darrick J. Wong
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2023-07-21  6:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 8/8] iomap: support IOCB_DIO_DEFER
  2023-07-20 18:13 ` [PATCH 8/8] iomap: support IOCB_DIO_DEFER Jens Axboe
@ 2023-07-21  6:19   ` Christoph Hellwig
  2023-07-21 16:01   ` Darrick J. Wong
  2023-07-21 22:05   ` Dave Chinner
  2 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2023-07-21  6:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/8] iomap: treat a write through cache the same as FUA
  2023-07-21  6:15   ` Christoph Hellwig
@ 2023-07-21 14:04     ` Jens Axboe
  2023-07-21 15:55       ` Darrick J. Wong
  0 siblings, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2023-07-21 14:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: io-uring, linux-xfs, andres, david

On 7/21/23 12:15?AM, Christoph Hellwig wrote:
> On Thu, Jul 20, 2023 at 12:13:05PM -0600, Jens Axboe wrote:
>> Whether we have a write back cache and are using FUA or don't have
>> a write back cache at all is the same situation. Treat them the same.
>>
>> This makes the IOMAP_DIO_WRITE_FUA name a bit misleading, as we have
>> two cases that provide stable writes:
>>
>> 1) Volatile write cache with FUA writes
>> 2) Normal write without a volatile write cache
>>
>> Rename that flag to IOMAP_DIO_STABLE_WRITE to make that clearer, and
>> update some of the FUA comments as well.
> 
> I would have preferred IOMAP_DIO_WRITE_THROUGH, STABLE_WRITES is a flag
> we use in file systems and the page cache for cases where the page
> can't be touched before writeback has completed, e.g.
> QUEUE_FLAG_STABLE_WRITES and SB_I_STABLE_WRITES.

Good point, it does confuse terminology with stable pages for writes.
I'll change it to WRITE_THROUGH, that is more descriptive for this case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/8] iomap: cleanup up iomap_dio_bio_end_io()
  2023-07-20 18:13 ` [PATCH 1/8] iomap: cleanup up iomap_dio_bio_end_io() Jens Axboe
  2023-07-21  6:14   ` Christoph Hellwig
@ 2023-07-21 15:13   ` Darrick J. Wong
  1 sibling, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2023-07-21 15:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

On Thu, Jul 20, 2023 at 12:13:03PM -0600, Jens Axboe wrote:
> Make the logic a bit easier to follow:
> 
> 1) Add a release_bio out path, as everybody needs to touch that, and
>    have our bio ref check jump there if it's non-zero.
> 2) Add a kiocb local variable.
> 3) Add comments for each of the three conditions (sync, inline, or
>    async workqueue punt).
> 
> No functional changes in this patch.
> 
> Signed-off-by: Jens Axboe <[email protected]>

Thanks for deindentifying this,
Reviewed-by: Darrick J. Wong <[email protected]>

--D

> ---
>  fs/iomap/direct-io.c | 46 +++++++++++++++++++++++++++++---------------
>  1 file changed, 31 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index ea3b868c8355..0ce60e80c901 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -152,27 +152,43 @@ void iomap_dio_bio_end_io(struct bio *bio)
>  {
>  	struct iomap_dio *dio = bio->bi_private;
>  	bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY);
> +	struct kiocb *iocb = dio->iocb;
>  
>  	if (bio->bi_status)
>  		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
> +	if (!atomic_dec_and_test(&dio->ref))
> +		goto release_bio;
>  
> -	if (atomic_dec_and_test(&dio->ref)) {
> -		if (dio->wait_for_completion) {
> -			struct task_struct *waiter = dio->submit.waiter;
> -			WRITE_ONCE(dio->submit.waiter, NULL);
> -			blk_wake_io_task(waiter);
> -		} else if (dio->flags & IOMAP_DIO_WRITE) {
> -			struct inode *inode = file_inode(dio->iocb->ki_filp);
> -
> -			WRITE_ONCE(dio->iocb->private, NULL);
> -			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> -			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
> -		} else {
> -			WRITE_ONCE(dio->iocb->private, NULL);
> -			iomap_dio_complete_work(&dio->aio.work);
> -		}
> +	/*
> +	 * Synchronous dio, task itself will handle any completion work
> +	 * that needs after IO. All we need to do is wake the task.
> +	 */
> +	if (dio->wait_for_completion) {
> +		struct task_struct *waiter = dio->submit.waiter;
> +
> +		WRITE_ONCE(dio->submit.waiter, NULL);
> +		blk_wake_io_task(waiter);
> +		goto release_bio;
> +	}
> +
> +	/* Read completion can always complete inline. */
> +	if (!(dio->flags & IOMAP_DIO_WRITE)) {
> +		WRITE_ONCE(iocb->private, NULL);
> +		iomap_dio_complete_work(&dio->aio.work);
> +		goto release_bio;
>  	}
>  
> +	/*
> +	 * Async DIO completion that requires filesystem level completion work
> +	 * gets punted to a work queue to complete as the operation may require
> +	 * more IO to be issued to finalise filesystem metadata changes or
> +	 * guarantee data integrity.
> +	 */
> +	WRITE_ONCE(iocb->private, NULL);
> +	INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> +	queue_work(file_inode(iocb->ki_filp)->i_sb->s_dio_done_wq,
> +			&dio->aio.work);
> +release_bio:
>  	if (should_dirty) {
>  		bio_check_pages_dirty(bio);
>  	} else {
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/8] iomap: add IOMAP_DIO_INLINE_COMP
  2023-07-20 18:13 ` [PATCH 2/8] iomap: add IOMAP_DIO_INLINE_COMP Jens Axboe
  2023-07-21  6:14   ` Christoph Hellwig
@ 2023-07-21 15:16   ` Darrick J. Wong
  1 sibling, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2023-07-21 15:16 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

On Thu, Jul 20, 2023 at 12:13:04PM -0600, Jens Axboe wrote:
> Rather than gate whether or not we need to punt a dio completion to a
> workqueue on whether the IO is a write or not, add an explicit flag for
> it. For now we treat them the same, reads always set the flags and async
> writes do not.
> 
> No functional changes in this patch.
> 
> Signed-off-by: Jens Axboe <[email protected]>

Looks good,
Reviewed-by: Darrick J. Wong <[email protected]>

--D

> ---
>  fs/iomap/direct-io.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 0ce60e80c901..c654612b24e5 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -20,6 +20,7 @@
>   * Private flags for iomap_dio, must not overlap with the public ones in
>   * iomap.h:
>   */
> +#define IOMAP_DIO_INLINE_COMP	(1 << 27)
>  #define IOMAP_DIO_WRITE_FUA	(1 << 28)
>  #define IOMAP_DIO_NEED_SYNC	(1 << 29)
>  #define IOMAP_DIO_WRITE		(1 << 30)
> @@ -171,8 +172,10 @@ void iomap_dio_bio_end_io(struct bio *bio)
>  		goto release_bio;
>  	}
>  
> -	/* Read completion can always complete inline. */
> -	if (!(dio->flags & IOMAP_DIO_WRITE)) {
> +	/*
> +	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
> +	 */
> +	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
>  		WRITE_ONCE(iocb->private, NULL);
>  		iomap_dio_complete_work(&dio->aio.work);
>  		goto release_bio;
> @@ -527,6 +530,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		iomi.flags |= IOMAP_NOWAIT;
>  
>  	if (iov_iter_rw(iter) == READ) {
> +		/* reads can always complete inline */
> +		dio->flags |= IOMAP_DIO_INLINE_COMP;
> +
>  		if (iomi.pos >= dio->i_size)
>  			goto out_free_dio;
>  
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/8] iomap: completed polled IO inline
  2023-07-20 18:13 ` [PATCH 4/8] iomap: completed polled IO inline Jens Axboe
  2023-07-21  6:16   ` Christoph Hellwig
@ 2023-07-21 15:19   ` Darrick J. Wong
  2023-07-21 21:43   ` Dave Chinner
  2 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2023-07-21 15:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

On Thu, Jul 20, 2023 at 12:13:06PM -0600, Jens Axboe wrote:
> Polled IO is only allowed for conditions where task completion is safe
> anyway, so we can always complete it inline. This cannot easily be
> checked with a submission side flag, as the block layer may clear the
> polled flag and turn it into a regular IO instead. Hence we need to
> check this at completion time. If REQ_POLLED is still set, then we know
> that this IO was successfully polled, and is completing in task context.
> 
> Signed-off-by: Jens Axboe <[email protected]>
> ---
>  fs/iomap/direct-io.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 9f97d0d03724..c3ea1839628f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -173,9 +173,19 @@ void iomap_dio_bio_end_io(struct bio *bio)
>  	}
>  
>  	/*
> -	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
> +	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline.
> +	 * Ditto for polled requests - if the flag is still at completion
> +	 * time, then we know the request was actually polled and completion

Glad you added the comment here pointing out that REQ_POLLED must
*still* be set after the bio has been executed, because that was the
only question I had about this patch.

> +	 * is called from the task itself. This is why we need to check it
> +	 * here rather than flag it at issue time.
>  	 */
> -	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
> +	if ((dio->flags & IOMAP_DIO_INLINE_COMP) || (bio->bi_opf & REQ_POLLED)) {
> +		/*
> +		 * For polled IO, we need to clear ->private as it points to
> +		 * the bio being polled for. The completion side uses it to
> +		 * know if a given request has been found yet or not. For
> +		 * non-polled IO, ->private isn't applicable.

Thanks for the clarifying note here too.

Reviewed-by: Darrick J. Wong <[email protected]>

--D


> +		 */
>  		WRITE_ONCE(iocb->private, NULL);
>  		iomap_dio_complete_work(&dio->aio.work);
>  		goto release_bio;
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/8] iomap: only set iocb->private for polled bio
  2023-07-20 18:13 ` [PATCH 5/8] iomap: only set iocb->private for polled bio Jens Axboe
  2023-07-21  6:18   ` Christoph Hellwig
@ 2023-07-21 15:35   ` Darrick J. Wong
  2023-07-21 15:37     ` Jens Axboe
  1 sibling, 1 reply; 39+ messages in thread
From: Darrick J. Wong @ 2023-07-21 15:35 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

On Thu, Jul 20, 2023 at 12:13:07PM -0600, Jens Axboe wrote:
> iocb->private is only used for polled IO, where the completer will
> find the bio to poll through that field.
> 
> Assign it when we're submitting a polled bio, and get rid of the
> dio->poll_bio indirection.

IIRC, the only time iomap actually honors HIPRI requests from the iocb
is if the entire write can be satisfied with a single bio -- no zeroing
around, no dirty file metadata, no writes past EOF, no unwritten blocks,
etc.  Right?

There was only ever going to be one assign to dio->submit.poll_bio,
which means the WRITE_ONCE isn't going to overwrite some non-NULL value.
Correct?

All this does is remove the indirection like you said.

If the answers are {yes, yes} then I understand the HIPRI mechanism
enough to say

Reviewed-by: Darrick J. Wong <[email protected]>

--D


> Signed-off-by: Jens Axboe <[email protected]>
> ---
>  fs/iomap/direct-io.c | 13 +++++--------
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index c3ea1839628f..cce9af019705 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -42,7 +42,6 @@ struct iomap_dio {
>  		struct {
>  			struct iov_iter		*iter;
>  			struct task_struct	*waiter;
> -			struct bio		*poll_bio;
>  		} submit;
>  
>  		/* used for aio completion: */
> @@ -64,12 +63,14 @@ static struct bio *iomap_dio_alloc_bio(const struct iomap_iter *iter,
>  static void iomap_dio_submit_bio(const struct iomap_iter *iter,
>  		struct iomap_dio *dio, struct bio *bio, loff_t pos)
>  {
> +	struct kiocb *iocb = dio->iocb;
> +
>  	atomic_inc(&dio->ref);
>  
>  	/* Sync dio can't be polled reliably */
> -	if ((dio->iocb->ki_flags & IOCB_HIPRI) && !is_sync_kiocb(dio->iocb)) {
> -		bio_set_polled(bio, dio->iocb);
> -		dio->submit.poll_bio = bio;
> +	if ((iocb->ki_flags & IOCB_HIPRI) && !is_sync_kiocb(iocb)) {
> +		bio_set_polled(bio, iocb);
> +		WRITE_ONCE(iocb->private, bio);
>  	}
>  
>  	if (dio->dops && dio->dops->submit_io)
> @@ -197,7 +198,6 @@ void iomap_dio_bio_end_io(struct bio *bio)
>  	 * more IO to be issued to finalise filesystem metadata changes or
>  	 * guarantee data integrity.
>  	 */
> -	WRITE_ONCE(iocb->private, NULL);
>  	INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
>  	queue_work(file_inode(iocb->ki_filp)->i_sb->s_dio_done_wq,
>  			&dio->aio.work);
> @@ -536,7 +536,6 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	dio->submit.iter = iter;
>  	dio->submit.waiter = current;
> -	dio->submit.poll_bio = NULL;
>  
>  	if (iocb->ki_flags & IOCB_NOWAIT)
>  		iomi.flags |= IOMAP_NOWAIT;
> @@ -648,8 +647,6 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (dio->flags & IOMAP_DIO_STABLE_WRITE)
>  		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
>  
> -	WRITE_ONCE(iocb->private, dio->submit.poll_bio);
> -
>  	/*
>  	 * We are about to drop our additional submission reference, which
>  	 * might be the last reference to the dio.  There are three different
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/8] iomap: only set iocb->private for polled bio
  2023-07-21 15:35   ` Darrick J. Wong
@ 2023-07-21 15:37     ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-21 15:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: io-uring, linux-xfs, hch, andres, david

On 7/21/23 9:35?AM, Darrick J. Wong wrote:
> On Thu, Jul 20, 2023 at 12:13:07PM -0600, Jens Axboe wrote:
>> iocb->private is only used for polled IO, where the completer will
>> find the bio to poll through that field.
>>
>> Assign it when we're submitting a polled bio, and get rid of the
>> dio->poll_bio indirection.
> 
> IIRC, the only time iomap actually honors HIPRI requests from the iocb
> is if the entire write can be satisfied with a single bio -- no zeroing
> around, no dirty file metadata, no writes past EOF, no unwritten blocks,
> etc.  Right?
> 
> There was only ever going to be one assign to dio->submit.poll_bio,
> which means the WRITE_ONCE isn't going to overwrite some non-NULL value.
> Correct?
> 
> All this does is remove the indirection like you said.
> 
> If the answers are {yes, yes} then I understand the HIPRI mechanism
> enough to say

Correct, yes to both. For multi bio or not a straight overwrite, iomap
disables polling.

> Reviewed-by: Darrick J. Wong <[email protected]>

Thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/8] fs: add IOCB flags related to passing back dio completions
  2023-07-20 18:13 ` [PATCH 6/8] fs: add IOCB flags related to passing back dio completions Jens Axboe
  2023-07-21  6:18   ` Christoph Hellwig
@ 2023-07-21 15:48   ` Darrick J. Wong
  2023-07-21 15:53     ` Jens Axboe
  1 sibling, 1 reply; 39+ messages in thread
From: Darrick J. Wong @ 2023-07-21 15:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

On Thu, Jul 20, 2023 at 12:13:08PM -0600, Jens Axboe wrote:
> Async dio completions generally happen from hard/soft IRQ context, which
> means that users like iomap may need to defer some of the completion
> handling to a workqueue. This is less efficient than having the original
> issuer handle it, like we do for sync IO, and it adds latency to the
> completions.
> 
> Add IOCB_DIO_DEFER, which the issuer can set if it is able to safely
> punt these completions to a safe context. If the dio handler is aware
> of this flag, assign a callback handler in kiocb->dio_complete and
> associated data io kiocb->private. The issuer will then call this handler
> with that data from task context.
> 
> No functional changes in this patch.
> 
> Signed-off-by: Jens Axboe <[email protected]>
> ---
>  include/linux/fs.h | 34 ++++++++++++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6867512907d6..2c589418a078 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -338,6 +338,20 @@ enum rw_hint {
>  #define IOCB_NOIO		(1 << 20)
>  /* can use bio alloc cache */
>  #define IOCB_ALLOC_CACHE	(1 << 21)
> +/*
> + * IOCB_DIO_DEFER can be set by the iocb owner, to indicate that the
> + * iocb completion can be passed back to the owner for execution from a safe
> + * context rather than needing to be punted through a workqueue. If this
> + * flag is set, the completion handling may set iocb->dio_complete to a
> + * handler, which the issuer will then call from task context to complete
> + * the processing of the iocb. iocb->private should then also be set to
> + * the argument being passed to this handler. Note that while this provides

Who should be setting iocb->private?  Can I suggest rewording this to:

"If this flag is set, the bio completion handling may set
iocb->dio_complete to a handler function and iocb->private to context
information for that handler.  The issuer should call the handler with
that context information from task context to complete the processing of
the iocb."

Assuming I've understood what this does from the next patch? :)

> + * a task context for the dio_complete() callback, it should only be used
> + * on the completion side for non-IO generating completions. It's fine to
> + * call blocking functions from this callback, but they should not wait for
> + * unrelated IO (like cache flushing, new IO generation, etc).
> + */
> +#define IOCB_DIO_DEFER		(1 << 22)

Sorry to nitpick names here, but "defer" feels a little vague to me.
Defer what?  And to whom?

This flag means "defer iocb completion to the caller", right?  If so,
wouldn't this be better named IOCB_DIO_CALLER_COMP?

>  /* for use in trace events */
>  #define TRACE_IOCB_STRINGS \
> @@ -351,7 +365,8 @@ enum rw_hint {
>  	{ IOCB_WRITE,		"WRITE" }, \
>  	{ IOCB_WAITQ,		"WAITQ" }, \
>  	{ IOCB_NOIO,		"NOIO" }, \
> -	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }
> +	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }, \
> +	{ IOCB_DIO_DEFER,	"DIO_DEFER" }
>  
>  struct kiocb {
>  	struct file		*ki_filp;
> @@ -360,7 +375,22 @@ struct kiocb {
>  	void			*private;
>  	int			ki_flags;
>  	u16			ki_ioprio; /* See linux/ioprio.h */
> -	struct wait_page_queue	*ki_waitq; /* for async buffered IO */
> +	union {
> +		/*
> +		 * Only used for async buffered reads, where it denotes the
> +		 * page waitqueue associated with completing the read. Valid
> +		 * IFF IOCB_WAITQ is set.
> +		 */
> +		struct wait_page_queue	*ki_waitq;
> +		/*
> +		 * Can be used for O_DIRECT IO, where the completion handling
> +		 * is punted back to the issuer of the IO. May only be set
> +		 * if IOCB_DIO_DEFER is set by the issuer, and the issuer must
> +		 * then check for presence of this handler when ki_complete is
> +		 * invoked.

Might want to reiterate in the comment that kiocb.private should be
passed as @data.

--D

> +		 */
> +		ssize_t (*dio_complete)(void *data);
> +	};
>  };
>  
>  static inline bool is_sync_kiocb(struct kiocb *kiocb)
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 7/8] io_uring/rw: add write support for IOCB_DIO_DEFER
  2023-07-20 18:13 ` [PATCH 7/8] io_uring/rw: add write support for IOCB_DIO_DEFER Jens Axboe
  2023-07-21  6:19   ` Christoph Hellwig
@ 2023-07-21 15:50   ` Darrick J. Wong
  2023-07-21 15:53     ` Jens Axboe
  1 sibling, 1 reply; 39+ messages in thread
From: Darrick J. Wong @ 2023-07-21 15:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

On Thu, Jul 20, 2023 at 12:13:09PM -0600, Jens Axboe wrote:
> If the filesystem dio handler understands IOCB_DIO_DEFER, we'll get
> a kiocb->ki_complete() callback with kiocb->dio_complete set. In that
> case, rather than complete the IO directly through task_work, queue
> up an intermediate task_work handler that first processes this
> callback and then immediately completes the request.
> 
> For XFS, this avoids a punt through a workqueue, which is a lot less
> efficient and adds latency to lower queue depth (or sync) O_DIRECT
> writes.
> 
> Only do this for non-polled IO, as polled IO doesn't need this kind
> of deferral as it always completes within the task itself. This then
> avoids a check for deferral in the polled IO completion handler.
> 
> Signed-off-by: Jens Axboe <[email protected]>

Seems pretty obvious to me, though I'm famous for not being an
experienced io_uring user yet...

Reviewed-by: Darrick J. Wong <[email protected]>

--D

> ---
>  io_uring/rw.c | 27 ++++++++++++++++++++++++---
>  1 file changed, 24 insertions(+), 3 deletions(-)
> 
> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 1bce2208b65c..f4f700383b4e 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -285,6 +285,14 @@ static inline int io_fixup_rw_res(struct io_kiocb *req, long res)
>  
>  void io_req_rw_complete(struct io_kiocb *req, struct io_tw_state *ts)
>  {
> +	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
> +
> +	if (rw->kiocb.dio_complete) {
> +		long res = rw->kiocb.dio_complete(rw->kiocb.private);
> +
> +		io_req_set_res(req, io_fixup_rw_res(req, res), 0);
> +	}
> +
>  	io_req_io_end(req);
>  
>  	if (req->flags & (REQ_F_BUFFER_SELECTED|REQ_F_BUFFER_RING)) {
> @@ -300,9 +308,11 @@ static void io_complete_rw(struct kiocb *kiocb, long res)
>  	struct io_rw *rw = container_of(kiocb, struct io_rw, kiocb);
>  	struct io_kiocb *req = cmd_to_io_kiocb(rw);
>  
> -	if (__io_complete_rw_common(req, res))
> -		return;
> -	io_req_set_res(req, io_fixup_rw_res(req, res), 0);
> +	if (!rw->kiocb.dio_complete) {
> +		if (__io_complete_rw_common(req, res))
> +			return;
> +		io_req_set_res(req, io_fixup_rw_res(req, res), 0);
> +	}
>  	req->io_task_work.func = io_req_rw_complete;
>  	__io_req_task_work_add(req, IOU_F_TWQ_LAZY_WAKE);
>  }
> @@ -916,6 +926,17 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
>  	}
>  	kiocb->ki_flags |= IOCB_WRITE;
>  
> +	/*
> +	 * For non-polled IO, set IOCB_DIO_DEFER, stating that our handler
> +	 * groks deferring the completion to task context. This isn't
> +	 * necessary and useful for polled IO as that can always complete
> +	 * directly.
> +	 */
> +	if (!(kiocb->ki_flags & IOCB_HIPRI)) {
> +		kiocb->ki_flags |= IOCB_DIO_DEFER;
> +		kiocb->dio_complete = NULL;
> +	}
> +
>  	if (likely(req->file->f_op->write_iter))
>  		ret2 = call_write_iter(req->file, kiocb, &s->iter);
>  	else if (req->file->f_op->write)
> -- 
> 2.40.1
 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/8] fs: add IOCB flags related to passing back dio completions
  2023-07-21 15:48   ` Darrick J. Wong
@ 2023-07-21 15:53     ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-21 15:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: io-uring, linux-xfs, hch, andres, david

On 7/21/23 9:48?AM, Darrick J. Wong wrote:
> On Thu, Jul 20, 2023 at 12:13:08PM -0600, Jens Axboe wrote:
>> Async dio completions generally happen from hard/soft IRQ context, which
>> means that users like iomap may need to defer some of the completion
>> handling to a workqueue. This is less efficient than having the original
>> issuer handle it, like we do for sync IO, and it adds latency to the
>> completions.
>>
>> Add IOCB_DIO_DEFER, which the issuer can set if it is able to safely
>> punt these completions to a safe context. If the dio handler is aware
>> of this flag, assign a callback handler in kiocb->dio_complete and
>> associated data io kiocb->private. The issuer will then call this handler
>> with that data from task context.
>>
>> No functional changes in this patch.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>>  include/linux/fs.h | 34 ++++++++++++++++++++++++++++++++--
>>  1 file changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 6867512907d6..2c589418a078 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -338,6 +338,20 @@ enum rw_hint {
>>  #define IOCB_NOIO		(1 << 20)
>>  /* can use bio alloc cache */
>>  #define IOCB_ALLOC_CACHE	(1 << 21)
>> +/*
>> + * IOCB_DIO_DEFER can be set by the iocb owner, to indicate that the
>> + * iocb completion can be passed back to the owner for execution from a safe
>> + * context rather than needing to be punted through a workqueue. If this
>> + * flag is set, the completion handling may set iocb->dio_complete to a
>> + * handler, which the issuer will then call from task context to complete
>> + * the processing of the iocb. iocb->private should then also be set to
>> + * the argument being passed to this handler. Note that while this provides
> 
> Who should be setting iocb->private?  Can I suggest rewording this to:
> 
> "If this flag is set, the bio completion handling may set
> iocb->dio_complete to a handler function and iocb->private to context
> information for that handler.  The issuer should call the handler with
> that context information from task context to complete the processing of
> the iocb."
> 
> Assuming I've understood what this does from the next patch? :)

Yep this is definitely better - thanks, I'll update it!

>> + * a task context for the dio_complete() callback, it should only be used
>> + * on the completion side for non-IO generating completions. It's fine to
>> + * call blocking functions from this callback, but they should not wait for
>> + * unrelated IO (like cache flushing, new IO generation, etc).
>> + */
>> +#define IOCB_DIO_DEFER		(1 << 22)
> 
> Sorry to nitpick names here, but "defer" feels a little vague to me.
> Defer what?  And to whom?
> 
> This flag means "defer iocb completion to the caller", right?  If so,
> wouldn't this be better named IOCB_DIO_CALLER_COMP?

That is probably better indeed. Naming is hard! CALLER_COMP or
ISSUER_COMP would be better and more descriptive. I'll go with your
suggestion.

>>  /* for use in trace events */
>>  #define TRACE_IOCB_STRINGS \
>> @@ -351,7 +365,8 @@ enum rw_hint {
>>  	{ IOCB_WRITE,		"WRITE" }, \
>>  	{ IOCB_WAITQ,		"WAITQ" }, \
>>  	{ IOCB_NOIO,		"NOIO" }, \
>> -	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }
>> +	{ IOCB_ALLOC_CACHE,	"ALLOC_CACHE" }, \
>> +	{ IOCB_DIO_DEFER,	"DIO_DEFER" }
>>  
>>  struct kiocb {
>>  	struct file		*ki_filp;
>> @@ -360,7 +375,22 @@ struct kiocb {
>>  	void			*private;
>>  	int			ki_flags;
>>  	u16			ki_ioprio; /* See linux/ioprio.h */
>> -	struct wait_page_queue	*ki_waitq; /* for async buffered IO */
>> +	union {
>> +		/*
>> +		 * Only used for async buffered reads, where it denotes the
>> +		 * page waitqueue associated with completing the read. Valid
>> +		 * IFF IOCB_WAITQ is set.
>> +		 */
>> +		struct wait_page_queue	*ki_waitq;
>> +		/*
>> +		 * Can be used for O_DIRECT IO, where the completion handling
>> +		 * is punted back to the issuer of the IO. May only be set
>> +		 * if IOCB_DIO_DEFER is set by the issuer, and the issuer must
>> +		 * then check for presence of this handler when ki_complete is
>> +		 * invoked.
> 
> Might want to reiterate in the comment that kiocb.private should be
> passed as @data.

OK, will do.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 7/8] io_uring/rw: add write support for IOCB_DIO_DEFER
  2023-07-21 15:50   ` Darrick J. Wong
@ 2023-07-21 15:53     ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-21 15:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: io-uring, linux-xfs, hch, andres, david

On 7/21/23 9:50?AM, Darrick J. Wong wrote:
> On Thu, Jul 20, 2023 at 12:13:09PM -0600, Jens Axboe wrote:
>> If the filesystem dio handler understands IOCB_DIO_DEFER, we'll get
>> a kiocb->ki_complete() callback with kiocb->dio_complete set. In that
>> case, rather than complete the IO directly through task_work, queue
>> up an intermediate task_work handler that first processes this
>> callback and then immediately completes the request.
>>
>> For XFS, this avoids a punt through a workqueue, which is a lot less
>> efficient and adds latency to lower queue depth (or sync) O_DIRECT
>> writes.
>>
>> Only do this for non-polled IO, as polled IO doesn't need this kind
>> of deferral as it always completes within the task itself. This then
>> avoids a check for deferral in the polled IO completion handler.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
> 
> Seems pretty obvious to me, though I'm famous for not being an
> experienced io_uring user yet...
> 
> Reviewed-by: Darrick J. Wong <[email protected]>

Thanks, keyword here is "yet" ;-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/8] iomap: treat a write through cache the same as FUA
  2023-07-21 14:04     ` Jens Axboe
@ 2023-07-21 15:55       ` Darrick J. Wong
  2023-07-21 16:03         ` Jens Axboe
  0 siblings, 1 reply; 39+ messages in thread
From: Darrick J. Wong @ 2023-07-21 15:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, io-uring, linux-xfs, andres, david

On Fri, Jul 21, 2023 at 08:04:19AM -0600, Jens Axboe wrote:
> On 7/21/23 12:15?AM, Christoph Hellwig wrote:
> > On Thu, Jul 20, 2023 at 12:13:05PM -0600, Jens Axboe wrote:
> >> Whether we have a write back cache and are using FUA or don't have
> >> a write back cache at all is the same situation. Treat them the same.
> >>
> >> This makes the IOMAP_DIO_WRITE_FUA name a bit misleading, as we have
> >> two cases that provide stable writes:
> >>
> >> 1) Volatile write cache with FUA writes
> >> 2) Normal write without a volatile write cache
> >>
> >> Rename that flag to IOMAP_DIO_STABLE_WRITE to make that clearer, and
> >> update some of the FUA comments as well.
> > 
> > I would have preferred IOMAP_DIO_WRITE_THROUGH, STABLE_WRITES is a flag
> > we use in file systems and the page cache for cases where the page
> > can't be touched before writeback has completed, e.g.
> > QUEUE_FLAG_STABLE_WRITES and SB_I_STABLE_WRITES.
> 
> Good point, it does confuse terminology with stable pages for writes.
> I'll change it to WRITE_THROUGH, that is more descriptive for this case.

+1 for the name change.

With IOMAP_DIO_WRITE_THROUGH,
Reviewed-by: Darrick J. Wong <[email protected]>

--D


Separately: At some point, the definition for IOMAP_DIO_DIRTY needs to
grow a type annotation:

#define IOMAP_DIO_DIRTY		(1U << 31)

due (apparently) triggering UBSAN because "1" on its own is a signed
constant.  If this series goes through my tree then I'll add a trivial
patch fixing all of this ... unless you'd rather do it yourself as a
patch 9?

--D

> -- 
> Jens Axboe
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 8/8] iomap: support IOCB_DIO_DEFER
  2023-07-20 18:13 ` [PATCH 8/8] iomap: support IOCB_DIO_DEFER Jens Axboe
  2023-07-21  6:19   ` Christoph Hellwig
@ 2023-07-21 16:01   ` Darrick J. Wong
  2023-07-21 16:30     ` Jens Axboe
  2023-07-21 22:05   ` Dave Chinner
  2 siblings, 1 reply; 39+ messages in thread
From: Darrick J. Wong @ 2023-07-21 16:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres, david

On Thu, Jul 20, 2023 at 12:13:10PM -0600, Jens Axboe wrote:
> If IOCB_DIO_DEFER is set, utilize that to set kiocb->dio_complete handler
> and data for that callback. Rather than punt the completion to a
> workqueue, we pass back the handler and data to the issuer and will get a
> callback from a safe task context.
> 
> Using the following fio job to randomly dio write 4k blocks at
> queue depths of 1..16:
> 
> fio --name=dio-write --filename=/data1/file --time_based=1 \
> --runtime=10 --bs=4096 --rw=randwrite --norandommap --buffered=0 \
> --cpus_allowed=4 --ioengine=io_uring --iodepth=$depth
> 
> shows the following results before and after this patch:
> 
> 	Stock	Patched		Diff
> =======================================
> QD1	155K	162K		+ 4.5%
> QD2	290K	313K		+ 7.9%
> QD4	533K	597K		+12.0%
> QD8	604K	827K		+36.9%
> QD16	615K	845K		+37.4%

Nice!

> which shows nice wins all around. If we factored in per-IOP efficiency,
> the wins look even nicer. This becomes apparent as queue depth rises,
> as the offloaded workqueue completions runs out of steam.
> 
> Signed-off-by: Jens Axboe <[email protected]>
> ---
>  fs/iomap/direct-io.c | 54 +++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 53 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index cce9af019705..de86680968a4 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -20,6 +20,7 @@
>   * Private flags for iomap_dio, must not overlap with the public ones in
>   * iomap.h:
>   */
> +#define IOMAP_DIO_DEFER_COMP	(1 << 26)

IOMAP_DIO_CALLER_COMP, to go with IOCB_CALLER_COMP?

>  #define IOMAP_DIO_INLINE_COMP	(1 << 27)
>  #define IOMAP_DIO_STABLE_WRITE	(1 << 28)
>  #define IOMAP_DIO_NEED_SYNC	(1 << 29)
> @@ -132,6 +133,11 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  }
>  EXPORT_SYMBOL_GPL(iomap_dio_complete);
>  
> +static ssize_t iomap_dio_deferred_complete(void *data)
> +{
> +	return iomap_dio_complete(data);
> +}
> +
>  static void iomap_dio_complete_work(struct work_struct *work)
>  {
>  	struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
> @@ -192,6 +198,31 @@ void iomap_dio_bio_end_io(struct bio *bio)
>  		goto release_bio;
>  	}
>  
> +	/*
> +	 * If this dio is flagged with IOMAP_DIO_DEFER_COMP, then schedule
> +	 * our completion that way to avoid an async punt to a workqueue.
> +	 */
> +	if (dio->flags & IOMAP_DIO_DEFER_COMP) {
> +		/* only polled IO cares about private cleared */
> +		iocb->private = dio;
> +		iocb->dio_complete = iomap_dio_deferred_complete;
> +
> +		/*
> +		 * Invoke ->ki_complete() directly. We've assigned out

"We've assigned our..."

> +		 * dio_complete callback handler, and since the issuer set
> +		 * IOCB_DIO_DEFER, we know their ki_complete handler will
> +		 * notice ->dio_complete being set and will defer calling that
> +		 * handler until it can be done from a safe task context.
> +		 *
> +		 * Note that the 'res' being passed in here is not important
> +		 * for this case. The actual completion value of the request
> +		 * will be gotten from dio_complete when that is run by the
> +		 * issuer.
> +		 */
> +		iocb->ki_complete(iocb, 0);
> +		goto release_bio;
> +	}
> +
>  	/*
>  	 * Async DIO completion that requires filesystem level completion work
>  	 * gets punted to a work queue to complete as the operation may require
> @@ -288,12 +319,17 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		 * after IO completion such as unwritten extent conversion) and
>  		 * the underlying device either supports FUA or doesn't have
>  		 * a volatile write cache. This allows us to avoid cache flushes
> -		 * on IO completion.
> +		 * on IO completion. If we can't use stable writes and need to

"If we can't use writethrough and need to sync..."

> +		 * sync, disable in-task completions as dio completion will
> +		 * need to call generic_write_sync() which will do a blocking
> +		 * fsync / cache flush call.
>  		 */
>  		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
>  		    (dio->flags & IOMAP_DIO_STABLE_WRITE) &&
>  		    (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
>  			use_fua = true;
> +		else if (dio->flags & IOMAP_DIO_NEED_SYNC)
> +			dio->flags &= ~IOMAP_DIO_DEFER_COMP;
>  	}
>  
>  	/*
> @@ -319,6 +355,13 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		pad = pos & (fs_block_size - 1);
>  		if (pad)
>  			iomap_dio_zero(iter, dio, pos - pad, pad);
> +
> +		/*
> +		 * If need_zeroout is set, then this is a new or unwritten
> +		 * extent. These need extra handling at completion time, so

"...then this is a new or unwritten extent, or dirty file metadata have
not been persisted to disk."

> +		 * disable in-task deferred completion for those.
> +		 */
> +		dio->flags &= ~IOMAP_DIO_DEFER_COMP;
>  	}
>  
>  	/*
> @@ -557,6 +600,15 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		iomi.flags |= IOMAP_WRITE;
>  		dio->flags |= IOMAP_DIO_WRITE;
>  
> +		/*
> +		 * Flag as supporting deferred completions, if the issuer
> +		 * groks it. This can avoid a workqueue punt for writes.
> +		 * We may later clear this flag if we need to do other IO
> +		 * as part of this IO completion.
> +		 */
> +		if (iocb->ki_flags & IOCB_DIO_DEFER)
> +			dio->flags |= IOMAP_DIO_DEFER_COMP;
> +

With those comment clarifications added,

Reviewed-by: Darrick J. Wong <[email protected]>

--D

>  		if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
>  			ret = -EAGAIN;
>  			if (iomi.pos >= dio->i_size ||
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/8] iomap: treat a write through cache the same as FUA
  2023-07-21 15:55       ` Darrick J. Wong
@ 2023-07-21 16:03         ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-21 16:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, io-uring, linux-xfs, andres, david

On 7/21/23 9:55?AM, Darrick J. Wong wrote:
> On Fri, Jul 21, 2023 at 08:04:19AM -0600, Jens Axboe wrote:
>> On 7/21/23 12:15?AM, Christoph Hellwig wrote:
>>> On Thu, Jul 20, 2023 at 12:13:05PM -0600, Jens Axboe wrote:
>>>> Whether we have a write back cache and are using FUA or don't have
>>>> a write back cache at all is the same situation. Treat them the same.
>>>>
>>>> This makes the IOMAP_DIO_WRITE_FUA name a bit misleading, as we have
>>>> two cases that provide stable writes:
>>>>
>>>> 1) Volatile write cache with FUA writes
>>>> 2) Normal write without a volatile write cache
>>>>
>>>> Rename that flag to IOMAP_DIO_STABLE_WRITE to make that clearer, and
>>>> update some of the FUA comments as well.
>>>
>>> I would have preferred IOMAP_DIO_WRITE_THROUGH, STABLE_WRITES is a flag
>>> we use in file systems and the page cache for cases where the page
>>> can't be touched before writeback has completed, e.g.
>>> QUEUE_FLAG_STABLE_WRITES and SB_I_STABLE_WRITES.
>>
>> Good point, it does confuse terminology with stable pages for writes.
>> I'll change it to WRITE_THROUGH, that is more descriptive for this case.
> 
> +1 for the name change.
> 
> With IOMAP_DIO_WRITE_THROUGH,
> Reviewed-by: Darrick J. Wong <[email protected]>

Thanks, I did make that change.

> Separately: At some point, the definition for IOMAP_DIO_DIRTY needs to
> grow a type annotation:
> 
> #define IOMAP_DIO_DIRTY		(1U << 31)
> 
> due (apparently) triggering UBSAN because "1" on its own is a signed
> constant.  If this series goes through my tree then I'll add a trivial
> patch fixing all of this ... unless you'd rather do it yourself as a
> patch 9?

Ah yes. I can add a patch for that and send out a v5. Will run the usual
testing on it with that patch added, then ship it out. Risk of conflict
with io_uring changes is pretty small, so would be fine to stage through
your tree.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 8/8] iomap: support IOCB_DIO_DEFER
  2023-07-21 16:01   ` Darrick J. Wong
@ 2023-07-21 16:30     ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-21 16:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: io-uring, linux-xfs, hch, andres, david

On 7/21/23 10:01?AM, Darrick J. Wong wrote:
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index cce9af019705..de86680968a4 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -20,6 +20,7 @@
>>   * Private flags for iomap_dio, must not overlap with the public ones in
>>   * iomap.h:
>>   */
>> +#define IOMAP_DIO_DEFER_COMP	(1 << 26)
> 
> IOMAP_DIO_CALLER_COMP, to go with IOCB_CALLER_COMP?

Yep, already made that change in conjunction with the other rename.

>>  #define IOMAP_DIO_INLINE_COMP	(1 << 27)
>> +	/*
>> +	 * If this dio is flagged with IOMAP_DIO_DEFER_COMP, then schedule
>> +	 * our completion that way to avoid an async punt to a workqueue.
>> +	 */
>> +	if (dio->flags & IOMAP_DIO_DEFER_COMP) {
>> +		/* only polled IO cares about private cleared */
>> +		iocb->private = dio;
>> +		iocb->dio_complete = iomap_dio_deferred_complete;
>> +
>> +		/*
>> +		 * Invoke ->ki_complete() directly. We've assigned out
> 
> "We've assigned our..."

Fixed.

>> @@ -288,12 +319,17 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>  		 * after IO completion such as unwritten extent conversion) and
>>  		 * the underlying device either supports FUA or doesn't have
>>  		 * a volatile write cache. This allows us to avoid cache flushes
>> -		 * on IO completion.
>> +		 * on IO completion. If we can't use stable writes and need to
> 
> "If we can't use writethrough and need to sync..."

Fixed.

>> @@ -319,6 +355,13 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>  		pad = pos & (fs_block_size - 1);
>>  		if (pad)
>>  			iomap_dio_zero(iter, dio, pos - pad, pad);
>> +
>> +		/*
>> +		 * If need_zeroout is set, then this is a new or unwritten
>> +		 * extent. These need extra handling at completion time, so
> 
> "...then this is a new or unwritten extent, or dirty file metadata have
> not been persisted to disk."

Fixed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/8] iomap: completed polled IO inline
  2023-07-20 18:13 ` [PATCH 4/8] iomap: completed polled IO inline Jens Axboe
  2023-07-21  6:16   ` Christoph Hellwig
  2023-07-21 15:19   ` Darrick J. Wong
@ 2023-07-21 21:43   ` Dave Chinner
  2023-07-22  3:10     ` Jens Axboe
  2023-07-22 16:54     ` Jens Axboe
  2 siblings, 2 replies; 39+ messages in thread
From: Dave Chinner @ 2023-07-21 21:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres

On Thu, Jul 20, 2023 at 12:13:06PM -0600, Jens Axboe wrote:
> Polled IO is only allowed for conditions where task completion is safe
> anyway, so we can always complete it inline. This cannot easily be
> checked with a submission side flag, as the block layer may clear the
> polled flag and turn it into a regular IO instead. Hence we need to
> check this at completion time. If REQ_POLLED is still set, then we know
> that this IO was successfully polled, and is completing in task context.
> 
> Signed-off-by: Jens Axboe <[email protected]>
> ---
>  fs/iomap/direct-io.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 9f97d0d03724..c3ea1839628f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -173,9 +173,19 @@ void iomap_dio_bio_end_io(struct bio *bio)
>  	}
>  
>  	/*
> -	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
> +	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline.
> +	 * Ditto for polled requests - if the flag is still at completion
> +	 * time, then we know the request was actually polled and completion
> +	 * is called from the task itself. This is why we need to check it
> +	 * here rather than flag it at issue time.
>  	 */
> -	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
> +	if ((dio->flags & IOMAP_DIO_INLINE_COMP) || (bio->bi_opf & REQ_POLLED)) {

This still smells wrong to me. Let me see if I can work out why...

<spelunk!>

When we set up the IO in iomap_dio_bio_iter(), we do this:

        /*
         * We can only poll for single bio I/Os.
         */
        if (need_zeroout ||
            ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
                dio->iocb->ki_flags &= ~IOCB_HIPRI;

The "need_zeroout" covers writes into unwritten regions that require
conversion at IO completion, and the latter check is for writes
extending EOF. i.e. this covers the cases where we have dirty
metadata for this specific write and so may need transactions or
journal/metadata IO during IO completion.

The only case it doesn't cover is clearing IOCB_HIPRI for O_DSYNC IO
that may require a call to generic_write_sync() in completion. That
is, if we aren't using FUA, will not have IOMAP_DIO_INLINE_COMP set,
but still do polled IO.

I think this is a bug. We don't want to be issuing more IO in
REQ_POLLED task context during IO completion, and O_DSYNC IO
completion for non-FUA IO requires a journal flush and that can
issue lots of journal IO and wait on it in completion process.

Hence I think we should only be setting REQ_POLLED in the cases
where IOCB_HIPRI and IOMAP_DIO_INLINE_COMP are both set.  If
IOMAP_DIO_INLINE_COMP is set on the dio, then it doesn't matter what
context we are in at completion time or whether REQ_POLLED was set
or cleared during the IO....

That means the above check should be:

        /*
         * We can only poll for single bio I/Os that can run inline
	 * completion.
         */
        if (need_zeroout ||
	    (iocb_is_dsync(dio->iocb) && !use_fua) ||
            ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
                dio->iocb->ki_flags &= ~IOCB_HIPRI;

or if we change the logic such that calculate IOMAP_DIO_INLINE_COMP
first:

	if (!(dio->flags & IOMAP_DIO_INLINE_COMP))
		dio->iocb->ki_flags &= ~IOCB_HIPRI;

Then we don't need to care about polled IO on the completion side at
all at the iomap layer because it doesn't change the completion
requirements at all...

Cheers,

Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 8/8] iomap: support IOCB_DIO_DEFER
  2023-07-20 18:13 ` [PATCH 8/8] iomap: support IOCB_DIO_DEFER Jens Axboe
  2023-07-21  6:19   ` Christoph Hellwig
  2023-07-21 16:01   ` Darrick J. Wong
@ 2023-07-21 22:05   ` Dave Chinner
  2023-07-22  3:12     ` Jens Axboe
  2 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2023-07-21 22:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres

On Thu, Jul 20, 2023 at 12:13:10PM -0600, Jens Axboe wrote:
> If IOCB_DIO_DEFER is set, utilize that to set kiocb->dio_complete handler
> and data for that callback. Rather than punt the completion to a
> workqueue, we pass back the handler and data to the issuer and will get a
> callback from a safe task context.
....
> @@ -288,12 +319,17 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		 * after IO completion such as unwritten extent conversion) and
>  		 * the underlying device either supports FUA or doesn't have
>  		 * a volatile write cache. This allows us to avoid cache flushes
> -		 * on IO completion.
> +		 * on IO completion. If we can't use stable writes and need to
> +		 * sync, disable in-task completions as dio completion will
> +		 * need to call generic_write_sync() which will do a blocking
> +		 * fsync / cache flush call.
>  		 */
>  		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
>  		    (dio->flags & IOMAP_DIO_STABLE_WRITE) &&
>  		    (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
>  			use_fua = true;
> +		else if (dio->flags & IOMAP_DIO_NEED_SYNC)
> +			dio->flags &= ~IOMAP_DIO_DEFER_COMP;
>  	}
>  
>  	/*
> @@ -319,6 +355,13 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		pad = pos & (fs_block_size - 1);
>  		if (pad)
>  			iomap_dio_zero(iter, dio, pos - pad, pad);
> +
> +		/*
> +		 * If need_zeroout is set, then this is a new or unwritten
> +		 * extent. These need extra handling at completion time, so
> +		 * disable in-task deferred completion for those.
> +		 */
> +		dio->flags &= ~IOMAP_DIO_DEFER_COMP;
>  	}

I don't think these are quite right. They miss the file extension
case that I pointed out in an earlier patch (i.e. where IOCB_HIPRI
gets cleared).

Fundamentally, I don't like have three different sets of logic which
all end up being set/cleared for the same situation - polled bios
and defered completion should only be used in situations where
inline iomap completion can be run.

IOWs, I think the iomap_dio_bio_iter() code needs to first decide
whether IOMAP_DIO_INLINE_COMP can be set, and if it cannot be set,
we then clear both IOCB_HIPRI and IOMAP_DIO_DEFER_COMP, because
neither should be used for an IO that can not do inline completion.

i.e. this all comes down to something like this:

-	/*
-	 * We can only poll for single bio I/Os.
-	 */
-	if (need_zeroout ||
-	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
-		dio->iocb->ki_flags &= ~IOCB_HIPRI;
+	/*
+	 * We can only do inline completion for pure overwrites that
+	 * don't require additional IO at completion. This rules out
+	 * writes that need zeroing or extent conversion, extend
+	 * the file size, or issue journal IO or cache flushes
+	 * during completion processing.
+	 */
+	if (need_zeroout ||
+	    ((dio->flags & IOMAP_DIO_NEED_SYNC) && !use_fua) ||
+	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
+		dio->flags &= ~IOMAP_DIO_INLINE_COMP;
+
+	/*
+	 * We can only used polled for single bio IOs or defer
+	 * completion for IOs that will run inline completion.
+	 */
+	if (!(dio->flags & IOMAP_DIO_INLINE_COMP) {
+		dio->iocb->ki_flags &= ~IOCB_HIPRI;
+		dio->flags &= ~IOMAP_DIO_DEFER_COMP;
+	}

This puts the iomap inline completion decision logic all in one
place in the submission code and clearly keys the fast path IO
completion cases to the inline completion paths.

Cheers,

Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/8] iomap: completed polled IO inline
  2023-07-21 21:43   ` Dave Chinner
@ 2023-07-22  3:10     ` Jens Axboe
  2023-07-22 23:05       ` Dave Chinner
  2023-07-22 16:54     ` Jens Axboe
  1 sibling, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2023-07-22  3:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: io-uring, linux-xfs, hch, andres

On 7/21/23 3:43?PM, Dave Chinner wrote:
> On Thu, Jul 20, 2023 at 12:13:06PM -0600, Jens Axboe wrote:
>> Polled IO is only allowed for conditions where task completion is safe
>> anyway, so we can always complete it inline. This cannot easily be
>> checked with a submission side flag, as the block layer may clear the
>> polled flag and turn it into a regular IO instead. Hence we need to
>> check this at completion time. If REQ_POLLED is still set, then we know
>> that this IO was successfully polled, and is completing in task context.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>>  fs/iomap/direct-io.c | 14 ++++++++++++--
>>  1 file changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index 9f97d0d03724..c3ea1839628f 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -173,9 +173,19 @@ void iomap_dio_bio_end_io(struct bio *bio)
>>  	}
>>  
>>  	/*
>> -	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
>> +	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline.
>> +	 * Ditto for polled requests - if the flag is still at completion
>> +	 * time, then we know the request was actually polled and completion
>> +	 * is called from the task itself. This is why we need to check it
>> +	 * here rather than flag it at issue time.
>>  	 */
>> -	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
>> +	if ((dio->flags & IOMAP_DIO_INLINE_COMP) || (bio->bi_opf & REQ_POLLED)) {
> 
> This still smells wrong to me. Let me see if I can work out why...
> 
> <spelunk!>
> 
> When we set up the IO in iomap_dio_bio_iter(), we do this:
> 
>         /*
>          * We can only poll for single bio I/Os.
>          */
>         if (need_zeroout ||
>             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
>                 dio->iocb->ki_flags &= ~IOCB_HIPRI;
> 
> The "need_zeroout" covers writes into unwritten regions that require
> conversion at IO completion, and the latter check is for writes
> extending EOF. i.e. this covers the cases where we have dirty
> metadata for this specific write and so may need transactions or
> journal/metadata IO during IO completion.
> 
> The only case it doesn't cover is clearing IOCB_HIPRI for O_DSYNC IO
> that may require a call to generic_write_sync() in completion. That
> is, if we aren't using FUA, will not have IOMAP_DIO_INLINE_COMP set,
> but still do polled IO.
> 
> I think this is a bug. We don't want to be issuing more IO in
> REQ_POLLED task context during IO completion, and O_DSYNC IO
> completion for non-FUA IO requires a journal flush and that can
> issue lots of journal IO and wait on it in completion process.
> 
> Hence I think we should only be setting REQ_POLLED in the cases
> where IOCB_HIPRI and IOMAP_DIO_INLINE_COMP are both set.  If
> IOMAP_DIO_INLINE_COMP is set on the dio, then it doesn't matter what
> context we are in at completion time or whether REQ_POLLED was set
> or cleared during the IO....
> 
> That means the above check should be:
> 
>         /*
>          * We can only poll for single bio I/Os that can run inline
> 	 * completion.
>          */
>         if (need_zeroout ||
> 	    (iocb_is_dsync(dio->iocb) && !use_fua) ||
>             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
>                 dio->iocb->ki_flags &= ~IOCB_HIPRI;

Looks like you are right, it would not be a great idea to handle that
off polled IO completion. It'd work just fine, but anything generating
more IO should go to a helper. I'll make that change.

> or if we change the logic such that calculate IOMAP_DIO_INLINE_COMP
> first:
> 
> 	if (!(dio->flags & IOMAP_DIO_INLINE_COMP))
> 		dio->iocb->ki_flags &= ~IOCB_HIPRI;
> 
> Then we don't need to care about polled IO on the completion side at
> all at the iomap layer because it doesn't change the completion
> requirements at all...

That still isn't true, because you can still happily issue as polled IO
and get it cleared and now have an IRQ based completion. This would work
for most cases, but eg xfs dio end_io handler will grab:

spin_lock(&ip->i_flags_lock);

if the inode got truncated. Maybe that can't happen because we did
inode_dio_begin() higher up? Still seems saner to check for the polled
flag at completion to me...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 8/8] iomap: support IOCB_DIO_DEFER
  2023-07-21 22:05   ` Dave Chinner
@ 2023-07-22  3:12     ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-22  3:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: io-uring, linux-xfs, hch, andres

On 7/21/23 4:05?PM, Dave Chinner wrote:
> On Thu, Jul 20, 2023 at 12:13:10PM -0600, Jens Axboe wrote:
>> If IOCB_DIO_DEFER is set, utilize that to set kiocb->dio_complete handler
>> and data for that callback. Rather than punt the completion to a
>> workqueue, we pass back the handler and data to the issuer and will get a
>> callback from a safe task context.
> ....
>> @@ -288,12 +319,17 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>  		 * after IO completion such as unwritten extent conversion) and
>>  		 * the underlying device either supports FUA or doesn't have
>>  		 * a volatile write cache. This allows us to avoid cache flushes
>> -		 * on IO completion.
>> +		 * on IO completion. If we can't use stable writes and need to
>> +		 * sync, disable in-task completions as dio completion will
>> +		 * need to call generic_write_sync() which will do a blocking
>> +		 * fsync / cache flush call.
>>  		 */
>>  		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
>>  		    (dio->flags & IOMAP_DIO_STABLE_WRITE) &&
>>  		    (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
>>  			use_fua = true;
>> +		else if (dio->flags & IOMAP_DIO_NEED_SYNC)
>> +			dio->flags &= ~IOMAP_DIO_DEFER_COMP;
>>  	}
>>  
>>  	/*
>> @@ -319,6 +355,13 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>  		pad = pos & (fs_block_size - 1);
>>  		if (pad)
>>  			iomap_dio_zero(iter, dio, pos - pad, pad);
>> +
>> +		/*
>> +		 * If need_zeroout is set, then this is a new or unwritten
>> +		 * extent. These need extra handling at completion time, so
>> +		 * disable in-task deferred completion for those.
>> +		 */
>> +		dio->flags &= ~IOMAP_DIO_DEFER_COMP;
>>  	}
> 
> I don't think these are quite right. They miss the file extension
> case that I pointed out in an earlier patch (i.e. where IOCB_HIPRI
> gets cleared).
> 
> Fundamentally, I don't like have three different sets of logic which
> all end up being set/cleared for the same situation - polled bios
> and defered completion should only be used in situations where
> inline iomap completion can be run.
> 
> IOWs, I think the iomap_dio_bio_iter() code needs to first decide
> whether IOMAP_DIO_INLINE_COMP can be set, and if it cannot be set,
> we then clear both IOCB_HIPRI and IOMAP_DIO_DEFER_COMP, because
> neither should be used for an IO that can not do inline completion.
> 
> i.e. this all comes down to something like this:
> 
> -	/*
> -	 * We can only poll for single bio I/Os.
> -	 */
> -	if (need_zeroout ||
> -	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
> -		dio->iocb->ki_flags &= ~IOCB_HIPRI;
> +	/*
> +	 * We can only do inline completion for pure overwrites that
> +	 * don't require additional IO at completion. This rules out
> +	 * writes that need zeroing or extent conversion, extend
> +	 * the file size, or issue journal IO or cache flushes
> +	 * during completion processing.
> +	 */
> +	if (need_zeroout ||
> +	    ((dio->flags & IOMAP_DIO_NEED_SYNC) && !use_fua) ||
> +	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
> +		dio->flags &= ~IOMAP_DIO_INLINE_COMP;
> +
> +	/*
> +	 * We can only used polled for single bio IOs or defer
> +	 * completion for IOs that will run inline completion.
> +	 */
> +	if (!(dio->flags & IOMAP_DIO_INLINE_COMP) {
> +		dio->iocb->ki_flags &= ~IOCB_HIPRI;
> +		dio->flags &= ~IOMAP_DIO_DEFER_COMP;
> +	}
> 
> This puts the iomap inline completion decision logic all in one
> place in the submission code and clearly keys the fast path IO
> completion cases to the inline completion paths.

I do like the suggestion of figuring out the inline part, and then
clearing HIPRI if the iocb was marked for polling and we don't have the
inline flag set. That makes it easier to follow rather than juggling two
sets of logic.

I'll make that change.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/8] iomap: completed polled IO inline
  2023-07-21 21:43   ` Dave Chinner
  2023-07-22  3:10     ` Jens Axboe
@ 2023-07-22 16:54     ` Jens Axboe
  1 sibling, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-22 16:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: io-uring, linux-xfs, hch, andres

On 7/21/23 3:43?PM, Dave Chinner wrote:
> On Thu, Jul 20, 2023 at 12:13:06PM -0600, Jens Axboe wrote:
>> Polled IO is only allowed for conditions where task completion is safe
>> anyway, so we can always complete it inline. This cannot easily be
>> checked with a submission side flag, as the block layer may clear the
>> polled flag and turn it into a regular IO instead. Hence we need to
>> check this at completion time. If REQ_POLLED is still set, then we know
>> that this IO was successfully polled, and is completing in task context.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>>  fs/iomap/direct-io.c | 14 ++++++++++++--
>>  1 file changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index 9f97d0d03724..c3ea1839628f 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -173,9 +173,19 @@ void iomap_dio_bio_end_io(struct bio *bio)
>>  	}
>>  
>>  	/*
>> -	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
>> +	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline.
>> +	 * Ditto for polled requests - if the flag is still at completion
>> +	 * time, then we know the request was actually polled and completion
>> +	 * is called from the task itself. This is why we need to check it
>> +	 * here rather than flag it at issue time.
>>  	 */
>> -	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
>> +	if ((dio->flags & IOMAP_DIO_INLINE_COMP) || (bio->bi_opf & REQ_POLLED)) {
> 
> This still smells wrong to me. Let me see if I can work out why...
> 
> <spelunk!>
> 
> When we set up the IO in iomap_dio_bio_iter(), we do this:
> 
>         /*
>          * We can only poll for single bio I/Os.
>          */
>         if (need_zeroout ||
>             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
>                 dio->iocb->ki_flags &= ~IOCB_HIPRI;
> 
> The "need_zeroout" covers writes into unwritten regions that require
> conversion at IO completion, and the latter check is for writes
> extending EOF. i.e. this covers the cases where we have dirty
> metadata for this specific write and so may need transactions or
> journal/metadata IO during IO completion.
> 
> The only case it doesn't cover is clearing IOCB_HIPRI for O_DSYNC IO
> that may require a call to generic_write_sync() in completion. That
> is, if we aren't using FUA, will not have IOMAP_DIO_INLINE_COMP set,
> but still do polled IO.
> 
> I think this is a bug. We don't want to be issuing more IO in
> REQ_POLLED task context during IO completion, and O_DSYNC IO
> completion for non-FUA IO requires a journal flush and that can
> issue lots of journal IO and wait on it in completion process.
> 
> Hence I think we should only be setting REQ_POLLED in the cases
> where IOCB_HIPRI and IOMAP_DIO_INLINE_COMP are both set.  If
> IOMAP_DIO_INLINE_COMP is set on the dio, then it doesn't matter what
> context we are in at completion time or whether REQ_POLLED was set
> or cleared during the IO....
> 
> That means the above check should be:
> 
>         /*
>          * We can only poll for single bio I/Os that can run inline
> 	 * completion.
>          */
>         if (need_zeroout ||
> 	    (iocb_is_dsync(dio->iocb) && !use_fua) ||
>             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
>                 dio->iocb->ki_flags &= ~IOCB_HIPRI;

Looks like you are right, it would not be a great idea to handle that
off polled IO completion. It'd work just fine, but anything generating
more IO should go to a helper. I'll make that change.

> or if we change the logic such that calculate IOMAP_DIO_INLINE_COMP
> first:
> 
> 	if (!(dio->flags & IOMAP_DIO_INLINE_COMP))
> 		dio->iocb->ki_flags &= ~IOCB_HIPRI;
> 
> Then we don't need to care about polled IO on the completion side at
> all at the iomap layer because it doesn't change the completion
> requirements at all...

That still isn't true, because you can still happily issue as polled IO
and get it cleared and now have an IRQ based completion. This would work
for most cases, but eg xfs dio end_io handler will grab:

spin_lock(&ip->i_flags_lock);

if the inode got truncated. Maybe that can't happen because we did
inode_dio_begin() higher up? Still seems saner to check for the polled
flag at completion to me...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/8] iomap: completed polled IO inline
  2023-07-22  3:10     ` Jens Axboe
@ 2023-07-22 23:05       ` Dave Chinner
  2023-07-24 22:35         ` Jens Axboe
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2023-07-22 23:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-xfs, hch, andres

On Fri, Jul 21, 2023 at 09:10:57PM -0600, Jens Axboe wrote:
> On 7/21/23 3:43?PM, Dave Chinner wrote:
> > On Thu, Jul 20, 2023 at 12:13:06PM -0600, Jens Axboe wrote:
> >> Polled IO is only allowed for conditions where task completion is safe
> >> anyway, so we can always complete it inline. This cannot easily be
> >> checked with a submission side flag, as the block layer may clear the
> >> polled flag and turn it into a regular IO instead. Hence we need to
> >> check this at completion time. If REQ_POLLED is still set, then we know
> >> that this IO was successfully polled, and is completing in task context.
> >>
> >> Signed-off-by: Jens Axboe <[email protected]>
> >> ---
> >>  fs/iomap/direct-io.c | 14 ++++++++++++--
> >>  1 file changed, 12 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> >> index 9f97d0d03724..c3ea1839628f 100644
> >> --- a/fs/iomap/direct-io.c
> >> +++ b/fs/iomap/direct-io.c
> >> @@ -173,9 +173,19 @@ void iomap_dio_bio_end_io(struct bio *bio)
> >>  	}
> >>  
> >>  	/*
> >> -	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
> >> +	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline.
> >> +	 * Ditto for polled requests - if the flag is still at completion
> >> +	 * time, then we know the request was actually polled and completion
> >> +	 * is called from the task itself. This is why we need to check it
> >> +	 * here rather than flag it at issue time.
> >>  	 */
> >> -	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
> >> +	if ((dio->flags & IOMAP_DIO_INLINE_COMP) || (bio->bi_opf & REQ_POLLED)) {
> > 
> > This still smells wrong to me. Let me see if I can work out why...
> > 
> > <spelunk!>
> > 
> > When we set up the IO in iomap_dio_bio_iter(), we do this:
> > 
> >         /*
> >          * We can only poll for single bio I/Os.
> >          */
> >         if (need_zeroout ||
> >             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
> >                 dio->iocb->ki_flags &= ~IOCB_HIPRI;
> > 
> > The "need_zeroout" covers writes into unwritten regions that require
> > conversion at IO completion, and the latter check is for writes
> > extending EOF. i.e. this covers the cases where we have dirty
> > metadata for this specific write and so may need transactions or
> > journal/metadata IO during IO completion.
> > 
> > The only case it doesn't cover is clearing IOCB_HIPRI for O_DSYNC IO
> > that may require a call to generic_write_sync() in completion. That
> > is, if we aren't using FUA, will not have IOMAP_DIO_INLINE_COMP set,
> > but still do polled IO.
> > 
> > I think this is a bug. We don't want to be issuing more IO in
> > REQ_POLLED task context during IO completion, and O_DSYNC IO
> > completion for non-FUA IO requires a journal flush and that can
> > issue lots of journal IO and wait on it in completion process.
> > 
> > Hence I think we should only be setting REQ_POLLED in the cases
> > where IOCB_HIPRI and IOMAP_DIO_INLINE_COMP are both set.  If
> > IOMAP_DIO_INLINE_COMP is set on the dio, then it doesn't matter what
> > context we are in at completion time or whether REQ_POLLED was set
> > or cleared during the IO....
> > 
> > That means the above check should be:
> > 
> >         /*
> >          * We can only poll for single bio I/Os that can run inline
> > 	 * completion.
> >          */
> >         if (need_zeroout ||
> > 	    (iocb_is_dsync(dio->iocb) && !use_fua) ||
> >             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
> >                 dio->iocb->ki_flags &= ~IOCB_HIPRI;
> 
> Looks like you are right, it would not be a great idea to handle that
> off polled IO completion. It'd work just fine, but anything generating
> more IO should go to a helper. I'll make that change.
> 
> > or if we change the logic such that calculate IOMAP_DIO_INLINE_COMP
> > first:
> > 
> > 	if (!(dio->flags & IOMAP_DIO_INLINE_COMP))
> > 		dio->iocb->ki_flags &= ~IOCB_HIPRI;
> > 
> > Then we don't need to care about polled IO on the completion side at
> > all at the iomap layer because it doesn't change the completion
> > requirements at all...
> 
> That still isn't true, because you can still happily issue as polled IO
> and get it cleared and now have an IRQ based completion. This would work
> for most cases, but eg xfs dio end_io handler will grab:
> 
> spin_lock(&ip->i_flags_lock);
> 
> if the inode got truncated. Maybe that can't happen because we did
> inode_dio_begin() higher up?

Yes, truncate, hole punch, etc block on inode_dio_wait() with the
i_rwsem held which means it blocks new DIO submissions and waits
until all in-flight DIO before the truncate operation starts.
inode_dio_complete() does not get called until after the filesystem
->endio completion has run, so there's no possibility of
truncate-like operations actually racing with DIO completion at
all...

> Still seems saner to check for the polled
> flag at completion to me...

I disagree. If truncate (or anything that removes extents or reduces
inode size) is running whilst DIO to that range is still in
progress, we have a use-after-free situation that will cause data
and/or filesystem corruption. It's just not a safe thing to allow,
so we prevent it from occurring at a high level in the filesystem
and the result is that low level IO code just doesn't need to
care about races with layout/size changing operations...

-Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/8] iomap: completed polled IO inline
  2023-07-22 23:05       ` Dave Chinner
@ 2023-07-24 22:35         ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-24 22:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: io-uring, linux-xfs, hch, andres

On 7/22/23 5:05?PM, Dave Chinner wrote:
> On Fri, Jul 21, 2023 at 09:10:57PM -0600, Jens Axboe wrote:
>> On 7/21/23 3:43?PM, Dave Chinner wrote:
>>> On Thu, Jul 20, 2023 at 12:13:06PM -0600, Jens Axboe wrote:
>>>> Polled IO is only allowed for conditions where task completion is safe
>>>> anyway, so we can always complete it inline. This cannot easily be
>>>> checked with a submission side flag, as the block layer may clear the
>>>> polled flag and turn it into a regular IO instead. Hence we need to
>>>> check this at completion time. If REQ_POLLED is still set, then we know
>>>> that this IO was successfully polled, and is completing in task context.
>>>>
>>>> Signed-off-by: Jens Axboe <[email protected]>
>>>> ---
>>>>  fs/iomap/direct-io.c | 14 ++++++++++++--
>>>>  1 file changed, 12 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>>>> index 9f97d0d03724..c3ea1839628f 100644
>>>> --- a/fs/iomap/direct-io.c
>>>> +++ b/fs/iomap/direct-io.c
>>>> @@ -173,9 +173,19 @@ void iomap_dio_bio_end_io(struct bio *bio)
>>>>  	}
>>>>  
>>>>  	/*
>>>> -	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
>>>> +	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline.
>>>> +	 * Ditto for polled requests - if the flag is still at completion
>>>> +	 * time, then we know the request was actually polled and completion
>>>> +	 * is called from the task itself. This is why we need to check it
>>>> +	 * here rather than flag it at issue time.
>>>>  	 */
>>>> -	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
>>>> +	if ((dio->flags & IOMAP_DIO_INLINE_COMP) || (bio->bi_opf & REQ_POLLED)) {
>>>
>>> This still smells wrong to me. Let me see if I can work out why...
>>>
>>> <spelunk!>
>>>
>>> When we set up the IO in iomap_dio_bio_iter(), we do this:
>>>
>>>         /*
>>>          * We can only poll for single bio I/Os.
>>>          */
>>>         if (need_zeroout ||
>>>             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
>>>                 dio->iocb->ki_flags &= ~IOCB_HIPRI;
>>>
>>> The "need_zeroout" covers writes into unwritten regions that require
>>> conversion at IO completion, and the latter check is for writes
>>> extending EOF. i.e. this covers the cases where we have dirty
>>> metadata for this specific write and so may need transactions or
>>> journal/metadata IO during IO completion.
>>>
>>> The only case it doesn't cover is clearing IOCB_HIPRI for O_DSYNC IO
>>> that may require a call to generic_write_sync() in completion. That
>>> is, if we aren't using FUA, will not have IOMAP_DIO_INLINE_COMP set,
>>> but still do polled IO.
>>>
>>> I think this is a bug. We don't want to be issuing more IO in
>>> REQ_POLLED task context during IO completion, and O_DSYNC IO
>>> completion for non-FUA IO requires a journal flush and that can
>>> issue lots of journal IO and wait on it in completion process.
>>>
>>> Hence I think we should only be setting REQ_POLLED in the cases
>>> where IOCB_HIPRI and IOMAP_DIO_INLINE_COMP are both set.  If
>>> IOMAP_DIO_INLINE_COMP is set on the dio, then it doesn't matter what
>>> context we are in at completion time or whether REQ_POLLED was set
>>> or cleared during the IO....
>>>
>>> That means the above check should be:
>>>
>>>         /*
>>>          * We can only poll for single bio I/Os that can run inline
>>> 	 * completion.
>>>          */
>>>         if (need_zeroout ||
>>> 	    (iocb_is_dsync(dio->iocb) && !use_fua) ||
>>>             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
>>>                 dio->iocb->ki_flags &= ~IOCB_HIPRI;
>>
>> Looks like you are right, it would not be a great idea to handle that
>> off polled IO completion. It'd work just fine, but anything generating
>> more IO should go to a helper. I'll make that change.
>>
>>> or if we change the logic such that calculate IOMAP_DIO_INLINE_COMP
>>> first:
>>>
>>> 	if (!(dio->flags & IOMAP_DIO_INLINE_COMP))
>>> 		dio->iocb->ki_flags &= ~IOCB_HIPRI;
>>>
>>> Then we don't need to care about polled IO on the completion side at
>>> all at the iomap layer because it doesn't change the completion
>>> requirements at all...
>>
>> That still isn't true, because you can still happily issue as polled IO
>> and get it cleared and now have an IRQ based completion. This would work
>> for most cases, but eg xfs dio end_io handler will grab:
>>
>> spin_lock(&ip->i_flags_lock);
>>
>> if the inode got truncated. Maybe that can't happen because we did
>> inode_dio_begin() higher up?
> 
> Yes, truncate, hole punch, etc block on inode_dio_wait() with the
> i_rwsem held which means it blocks new DIO submissions and waits
> until all in-flight DIO before the truncate operation starts.
> inode_dio_complete() does not get called until after the filesystem
> ->endio completion has run, so there's no possibility of
> truncate-like operations actually racing with DIO completion at
> all...

OK so that part is all fine at least. This means we'll never hit that
lock, as we know we'll be within the range.

>> Still seems saner to check for the polled
>> flag at completion to me...
> 
> I disagree. If truncate (or anything that removes extents or reduces
> inode size) is running whilst DIO to that range is still in
> progress, we have a use-after-free situation that will cause data
> and/or filesystem corruption. It's just not a safe thing to allow,
> so we prevent it from occurring at a high level in the filesystem
> and the result is that low level IO code just doesn't need to
> care about races with layout/size changing operations...

I looked at the rest of the completion side, to ensure that writes would
be sane. And one thing that is problematic _without_ checking for POLLED
is invalidating inode pages post IO completion. That can happen if
someone is doing buffered IO on the same file, which arguably is silly
and not really supported, but we need to do it and it needs task context
to do so.

So I think I'll just simplify things a bit - get rid of the inline
completion for writes, and only have them be doable via CALLER_COMP
(which was DEFER_COMP before). Then we always have task context. This
doesn't change the other side, we'll still only do them for the cases
outlined. But it does mean that non-CALLER_COMP can't ever set
INLINE_COMP, and that latter flag just goes away.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/8] iomap: cleanup up iomap_dio_bio_end_io()
  2023-07-24 22:55 [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
@ 2023-07-24 22:55 ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2023-07-24 22:55 UTC (permalink / raw)
  To: io-uring, linux-xfs; +Cc: hch, andres, david, djwong, Jens Axboe

Make the logic a bit easier to follow:

1) Add a release_bio out path, as everybody needs to touch that, and
   have our bio ref check jump there if it's non-zero.
2) Add a kiocb local variable.
3) Add comments for each of the three conditions (sync, inline, or
   async workqueue punt).

No functional changes in this patch.

Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
 fs/iomap/direct-io.c | 46 +++++++++++++++++++++++++++++---------------
 1 file changed, 31 insertions(+), 15 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index ea3b868c8355..0ce60e80c901 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -152,27 +152,43 @@ void iomap_dio_bio_end_io(struct bio *bio)
 {
 	struct iomap_dio *dio = bio->bi_private;
 	bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY);
+	struct kiocb *iocb = dio->iocb;
 
 	if (bio->bi_status)
 		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
+	if (!atomic_dec_and_test(&dio->ref))
+		goto release_bio;
 
-	if (atomic_dec_and_test(&dio->ref)) {
-		if (dio->wait_for_completion) {
-			struct task_struct *waiter = dio->submit.waiter;
-			WRITE_ONCE(dio->submit.waiter, NULL);
-			blk_wake_io_task(waiter);
-		} else if (dio->flags & IOMAP_DIO_WRITE) {
-			struct inode *inode = file_inode(dio->iocb->ki_filp);
-
-			WRITE_ONCE(dio->iocb->private, NULL);
-			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
-			queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work);
-		} else {
-			WRITE_ONCE(dio->iocb->private, NULL);
-			iomap_dio_complete_work(&dio->aio.work);
-		}
+	/*
+	 * Synchronous dio, task itself will handle any completion work
+	 * that needs after IO. All we need to do is wake the task.
+	 */
+	if (dio->wait_for_completion) {
+		struct task_struct *waiter = dio->submit.waiter;
+
+		WRITE_ONCE(dio->submit.waiter, NULL);
+		blk_wake_io_task(waiter);
+		goto release_bio;
+	}
+
+	/* Read completion can always complete inline. */
+	if (!(dio->flags & IOMAP_DIO_WRITE)) {
+		WRITE_ONCE(iocb->private, NULL);
+		iomap_dio_complete_work(&dio->aio.work);
+		goto release_bio;
 	}
 
+	/*
+	 * Async DIO completion that requires filesystem level completion work
+	 * gets punted to a work queue to complete as the operation may require
+	 * more IO to be issued to finalise filesystem metadata changes or
+	 * guarantee data integrity.
+	 */
+	WRITE_ONCE(iocb->private, NULL);
+	INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
+	queue_work(file_inode(iocb->ki_filp)->i_sb->s_dio_done_wq,
+			&dio->aio.work);
+release_bio:
 	if (should_dirty) {
 		bio_check_pages_dirty(bio);
 	} else {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2023-07-24 22:55 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-20 18:13 [PATCHSET v4 0/8] Improve async iomap DIO performance Jens Axboe
2023-07-20 18:13 ` [PATCH 1/8] iomap: cleanup up iomap_dio_bio_end_io() Jens Axboe
2023-07-21  6:14   ` Christoph Hellwig
2023-07-21 15:13   ` Darrick J. Wong
2023-07-20 18:13 ` [PATCH 2/8] iomap: add IOMAP_DIO_INLINE_COMP Jens Axboe
2023-07-21  6:14   ` Christoph Hellwig
2023-07-21 15:16   ` Darrick J. Wong
2023-07-20 18:13 ` [PATCH 3/8] iomap: treat a write through cache the same as FUA Jens Axboe
2023-07-21  6:15   ` Christoph Hellwig
2023-07-21 14:04     ` Jens Axboe
2023-07-21 15:55       ` Darrick J. Wong
2023-07-21 16:03         ` Jens Axboe
2023-07-20 18:13 ` [PATCH 4/8] iomap: completed polled IO inline Jens Axboe
2023-07-21  6:16   ` Christoph Hellwig
2023-07-21 15:19   ` Darrick J. Wong
2023-07-21 21:43   ` Dave Chinner
2023-07-22  3:10     ` Jens Axboe
2023-07-22 23:05       ` Dave Chinner
2023-07-24 22:35         ` Jens Axboe
2023-07-22 16:54     ` Jens Axboe
2023-07-20 18:13 ` [PATCH 5/8] iomap: only set iocb->private for polled bio Jens Axboe
2023-07-21  6:18   ` Christoph Hellwig
2023-07-21 15:35   ` Darrick J. Wong
2023-07-21 15:37     ` Jens Axboe
2023-07-20 18:13 ` [PATCH 6/8] fs: add IOCB flags related to passing back dio completions Jens Axboe
2023-07-21  6:18   ` Christoph Hellwig
2023-07-21 15:48   ` Darrick J. Wong
2023-07-21 15:53     ` Jens Axboe
2023-07-20 18:13 ` [PATCH 7/8] io_uring/rw: add write support for IOCB_DIO_DEFER Jens Axboe
2023-07-21  6:19   ` Christoph Hellwig
2023-07-21 15:50   ` Darrick J. Wong
2023-07-21 15:53     ` Jens Axboe
2023-07-20 18:13 ` [PATCH 8/8] iomap: support IOCB_DIO_DEFER Jens Axboe
2023-07-21  6:19   ` Christoph Hellwig
2023-07-21 16:01   ` Darrick J. Wong
2023-07-21 16:30     ` Jens Axboe
2023-07-21 22:05   ` Dave Chinner
2023-07-22  3:12     ` Jens Axboe
2023-07-24 22:55 [PATCHSET v6 0/8] Improve async iomap DIO performance Jens Axboe
2023-07-24 22:55 ` [PATCH 1/8] iomap: cleanup up iomap_dio_bio_end_io() Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox